CN112069795B - Corpus detection method, device, equipment and medium based on mask language model - Google Patents

Corpus detection method, device, equipment and medium based on mask language model Download PDF

Info

Publication number
CN112069795B
CN112069795B CN202010888877.4A CN202010888877A CN112069795B CN 112069795 B CN112069795 B CN 112069795B CN 202010888877 A CN202010888877 A CN 202010888877A CN 112069795 B CN112069795 B CN 112069795B
Authority
CN
China
Prior art keywords
word
corpus
trained
generator
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010888877.4A
Other languages
Chinese (zh)
Other versions
CN112069795A (en
Inventor
邓悦
郑立颖
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010888877.4A priority Critical patent/CN112069795B/en
Priority to PCT/CN2020/117434 priority patent/WO2021151292A1/en
Publication of CN112069795A publication Critical patent/CN112069795A/en
Application granted granted Critical
Publication of CN112069795B publication Critical patent/CN112069795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to artificial intelligence, and particularly discloses a corpus detection method, device, equipment and medium based on a mask language model, wherein the method comprises the following steps: inputting the corpus word to be trained into the generator for training to obtain probability distribution corresponding to the corpus word; inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not, and the prediction result is stored in a blockchain node; inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result by the discriminator based on the classification label and the corpus word to obtain a context vector; and detecting the state of the corpus word to be trained according to the context vector. The training efficiency of the model is effectively improved, and the abnormal condition of the log file can be efficiently and accurately judged.

Description

Corpus detection method, device, equipment and medium based on mask language model
Technical Field
The present disclosure relates to the field of intelligent decision making technologies, and in particular, to a corpus detection method, device, computer device, and medium based on a mask language model.
Background
In text processing, anomaly detection of log files plays an important role in the management of modern large distributed systems, which are widely used in logs for recording system runtime information. Currently, operators typically use keyword searches and rule matching to examine and match logs. However, as the workload and business requirements increase, the time required for manual detection also increases, becoming more time-consuming and labor-consuming. In order to reduce the manual workload and improve the accuracy of the current detection, the application of the log anomaly detection method based on deep learning in the anomaly detection direction is gradually increased.
The current popular text processing model is based on a mask pre-training language model, but the modification and training of the model is limited by the high requirements on computing resources, training cost and running time.
Disclosure of Invention
The utility model provides a corpus detection method, device, computer equipment and medium based on a mask language model for intelligent decision, which effectively improves model training efficiency and can efficiently and accurately judge abnormal conditions of log files.
In a first aspect, the present application provides a corpus detection method based on a mask language model, where the method is applied to the mask language model, and the mask language model includes a generator and a discriminator; the method comprises the following steps:
inputting the corpus word to be trained into the generator for training to obtain probability distribution corresponding to the corpus word;
inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not;
inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result by the discriminator based on the classification label and the corpus word to obtain a context vector;
detecting the state of the corpus word to be trained according to the context vector; the categories of the corpus words comprise log file categories.
In a second aspect, the present application further provides a corpus detection apparatus, where the apparatus includes:
the first training module is used for inputting the corpus word to be trained into the generator for training to obtain probability distribution corresponding to the corpus word;
the second training module is used for inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not;
the adjustment module is used for inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result based on the classification label and the corpus word through the discriminator to obtain a context vector;
and the detection module is used for detecting the state of the corpus word to be trained according to the context vector.
In a third aspect, the present application also provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the corpus detection method based on the mask language model when executing the computer program.
In a fourth aspect, the present application further provides a computer readable storage medium storing a computer program, where the computer program when executed by a processor causes the processor to implement a corpus detection method based on a mask language model as described above.
The utility model discloses a corpus detection method, device, computer equipment and storage medium based on a mask language model, which adopts a brand-new mask language model, wherein the mask language model comprises a generator and a discriminator, during training, corpus words to be trained are input into the generator to train to obtain probability distribution corresponding to the corpus words, then the probability distribution is input into the discriminator to train to obtain a prediction result corresponding to the probability distribution, thereby determining the prediction result of the mask language model, wherein the prediction result comprises whether the corpus words are replaced or not, and because the corpus words to be trained of the input generator are all input, the generator and the discriminator train together, and parameters of the discriminator and the generator are shared with each other, so that the training time of the model is greatly reduced again, and the training efficiency of the model is higher; when the model is used, only the classifier is used for inputting classification labels aiming at the categories of corpus words, so that the efficiency of testing the model is greatly improved, and the testing time is effectively shortened; after the context vector is obtained, the state of the corpus word to be trained is detected according to the context vector, for example, the abnormal condition of the log file of the operation and maintenance server is detected, so that the abnormal detection result is more efficient and rapid, and the detection speed is greatly improved for daily detection tasks.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an architecture diagram of a masking speech model provided by an embodiment of the present application;
FIG. 2 is a diagram of a generator architecture provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of input word vectors for a generator provided by an embodiment of the present application;
FIG. 4 is a block diagram of a arbiter provided by an embodiment of the present application;
FIG. 5 is a flowchart of a corpus detection method based on a mask language model according to an embodiment of the present application;
FIG. 6 is a diagram of a arbiter that examines log files provided by an embodiment of the present application;
FIG. 7 is a schematic block diagram of another corpus detection device according to an embodiment of the present application;
fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.
It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
The embodiment of the application provides a corpus detection method, device, computer equipment and storage medium based on a mask language model. The corpus detection method based on the mask language model effectively improves model training efficiency, and can efficiently and accurately judge abnormal conditions of the log file.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a training architecture of a mask language model provided in an embodiment of the present application, where the training architecture of the mask language model includes a generator and a arbiter, and the generator and the arbiter train together during training, and only detect through the arbiter during detection, so as to effectively improve training efficiency.
The architecture of the generator as shown in fig. 2, the generator comprises a series of encoders encodings and corresponding inputs and outputs, the inputs of the generator are word vectors, the outputs of the generator are probability distributions, i.e. probability distributions of the words at the current position, corresponding to the probability of each word, and the most probable word can be selected by the maximum probability.
Specifically, as shown in FIG. 3, the input of the encoder of the generator is a vector w corresponding to each word 1 、w 2 、……w n The generation of the word vector consists of superposition of 3 partial vectors, which may include word dimension vectors, sentence dimension vectors, and position dimension vectors.
The task of the generator is simple in nature and the parameters are not particularly numerous, and the architecture of the generator is such that more complex tasks are input to the arbiter.
The architecture of the arbiter is shown in FIG. 4. The architecture of the arbiter is generally similar to the generator architecture, and also includes a series of encoders, inputs, outputs, O in FIG. 4 1 ……O n Is the output from the generator, and likewise, the input O 1 ……O n After passing through the word vector layer Embedding of the discriminator, the word vector layer Embedding is input into the Encoder structure of the discriminator, and is different from a generator in that a layer of Classifier is added in the output of the Encoder of the discriminator, and is used for judging whether each word is replaced or not, and the corresponding output is R 1 ……R n A0/1 classification is determined as probability, i.e., whether the word is replaced.
Based on the framework of the mask language model, a corpus detection method based on the mask language model is provided.
Referring to fig. 5, fig. 5 is a schematic flowchart of a corpus detection method based on a mask language model according to an embodiment of the present application. The corpus detection method based on the mask language model can be applied to the mask language model in fig. 1, so that the model training efficiency is effectively improved, and the abnormal condition of the log file can be efficiently and accurately judged.
As shown in fig. 5, the corpus detection method based on the mask language model specifically includes steps S101 to S104.
S101, inputting corpus words to be trained into the generator for training, and obtaining probability distribution corresponding to the corpus words.
When a user needs to carry out partial masking on a certain sentence, inputting the whole sentence into a masking language model, wherein the whole sentence is used as a corpus word to be trained, and a generator of the masking language model is input for training, and specifically comprises the following steps:
s11, inputting word vectors corresponding to the corpus words to be trained into a generator, wherein the word vectors comprise word dimensions, sentence dimensions and position dimensions.
As shown in FIG. 3, the input of the encoder of the generator is a vector w corresponding to each word 1 、w 2 、……w n The generation of the word vector consists of superposition of 3 partial vectors, which may include word dimension vectors, sentence dimension vectors, and position dimension vectors.
As shown in fig. 4, the corpus word to be trained is a sentence, taking "the value is high", "it is urgent now" as an example, and "the value is high", "it is urgent now" is input to the generator, and when the generator inputs, the word dimension, the sentence dimension and the position dimension are included.
S12, inputting the word vectors with the word dimensions, the sentence dimensions and the position dimensions which are overlapped into an encoder of the generator for encoding to obtain word vectors with each dimension, wherein the encoder comprises a plurality of encoders.
The generator superimposes the input word vectors including word dimensions, sentence dimensions and position dimensions to obtain superimposed word vectors, then inputs the superimposed word vectors into an encoder of the generator for encoding to obtain word vectors of all dimensions, the encoder of the generator has a plurality of layers, and the word vectors of all dimensions are obtained through layer-by-layer encoding of the generator.
The generator may be a pre-trained model or may be trained using the model at the time of input, for example, the overall length of the word vector may be 768, for the word dimension, and if the input corpus word to be trained corresponds to 6 words, the word dimension of the output corresponding to the generator model is (6, 768).
In some embodiments, for sentence dimensions, when the input corpus word to be trained includes two sentences, then the generator may add different word vectors (ebedding) to different sentences with corresponding dimensions of first sentence 1 of (1, 768) and second sentence 2 of (2, 768).
In some embodiments, when the corpus word to be trained includes the same word with different positions, the position information of the word needs to be considered according to the position information of the word according to the position dimension. For example, the input sentence is "I come and I watch", and then the two "I" inside are different for the generator, and for the information of the position, the generator adopts a sinusoidal coding manner, and the formula is as follows:
Figure BDA0002656339400000061
Figure BDA0002656339400000062
where pos is an index of position, representing the position of the word in the sequence, i is an index in the vector, d is the dimension of the generator model, which uses 768 dimensions. This formula may encode the information of the position with a sine function at even positions and a cosine function at odd positions of the vector, such that each dimension of the position-coded vector is a waveform of a different frequency, each value being between-1 and 1, resulting in a position dimension.
S13, randomly replacing part of words in the dimension word vectors according to a preset replacement rule, and obtaining probability distribution corresponding to each dimension word vector.
When the generator performs masking processing, the generator replaces partial words of the corpus word to be trained according to a preset rule when the corpus word to be trained is input, for example, replaces a token at a mask position, predicts masked words by the context of the word which is not masked in sentences when the generator outputs, for example, as shown in fig. 1, in the input corpus word to be trained, the mask position corresponds to the word which is "the value is high" and the value, and then the mask word is the word which is "and the value", the context of the word which is not masked is the word which is "and the high", and the word which is not masked is "and the value" is predicted by the word which is "and the high".
In some embodiments, when masking, a preset replacement rule is used, e.g., the generator uses the following rules in the 20% mask in addition to randomly selecting 20% mask input:
1. 10% are replaced with arbitrary words;
2. 10% of words do not change;
3. 80% of the words are replaced with mask.
And randomly replacing part of words in the dimension word vectors according to a preset replacement rule to obtain probability distribution corresponding to each dimension word vector.
It will be appreciated that for the Encoder encodable of the generator, a attentive mechanism is used, which aims at finding the word that is relatively relevant in the sentence in which the word is located when processing the single word, and fusing it into the word that needs to be processed, so as to achieve a better encoding effect. In the case of the multi-head attention mechanism, the Encoder is set to 16 layers based on the attention mechanism (self-attention), and then the corresponding attention mechanism is used 16 times to obtain the final output through linear mapping. Through a multi-head attention mechanism, different positions of the model are captured, so that information of the blind dimension of the sentence is captured.
By using a multi-head attention mechanism and a brand new word embedding method, triple-dimension (position dimension, sentence dimension and word dimension) coding information is introduced, so that more dimensions are brought to understanding words.
In some embodiments, after inputting the corpus word to be trained into the generator for training, and obtaining the probability distribution corresponding to the corpus word, the method may further include:
a loss function of the generator is calculated, and the generator is adjusted according to the loss function of the generator.
The loss function of the generator is whether the pass-context prediction is correct for those words that are [ mask ], by equation 3:
Figure BDA0002656339400000071
wherein L is MLM Is the loss function of the generator, x is the sample, x masked Is a sample shielded by a mask in the process of Embedding, θ G Is a parameter of the generator, (x) i |x masked ) Is sample x in the case of known i Is a conditional distribution of (a).
And carrying out word vector superposition and encoding on the corpus words to be trained through a generator to realize mask codes, and outputting each word vector and probability distribution corresponding to the word vector.
In some embodiments, when the user performs some business operations, the business system will generally generate some corresponding log files, and when the log is detected, the mask language model of the scheme is applied. If the category of the corpus word to be trained is a log file category, before inputting the log file to be trained into the generator for training to obtain the probability distribution corresponding to the corpus word, the method may include:
and preprocessing the log file to be trained.
Specifically, the preprocessing may be to filter some fixed text that is identical in structure and replace some unimportant information (address/time/date) after converting the uppercase format in the log file to lowercase format.
After the log file is preprocessed, the processed log file is input into a generator for training.
For different corpus words to be trained, different preprocessing needs to be performed first, and in this case, a log file is taken as an example, but the method is not limited to the log file.
S102, inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not.
Specifically, as shown in fig. 4, in the arbiter, inputting the probability distribution into the arbiter for training, to obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced or not may include:
and replacing the word vectors corresponding to the probability distribution by the discriminator according to a preset replacement probability so as to predict whether the word vectors corresponding to the probability distribution are replaced or not, and obtaining a prediction result.
In some embodiments, after the arbiter receives the output of the generator, the arbiter also replaces the input word vector with a probability to predict whether the word output by the generator has been replaced, in particular, O 1 ……O n Corresponding to each output from the generator, which is input to the Encoder structure through the word vector layer, and finally, unlike the generator, a layer of Classifier is added, and the corresponding output is R in order to judge whether each word is original or replaced 1 ……R N . The prediction results comprise whether the word vector is replaced, namely, two prediction results which are replaced and not replaced.
For example, the corpus word to be trained is "the value is high" and is input into the generator to be subjected to three-dimensional superposition and partial masking, then the "key is high" is output, and obviously, the "vlue" is subjected to masking processing, then the "key is high" output by the generator is input into the discriminator to be discriminated, and when the discrimination is performed, the substitution is performed according to the preset substitution probability, so that the "," is "and" high "are all original states, and the" key "is replaced states, namely, the discriminator discriminates the specific substituted word.
In some embodiments, after detecting the state of the corpus word to be trained from the context vector, the method further comprises:
and calculating a loss function of the discriminator, and adjusting the discriminator according to the loss function of the discriminator. The loss function of the arbiter is as in equation 4:
Figure BDA0002656339400000081
wherein L is Disc Is the loss function of the arbiter, l (x) is the indirection function, l (x) is the linear function, x is the sample, t is the time step, x corrupt Is a replaced sample, θ D Is a parameter of the discriminator, and D is the discriminator.
And superposing the loss function of the generator and the loss function of the discriminator to obtain an overall loss function so as to adjust the mask language model.
Specifically, the loss function of the generator is superimposed with the loss function of the arbiter, resulting in a model with an overall loss function of formula 5:
Figure BDA0002656339400000091
since the generator and the arbiter are identical structures, model training can be made more efficient by parameter sharing of the model for parameters contained in the generator and the arbiter in the model. Moreover, when training, the generator and the discriminant are trained together, and when the model is used, only the discriminant is put into use, so that more parameters of the model can be reduced, and better training efficiency can be achieved.
It is emphasized that to further guarantee the privacy and security of the prediction, the prediction may also be stored in a blockchain node.
S103, inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result by the discriminator based on the classification label and the corpus word to obtain a context vector.
In some embodiments, when the corpus word is a log file, the log file is preprocessed, then the preprocessed log file is input into a generator and a discriminator for training, and when the corpus word is detected, a prediction result obtained after training is input into the discriminator of the model, and the input of the model is the word corresponding to each log text.
Specifically, when the category of the corpus word is a log file category, inputting a classification label in the discriminator according to the category of the corpus word, and adjusting the prediction result by the discriminator based on the classification label and the corpus word to obtain a context vector, which may include:
s31, in the discriminator, replacing the first word corresponding to the log file with a classification label.
Specifically, the length of the input may be set to 512, while the start O1 of each sentence is replaced with a [ CLS ] classification tag at the time of the input, corresponding to abnormality or non-abnormality of the log.
S32, after all words corresponding to the log file are input to an encoder for training, vectors corresponding to the classification labels are input to a classified neural network for training, and context vectors are output, wherein the first position of the context vectors corresponds to the classification labels.
Specifically, considering that the anomaly detection is a task of two classes, so that after the layer-by-layer encoder training of the generator, each layer can obtain a vector, the length of the vector can be set to 768, the vector of the [ CLS ] classification label is directly input into a two-class neural network, and corresponds to the classifier in the above diagram, and is used for judging whether the anomaly exists, and the output result is 0/1, namely corresponds to the judgment result of the anomaly or the non-anomaly.
In some embodiments, if the detection is a multi-classification task, a multi-classification neural network may be used to replace the classification device, and a SoftMax logistic regression function is used to obtain the probability of each classification, and the classification is completed by attributing the probability to the classification corresponding to the maximum probability, so as to obtain the classification result.
S104, detecting the state of the corpus word to be trained according to the context vector; the categories of the corpus words comprise log file categories.
In some embodiments, when the category of the corpus word is a log file category, detecting the state of the corpus word to be trained according to the context vector may include: and judging the abnormal condition of the log file according to the first position of the context vector.
When an abnormality log file is detected, since the vector of [ CLS ] classification tags is directly input into a two-class neural network, only the output vector whose first position is [ CLS ] is focused as a vector of a context.
As shown in fig. 6, fig. 6 is a diagram of a discriminator architecture at the time of detecting an abnormal log file, and for log abnormality detection, the input is a sentence corresponding to each log file, and the length of the input is set to 512. And simultaneously, when inputting, replacing the beginning of each sentence with [ CLS ] (classification label) to obtain the detection result of abnormality or non-abnormality of the corresponding log.
Because only a discriminator is needed during detection, for a server for judging abnormal information of operation and maintenance, the load of a CPU and a memory is reduced, meanwhile, the abnormal detection result is more efficient and rapid, and for daily detection tasks, the detection speed is greatly improved.
The embodiment provides a corpus detection method based on a mask language model, which adopts a brand-new mask language model, wherein the mask language model comprises a generator and a discriminator, during training, corpus words to be trained are input into the generator to be trained to obtain probability distribution corresponding to the corpus words, then the probability distribution is input into the discriminator to be trained to obtain a prediction result corresponding to the probability distribution, so that the prediction result of the mask language model is determined, wherein the prediction result comprises whether the corpus words are replaced or not; when the model is used, only the classifier is used for inputting classification labels aiming at the categories of corpus words, so that the efficiency of testing the model is greatly improved, and the testing time is effectively shortened; after the context vector is obtained, the state of the corpus word to be trained is detected according to the context vector, for example, the abnormal condition of the log file of the operation and maintenance server is detected, so that the abnormal detection result is more efficient and rapid, and the detection speed is greatly improved for daily detection tasks.
Referring to fig. 7, fig. 7 is a schematic block diagram of a corpus detection device according to an embodiment of the present application, where the corpus detection device is configured to perform the foregoing corpus detection method based on a mask language model. The corpus detection device can be configured at a terminal or a server.
As shown in fig. 7, the corpus detection apparatus 400 includes: a first training module 401, a second training module 402, an adjusting module 403, and a detecting module 404.
The first training module 401 is configured to input a corpus word to be trained into the generator for training, so as to obtain probability distribution corresponding to the corpus word;
a second training module 402, configured to input the probability distribution to the arbiter for training, to obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced;
an adjustment module 403, configured to input a classification label in the arbiter according to the category of the corpus word, and adjust the prediction result by the arbiter based on the classification label and the corpus word, so as to obtain a context vector;
and the detection module 404 is configured to detect a state of the corpus word to be trained according to the context vector.
It should be noted that, for convenience and brevity of description, the specific working process of the apparatus and each module described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server.
With reference to FIG. 8, the computer device includes a processor, memory, and a network interface connected by a system bus, where the memory may include a non-volatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of a number of corpus detection methods based on a mask language model.
The processor is used to provide computing and control capabilities to support the operation of the entire computer device.
The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by the processor, causes the processor to perform any of a variety of corpus detection methods based on a mask language model.
The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:
inputting the corpus word to be trained into the generator for training to obtain probability distribution corresponding to the corpus word;
inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not;
inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result by the discriminator based on the classification label and the corpus word to obtain a context vector;
and detecting the state of the corpus word to be trained according to the context vector.
In some embodiments, the processor implements the inputting the corpus word to be trained into the generator for training, to obtain a probability distribution corresponding to the corpus word, including:
inputting word vectors corresponding to the corpus words to be trained into a generator, wherein the word vectors comprise word dimensions, sentence dimensions and position dimensions;
inputting the word vectors with the word dimension, the sentence dimension and the position dimension overlapped into an encoder of the generator for encoding to obtain word vectors with each dimension, wherein the encoder comprises a plurality of encoders;
and randomly replacing part of words in the dimension word vectors according to a preset replacement rule, and obtaining probability distribution corresponding to each dimension word vector.
In some embodiments, the processor realizes that the category of the corpus word to be trained is a log file category, and before inputting the corpus word to be trained into the generator for training, obtaining the probability distribution corresponding to the corpus word, the processor includes:
and preprocessing the log file to be trained.
In some embodiments, after the processor performs training by inputting the corpus word to be trained into the generator, obtaining a probability distribution corresponding to the corpus word, the method includes:
a loss function of the generator is calculated, and the generator is adjusted according to the loss function of the generator.
In some embodiments, after the processor implements the detecting the state of the corpus word to be trained according to the context vector, the method includes:
and calculating a loss function of the discriminator, and adjusting the discriminator according to the loss function of the discriminator.
In some embodiments, the processor implementation includes:
and superposing the loss function of the generator and the loss function of the discriminator to obtain an overall loss function so as to adjust the mask language model.
In some embodiments, the processor further performs inputting the probability distribution to the arbiter for training, to obtain a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced, and includes:
and replacing word vectors corresponding to the probability distribution by the discriminator according to preset replacement probability to predict whether the word vectors corresponding to the probability distribution are replaced or not, so as to obtain a prediction result, wherein the prediction result is stored in a block chain node.
In some embodiments, the processor further inputs a classification label into the arbiter according to the category of the corpus word, adjusts the prediction result based on the classification label and the corpus word by the arbiter, and obtains a context vector, including:
in the discriminator, replacing the first word corresponding to the log file with a classification label;
after all words corresponding to the log file are input to an encoder for training, vectors corresponding to the classification labels are input to a classified neural network for training, and context vectors are output, wherein the first position of the context vectors corresponds to the classification labels;
the detecting the state of the corpus word to be trained according to the context vector comprises the following steps:
and judging the abnormal condition of the log file according to the first position of the context vector.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, the computer program comprises program instructions, and the processor executes the program instructions to realize any corpus detection method based on the mask language model.
The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, which are provided on the computer device.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (9)

1. A corpus detection method based on a mask language model, which is characterized in that the method is applied to the mask language model, and the mask language model comprises a generator and a discriminator; the method comprises the following steps:
inputting the corpus word to be trained into the generator for training to obtain probability distribution corresponding to the corpus word;
inputting the probability distribution into the discriminator for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not;
inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result by the discriminator based on the classification label and the corpus word to obtain a context vector;
detecting the state of the corpus word to be trained according to the context vector;
inputting the corpus word to be trained into the generator for training to obtain probability distribution corresponding to the corpus word, wherein the method comprises the following steps:
inputting word vectors corresponding to the corpus words to be trained into a generator, wherein the word vectors comprise word dimensions, sentence dimensions and position dimensions;
inputting the word vectors with the word dimension, the sentence dimension and the position dimension overlapped into an encoder of the generator for encoding to obtain word vectors with each dimension, wherein the encoder comprises a plurality of encoders;
randomly replacing part of words in the dimension word vectors according to a preset replacement rule to obtain probability distribution corresponding to each dimension word vector;
under the condition that the dimension of the word vector corresponding to the corpus word to be trained is the sentence dimension and the input corpus word to be trained comprises two sentences, the word vector input generator adds different word vectors to different sentences;
and under the condition that the dimension of the word vector corresponding to the corpus word to be trained is the position dimension and the corpus word to be trained comprises the same word at different positions, adding different word vectors according to the position information of the word, encoding by using a sine function at even positions of the word vectors and encoding by using a cosine function at odd positions of the word vectors.
2. The method according to claim 1, wherein the category of the corpus word to be trained is a log file category, and before the corpus word to be trained is input into the generator for training, and a probability distribution corresponding to the corpus word is obtained, the method comprises:
and preprocessing the log file to be trained.
3. The method according to claim 1, wherein after inputting the corpus word to be trained into the generator for training, and obtaining the probability distribution corresponding to the corpus word, the method further comprises:
calculating a loss function of the generator, and adjusting the generator according to the loss function of the generator;
after the state of the corpus word to be trained is detected according to the context vector, the method further comprises:
and calculating a loss function of the discriminator, and adjusting the discriminator according to the loss function of the discriminator.
4. A method according to claim 3, characterized in that the method further comprises:
and superposing the loss function of the generator and the loss function of the discriminator to obtain an overall loss function so as to adjust the mask language model.
5. The method according to claim 1, wherein the inputting the probability distribution into the arbiter for training, and obtaining a prediction result corresponding to the probability distribution, where the prediction result includes whether the corpus word is replaced, includes:
and replacing word vectors corresponding to the probability distribution by the discriminator according to preset replacement probability to predict whether the word vectors corresponding to the probability distribution are replaced or not, so as to obtain a prediction result, wherein the prediction result is stored in a block chain.
6. The method according to claim 1, wherein the category of the corpus word to be trained is a log file category, the classifying label is input in the discriminator according to the category of the corpus word, the predicting result is adjusted by the discriminator based on the classifying label and the corpus word, and a context vector is obtained, including:
in the discriminator, replacing the first word corresponding to the log file with a classification label;
after all words corresponding to the log file are input to an encoder for training, vectors corresponding to the classification labels are input to a classified neural network for training, and context vectors are output, wherein the first position of the context vectors corresponds to the classification labels;
the detecting the state of the corpus word to be trained according to the context vector comprises the following steps:
and judging the abnormal condition of the log file according to the first position of the context vector.
7. A corpus detection apparatus, comprising:
the first training module is used for inputting corpus words to be trained into a generator of a mask language model for training, and obtaining probability distribution corresponding to the corpus words;
the second training module is used for inputting the probability distribution into a discriminator of the mask language model for training to obtain a prediction result corresponding to the probability distribution, wherein the prediction result comprises whether the corpus word is replaced or not;
the adjustment module is used for inputting a classification label into the discriminator according to the category of the corpus word, and adjusting the prediction result based on the classification label and the corpus word through the discriminator to obtain a context vector;
the detection module is used for detecting the state of the corpus word to be trained according to the context vector;
inputting the corpus word to be trained into a generator of a mask language model for training to obtain probability distribution corresponding to the corpus word, wherein the training comprises the following steps:
inputting word vectors corresponding to the corpus words to be trained into a generator, wherein the word vectors comprise word dimensions, sentence dimensions and position dimensions;
inputting the word vectors with the word dimension, the sentence dimension and the position dimension overlapped into an encoder of the generator for encoding to obtain word vectors with each dimension, wherein the encoder comprises a plurality of encoders;
randomly replacing part of words in the dimension word vectors according to a preset replacement rule to obtain probability distribution corresponding to each dimension word vector;
under the condition that the dimension of the word vector corresponding to the corpus word to be trained is the sentence dimension and the input corpus word to be trained comprises two sentences, the word vector input generator adds different word vectors to different sentences;
and under the condition that the dimension of the word vector corresponding to the corpus word to be trained is the position dimension and the corpus word to be trained comprises the same word at different positions, adding different word vectors according to the position information of the word, encoding by using a sine function at even positions of the word vectors and encoding by using a cosine function at odd positions of the word vectors.
8. A computer device, the computer device comprising a memory and a processor;
the memory is used for storing a computer program;
the processor is configured to execute the computer program and implement the corpus detection method based on a mask language model according to any one of claims 1 to 6 when the computer program is executed.
9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the corpus detection method based on a mask language model according to any one of claims 1 to 6.
CN202010888877.4A 2020-08-28 2020-08-28 Corpus detection method, device, equipment and medium based on mask language model Active CN112069795B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010888877.4A CN112069795B (en) 2020-08-28 2020-08-28 Corpus detection method, device, equipment and medium based on mask language model
PCT/CN2020/117434 WO2021151292A1 (en) 2020-08-28 2020-09-24 Corpus monitoring method based on mask language model, corpus monitoring apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010888877.4A CN112069795B (en) 2020-08-28 2020-08-28 Corpus detection method, device, equipment and medium based on mask language model

Publications (2)

Publication Number Publication Date
CN112069795A CN112069795A (en) 2020-12-11
CN112069795B true CN112069795B (en) 2023-05-30

Family

ID=73660536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010888877.4A Active CN112069795B (en) 2020-08-28 2020-08-28 Corpus detection method, device, equipment and medium based on mask language model

Country Status (2)

Country Link
CN (1) CN112069795B (en)
WO (1) WO2021151292A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011177B (en) * 2021-03-15 2023-09-29 北京百度网讯科技有限公司 Model training and word vector determining method, device, equipment, medium and product
CN113094482B (en) * 2021-03-29 2023-10-17 中国地质大学(北京) Lightweight semantic intelligent service adaptation training evolution method and system
CN114049662B (en) * 2021-10-18 2024-05-28 天津大学 Facial feature transfer learning-based expression recognition network device and method
CN114723073B (en) * 2022-06-07 2023-09-05 阿里健康科技(杭州)有限公司 Language model pre-training method, product searching method, device and computer equipment
CN114936327B (en) * 2022-07-22 2022-10-28 腾讯科技(深圳)有限公司 Element recognition model acquisition method and device, computer equipment and storage medium
CN115495314A (en) * 2022-09-30 2022-12-20 中国电信股份有限公司 Log template identification method and device, electronic equipment and readable medium
CN116662579B (en) * 2023-08-02 2024-01-26 腾讯科技(深圳)有限公司 Data processing method, device, computer and storage medium
CN117786104B (en) * 2023-11-17 2024-06-21 中信建投证券股份有限公司 Model training method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009628A (en) * 2017-10-30 2018-05-08 杭州电子科技大学 A kind of method for detecting abnormality based on generation confrontation network
CN108734276A (en) * 2018-04-28 2018-11-02 同济大学 A kind of learning by imitation dialogue generation method generating network based on confrontation

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8589163B2 (en) * 2009-12-04 2013-11-19 At&T Intellectual Property I, L.P. Adapting language models with a bit mask for a subset of related words
CN110196894B (en) * 2019-05-30 2021-06-08 北京百度网讯科技有限公司 Language model training method and language model prediction method
CN111028206A (en) * 2019-11-21 2020-04-17 万达信息股份有限公司 Prostate cancer automatic detection and classification system based on deep learning
CN111414772B (en) * 2020-03-12 2023-09-26 北京小米松果电子有限公司 Machine translation method, device and medium
CN111241291B (en) * 2020-04-24 2023-01-03 支付宝(杭州)信息技术有限公司 Method and device for generating countermeasure sample by utilizing countermeasure generation network
CN111539223B (en) * 2020-05-29 2023-08-18 北京百度网讯科技有限公司 Language model training method and device, electronic equipment and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009628A (en) * 2017-10-30 2018-05-08 杭州电子科技大学 A kind of method for detecting abnormality based on generation confrontation network
CN108734276A (en) * 2018-04-28 2018-11-02 同济大学 A kind of learning by imitation dialogue generation method generating network based on confrontation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
中文笑话语料库的构建与应用;任璐;杨亮;徐琳宏;樊小超;刁宇峰;林鸿飞;;中文信息学报(第07期);第25-34页 *

Also Published As

Publication number Publication date
CN112069795A (en) 2020-12-11
WO2021151292A1 (en) 2021-08-05

Similar Documents

Publication Publication Date Title
CN112069795B (en) Corpus detection method, device, equipment and medium based on mask language model
Nedelkoski et al. Self-attentive classification-based anomaly detection in unstructured logs
Tay et al. Compare, compress and propagate: Enhancing neural architectures with alignment factorization for natural language inference
CN111859960B (en) Semantic matching method, device, computer equipment and medium based on knowledge distillation
US11763091B2 (en) Automated content tagging with latent dirichlet allocation of contextual word embeddings
Gomes et al. BERT-and TF-IDF-based feature extraction for long-lived bug prediction in FLOSS: a comparative study
US20230252139A1 (en) Efficient transformer for content-aware anomaly detection in event sequences
CN115204886A (en) Account identification method and device, electronic equipment and storage medium
CN116305119A (en) APT malicious software classification method and device based on predictive guidance prototype
CN116663539A (en) Chinese entity and relationship joint extraction method and system based on Roberta and pointer network
Karlsen et al. Benchmarking Large Language Models for Log Analysis, Security, and Interpretation
Otsubo et al. Compiler provenance recovery for multi-cpu architectures using a centrifuge mechanism
CN116526670A (en) Information fusion method for power big data visualization
Nagashima Simple dataset for proof method recommendation in isabelle/hol
Althar et al. Software Source Code: Statistical Modeling
Gaykar et al. A Hybrid Supervised Learning Approach for Detection and Mitigation of Job Failure with Virtual Machines in Distributed Environments.
Bernardelli et al. The BeMi stardust: a structured ensemble of binarized neural networks
Bobek et al. Framework for benchmarking rule-based inference engines
Shen et al. EDP-BGCNN: effective defect prediction via BERT-based graph convolutional neural network
Bisi et al. CNN-BPSO approach to select optimal values of CNN parameters for software requirements classification
Bodyanskiy et al. Semantic annotation of text documents using evolving neural network based on principle “Neurons at Data Points”
Al Saidat et al. Exploring the Interpretability of Sequential Predictions Through Rationale Model
Xu et al. Security monitoring data fusion method based on ARIMA and LS-SVM
Sarbakysh et al. A1BERT: A Language-Agnostic Graph Neural Network Model for Vulnerability Detection
Lin et al. Accelerating transmission-constrained unit commitment via a data-driven learning framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40040160

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant