CN110659175A

CN110659175A - Log trunk extraction method, log trunk classification method, log trunk extraction equipment and log trunk storage medium

Info

Publication number: CN110659175A
Application number: CN201810703614.4A
Authority: CN
Inventors: 韩静; 李苏南; 刘建伟; 董辛酉; 樊元元; 双锴; 吕志恒; 李怡雯
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2018-06-30
Filing date: 2018-06-30
Publication date: 2020-01-07

Abstract

The embodiment of the invention discloses a log trunk extraction method, a log trunk classification method, log trunk extraction equipment and a log trunk storage medium, and belongs to the technical field of communication. The method comprises the following steps: after receiving the log data, classifying the log data according to a classification template; when the classification is unsuccessful, carrying out backbone extraction on the log data which is unsuccessfully classified to generate a new classification template; and classifying the failed log data according to the new classification template. According to the embodiment of the invention, the automatic extraction and classification of the main stems of the logs can be realized, and higher accuracy can be obtained without manual intervention, so that the accuracy and efficiency of log classification and abnormal pattern recognition are greatly improved, and the subsequent log processing efficiency is improved.

Description

Log trunk extraction method, log trunk classification method, log trunk extraction equipment and log trunk storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a log trunk extraction method, a log trunk classification method, log trunk extraction equipment and a log trunk storage medium.

Background

Modern computer systems generate large amounts of log data. The log data records the state of the internal operation of the system, and a system administrator or a field expert can use the log data to know and optimize the behavior of the system and detect the system abnormity. Today, relying on manual analysis is far from sufficient, due to the tremendous amount of Iaas/Paas/Saas platform log data. If the main stem of each log can be automatically extracted and the massive logs can be effectively classified, the method can help an administrator to improve the efficiency of log analysis. Therefore, how to classify the logs quickly and accurately becomes a problem to be solved urgently.

Most of the existing log classification methods classify log texts by analyzing the statistical rules. At present, when log classification is carried out, a mode of directly carrying out clustering operation on a large number of original logs is generally adopted, the trunks of the logs are not extracted, and if the trunks of the logs can be extracted to be more complete, great convenience is provided for an administrator to subsequently carry out log system processing, abnormal mode identification and the like. Most of the existing trunk extraction algorithms with good effects still need a manual auxiliary process and are not automatic enough; meanwhile, a large amount of original logs are directly operated without preprocessing the logs, so that the system resources are occupied and the processing speed is low.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus, and a computer storage medium for classifying log files, so as to solve the technical problem of insufficient automation in current log classification.

The technical scheme adopted by the embodiment of the invention for solving the technical problems is as follows:

according to an aspect of the present invention, there is provided a method for extracting a trunk of a log, the method including:

preprocessing log data;

clustering the preprocessed log data according to the preset initial clustering number;

performing self-adaptive optimization on the current clustering result according to a preset clustering acceptance condition;

stems are extracted from each cluster.

According to another aspect of the present invention, there is provided a method for classifying logs, the method including:

after receiving the log data, classifying the log data according to a classification template;

when the classification is unsuccessful, carrying out backbone extraction on the log data which is unsuccessfully classified to generate a new classification template; and classifying the failed log data according to the new classification template.

According to the log trunk extraction method, the log classification method, the log trunk extraction equipment and the log storage medium, provided by the embodiment of the invention, the number of clusters is searched in a self-adaptive manner, the trunk extraction is taken as a staged target, the log classification is further realized, the automatic trunk extraction and classification of the logs can be realized, the higher accuracy can be obtained without manual intervention, the accuracy and efficiency of log classification and abnormal pattern recognition are improved to a great extent, and the subsequent log processing efficiency is improved.

Drawings

Fig. 1 is a flowchart of a method for extracting a main trunk of a log according to an embodiment of the present invention;

fig. 2 is a flowchart of a log data preprocessing method according to an embodiment of the present invention;

fig. 3 is a flowchart of a preliminary clustering method according to an embodiment of the present invention;

fig. 4 is a flowchart of a cluster optimization method according to an embodiment of the present invention;

fig. 5 is a flowchart of a log classification template training method according to a second embodiment of the present invention;

fig. 6 is a flowchart of log classification according to a third embodiment of the present invention;

FIG. 7 is a flowchart illustrating a method for classifying log data by using a classification template according to a third embodiment of the present invention;

fig. 8 is a flowchart of a method for generating a log vector according to a third embodiment of the present invention;

fig. 9 is a flowchart of a method for matching a log vector and a template vector to obtain a category according to a third embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example one

As shown in fig. 1, a method for extracting a main trunk of a log according to an embodiment of the present invention includes:

and S11, preprocessing the log data.

In order to reduce the consumption of extra system resources caused by data redundancy, the running speed is effectively increased. The step can realize the compression of the log data and the deletion of redundant data.

As shown in fig. 2, the step S11 may be implemented in the following manner:

and S111, removing the parameters in the log in a known form by using a regular expression.

And S112, constructing a dictionary by using a word frequency statistical mode, and constructing an n-dimensional vector for each log according to the dictionary.

Specifically, an n-dimensional 1 vector may be constructed for each log using a 1-word bag model using a dictionary. Words in short logs can also be assigned greater weight, thereby achieving good results on different types of log data sets.

And S113, deleting the logs with the same vector, and only keeping one log for the logs with the same vector.

Therefore, the number of logs entering the main trunk for extraction is greatly reduced, higher efficiency can be still kept when large-scale log data are processed, and consumption of system resources is reduced.

And S12, clustering the preprocessed log data according to the preset initial clustering number.

As shown in fig. 3, the present step S12 can be implemented in the following manner:

and S121, randomly distributing each log into any one of K classes according to the preset initial clustering number K.

And S122, performing transfer gain calculation on each log, and dividing the log into clusters with the maximum gain values.

Specifically, at the beginning of clustering, a preset initial clustering number K randomly classifies each log into any one of K classes. And performing transfer gain function calculation on each log, dividing the log into clusters with the maximum gain function, and considering that the clusters are stable when all the current clusters to which the logs belong are the clusters with the maximum gain function.

And S13, performing self-adaptive optimization on the current clustering result according to the preset clustering acceptance condition.

Specifically, if the clustering number K meets the requirement, clustering is finished. If the number of clusters is too large or too small, the value of the number of clusters K needs to be further adjusted to re-cluster. Because the cluster numbers K of different log sets are different and even very different, a mode of adaptively searching for the cluster numbers according to binary search can be adopted in consideration of the positive correlation relationship between the extracted trunk identification number and the cluster numbers to a certain extent.

As shown in fig. 4, the present step S13 can be implemented in the following manner:

s131, when the current cluster meets the preset cluster acceptance condition, reducing the cluster number by using a binary search method until the cluster number is minimum and the preset cluster acceptance condition is met.

S132, when the current cluster does not meet the preset cluster acceptance condition, increasing the cluster number by using a binary search method until the cluster number meets the preset cluster acceptance condition;

and S133, adjusting the current clustering result according to the clustering number.

For example, the preset cluster acceptance condition (i.e. the condition for considering the stem extraction success) in this embodiment is:

(1) the number of parameters in the log is less than X;

(2) the proportion of the parameters in the log is less than R1, wherein the judgment conditions for the parameters are as follows: a word appears in a log sample less than the ratio R2 in the cluster, i.e. is considered a parameter;

(3) the proportion R of the single log with the successful trunk extraction in the total number is higher than R3.

After such a condition is set, the K value can be automatically located by using a binary search method because the K value and the R have a certain degree of correlation.

In order to obtain a stable K value, the method can carry out binary search for many times, and an average value of obtained results is taken as a final K value. Therefore, the clustering number K can be well obtained for different types of log sets, so that a good effect can be obtained when different types of log data are analyzed, and the generalization performance is strong.

And S14, extracting a main stem from each cluster.

Specifically, after clustering is stable, for each log cluster, judging how many logs the word in the log cluster appears, when the appearance frequency is greater than a set threshold value, keeping the word in the final log trunk, otherwise, deleting the word. For example, if a word occurs in more than 5% of the logs in the cluster, all words belonging to the stem are retained and the rest are removed or replaced with < p >.

The method of the embodiment of the invention overcomes the defect of manual assistance in the related technology, and automatically and accurately extracts the main stems of the logs by self-adaptively searching the cluster number. In addition, words in the short log can be distributed with larger weight, and suitable clustering number can be obtained for different types of log sets, so that good effect can be obtained on different types of log data sets, generalization performance is strong, and in addition, the log data is compressed before trunk extraction is carried out, redundant data is deleted, the data amount required to be processed in nonlinear time is reduced, and the extraction efficiency is improved.

Example two

As shown in fig. 5, a method for extracting a main trunk of a log according to an embodiment of the present invention includes:

s51, preprocessing the log data;

s52, clustering the preprocessed log data according to the preset initial clustering number;

s53, performing self-adaptive optimization on the current clustering result according to preset clustering acceptance conditions;

s54, extracting a main stem from each cluster;

and S55, generating a classification template according to the extracted main stems.

Specifically, after the log is subjected to trunk extraction, the extracted trunk is added into the existing classification template as a classification template. The existing classification template can be represented as a category set, the category comprises a plurality of log templates, and the logs which are represented as the same as the log templates belong to the category.

The embodiment can be used for offline log classification template training and can also be applied to online log classification.

EXAMPLE III

As shown in fig. 6, a method for classifying logs provided by an embodiment of the present invention includes:

and S61, after receiving the log data, classifying the log data according to the classification template.

The log data includes but is not limited to log files or records in a database. As shown in fig. 7, the following method can be used for classification:

s611, extracting a template keyword dictionary, and generating a template vector according to the template keyword dictionary and the classification template.

For a classified template, its template keyword dictionary is determined. Therefore, a single log classification template can be made into an m-dimensional vector, wherein m is the number of words in the template keyword dictionary, each dimensional vector represents a word in the template keyword dictionary, and the initial value is set to False. And counting words of the template keyword dictionary, and if a certain word appears in a single log classification template, updating the value of the corresponding dimension of the word to True to obtain a classification template vector.

And S612, extracting a log keyword dictionary, and generating a log vector according to the log keyword dictionary and the classification template.

Referring to fig. 8, the step S612 further includes:

s6121, counting the frequency of each word in the log data in the whole log, and adding the word into the preliminary keyword list when the frequency of the word is greater than a preset frequency threshold.

Specifically, after the frequency of the words appearing in the whole log is counted, the words with the frequency greater than the threshold are considered to be possible words, the preset frequency threshold can be adjusted according to different fields, a word set considered to be meaningful for classification is obtained, and a preliminary keyword list is formed.

S6122, screening the preliminary keyword list according to the preset field standard, and removing words which do not belong to the preset field standard from the preliminary keyword list to obtain a log keyword dictionary.

Specifically, after the last step process, if want to carry out more accurate extraction to the keyword, can judge whether belong to predetermined field standard to the word in preliminary keyword, play meaningful word.

Specifically, assume the following three categories of words as domain criteria:

(1) words that conform to the basic semantics of English, such as: apple banana

(2) Specialized vocabulary in the computer domain, such as: dns dhcp

(3) For a particular log set, special representations occur frequently, such as: ZTE

If the word does not belong to any of the three categories, the keyword list is rejected.

S6123, making the single log into an n-dimensional vector, wherein n is the number of words in the log keyword dictionary, each dimensional vector represents one word in the log keyword dictionary, and the initial value is set as a false value.

S6124, making statistics on words in the log keyword dictionary, and if a word appears in a single log, updating the value of the corresponding dimension of the word to a true value to obtain a log vector.

And S613, matching the log vector with the template vector to obtain a category.

Referring to fig. 9, the step S613 further includes:

s6131, respectively carrying out the same hash operation on the log template vector and the log vector to obtain respective hash tables;

s6132, if the hash value of the log vector is the same as that of a template vector, classifying the log under a label corresponding to the hash value of the template vector.

Specifically, whether the obtained vector _ msg is the same as the vector of a certain classified sample is judged through a hash value: if the msg is the same as the hash value of the label, classifying the msg under the existing label corresponding to the hash value of the msg, and successfully classifying; if the matching fails, classifying the msg into [ unidentified ] class, if the classification fails, continuing to perform a trunk extraction algorithm. It should be noted that the hash operation in this embodiment is only an example, and the actual matching method includes, but is not limited to, using a hash method.

S62, judging whether the classification is successful, if so, going to step S65, otherwise, executing step S63.

Specifically, when the classification template is used for classifying the logs, the logs which cannot be matched with the classification template in the log classification process are classified into the logs of [ unidentified ] class (namely the logs with failed classification), the clustering process is subsequently carried out, the main stems are extracted and processed, and the logs are classified according to the newly extracted main stems.

And S63, carrying out backbone extraction on the log data which are unsuccessfully classified, and generating a new classification template.

It should be noted that, the manner of performing the backbone extraction on the log data with failed classification in step S63 may be performed by the method of the first embodiment, and will not be repeated here.

And after the logs with classification failure are subjected to trunk extraction, adding the extracted new trunk serving as a new classification template into the existing classification template. The existing classification template can be represented as a category set, the category comprises a plurality of log templates, and the logs which are represented as the same as the log templates belong to the category. And adding the new trunk into the category in the classification template if and only if the ratio of the longest common subsequence of all the log templates in the same category of the newly extracted trunk to the original log is greater than a set threshold RT.

And S64, classifying the failed log data according to the new classification template.

The method of step S64 is the same as the method of step S61, and the method of step S61 is applicable in this step, except that the classification template is different, and will not be repeated here.

And S65, ending the flow.

In the embodiment of the invention, when the classification template is used for classifying the logs, the main stem of the logs which are not successfully classified is extracted to generate the new classification template, and the new classification template is used for classifying the log data which are unsuccessfully classified, so that the whole process does not need manual participation, technical personnel in related fields can be helped to quickly and accurately classify the logs of the same type in a large number of logs, and convenience is provided for the subsequent processing work of the logs.

In addition, an embodiment of the present invention further provides an apparatus, where the apparatus includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is executed by the processor, the steps of the above-mentioned log trunk extraction method are implemented.

In addition, an embodiment of the present invention further provides an apparatus, where the apparatus includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is executed by the processor, the steps of the log classification method are implemented.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for extracting a main trunk of a log and/or the steps of the method for classifying a log are/is implemented.

It should be noted that the above-mentioned device and computer-readable storage medium belong to the same concept as the log file classification method embodiment, and specific implementation processes thereof are detailed in the method embodiment, and technical features in the method embodiment are applicable to both the device and the computer-readable storage medium, which are not described herein again.

The log backbone extraction method, the log backbone classification method, the log backbone extraction device and the log backbone storage medium overcome the defect that manual assistance is needed in the related technology, and automatically extract and classify the logs by searching the cluster number in a self-adaptive mode. In addition, words in the short log can be assigned with larger weight, so that good effect can be achieved on different types of log data sets; in addition, log data is compressed and then processed before backbone extraction, so that the data amount needing to be processed in nonlinear time is reduced to a great extent, and the classification efficiency is improved.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not to be construed as limiting the scope of the invention. Any modifications, equivalents and improvements which may occur to those skilled in the art without departing from the scope and spirit of the present invention are intended to be within the scope of the claims.

Claims

1. A method for extracting a main stem of a log comprises the following steps:

preprocessing log data;

stems are extracted from each cluster.

2. The method according to claim 1, wherein the preprocessing the log data specifically includes:

removing parameters in a known form in the log by using a regular expression;

constructing a dictionary by using a word frequency statistical mode, and constructing a multi-dimensional vector for each log according to the dictionary;

and deleting the logs with the same vector, and reserving one log with the same vector.

3. The method according to claim 1, wherein the clustering the preprocessed log texts according to the preset initial clustering number specifically comprises:

randomly classifying each log into any one of K classes according to a preset initial clustering number K;

and (4) performing transfer gain calculation on each log, and dividing the log into clusters with the maximum gain values.

4. The method according to claim 1, wherein the optimizing the current clustering result according to a preset clustering acceptance condition specifically comprises:

when the current cluster meets a preset cluster acceptance condition, reducing the cluster number by using a binary search method until the cluster number is minimum and meets the preset cluster acceptance condition;

when the current cluster does not meet the preset cluster acceptance condition, increasing the cluster number by using a binary search method until the cluster number meets the preset cluster acceptance condition;

and adjusting the current clustering result according to the clustering number.

5. The method of any one of claims 1 to 4, wherein the extracting the skeleton from each cluster further comprises: and generating a classification template according to the extracted trunks.

6. A method of classifying logs, the method comprising:

7. The method for classifying logs according to claim 6, wherein the step of performing stem extraction on the logs with failed classification specifically comprises: a method of extracting a stem from a log according to any one of claims 1 to 5.

8. The method of classifying a log according to claim 6, wherein the classifying the log data using the classification template specifically comprises:

extracting a template keyword dictionary, and generating a template vector according to the template keyword dictionary and the classification template;

extracting a log keyword dictionary, and generating a log vector according to the log keyword dictionary and the classification template;

and matching the log vector with the template vector to obtain a category.

9. The method of classifying a log according to claim 8, wherein the generating a template vector from the template keyword dictionary and the classification template specifically comprises:

making a single log classification template into an m-dimensional vector, wherein m is the number of words in a template keyword dictionary, each dimensional vector represents one word in the template keyword dictionary, and the initial value is set as a false value;

and counting the words of the template keyword dictionary, and if a certain word appears in the single log classification template, updating the value of the corresponding dimension of the word into a true value to obtain a classification template vector.

10. The method for classifying a log according to claim 8, wherein the extracting a log keyword dictionary and generating a log vector according to the log keyword dictionary and the classification template specifically include:

counting the frequency of each word in the log data in the whole log, and adding the word into the preliminary keyword list when the frequency of the word is greater than a preset frequency threshold.

Screening the preliminary keyword list according to a preset field standard, and removing words which do not belong to the preset field standard from the preliminary keyword list to obtain a log keyword dictionary;

making a single log into an n-dimensional vector, wherein n is the number of words in a log keyword dictionary, each dimensional vector represents one word in the log keyword dictionary, and the initial value is set as a false value;

and counting the words of the log keyword dictionary, and if a certain word appears in the single log, updating the value of the corresponding dimension of the word into a true value to obtain a log vector.

11. The method for classifying a log according to claim 8, wherein the matching the log vector with the template vector into a category further comprises:

respectively carrying out the same hash operation on the log template vector and the log vector to obtain respective hash tables;

and when the hash value of the log vector is the same as that of a certain template vector, classifying the log under a label corresponding to the hash value of the template vector.

12. An apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of stem extraction of a log according to any one of claims 1 to 5 or implementing the steps of the method of classification of a log according to any one of claims 6 to 11.

13. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for stem extraction of a log according to any one of claims 1 to 5, or the steps of the method for classification of a log according to any one of claims 6 to 11.