CN116541728A - Fault diagnosis method and device based on density clustering - Google Patents

Fault diagnosis method and device based on density clustering Download PDF

Info

Publication number
CN116541728A
CN116541728A CN202310494693.3A CN202310494693A CN116541728A CN 116541728 A CN116541728 A CN 116541728A CN 202310494693 A CN202310494693 A CN 202310494693A CN 116541728 A CN116541728 A CN 116541728A
Authority
CN
China
Prior art keywords
log
log data
fault
preset
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310494693.3A
Other languages
Chinese (zh)
Inventor
杨虎
孙雅伦
张芳
耿志成
郭锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Jinan data Technology Co ltd
Original Assignee
Inspur Jinan data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Jinan data Technology Co ltd filed Critical Inspur Jinan data Technology Co ltd
Priority to CN202310494693.3A priority Critical patent/CN116541728A/en
Publication of CN116541728A publication Critical patent/CN116541728A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a fault diagnosis method and device based on density clustering, which relate to the field of fault diagnosis, when a server breaks down, first log data of log files of each device in the server in a period before and after the current moment are obtained, second log data of the log files in a period before and after the moment when the fault happens are obtained, finally, current working parameters of each device in the server are obtained, one of the three parameters is transmitted to a clustering model, so that the cause of the fault of the server and a solution strategy thereof are determined, and the fault cause and the fault solution strategy are transmitted to a user terminal. By means of acquiring various log data, a plurality of log files can be connected, and the fault cause can be more accurately determined; the clustering model is simple in structure, low in application difficulty and favorable for wide application; in addition, the fault solving strategy is directly given through the clustering model, so that manual diagnosis is not needed, and the technical threshold of staff is reduced.

Description

Fault diagnosis method and device based on density clustering
Technical neighborhood
The invention relates to the field of fault diagnosis, in particular to a fault diagnosis method and device based on density clustering.
Background
Various devices in the server can continuously generate logs when working, as the number of the servers becomes larger, the number of the logs generated in the same time becomes larger, so that the daily operation and maintenance difficulty of workers on the server is increased, when the server breaks down, a large number of logs exist, and a large number of useless logs exist in the logs to serve as interference, so that the workers can not timely locate the fault and determine a solution corresponding to the fault, and the operation and maintenance difficulty is further increased. In order to improve operation and maintenance efficiency, a neural network model is usually added, and through learning of the neural network model, logs representing faults can be screened out of all logs generated by a server for being referred by staff.
Disclosure of Invention
The invention aims to provide a fault diagnosis method and device based on density clustering, which can link a plurality of log files, can determine the cause of a fault more accurately, has lower application difficulty, is favorable for wide use, does not need to be diagnosed manually, and reduces the technical threshold of staff.
In order to solve the technical problems, the invention provides a fault diagnosis method based on density clustering, which comprises the following steps:
judging whether the server fails at the current moment according to log files generated by all devices in the server;
if yes, acquiring first log data of all the log files in a first preset duration range centering on the current moment;
acquiring second log data in a second preset duration range centering on each fault historical moment in the log file;
acquiring current working parameters of all the devices in the server;
transmitting all the first log data, the second log data and all the current working parameters to a preset clustering model so as to determine the fault type and the fault resolution strategy at the current moment;
and sending the fault type and the fault resolution strategy to a user terminal.
Preferably, determining whether the server fails at the current moment according to the log file generated by each device in the server includes:
acquiring log files generated by all the devices in the server;
judging whether a log row containing a preset keyword is generated in any one log file at the current moment;
if yes, judging that the server fails at the current moment.
Preferably, the obtaining second log data in the second preset duration range centered on each fault-occurring historical time in the log file includes:
and for any log file generating a preset keyword at the current moment, acquiring all second log data in the log file in a second preset duration range centering on the moment of generating the preset keyword.
Preferably, in the preset cluster model, determining the fault type and the fault resolution strategy at the current moment includes:
for any one of the log files generating a preset keyword at the current moment, determining a first distance between the first log data in the log file and the first log data of all other log files;
for any one of the log files generating a preset keyword at the current time, determining a second distance between the first log data in the log file and the second log data in all other log files generating preset keywords at the current time;
clustering the first log data and all log data with the distance close to the corresponding preset distance to obtain a first cluster corresponding to the first log data of the log file generating preset keywords at the current moment;
and determining the fault type and the fault solving strategy at the current moment according to each first cluster and the current working parameter.
Preferably, before obtaining the first log data of all the log files within the first preset duration range centering on the current time, the method further includes:
for any one of the log files, deleting the log lines without the timestamp in the log file.
Preferably, before all the first log data, the second log data and all the current working parameters are sent to a preset cluster model, the method further includes:
dividing words of all log data respectively to obtain a plurality of words;
judging whether the similarity between the log data is greater than a preset similarity or not according to the words;
if the similarity is larger than the preset similarity, changing one neighborhood parameter corresponding to the log data, and keeping the neighborhood parameter corresponding to the other log data as a default parameter;
transmitting all the first log data, the second log data and all the current working parameters to a preset clustering model, wherein the method comprises the following steps of:
and sending all the first log data, the second log data, the neighborhood parameters of the two log data and all the current working parameters to a preset clustering model.
Preferably, determining whether the similarity between the log data is greater than a preset similarity according to the word includes:
in any one of the log data, determining a ratio between the total number of words in the log data occupied by each of the words in the log data;
for any two of the log data, the following steps are performed:
determining a ratio difference between ratios of respective identical words between two of said log data;
taking all words with the proportion difference value smaller than a preset difference value as similar words between the two log data;
judging whether the sum of the proportions of all the similar words is larger than a preset proportion or not;
if yes, judging that the similarity of the two log data is larger than the preset similarity.
Preferably, in the preset cluster model, determining the fault type and the fault resolution strategy of the current fault of the server includes:
clustering all the first log data and the second log data, multiplying the clustering result by a preset multiple and multiplying the neighborhood parameters corresponding to the two log data to obtain a plurality of second clustering clusters;
and determining the fault type and the fault solving strategy at the current moment according to each second cluster and the working parameters.
Preferably, the pre-training of the preset clustering model includes:
acquiring N log files;
dividing N log files into M log packets, wherein N and M are positive integers and M is not more than N;
correspondingly adding the current working parameters obtained when the preset keywords are detected each time into the log package where the log files for generating the preset keywords are located;
acquiring a fault type and a corresponding fault resolution strategy sent by a user, and adding the fault type and the corresponding fault resolution strategy into the appointed log packet;
and for any one of the log files, training the log file by taking M log packets as M training sets of the log file respectively.
The application also provides a fault diagnosis device based on density clustering, which comprises:
a memory for storing a computer program;
and a processor for implementing the steps of the density clustering-based fault diagnosis method as described above when executing the computer program.
The application provides a fault diagnosis method and device based on density clustering, which relate to the field of fault diagnosis, when a server breaks down, first log data of log files of each device in the server in a period before and after the current moment are obtained, second log data of the log files in a period before and after the moment when the fault happens are obtained, finally, current working parameters of each device in the server are obtained, one of the three parameters is transmitted to a clustering model, so that the cause of the fault of the server and a solution strategy thereof are determined, and the fault cause and the fault solution strategy are transmitted to a user terminal. By means of acquiring various log data, a plurality of log files can be connected, and the fault cause can be more accurately determined; the clustering model is simple in structure, low in application difficulty and favorable for wide application; in addition, the fault solving strategy is directly given through the clustering model, so that manual diagnosis is not needed, and the technical threshold of staff is reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required in the prior art and the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a fault diagnosis method based on density clustering provided by the application;
FIG. 2 is a schematic structural diagram of a fault diagnosis model provided in the present application;
fig. 3 is a block diagram of a fault diagnosis device based on density clustering.
Detailed Description
The invention has the core of providing a fault diagnosis method and device based on density clustering, which can link a plurality of log files, can determine the cause of the fault more accurately, has lower application difficulty, is favorable for wide use, does not need to diagnose manually, and reduces the technical threshold of staff.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which are obtained by persons of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments of the present invention, are within the scope of the present invention.
With the advent of the big data age, the storage requirements for services are becoming larger and the number of servers is becoming larger, which makes daily operation and maintenance of a server cluster more and more difficult. Because server operations must pay attention to the logs in the server, but because the logs generated by the server are more and more, the more and more servers can cause huge amounts of logs of the whole server cluster, and the more and more difficult server operations and maintenance work is difficult to be performed manually.
In order to improve operation and maintenance efficiency and speed of a single server, a neural network model is usually added to replace manual work in the prior art, a large number of logs are selected in real time to represent fault logs by utilizing strong processing analysis capacity of the neural network model, the fault logs and normal logs are classified, workers only need to pay attention to the fault logs, and the log quantity which the workers need to face is reduced. The method can improve the efficiency, but only does classification work, not only does not give the cause of the fault, but also does not connect a plurality of log files to diagnose the fault together, the generation of the problem of the server is a complex process, and whether the problem of the server exists or not can not be diagnosed through the log in one file, and the comprehensive decision of a plurality of log files and a plurality of log data is needed. Therefore, the prior art cannot accurately determine the faults, or needs to consume a great deal of training samples to train the model, the accuracy of the former is low, and the latter is difficult to widely apply.
In order to solve the above technical problems, please refer to fig. 1, fig. 1 is a flowchart of a fault diagnosis method based on density clustering provided in the present application, which includes:
s1: judging whether the server fails at the current moment according to log files generated by all devices in the server;
in order to solve the failure in the server in time, first, it is necessary to continuously or periodically acquire a log file of each device in the server. The devices in the server all have a corresponding log file, and when the devices work, the devices write various data and actions such as various behaviors, executed task results, current states and the like into the log files, and the log files generally comprise two types of data: one is a file in which a device continuously writes in one log file, and the other is a plurality of log files in which the device classifies according to different behaviors and tasks, etc., that is, the content in the log files generally includes a plurality of actions at different moments in time. Therefore, when judging whether the server fails at the current time based on these log files, it is mainly determined whether there is a content indicating a failure in a newly generated log line in the existing log files, and whether there is a content indicating a failure in a newly generated log file at the current time.
S2: if yes, acquiring first log data of all log files in a first preset duration range centering on the current moment;
s3: acquiring second log data in a second preset duration range centering on each fault-occurring historical moment in the log file;
in order to accurately determine a failure occurring at a server, it cannot be determined on a one-sided basis from a single log file, for example, the cause of the failure cannot be determined from a log file having contents representing the failure in a newly generated row of logs alone. Therefore, when it is determined that the server fails, two types of log data need to be acquired: the first log data is all log data of all log files in the server in a period of time before and after the current moment, for example, data in 3 minutes before and after the current moment can be obtained, and one log file can obtain about 100 lines of log data generally; the second log data is obtained before, and because the content of one log file contains a plurality of logs within a period of time, one log file also usually contains a plurality of fault logs, and the second log data is obtained after all log data within a period of time before and after the moment of each fault log in the log file. In summary, the essence of the first log data and the second log data is: when each server fails, each log of each log file in a period of time before and after the moment of each failure is obtained. Further, in order to reduce the data processing amount when acquiring the second log data, only the second log data of the log file having the faulty content at the current time may be acquired, that is, only the second log data of the log file (the latest generated one line of log in the existing log file has the content indicating the faulty content).
S4: acquiring current working parameters of all devices in a server;
in order to further accurately determine the failure of the server, it is also necessary to obtain the current operating parameters of each device in the server, where the parameters refer to the model number of the device, the operating temperature, the frequency, the voltage, and some parameters related to actual operation (such as the rotation speed of the fan or the occupancy rate of the processor). It will be appreciated that when a server fails, there must be anomalies in some devices on the server, where the operating parameters of the device correspond to the manifestation of the "symptoms" of the device. The specific step of acquiring the operating parameters may be acquired through various sensors provided in the server, or may be acquired through a BMC (Baseboard Management Controller ) in the server, which is not limited in this application.
S5: all the first log data, the second log data and all the current working parameters are sent to a preset clustering model so as to determine the fault type and the fault resolution strategy at the current moment;
s6: and sending the fault type and the fault resolution strategy to the user terminal.
The clustering model is mainly a density clustering model (DBSCAN), similar log data are clustered into a cluster, when the model is trained in advance, a worker diagnoses the server faults according to the log files and the collected working parameters, and the log files, the working parameters, the diagnosed fault types and the fault solving strategies are uploaded to the model in a unified mode through the system. When the fault type is actually determined, after the log data acquired at the current moment are clustered into a certain cluster, mainly according to the cluster where the log file with the fault content at the current moment is located, whether the working parameters corresponding to the cluster are approximate to the working parameters acquired at the current moment or not is judged, and if so, the fault type and the fault resolution strategy which are diagnosed in advance can be uniformly transmitted to staff; if the working parameters are not approximate, the fault type and the fault solving strategy which are obtained through the diagnosis in advance can be uniformly sent to the staff, the model is optimized according to the feedback content of the staff, if the feedback content of the staff indicates that the fault type and the fault solving strategy are correct, the current working parameters can be corresponding to the cluster, otherwise, a cluster needs to be regenerated.
It should be noted that, in the obtained first log data and second log data, there is necessarily a normal log, and when the same fault occurs multiple times, if log data of log files of a certain device in the several faults are all indicative of normal contents, but the log data of the several log data are very similar or even identical, it is indicated that the log data of the several log data are positively associated with the fault, and therefore, the normal log data also needs to be taken into consideration. Furthermore, a certain keyword frequently appearing in the normal log data can be used as a new fault identifier, so that the fault detection method is enriched.
In summary, when the server fails, first log data of log files of each device in the server in a period before and after the current moment are obtained, second log data of the log files in a period before and after the fault happens once are obtained, and finally, current working parameters of each device in the server are obtained, one of the three parameters is sent to a clustering model to determine the reason of the fault of the server and the solution strategy thereof, and the fault reason and the fault solution strategy are sent to the user terminal. By means of acquiring various log data, a plurality of log files can be connected, and the fault cause can be more accurately determined; the clustering model is simple in structure, low in application difficulty and favorable for wide application; in addition, the fault solving strategy is directly given through the clustering model, so that manual diagnosis is not needed, and the technical threshold and the workload of staff are reduced.
Based on the above embodiments:
as a preferred embodiment, determining whether the server has failed at the current time according to the log files generated by the respective devices in the server includes:
acquiring log files generated by all devices in a server;
judging whether a log line containing a preset keyword is generated in any log file at the current moment;
if yes, judging that the server fails at the current moment.
In order to simply determine whether a server fails, in the present application, it is considered that when each device generates a log, the type of the log is marked in the log. For example: info, error, fail, waring, debug, wherein, the marks of error, fail and wash all represent some problems in the log, so when judging whether the server has a fault, only needs to judge whether the newly generated log line in the existing log file has the marks of error, fail and wash which represent problems, and judge whether the newly generated log file at the current moment has the marks. Based on this, by detecting the keyword, it can be simply determined whether the server has failed.
As a preferred embodiment, obtaining second log data in a second preset duration range centered on each of the historical time points of occurrence of the fault in the log file includes:
and for any log file generating the preset keywords at the current moment, acquiring all second log data in a second preset duration range taking the moment of generating the preset keywords as the center in the log file.
In order to reduce the data processing amount, in the application, some second log data with small fault relevance cannot be acquired. Specifically, the purpose of acquiring the first log data is to determine whether there is a relationship between various log files at the current time, so that the first data of each log file needs to be acquired; the purpose of obtaining the second log data is mainly to determine whether there is a relationship between the previous fault and the current fault, and for the log file without fault content at the current moment, because of the randomness of the log data, the historical fault log data is difficult to be associated with the log data at the current moment, so that only the second log data of the log file with fault content at the current moment can be obtained when the second log data is obtained. Based on this, the data processing amount can be effectively reduced.
As a preferred embodiment, in the preset cluster model, determining the fault type and the fault resolution strategy at the current moment includes:
for any one log file generating a preset keyword at the current moment, determining a first distance between first log data in the log file and first log data of all other log files;
for any one log file generating the preset keyword at the current moment, determining a second distance between the first log data in the log file and the second log data in all other log files generating the preset keyword at the current moment;
clustering the first log data and all log data with the distance close to the corresponding preset distance to obtain a first cluster corresponding to the first log data of each log file generating the preset keyword at the current moment;
and determining the fault type and the fault solving strategy at the current moment according to each first cluster and the current working parameters.
In order to cluster log data, in the present application, the log files having fault contents at the current time are mainly clustered according to the distance between two log data, and herein, the log files having no fault contents at the current time are referred to as normal log files. Specifically, a distance needs to be calculated between the first log data of the fault log and the first log data of each other fault+normal log file, which is a first distance; the first log data of the fault log needs to be calculated with the second log data of each other fault log file, and the first log data of the fault log and the second log data of each other fault log file need to be calculated as second distances; the first distance corresponds to calculating the correlation between different log files at the current moment and the second distance corresponds to determining whether the impact on these devices each time a fault occurs is similar or identical. Further, the distance between the first log data when the fault occurs and the second log data of the fault log file itself may be calculated as the third distance for each fault log file, which is equivalent to checking whether the fault occurred previously.
In actual clustering, the first preset distance may be set smaller, for example, to 0.4; the second preset distance needs to be set to be larger, e.g. to be set to 0.6. This may require that the historical fault log data of the different files be accurately analyzed for correlation with the current fault, and the log data at the current time is necessarily correlated with the current fault, so that the accuracy of the judgment of the first log data is lower than that of the second log data.
As a preferred embodiment, before acquiring the first log data of all log files within the first preset duration range centered on the current time, the method further includes:
for any one log file, the whole line deletes the log line without the time stamp in the log file.
In order to obtain effective log data, in the present application, considering that the logs generated by the server are of various types, some of the logs do not have time stamps (such as logs simply reporting the task execution state), and the logs cannot be used for operation and maintenance of the server, and cannot be deduced when the logs are generated, so that no reference is made to fault detection. Therefore, each log without a timestamp in the log file needs to be deleted before the log data is acquired, and when the first log data is acquired, the first preset duration range is correspondingly enlarged according to the number of the deleted logs, so that the number of the acquired logs is prevented from being too small. Based on this, by deleting the number of log pieces without a time stamp, a part without references in the log file is removed, and effective log data can be obtained.
As a preferred embodiment, before all the first log data, the second log data and all the current operating parameters are sent to the preset cluster model, the method further comprises:
dividing words of all log data respectively to obtain a plurality of words;
judging whether the similarity between the log data is greater than a preset similarity according to the words;
if the similarity is greater than the preset similarity, changing a neighborhood parameter corresponding to one of the log data, and keeping the neighborhood parameter corresponding to the other log data as a default parameter;
transmitting all the first log data, the second log data and all the current working parameters to a preset clustering model, wherein the method comprises the following steps of:
and sending all the first log data, the second log data, the neighborhood parameters of the two log data and all the current working parameters to a preset clustering model.
In order to ensure diversity of the clusters, in the application, considering that the situation that different log files are too similar or even identical may exist, a plurality of identical clusters may be generated based on the identical log files, which results in useless clusters in the generated clusters, therefore, a neighborhood parameter of one log file needs to be adjusted, so that the clustering algorithm can obtain different clusters when clustering the similar or even identical log files. Specifically, similarity judgment is performed on two log files from two dimensions: one dimension is to divide the log files into words to obtain word sets corresponding to the log files, and if the word sets of the two log files are highly repeated, the two log files are similar, for example, the similarity between the two log files exceeds 50% and is considered to be very similar; one dimension is from a line perspective, and if a line of data in two log files is identical, it is stated that the two log files are relatively similar. For two similar log files, one of the neighborhood parameters of the log file is kept unchanged and still is the default value, and the neighborhood parameters of the other log file need to be changed so as to distinguish the two log files. Based on this, the diversity of the cluster can be ensured.
As a preferred embodiment, determining whether the similarity between the log data is greater than a preset similarity according to the word includes:
in any one of the log data, determining a ratio between the total number of words in the log data occupied by each word in the log data;
for any two log data, the following steps are performed:
determining a ratio difference between ratios of respective identical words between the two log data;
taking all words with the proportion difference value smaller than the preset difference value as similar words between the two log data;
judging whether the sum of the proportions of all similar words is larger than a preset proportion;
if yes, judging that the similarity of the two log data is larger than the preset similarity.
In order to accurately determine the similarity between two log files, in the application, firstly, the ratio of each word in the log file where the word is located is calculated, wherein the ratio refers to the ratio of the number of times of occurrence of the word to the total number of words in the log file; and secondly, comparing the duty ratio of the same word in different log files, if the duty ratio of a certain word in two log files is almost the same, indicating that the word is a similar word of the two log files, and calculating all similar words between the two log files based on the comparison. Then, the sum of the ratios of all similar words is calculated, that is, the sum of the ratios of all similar words to the total number of words in the log files is calculated, and only the ratio of the similar words in one of the log files needs to be calculated because the ratios of the similar words in both log files are almost the same. If the sum of the duty ratios of all similar words is larger than a preset proportion, two log files are indicated to be approximate. For example, if the ratio of all similar words is greater than 50% of the entire log file, it is indicated that at least half of the two log files are similar. Based on this, the similarity between the two log files can be accurately determined.
As a preferred embodiment, in the preset cluster model, determining the fault type and the fault resolution policy of the current fault of the server includes:
clustering all the first log data and the second log data, multiplying the clustering result by a preset multiple and multiplying the clustering result by the neighborhood parameters corresponding to the two log data to obtain a plurality of second clustering clusters;
and determining the fault type and the fault solving strategy at the current moment according to each second cluster and the working parameters.
For accurate clustering, in this application, for each log fileAnd initializing epsilon neighborhood parameters (epsilon, minPts) by adopting a DBSCAN density clustering algorithm, and calculating the tightness between two log data. The basic formula is min (X i ,X j ) 0.4, wherein Xi is the log data in one log file, xj is the log data in the other log file, 0.4 is a preset multiple, and further, for two log data which are not similar, the neighborhood parameter P is 1, so the final result is min (X i ,X j ) 0.4; for two similar log data, where the neighborhood parameter P of a certain log data is not 1, for example, may be 0.8, then the end result is min (X i ,X j ) *0.4*0.8. Epsilon is the same, for two log data, if two log data are approximate, epsilon=epsilon×eta, wherein eta is the sum of the duty ratios of all similar words; otherwise, ε is kept at a default value.
After clustering, the log data with a closer distance are converged together each time, and the log data is not calculated any more when other clustering is performed later, as a simple example, assuming that the five log data of A, B, C, D and E need to be clustered at present, firstly, the distance between A and BCDE is calculated respectively, if B is closer to A, A and B are clustered, and the distance between C and DE is calculated directly when clustering is performed for the second time, and A and B are not considered. Based on the above, a plurality of clusters are formed until each log data is clustered, and the clusters are marked according to the fault type and the fault resolution strategy uploaded by the operation and maintenance personnel. Based on this, clustering can be performed accurately.
As a preferred embodiment, the pre-training of the preset cluster model comprises:
acquiring N log files;
dividing N log files into M log packets, wherein N and M are positive integers and M is not more than N;
correspondingly adding the current working parameters obtained when the preset keywords are detected each time into a log packet in which a log file for generating the preset keywords is located;
acquiring a fault type and a corresponding fault resolution strategy sent by a user, and adding the fault type and the corresponding fault resolution strategy into a specified log packet;
for any one log file, training the log file by using M log packets as M training sets of the log file respectively.
In order to improve the accuracy of the clustering model, in the application, some log files can be randomly screened out of all log files in the history, and are packaged into a plurality of log packages, log data contained in each log package can be different, for example, millions of log files can be randomly screened out, and the log files are irregularly packaged into 10 tens of thousands of log packages. For each of millions of log files, the log package is used as the training set of the log files, that is, there are 10 ten thousand training sets for each log file, and the fault cause and the fault resolution strategy are used as the association.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a fault diagnosis model provided in the present application, the left side block diagram is a training step, a log packet in a server is input into a log analysis device as a training set, that is, into a device where a cluster model is located, useless data in the log packet is then cleaned, and a time stamp is standardized, the logs are segmented to obtain a set of multiple words, then a corresponding model is built for each log file, and the training set using the log packet as the model is used for training the model, so as to determine association rules between the logs and corresponding fault causes and solutions. The right block diagram is an actual use flow, collects logs in a server, if faults are found in the logs, various log data in the embodiment are sent to a diagnosis device, the diagnosis device can be the same device as the log analysis device or another device, the diagnosis device calculates and further clusters the obtained log data and a previous training result based on a clustering model so as to find out fault reasons and solving strategies corresponding to the faults, and the fault reasons and the solving strategies are output to staff as diagnosis results.
For a specific training step, taking a log file indicating that a fault exists as an example, when a new log file indicating that a fault exists is added to a log package at a certain moment, clustering the log file and all other log files in the same log package, and then for other log packages, checking whether the log file added to other log packages at the same moment with the log file exists, if so, clustering the log file and each log file in the other log packages. Of course, in order to reduce the amount of calculation, only the log file cluster in the log packet in which the log file having a failure is included in the other log packet at the same time as the log file is added may be used. As a simple example, assume that there are a total of A, B, C, D log files, AB in one log package x, C and D in another log package y and z, respectively, and if at a certain time a log file E indicating the presence of a fault is added to one log package x, and a log file F indicating the presence of a fault is added to one log package z, first the E and AB are clustered, and then the E and DF are also clustered.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a fault diagnosis device based on density clustering provided in the present application, including:
a memory 21 for storing a computer program;
a processor 22 for implementing the steps of the density cluster based fault diagnosis method as described above when executing the computer program.
For a detailed description of a fault diagnosis device based on density clustering provided in the present application, please refer to an embodiment of the fault diagnosis method based on density clustering, and the detailed description is omitted herein.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A fault diagnosis method based on density clustering, comprising:
judging whether the server fails at the current moment according to log files generated by all devices in the server;
if yes, acquiring first log data of all the log files in a first preset duration range centering on the current moment;
acquiring second log data in a second preset duration range centering on each fault historical moment in the log file;
acquiring current working parameters of all the devices in the server;
transmitting all the first log data, the second log data and all the current working parameters to a preset clustering model so as to determine the fault type and the fault resolution strategy at the current moment;
and sending the fault type and the fault resolution strategy to a user terminal.
2. The density cluster-based fault diagnosis method according to claim 1, wherein determining whether the server has a fault at a current time based on log files generated by respective devices in the server comprises:
acquiring log files generated by all the devices in the server;
judging whether a log row containing a preset keyword is generated in any one log file at the current moment;
if yes, judging that the server fails at the current moment.
3. The density-cluster-based fault diagnosis method according to claim 2, wherein obtaining second log data in a second preset duration range centered on each of the failed historical moments in the log file, comprises:
and for any log file generating a preset keyword at the current moment, acquiring all second log data in the log file in a second preset duration range centering on the moment of generating the preset keyword.
4. The density-cluster-based fault diagnosis method as claimed in claim 3, wherein determining the fault type and fault resolution strategy at the current time in the preset cluster model comprises:
for any one of the log files generating a preset keyword at the current moment, determining a first distance between the first log data in the log file and the first log data of all other log files;
for any one of the log files generating a preset keyword at the current time, determining a second distance between the first log data in the log file and the second log data in all other log files generating preset keywords at the current time;
clustering the first log data and all log data with the distance close to the corresponding preset distance to obtain a first cluster corresponding to the first log data of the log file generating preset keywords at the current moment;
and determining the fault type and the fault solving strategy at the current moment according to each first cluster and the current working parameter.
5. The density-cluster-based fault diagnosis method according to claim 1, further comprising, before acquiring first log data of all the log files within a first preset duration range centered on the current time, the steps of:
for any one of the log files, deleting the log lines without the timestamp in the log file.
6. The density-cluster-based fault diagnosis method according to claim 1, further comprising, before transmitting all of said first log data, said second log data, and all of said current operating parameters into a preset cluster model:
dividing words of all log data respectively to obtain a plurality of words;
judging whether the similarity between the log data is greater than a preset similarity or not according to the words;
if the similarity is larger than the preset similarity, changing one neighborhood parameter corresponding to the log data, and keeping the neighborhood parameter corresponding to the other log data as a default parameter;
transmitting all the first log data, the second log data and all the current working parameters to a preset clustering model, wherein the method comprises the following steps of:
and sending all the first log data, the second log data, the neighborhood parameters of the two log data and all the current working parameters to a preset clustering model.
7. The density-cluster-based fault diagnosis method according to claim 6, wherein determining whether the similarity between the respective log data is greater than a preset similarity according to the word, comprises:
in any one of the log data, determining a ratio between the total number of words in the log data occupied by each of the words in the log data;
for any two of the log data, the following steps are performed:
determining a ratio difference between ratios of respective identical words between two of said log data;
taking all words with the proportion difference value smaller than a preset difference value as similar words between the two log data;
judging whether the sum of the proportions of all the similar words is larger than a preset proportion or not;
if yes, judging that the similarity of the two log data is larger than the preset similarity.
8. The method for diagnosing a failure based on density clustering as claimed in claim 6, wherein determining the failure type and the failure resolution strategy of the present failure of the server in the preset clustering model comprises:
clustering all the first log data and the second log data, multiplying the clustering result by a preset multiple and multiplying the neighborhood parameters corresponding to the two log data to obtain a plurality of second clustering clusters;
and determining the fault type and the fault solving strategy at the current moment according to each second cluster and the working parameters.
9. The density-cluster-based fault diagnosis method according to any one of claims 1-8, wherein pre-training the preset cluster model comprises:
acquiring N log files;
dividing N log files into M log packets, wherein N and M are positive integers and M is not more than N;
correspondingly adding the current working parameters obtained when the preset keywords are detected each time into the log package where the log files for generating the preset keywords are located;
acquiring a fault type and a corresponding fault resolution strategy sent by a user, and adding the fault type and the corresponding fault resolution strategy into the appointed log packet;
and for any one of the log files, training the log file by taking M log packets as M training sets of the log file respectively.
10. A density clustering-based fault diagnosis apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the density cluster based fault diagnosis method according to any one of claims 1 to 9 when executing the computer program.
CN202310494693.3A 2023-04-28 2023-04-28 Fault diagnosis method and device based on density clustering Pending CN116541728A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310494693.3A CN116541728A (en) 2023-04-28 2023-04-28 Fault diagnosis method and device based on density clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310494693.3A CN116541728A (en) 2023-04-28 2023-04-28 Fault diagnosis method and device based on density clustering

Publications (1)

Publication Number Publication Date
CN116541728A true CN116541728A (en) 2023-08-04

Family

ID=87455488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310494693.3A Pending CN116541728A (en) 2023-04-28 2023-04-28 Fault diagnosis method and device based on density clustering

Country Status (1)

Country Link
CN (1) CN116541728A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118211170A (en) * 2024-05-22 2024-06-18 苏州元脑智能科技有限公司 Server failure diagnosis method, product, computer device, and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118211170A (en) * 2024-05-22 2024-06-18 苏州元脑智能科技有限公司 Server failure diagnosis method, product, computer device, and storage medium

Similar Documents

Publication Publication Date Title
JP6828096B2 (en) Server hardware failure analysis and recovery
CN112162878B (en) Database fault discovery method and device, electronic equipment and storage medium
US9672085B2 (en) Adaptive fault diagnosis
Xu et al. Online system problem detection by mining patterns of console logs
EP3935460A1 (en) Systems and methods for detecting and predicting faults in an industrial process automation system
US7113988B2 (en) Proactive on-line diagnostics in a manageable network
US7693982B2 (en) Automated diagnosis and forecasting of service level objective states
US10346744B2 (en) System and method for visualisation of behaviour within computer infrastructure
CN109800127A (en) A kind of system fault diagnosis intelligence O&M method and system based on machine learning
CN110032463B (en) System fault positioning method and system based on Bayesian network
Zheng et al. 3-dimensional root cause diagnosis via co-analysis
CN115118581B (en) Internet of things data all-link monitoring and intelligent guaranteeing system based on 5G
CN104796273A (en) Method and device for diagnosing root of network faults
CN112083244A (en) Integrated avionics equipment fault intelligent diagnosis system
CN112836436A (en) Power distribution network line risk quantitative prediction method based on probability graph model
CN116541728A (en) Fault diagnosis method and device based on density clustering
Zeng et al. Traceark: Towards actionable performance anomaly alerting for online service systems
Grottke et al. Ten fallacies of availability and reliability analysis
CN112699048B (en) Program fault processing method, device, equipment and storage medium based on artificial intelligence
CN115114124A (en) Host risk assessment method and device
CN113395182A (en) Intelligent network equipment management system and method with fault prediction
CN107590008A (en) A kind of method and system that distributed type assemblies reliability is judged by weighted entropy
CN117194154A (en) APM full-link monitoring system and method based on micro-service
ZHANG et al. Approach to anomaly detection in microservice system with multi-source data streams
Domingos et al. Why is it so hard to predict computer systems failures?

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination