CN106571960B

CN106571960B - Log collection management system and method

Info

Publication number: CN106571960B
Application number: CN201610953544.9A
Authority: CN
Inventors: 李玉福; 宋春喜; 易有涛
Original assignee: Beijing Nongxin Internet Technology Co ltd
Current assignee: BEIJING NONGXIN INTERNET DATA TECHNOLOGY Co.,Ltd.; BEIJING NONGXIN INTERNET TECHNOLOGY GROUP Co.,Ltd.
Priority date: 2016-11-03
Filing date: 2016-11-03
Publication date: 2020-05-22
Anticipated expiration: 2036-11-03
Also published as: CN106571960A

Abstract

The invention provides a log collection management system and a method, wherein the system comprises an application layer subsystem, a data transmission layer subsystem, a data processing layer subsystem and a data storage layer subsystem, wherein the application layer subsystem comprises more than one service system and is used for collecting logs of a single machine and transmitting the collected logs to the data transmission layer subsystem; the data transmission layer subsystem consists of a distributed publishing and subscribing message system and is used for receiving the logs sent by the application layer subsystem; the data processing layer subsystem consists of a distributed computing system, subscribes logs from the data transmission layer subsystem and analyzes the logs; the data storage layer subsystem stores the original log into the distributed file system, and stores the analysis result of the data processing layer subsystem into the distributed database system and the relational database unit. The log collection management system and the log collection management method can meet the requirements of log collection management service on low requirements of a service system and good universality.

Description

Log collection management system and method

Technical Field

The invention relates to a distributed online service technology, in particular to the technical field of distributed log collection management.

Background

Various services for internet applications are typically implemented using complex large-scale distributed clusters. These internet applications are built on different sets of software modules, possibly developed by different teams, possibly implemented using different programming languages, possibly distributed over thousands of servers, spanning multiple different data centers. Therefore, tools are needed to analyze performance problems that can help understand system behavior.

For distributed online services, it consists of hundreds of distributed services. Each request is routed through multiple business systems and leaves footprints and accesses to various caches or databases, but this decentralized data is of limited help for problem troubleshooting, or for process optimization. One request needs to pass through a plurality of modules in the system, and a single request is completed by cooperation of hundreds of machines, so that the performance overhead of each stage in the whole request cannot be mastered in a typical scene by manpower alone, and a performance bottleneck in the system cannot be rapidly positioned. For such a cross-process/cross-thread scenario, it is very important to collect and analyze massive logs in a gathering manner. The distributed tracking is a target to track the complete call link of each request, collect the performance data of each service on the call link, calculate the performance data and compare the performance indexes, and even feed back the performance data to the service management in the farther future.

The requirements to be met by such a system include 1. low invasiveness-as a non-business component, there should be as little or no intrusion into other business systems as possible, transparent to the user, and reduced burden on developers; 2. flexible application strategy-the scope and granularity of the collected data can be decided (preferably at any time); 3. timeliness — the requirement is as fast as possible from the collection and generation of data, to the calculation and processing of data, to the final presentation; 4. decision support — whether or not these data can play a role at the decision support level, particularly from the development and development of integrated operations and maintenance (DevOps) perspective; 5. and the requirement of visualization is met.

The existing log collection management system does not have such performance, for example, some log collection management systems change the original service system greatly, some log collection management systems have poor timeliness, and data generation is slow.

Disclosure of Invention

The invention aims to provide a log collection management system which can meet the requirements of low requirement on a service system and good universality aiming at the problems that the log collection management system in the prior art has great change on the original service system, and the existing log collection management system has poor timeliness and slow data generation.

In order to solve the technical problems, the invention adopts the following technical scheme:

a log collection management system comprises an application layer subsystem, a data transmission layer subsystem, a data processing layer subsystem and a data storage layer subsystem, wherein,

the application layer subsystem comprises more than one service system and is used for collecting the logs of the single machine and transmitting the collected logs to the data transmission layer subsystem;

the data transmission layer subsystem consists of a distributed publishing and subscribing message system and is used for receiving the logs sent by the application layer subsystem;

the data processing layer subsystem consists of a distributed computing system, subscribes logs from the data transmission layer subsystem and analyzes the logs;

the data storage layer subsystem stores the original log into the distributed file system, and stores the analysis result of the data processing layer subsystem into the distributed database system and the relational database unit.

The application layer subsystem organizes the collected logs by adopting a data exchange format and unifies the log formats obtained by different service systems.

In addition, the application layer subsystem sets a buried point in the more than one service system, and acquires log contents including a tracking identification number, a remote procedure call identification number, call start time, a call type, a protocol type, a calling party network address and port, service name information, call time consumption, a call result and exception information.

And the data transmission layer subsystem acquires the logs issued by the application layer subsystem in a mode of combining offline storage and real-time storage.

In addition, the application layer subsystem sends all logs to all the child nodes of the data transmission layer subsystem in a balanced manner so as to realize load balance, and when a single child node of the data transmission layer subsystem fails, other child nodes supplement the function of publishing and subscribing messages.

Particularly, when a service system is abnormal, the application layer subsystem collects an alarm log, wherein the alarm log comprises an abnormal service system identification number, an alarm identification, a notification mode field, a notification object and alarm notification content.

The data processing layer subsystem comprises an alarm monitoring unit, when the alarm log is collected by the application layer subsystem, if the alarm monitoring unit of the data processing layer subsystem monitors that the alarm log is sent repeatedly, the alarm log is sent according to a preset interval, and after the alarm log is sent successfully, the alarm log is stored and is not sent again until the alarm log exceeds the preset effective time.

A log collection management method, the method comprising the steps of:

A. collecting logs of a single machine and issuing the collected logs;

B. receiving a log by using a distributed publish-subscribe message system;

C. subscribing logs by using a distributed computing system and analyzing;

D. and storing the original log in a distributed file system, and storing the analysis result of the data processing layer subsystem in a distributed database system and a relational database unit.

Specifically, the step of collecting the logs of the single machines in the step A comprises the following steps: and setting a buried point in a service system, and acquiring log contents including a tracking identification number, a remote procedure call identification number, call starting time, a call type, a protocol type, a calling party network address and port, service name information, call time consumption, a call result and abnormal information.

The steps of publishing the collected log and receiving the log by using the distributed publish-subscribe message system include: the application layer subsystem sends all logs to all the child nodes of the data transmission layer subsystem in a balanced manner so as to realize load balance, and when a single child node of the data transmission layer subsystem fails, other child nodes supplement the function of issuing the subscription message.

The log collection management system and the log collection management method adopt a data exchange format to organize the collected logs and unify the log formats obtained by different service systems, so the log collection management system and the log collection management method have low requirement on the service systems and better universality, and the log files among the different service systems can be compatible.

The log collection management system and the log collection management method adopt an anti-avalanche mode for the alarm log, thereby providing the stability of the system and preventing the crash of a service system.

Drawings

Fig. 1 is a schematic structural diagram of a log collection management system according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a log collection management method according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

Detailed exemplary embodiments are disclosed below. However, specific structural and functional details disclosed herein are merely for purposes of describing example embodiments.

It should be understood, however, that the intention is not to limit the invention to the particular exemplary embodiments disclosed, but to cover all modifications, equivalents, and alternatives falling within the scope of the disclosure. Like reference numerals refer to like elements throughout the description of the figures.

Referring to the drawings, the structures, ratios, sizes, and the like shown in the drawings are only used for matching the disclosure of the present disclosure, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present disclosure can be implemented, so that the present disclosure has no technical significance, and any structural modification, ratio relationship change, or size adjustment should still fall within the scope of the disclosure of the present disclosure without affecting the efficacy and the achievable purpose of the present disclosure. Meanwhile, the positional limitation terms used in the present specification are for clarity of description only, and are not intended to limit the scope of the present invention, and changes or modifications of the relative relationship therebetween may be regarded as the scope of the present invention without substantial changes in the technical content.

It will also be understood that the term "and/or" as used herein includes any and all combinations of one or more of the associated listed items. It will be further understood that when an element or unit is referred to as being "connected" or "coupled" to another element or unit, it can be directly connected or coupled to the other element or unit or intervening elements or units may also be present. Moreover, other words used to describe the relationship between components or elements should be understood in the same manner (e.g., "between" versus "directly between," "adjacent" versus "directly adjacent," etc.).

Before describing the embodiments of the present invention, some related art will be briefly described.

The embedded point is context information of the system at the current node, and can be divided into a client embedded point, a server embedded point and a client and server bidirectional embedded point.

Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the action flow data in a consumer-scale website. This action (web browsing, searching and other user actions) is a key factor in many social functions on modern networks. These data are typically addressed by handling logs and log aggregations due to throughput requirements. This is a viable solution to the limitations of Hadoop-like log data and offline analysis systems, but which require real-time processing. The purpose of Kafka is to unify online and offline message processing through the Hadoop parallel load mechanism, and also to provide real-time consumption through a cluster machine.

Hadoop is a distributed system infrastructure developed by the Apache Foundation. Hadoop is made up of many elements. At the bottom is a Hadoop Distributed File System (HDFS) that stores files on all storage nodes in a Hadoop cluster.

Strom is a distributed, fault tolerant, real-time computing system. Storm differs from other big data solutions in the way it is processed. Hadoop is essentially a batch system. Data is introduced into a Hadoop file system (HDFS) and distributed to various nodes for processing. When processing is complete, the result data is returned to the HDFS for use by the originator. Storm supports the creation of topologies to convert data streams without endpoints, unlike Hadoop jobs, where these conversions never stop and they continue to process arriving data.

The Storm cluster is composed of a main node and a plurality of working nodes. The master node runs a daemon named "Nimbus" for code allocation, task placement and fault detection. Each worker node runs a daemon named "hypervisor" to monitor the work and start and stop the worker process. Both Nimbus and super fail rapidly and are stateless, so they become very robust, and the coordination of both is done by apache zookeeper.

HBase is a distributed, column-oriented open source database, and the technology is derived from the Google paper "Bigtable: a distributed storage system of structured data. Just as Bigtable takes advantage of the distributed data storage provided by the Google File System (File System), HBase provides Bigtable-like capabilities over Hadoop. HBase is a sub-item of the Hadoop item of Apache. HBase is different from a general relational database, and is a database suitable for unstructured data storage. Another difference is that HBase is based on a column rather than a row based pattern.

Log4j is an open source project of Apache, and by using Log4j, the destinations where we can control Log information transmission are consoles, files, GUI components, even socket servers, NT event recorders, UNIX Syslog daemons, and the like; we can also control the output format of each log; by defining the level of each piece of log information, the generation process of the log can be controlled more finely. Most conveniently, these can be flexibly configured via a configuration file without the need to modify the code of the application.

Log4j has three main components, Logger, appendix and Layout. Wherein Logger is responsible for calling by client code, executing methods such as debug (object msg), info (object msg), war (object msg), error (object msg), etc. The appendix is responsible for outputting the Log, and the Log4j already realizes various output modes with different targets, and can output the Log to a file, output the Log to a console, output the Log to a Socket and the like. Layout is responsible for formatting log information.

JSON (JavaScript Object Notification) is a lightweight data exchange format. It is based on a subset of ECMAScript. JSON employs a text format that is completely language independent, but also uses conventions similar to the C language family (including C, C + +, C #, Java, JavaScript, Perl, Python, etc.). These properties make JSON an ideal data exchange language.

As shown in fig. 1, the embodiment of the present invention provides a log collection management system and method, where the system includes an application layer subsystem, a data transmission layer subsystem, a data processing layer subsystem, and a data storage layer subsystem, where the application layer subsystem includes more than one service system, and is used to collect logs of a single machine and transmit the collected logs to the data transmission layer subsystem; the data transmission layer subsystem consists of a distributed publishing and subscribing message system and is used for receiving the logs sent by the application layer subsystem; the data processing layer subsystem consists of a distributed computing system, subscribes logs from the data transmission layer subsystem and analyzes the logs; the data storage layer subsystem stores the original log into the distributed file system, and stores the analysis result of the data processing layer subsystem into the distributed database system and the relational database unit.

In particular, each business system of the application layer subsystem configures an appendix with Log4j, and is responsible for carrying out Log collection work on a single machine.

Specifically, the application layer subsystem sets a buried point in the more than one service system, and acquires log contents including a tracking identification number TraceId, a remote procedure call identification number RPCId, call start time, a call type, a protocol type, a calling party network address and port, service name information, call time consumption, a call result, and exception information. In addition, the log can reserve an extensible field to prepare for the next step of extension.

In order to process the logs uniformly. A unified and normative log format is very important, and the Pattern Layout commonly used in the prior art is mainly used for human-eye viewing, and is very inconvenient for the analysis and processing of the system, especially for the segmentation of fields, as follows:

it can be seen that the organization of the logs is not conducive to unified management, nor to automatic analysis by a computer. Therefore, the application layer subsystem organizes the collected logs by adopting a data exchange format and unifies the log formats obtained by different service systems. For example, in one embodiment, the JSON format is used to uniformly organize the required log format.

Therefore, the log collection management system and the log collection management method adopt the data exchange format to organize the collected logs and unify the log formats obtained by different service systems, so the log collection management system and the log collection management method have low requirement on the service systems and better universality, and the log files among the different service systems can be compatible.

In addition, the data transmission layer subsystem acquires the logs issued by the application layer subsystem in a mode of combining offline and real-time storage. For example, in one embodiment, the Log is pushed to the distributed publish-subscribe messaging system directly using a Log4j based MQ component.

For the data transmission layer subsystem, the distributed publish-subscribe message system may be Kafka, and the journal published by the application layer subsystem is accepted by using Kafka and sent to the data processing layer subsystem subscribing to the journal.

And the distributed computing system of the data processing layer subsystem can be a Storm cluster which is composed of a main node and a plurality of working nodes. The master node runs a daemon named "Nimbus" for code allocation, task placement and fault detection.

In a specific embodiment, the application layer subsystem sends all logs to all child nodes of the data transmission layer subsystem in a balanced manner to achieve load balancing, and when a single child node of the data transmission layer subsystem fails, other child nodes supplement a publish-subscribe message function.

In addition, when the service system is abnormal, the application layer subsystem collects an alarm log, wherein the alarm log comprises an abnormal service system mark number, an alarm mark, a notification mode field, a notification object and alarm notification content.

For example, the fields of the alarm log are:

sysid system identification number

Indication whether alarm is given or not, default false

msgtype [ sms \ email ] notification mode field,

sysidto sends notification objects that need to be notified,

content alarm notification content

If some systems continue the alias log under certain conditions, the message system may be crashed or even a series of other traffic system avalanches may result if the message is not rejected by the intended party. In order to overcome this problem, in a specific embodiment of the present invention, the data processing layer subsystem includes an alarm monitoring unit, and after the application layer subsystem collects an alarm log, if the alarm monitoring unit of the data processing layer subsystem monitors that the alarm log is repeatedly sent, the alarm log is sent at a predetermined interval, and after the alarm log is successfully sent, the alarm log is stored and the alarm log is not sent until the stored alarm log exceeds a preset valid time.

Expressed by way of pseudo-code as:

therefore, the log collection management system and the log collection management method adopt an anti-avalanche mode for the alarm log, thereby improving the stability of the system and preventing the crash of a service system.

Corresponding to the log collection management system of the present invention, the specific embodiment of the present invention further includes a log collection management method, as shown in fig. 2, the method includes the steps of:

A. collecting logs of a single machine and issuing the collected logs;

B. receiving a log by using a distributed publish-subscribe message system;

C. subscribing logs by using a distributed computing system and analyzing;

In particular, the step of collecting the logs of the single machines in step a comprises: and setting a buried point in a service system, and acquiring log contents including a tracking identification number, a remote procedure call identification number, call starting time, a call type, a protocol type, a calling party network address and port, service name information, call time consumption, a call result and abnormal information.

In addition, the steps of publishing the collected log and receiving the log using the distributed publish-subscribe messaging system include: the application layer subsystem sends all logs to all the child nodes of the data transmission layer subsystem in a balanced manner so as to realize load balance, and when a single child node of the data transmission layer subsystem fails, other child nodes supplement the function of issuing the subscription message.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the scope of the present invention, and any minor changes and modifications to the present invention are within the scope of the present invention without departing from the spirit of the present invention.

Claims

1. A log collection management system comprises an application layer subsystem, a data transmission layer subsystem, a data processing layer subsystem and a data storage layer subsystem, wherein,

the data transmission layer subsystem consists of a distributed publishing and subscribing message system and is used for receiving the log transmitted by the application layer subsystem;

the data processing layer subsystem consists of a distributed computing system and is used for subscribing logs from the data transmission layer subsystem and analyzing the logs;

the data storage layer subsystem stores original logs into a distributed file system, and stores analysis results of the data processing layer subsystem into a distributed database system and a relational database unit, wherein when a service system is abnormal, the application layer subsystem collects alarm logs;

2. The log collection management system of claim 1, wherein the application layer subsystem organizes the collected logs in a data exchange format to unify the log formats obtained by different business systems.

3. The log collection management system of claim 1, wherein the application layer subsystem is configured to obtain log content including a tracking identifier, a remote procedure call identifier, a call start time, a call type, a protocol type, a caller network address and port, service name information, a call elapsed time, a call result, and exception information at a site where the one or more service systems are installed.

4. The log collection management system of claim 1, wherein the data transmission layer subsystem is configured to obtain the log transmitted by the application layer subsystem in an offline and real-time storage manner.

5. The log collection management system of claim 1, wherein the application layer subsystem is configured to transmit all logs to all child nodes of the data transmission layer subsystem in a balanced manner to achieve load balancing, and when a single child node of the data transmission layer subsystem fails, other child nodes supplement a publish-subscribe message function.

6. The log collection management system as claimed in claim 1, wherein the application layer subsystem collects an alarm log when a service system abnormality occurs, the alarm log including an abnormal service system identification number, an alarm indication, a notification mode field, a notification object, and an alarm notification content.

7. A log collection management method, the method comprising the steps of:

A. collecting logs of a single machine and issuing the collected logs;

B. receiving a log by using a distributed publish-subscribe message system;

C. subscribing logs by using a distributed computing system and analyzing;

D. storing the original log into a distributed file system, and storing the analysis result of the data processing layer subsystem into a distributed database system and a relational database unit;

when the service system is abnormal, the application layer subsystem collects an alarm log;

after the application layer subsystem collects the alarm logs, if the distributed computing system monitors that the alarm logs are sent repeatedly, the alarm logs are sent according to a preset interval, and after the alarm logs are sent successfully, the alarm logs are stored and the alarm logs are not sent until the alarm logs exceed the preset effective time.

8. The log collection management method of claim 7, wherein the step of collecting the logs of the individual machines in the step a comprises: and setting a buried point in a service system, and acquiring log contents including a tracking identification number, a remote procedure call identification number, call starting time, a call type, a protocol type, a calling party network address and port, service name information, call time consumption, a call result and abnormal information.

9. The log collection management method of claim 7, wherein the steps of publishing the collected log and receiving the log using a distributed publish-subscribe messaging system comprise: the application layer subsystem sends all logs to all the child nodes of the data transmission layer subsystem in a balanced manner so as to realize load balance, and when a single child node of the data transmission layer subsystem fails, other child nodes supplement the function of issuing the subscription message.