CN111241078B

CN111241078B - Data analysis system, data analysis method and device

Info

Publication number: CN111241078B
Application number: CN202010014747.8A
Authority: CN
Inventors: 刘晶晶
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2024-06-21
Anticipated expiration: 2040-01-07
Also published as: CN111241078A

Abstract

The invention discloses a data analysis system, a data analysis method and a data analysis device. Wherein, this data analysis system includes: the log collection module is used for collecting original log data reported by at least one target application; the data distribution module is used for cleaning and distributing the original log data according to preset configuration information to obtain a distribution result, wherein the preset configuration information comprises: the corresponding relation between the category information of the log data and the category information created on the message processor cluster; and the data analysis module is used for carrying out serialization processing and logic operation on the shunting result to obtain an operation result and storing the operation result. The invention solves the technical problems that the mode for analyzing the business in real time in the related technology needs operators to have enough field knowledge and is difficult to implement.

Description

Data analysis system, data analysis method and device

Technical Field

The invention relates to the technical field of data analysis and processing, in particular to a data analysis system, a data analysis method and a data analysis device.

Background

With the development of Hadoop data warehouse, mainly based on Hive's offline analysis. The business starts to put forth demands on real-time analysis, such as real-time statistics of program effects, online learning of recommendation systems, and real-time feature systems. After investigation, flink calculation engines can be found to meet the requirement of business to analyze in real time, however flink requires special domain knowledge to develop and is not friendly enough to algorithm or white scale.

In view of the above-mentioned problems that the manner of real-time analysis of services in the related art requires operators to have enough knowledge in the field and is difficult to implement, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a data analysis system, a data analysis method and a data analysis device, which at least solve the technical problems that an operator needs to have enough field knowledge and is difficult to implement in a mode for carrying out real-time analysis on a service in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a data analysis system including: the log collection module is used for collecting original log data reported by at least one target application; the data distribution module is used for cleaning and distributing the original log data according to preset configuration information to obtain a distribution result, wherein the preset configuration information comprises: the corresponding relation between the category information of the log data and the category information created on the message processor cluster; and the data analysis module is used for carrying out serialization processing and logic operation on the shunting result to obtain an operation result and storing the operation result.

Optionally, the log collection module is further configured to, after receiving the original log data reported by the at least one target application, perform format conversion on the original log data to obtain original log data in a predetermined format.

Optionally, the log collection module is further configured to determine, in a buried point manner, the at least one target application and category information of log data that needs to be collected by the at least one target application, so as to trigger the at least one target application to report the original log data after the at least one target application collects the original log data.

Optionally, the at least one target application collects log data of the client by at least one of: logging, nginx access log.

Optionally, the log collection module includes: the log acquisition sub-module is used for acquiring original log data reported by the at least one target application; and the log monitoring sub-module is connected with the log acquisition sub-module and is used for triggering the log acquisition sub-module to send the original log data to the message processor cluster when the original log data are stored in the log acquisition sub-module.

Optionally, the data splitting module is further configured to configure a log type of original log data to be split through MySQL, so as to create a plurality of log lists corresponding to the log type in the message processor cluster.

Optionally, the data splitting module is further configured to obtain a splitting rule updated to the local cache by using the flink distributed computing system through a database connection pool, so that the original log data is split by using the splitting rule, and the split log data is obtained.

Optionally, the data splitting module is further configured to extract a log type field of the split log data, perform dimension information expansion on the split log data to obtain expanded log data, and distribute the expanded log data to a plurality of log lists of the message processor cluster to form a data source of a data warehouse.

Optionally, the data analysis module is configured to analyze the received SQL statement to obtain an analyzed SQL statement, and perform serialization processing and logic operation on the data in the data warehouse based on the service requirement corresponding to the analyzed SQL statement to obtain the operation result.

Optionally, the data analysis module is configured to obtain, after receiving the SQL statement, data in the data warehouse through a table source provided by the flink distributed computing system, and submit the obtained data in the data warehouse to a stream operator of the flink distributed computing system, so as to perform statistical analysis on the received data in the data warehouse by using the stream operator, thereby obtaining the statistical analysis result.

Optionally, the data analysis module is further configured to store, through a table sink provided by the flink distributed computing system, an operation result obtained through the stream operator serialization processing and the logic operation, and store the operation result in a database for a service party to use.

Optionally, the message processor cluster is a kafka cluster.

According to another aspect of the embodiments of the present invention, there is provided a data analysis method applied to any one of the above data analysis systems, including: collecting original log data reported by at least one target application; cleaning and splitting the original log data according to preset configuration information to obtain a splitting result, wherein the preset configuration information comprises: the corresponding relation between the category information of the log data and the category information created on the message processor cluster; and carrying out serialization processing and logic operation on the shunting result to obtain an operation result, and storing the operation result.

According to another aspect of the embodiment of the present invention, there is also provided a data analysis apparatus, and a method for data analysis using the above method, including: the collecting unit is used for collecting the original log data reported by at least one target application; the processing unit is used for cleaning and splitting the original log data according to preset configuration information to obtain a splitting result, wherein the preset configuration information comprises: the corresponding relation between the category information of the log data and the category information created on the message processor cluster; and the acquisition unit is used for carrying out serialization processing and logic operation on the shunting result to obtain an operation result and storing the operation result.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program performs the method of data analysis described in the above.

According to another aspect of the embodiment of the present invention, there is provided a processor, where the processor is configured to execute a program, where the program executes the method for analyzing data described in the foregoing.

In the embodiment of the invention, a log collection module is adopted to collect original log data reported by at least one target application; and cleaning and splitting the original log data by utilizing a data splitting module according to preset configuration information to obtain a splitting result, wherein the preset configuration information comprises: the corresponding relation between the category information of the log data and the category information created on the message processor cluster; the data analysis system in the embodiment of the invention realizes the purposes of shunting the original log data in real time and carrying out serialization processing and logic operation on the shunting result to obtain the operation result, thereby achieving the technical effect of reducing the difficulty of analyzing the service, and further solving the technical problems that operators need to have enough domain knowledge in the mode for carrying out real-time analysis on the service in the related technology and the implementation is difficult.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a schematic diagram of a data analysis system according to an embodiment of the invention;

FIG. 2 is a schematic diagram of log data flow in a data analysis system according to an embodiment of the present invention;

FIG. 3 is a functional diagram of flink SQL according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of data warehouse hierarchy in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of a web UI interface according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a blood-lineage relationship between data according to an embodiment of the invention;

FIG. 7 is a flow chart of a method of data analysis according to an embodiment of the invention; and

Fig. 8 is a schematic diagram of an apparatus for data analysis according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to facilitate understanding, some of the terms or expressions that appear in the embodiments of the invention are described in detail below.

Serial peripheral interface (SERIAL PERIPHERAL INTERFACE, abbreviated SPI): the high-speed full-duplex synchronous communication bus occupies only four wires on the pins of the chip, saves space on the layout of the PCB, and provides convenience.

User data protocol (user datagram protocol, UDP for short): a method is provided for an application to send encapsulated IP datagrams without establishing a connection.

Operation data storage (operational data store, ODS for short): is an optional part of the data warehouse architecture, is a combination of body-oriented, integrated, current or near current, constantly changing, current detail data.

Access control list (access control list, ACL for short): is an access control technology based on packet filtering, which can filter the data packet on the interface according to the set condition, and allow the data packet to pass or be discarded.

Database availability group (DIRECTED ACYCLIC GRAPH, abbreviated DAG): directed acyclic graphs refer to graphs in which one edge is directed and no loops are present.

Hadoop is an infrastructure of a distributed system developed by the Apache foundation.

ETL is a process for extracting, converting, and loading data from a source to a destination, and is commonly used in a data warehouse.

Rsylog: being able to accept results from a variety of sources, input, output to different destinations, can provide over one million messages per second to a destination file.

Filebeat: a log data collector, which is a local file, can monitor log directories or specific log files and forward them to ELASTICSEARCH or logstash for indexing, kafka, etc., with internal modules that can simplify the collection, parsing and visualization of the general log format by one specified command, with two components, a finder and a collector, to read the files and send event data to specified outputs.

Flink: is an open source stream processing framework that executes arbitrary stream data programs in a data parallel and pipelined fashion, and the pipeline runtime system of flink can execute batch processing and stream processing programs.

Jason: is a lightweight data interchange format that stores and presents data in a text format that is completely independent of the programming language, and is a sequence of identifiers that contain six constituent characters, strings, numbers, and three literal names, as well as a serialized object or array.

Example 1

According to an aspect of an embodiment of the present invention, there is provided a data analysis system, fig. 1 is a schematic diagram of the data analysis system according to an embodiment of the present invention, as shown in fig. 1, the data analysis system includes:

the log collection module 11 is configured to collect raw log data reported by at least one target application.

Optionally, the log collecting module collects at least one original log data reported by the target application at intervals of a predetermined period, where the predetermined period may be a smaller time interval of 1 minute, 5 minutes, etc., so as to meet different service analysis requirements of users.

The data splitting module 13 is configured to perform cleaning and splitting processing on the original log data according to preset configuration information to obtain a splitting result, where the preset configuration information includes: correspondence between category information of log data and category information created on the message processor cluster.

It should be noted that, in the embodiment of the present invention, the message processor cluster is a kafka-based cluster. Wherein Kafka is an open source stream processing platform, a high throughput distributed publish-subscribe messaging system that can process all action stream data of consumers in websites. This action is a key factor in many social functions on modern networks, and this data is usually solved by processing logs and log aggregations due to throughput requirements.

Because the original log data is collected on the message processing cluster, all the original log data is uploaded through a unified channel, the message processing cluster information is huge, and huge pressure is applied to the processing capacity of the service. While not all log columns are of interest to the user for different services. Therefore, in order to reduce the pressure of the message processing cluster, the original log data may be optionally pre-processed, for example, cleaned and split according to the category information of the log data.

The original log data is collected by the log collection module and then reported to an original layer of the data analysis system, and the original log data collected by the original layer can be split. For example, ETL splitting may be performed on the original log data.

The data analysis module 15 is configured to perform serialization processing and logic operation on the split result, obtain an operation result, and store the operation result.

Optionally, the original log data after the splitting process may be subjected to serialization processing to obtain service ranking data; the original log data after the split processing can also be subjected to logic operation to obtain the heat degree of the anchor and the like.

FIG. 2 is a schematic diagram of log data flow in a data analysis system according to an embodiment of the present invention, as shown in FIG. 2, the log data being collected into an original layer of the data analysis system; then, carrying out log splitting processing on the original log data received by the original layer, and distributing the log splitting processing to different topics of kafaka to form a real-time data warehouse, namely a data source layer of the ODS layer; then, logic aggregation is carried out through flink systems to obtain an operation result, and a theme convergence layer is formed; the theme convergence layer can perform service combination to obtain a service application layer, and the service application layer is stored through different service combinations.

As can be seen from the above, in the embodiment of the present invention, the log data module may be used to collect the original log data reported by at least one target application, and the data analysis module may be used to clean and shunt the original log data according to preset configuration information, so as to obtain a shunt result, where the preset configuration information includes: the corresponding relation between the category information of the log data and the category information created on the message processor cluster is utilized to carry out serialization processing and logic operation on the shunting result by utilizing the data analysis module to obtain an operation result, and the operation result is stored, so that the purposes of shunting the original log data in real time, carrying out serialization processing and logic operation on the shunting result to obtain the operation result are realized.

It is easy to notice that, in the embodiment of the present invention, the log data collection module is used to collect the original log data reported by at least one target application in advance, and the data analysis module can be used to clean and shunt the original log data according to the preset configuration information to obtain a shunt result, and then the data analysis module is used to perform serialization processing and logic operation on the shunt result to obtain an operation result, and store the operation result, so that the purpose of obtaining the operation result by shunting the original log data in real time and performing serialization processing and logic operation on the shunt result is achieved, and the technical effect of reducing the difficulty of analyzing the service is achieved.

The data analysis system provided by the embodiment of the invention solves the technical problems that the mode for carrying out real-time analysis on the business in the related technology needs to have enough field knowledge for operators and is difficult to implement.

In an optional embodiment, the log collection module is further configured to, after receiving the original log data reported by the at least one target application, perform format conversion on the original log data to obtain original log data in a predetermined format.

Alternatively, the at least one target application may be APP1 provided at a client of a different user. For example, an APP1 is set on a mobile phone of the user a, and when the user downloads the APP2 through an application on the mobile phone, the APP1 is triggered to acquire a piece of log data; in addition, when the user opens the APP2, log data acquisition may also be performed on a log corresponding to the situation browsed by the user in the APP 2.

That is, in the embodiment of the present invention, in order to facilitate the structural analysis of the original log data, json log data of the log data may be normalized at the log source, and the format may be: [ logtime ] [ logtype ], json; for example, [ 2013-04-10:11:00:09 ] [ Click ], { "urs":12344343, "server": "1001" }. The above format design facilitates the extraction of event time and log identification processed in real time by the data analysis system, and the json format facilitates serialization and expansion.

In an alternative embodiment, the log collection module is further configured to determine, in a buried point manner, at least one target application and category information of log data that needs to be collected by the at least one target application, so as to trigger the at least one target application to report the original log data after the original log data is collected by the at least one target application.

The content of the buried point mainly depends on which information is to be obtained from the user, and is generally mainly divided into basic attribute information and behavior information of the user, where the basic attribute information of the user mainly includes: city, address, age, gender, latitude and longitude, account type, operator, network, device, etc.; in another aspect, behavior information is a clicking behavior and a browsing behavior of the user, for example, what button the user clicks at what time, which page is browsed, a browsing duration, which operations are performed in the browsed page, and so on.

In an alternative embodiment, at least one target application gathers log data for a client by at least one of: logging, nginx access log.

For example, at least one target application may collect log data of a client through a logging module, or through an nginx access log and store it locally. The log data collected using rsyslog, filebeat, scribeagent, etc. tools is then sent in real time to the topic of the message processor cluster kafka.

In an alternative embodiment, the log collection module includes: the log acquisition sub-module is used for acquiring original log data reported by at least one target application; the log monitoring sub-module is connected with the log acquisition sub-module and is used for triggering the log acquisition sub-module to send the original log data to the message processor cluster when the original log data are stored in the log acquisition sub-module.

In an alternative embodiment, the data splitting module is further configured to configure a log type of the original log data to be split through MySQL, so as to create a plurality of log lists corresponding to the log types in the message processor cluster.

The above-mentioned configuration information of asynchronously obtaining MySQL stores map [ logtype ] =kafka_topic in the memory field, and updates periodically, for example, by using the log type of the log data to be filtered and the topic name of the transmitted kafka. Mainly after extracting the log type of the log data, processing according to the configuration information, if the name of the topic sent by the corresponding logtype and kafka exists, sending the name to the corresponding kafka topic, and if the name of the topic does not exist, directly discarding the name.

It should be noted that, in the embodiment of the present invention, after determining the log type of the original log data, the log data of the same log type is stored in the corresponding position of the same topic, so as to reduce the data processing amount of the downstream task.

In an optional embodiment, the data splitting module is further configured to obtain a splitting rule updated to the local cache by using the flink distributed computing system through the database connection pool, so as to split the original log data by using the splitting rule, and obtain the split log data.

In an optional embodiment, the data splitting module is further configured to extract a log type field of the split log data, perform dimension information expansion on the split log data, obtain expanded log data, and distribute the expanded log data to a plurality of log lists of the message processor cluster to form a data source of the data warehouse.

The log data are collected in the topic of kafka in a unified way, and all the log data are reported through a unified channel. All the buried points of the APP application report data to the channel, so that the kafka topic information is huge, and the kafka topic information has huge pressure on service processing capacity. Each service is only concerned with individual log categories, so ETL splitting of the original layer is required. The general mode is as follows: establishing corresponding kafka topic information through MySQL configuration of log types logtype of log data needing to be shunted, then flink asynchronously acquiring updated local cache configuration information through a database connection pool, efficiently cleaning an original log in real time, extracting a Logtype field, expanding dimension information, and distributing the dimension information to topics of different kafka to form a data source layer of a real-time data warehouse. Therefore, the latter processing task only needs to care the data which is interested by the user, the performance is greatly improved, and after data is shunted, ACL management is carried out on topic by combining KAFKA SASL mechanisms, so that the data can be well isolated and protected.

In an alternative embodiment, the data analysis module is configured to parse the received SQL statement to obtain a parsed SQL statement, and perform serialization processing and logic operation on the data in the data warehouse based on the service requirement corresponding to the parsed SQL statement to obtain an operation result.

The data analysis module is used for acquiring data in the data warehouse through a table source provided by the flink distributed computing system after receiving the SQL sentence, and submitting the acquired data in the data warehouse to a stream operator of the flink distributed computing system so as to perform statistical analysis on the received data in the data warehouse by using the stream operator to obtain a statistical analysis result.

For example, after log data is streamed, flink SQL may be used for real-time analysis for the compute engine. Specifically, after the SQL statement is submitted to the data analysis system, the SQL statement is converted into a logic plan through SQLPARSER, namely, the SQL statement is analyzed by SQLPARSER to obtain the logic plan, then the logic plan is optimized to JobGraph, the logic plan is submitted to a Dispatcher to operate through a restful interface, a container operation JobManager, taskManager is applied to Yarn, and then the job is operated in parallel on the slot.

Wherein the processing logic of JobGraph may be abstracted into TableSource, streamingOperator, tableSink three parts. TableSource is used for obtaining source log data, for example, by reading data in kafka or MySQL file, then serializing the data through a custom schema, converting the data into line data of a data table, and delivering the line data to a stream operator; stream operators are mainly used for performing logical calculations such as summation, averaging, sorting and the like; the table sink is mainly to implement the calculation result of the stream operator to the database such as redis and MySQL for the service party to use. The data is then serialized through a custom schema and converted into row data for the data table. And after various statistical operations are performed by the Stream Operator, the result is output to the next stage TableSink.

In an alternative embodiment, the data analysis module is further configured to store the operation result obtained by the stream operator serialization processing and the logic operation through a table sink provided by the flink distributed operation system in a database for use by a service party.

Optionally, the flexible system provides limited TableSource, streamOperator, tableSink capabilities, and through rich extensions, uses a service system adapted to itself, as mainly shown below: 1) Log serialization format: the method realizes the message analysis of own internal format, such as [2013-04-1011:00:09] [ Click ], { "urs":12344343, "server": 1001"}, maps to a real-time table structure of Schema containing logtime, logtype, urs, server, realizes the dual of stream data and a table, thereby facilitating the sql to perform field operation; 2) TableSource, tableSink connector: the operation of the database connection pool of Redis and Mysql is mainly realized, and the result is read by a cursor and sent to StreamOperator for processing or received StreamOperator and written into a database, so that the statistical system data such as a micro-service business ranking list and the like can be conveniently obtained in real time.

In addition, in the embodiment of the invention, the UDF user can realize the logic function by himself, and the capability of StreamOperator, such as the heat algorithm of the anchor, is suitable for calculating the business complex logic.

Fig. 3 is a functional diagram flink SQL according to an embodiment of the invention, and the processing logic of JobGraph may be abstracted as three parts TableSource, STREAMING OPERATOR, tableSink, and during Job operation, metrics are sent to Promethus at defined TableSource, tableSink burial points, shown using Grafana. The above custom extension is compiled into Jar packages, clients then take effect by loading the corresponding Jar packages into Jvm through Classload, and the data analysis system can find the custom Jar packages of format, connector, UDF through a java SPI extension mechanism during operation. Therefore, the detail knowledge of the bottom layer Kafka, flink, redis and the like does not need to be concerned, the SQL is used safely, and the threshold of real-time analysis is greatly reduced.

It should be noted that, in the embodiment of the present invention, the message processor cluster is a kafka cluster.

In the embodiment of the invention, in order to facilitate job submission and organization, a function supporting batch SQL file analysis submission is developed, and two modes, namely a session mode and a job cluster mode are supported. The Session mode can effectively run a group of strongly related tasks in a cluster, saves resources and facilitates task management. Job Cluster is a Cluster started by one task, so that the direct isolation of the task is enhanced, but the resource cost is high. The business tasks can be flexibly selected according to own requirements. The physical execution plan is submitted to distributed execution above Yarn. By configuring a task Checkpoint mechanism and task failure retry, stable operation of the task can be effectively ensured, and the concurrency can be dynamically changed along with the change of traffic in the later period, so that more resources can be dynamically applied. During Job operation, metrics are sent to Promethus at defined TableSource, tableSink burial points, shown using Grafana. Metrics granularity is divided into a system index, a task index and a time delay index, and the health state of job is displayed in an omnibearing three-dimensional way, and simultaneously, ALARMMANAGER is combined for alarming.

FIG. 4 is a schematic diagram of a data warehouse hierarchy for which real-time processing of data streams is primarily taught in an implementation sense, in order to facilitate management of a real-time data warehouse model that we abstract. Based on kakfka as a storage engine, flink sql as a calculation engine and json as a data format, the data warehouse is divided into ODS, a theme layer, an application layer and data layering to form an own scope, so that the user can conveniently locate and understand when using a table, and the main classification is as shown in the following figure 4: ODS layer: the method mainly comprises the step of storing detailed flow logs after diversion and cleaning. Shielding the influence of the abnormality of the original data on the upper layer service; dimension layer: using mysql to store information such as time dimension, region dimension, service dimension and the like for flinksql join expansion information; theme layer: at a higher logical level, abstract traffic is familiar with concepts that are user portraits, anchor portraits, user behavior, financial topics, and the like. The main realization logic counts statistical indexes of intervals such as 1 minute, 5 minutes, 1 hour and the like through sql in real time, develops universal middle layer data, and can reduce great repeated calculation; service layer: according to the service requirement, decomposing a complex task into a plurality of subtasks to finish, and cross-combining the data of the theme layer to finish service logic; each layer only processes a single step, is relatively simple and easy to understand, and is convenient for maintaining the accuracy of data.

In the embodiment of the invention, aiming at metadata, the data warehouse model abstracted from the data warehouse model is in butt joint with the sql real-time analysis system, a metadata system needs to be established, the model is described, and five tables such as catalogs, tables, table _ configs, columns, table _ prevs are mainly defined for information management.

1) Database catalogs, which logically corresponds to categorizing tables, represents different topics and uses, and physically corresponds to database. The tables correspond to the different kafka topics, redis data sources, mysql data tables, with log function data inside. Table configuration table_ configs describes a connection method such as kafka, redis, mysql, and information description such as whether a table is input or output. Connection for external systems in TableSource, tableSink, and serialization of data. The field columns is used to describe the fields of the data, corresponding to the schema of the table, and the corresponding fields are extracted from the json format.

The rights table tbl_ prevs is used to control the rights of the user's access table. Taking Columns, TBL_ PRIVS for example:

tbl_ prevs table structure:

When the real-time sql system is started, a Mysql connection pool is used, information, fields, such as a database, table configuration and the like, in metadata are acquired according to the authority of a user, connector, format and Schema information are formed through data arrangement, custom Flink ExternalCatalog and ExternalCatalogTable are realized, and then external metadata are injected into the real-time analysis system through a TableEnviroment registerCatalog interface. When the sql statement is parsed, information can be acquired through metadata Provider interfaces, and finally a corresponding TableSource, streamOperator and TableSink DAG physical plan is formed, and the DAG physical plan is scheduled to be executed on the yarn.

In the embodiment of the present invention, in order to facilitate the operation, a web UI interface is provided to configure a data system, so that the definition of a real-time data warehouse table may be facilitated, and fig. 5 is a schematic diagram of the web UI interface according to the embodiment of the present invention, and in particular, the data system may be configured in a configuration manner as shown in fig. 5.

In addition, through the data analysis system provided by the embodiment of the invention, the blood-edge relationship between the data is also constructed, fig. 6 is a schematic diagram of the blood-edge relationship between the data according to the embodiment of the invention, as shown in fig. 6, by sqlParse, it can be analyzed from which table from which sql is inserted to which table, the data blood-edge relationship between the tables is built, the tables are taken as nodes, the side is the sql statement, a directed acyclic graph is formed, and the point and the changed information are stored in the graph database. The above graph is the blood-margin graph formed by the flow of the click log cleaning. The data blood-lineage diagram is provided, so that the definition of the table, the source and the direction of each table and the processing logic of each field can be conveniently checked, and the data business flow and the positioning data problem can be conveniently understood.

The data analysis system provided by the embodiment of the invention mainly comprises a log embedded point real-time collection module, a real-time system distribution module, a Flink sql real-time analysis system ecology and a real-time data warehouse system. The log embedded point real-time collection module is mainly responsible for standardizing a log format and sending the log format to the kafka system in real time. The real-time data distribution system is mainly used for separating the original data of the unified log collection channel and is used for fine control and data access. The Flink Sql real-time analysis system we extend TableSource and TableSink's Connector, schema, UDF ecology, build job fault tolerance and monitoring mechanisms. The real-time warehouse system is mainly oriented to business model data layering, metadata system and data blood relationship, so that real-time data are managed in a standardized mode. Thus, the ecological environment is established with high performance, easy expansion and stable millisecond level real-time analysis.

In addition, by establishing the sql on hadoop system on the basis of flink, the threshold of real-time analysis can be greatly reduced, statistical analysis of planning can be completed by using the sql conveniently and rapidly, and challenging tasks such as real-time warehouse, real-time characteristics and the like are avoided, and the concern about bottom details is avoided. And meanwhile, peripheral rediss, mysql, log format and other plugins are expanded to perfect the service use environment. And a metadata system, a blood relationship and a wiki system are established to effectively manage real-time data, so that the maintenance and multiplexing of the data are facilitated, and the workload of repeated development is reduced. And the environment such as monitoring, resource capacity expansion, state recovery and the like of good real-time analysis is established, and the stability and the high efficiency of the system are well ensured. And a solid foundation is provided for upper layer application of real-time characteristic systems, real-time reports and the like.

The data analysis system provided by the embodiment of the invention achieves the following beneficial effects: 1) Real-time performance can reach millisecond level; 2) The business personnel only need to write SQL for analysis, so that the analysis threshold is greatly reduced, and the working efficiency is improved; 3) Perfect peripheral component realization and ecological environment; 4) High expansibility and stability; 5) And the system of the shunt and real-time warehouse is convenient for real-time data management and multiplexing.

Example 2

According to an embodiment of the present invention, there is provided a method embodiment of a method of data analysis, it being noted that, in a data analysis system applied to any one of the above, the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that shown or described herein.

FIG. 7 is a flow chart of a method of data analysis according to an embodiment of the present invention, as shown in FIG. 1, comprising the steps of:

Step S702, collecting original log data reported by at least one target application.

Step S704, cleaning and splitting the original log data according to preset configuration information to obtain a splitting result, wherein the preset configuration information comprises: correspondence between category information of log data and category information created on the message processor cluster.

Step S706, performing serialization processing and logic operation on the segmentation result to obtain an operation result, and storing the operation result.

As can be seen from the above, in the embodiment of the present invention, the original log data reported by at least one target application may be collected; and cleaning and splitting the original log data according to preset configuration information to obtain a splitting result, wherein the preset configuration information comprises: the corresponding relation between the category information of the log data and the category information created on the message processor cluster; the method comprises the steps of carrying out serialization processing and logic operation on the shunting result to obtain an operation result, and storing the operation result, so that the purposes of shunting the original log data in real time, and carrying out serialization processing and logic operation on the shunting result to obtain the operation result are achieved.

It is easy to notice that, in the embodiment of the present invention, raw log data reported by at least one target application may be collected; cleaning and splitting the original log data according to preset configuration information to obtain splitting results; the method comprises the steps of carrying out serialization processing and logic operation on the shunting result to obtain an operation result, and storing the operation result, so that the purposes of shunting the original log data in real time, carrying out serialization processing and logic operation on the shunting result to obtain the operation result are achieved, and the technical effect of reducing the difficulty of analyzing the service is achieved.

By the data analysis method in the embodiment of the invention, the technical problems that the mode for carrying out real-time analysis on the business in the related technology needs to have enough field knowledge for operators and is difficult to implement are solved.

Example 3

According to another aspect of the embodiment of the present invention, there is provided an apparatus for data analysis, using the method for data analysis described above, fig. 8 is a schematic diagram of an apparatus for data analysis according to an embodiment of the present invention, and as shown in fig. 8, the apparatus for data analysis further includes: a collection unit 81, a processing unit 83, and an acquisition unit 85. The apparatus for analyzing the data will be described in detail.

And the collecting unit 81 is configured to collect raw log data reported by at least one target application.

The processing unit 83 is configured to perform cleaning and splitting processing on the original log data according to preset configuration information, so as to obtain a splitting result, where the preset configuration information includes: correspondence between category information of log data and category information created on the message processor cluster.

The obtaining unit 85 is configured to perform serialization processing and logic operation on the split result, obtain an operation result, and store the operation result.

Here, the above-described collection unit 81, processing unit 83, and acquisition unit 85 correspond to steps S702 to S706 in embodiment 2, and the above-described units are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 2. It should be noted that the above-described elements may be implemented as part of an apparatus in a computer system such as a set of computer-executable instructions.

As can be seen from the above, in the above embodiment of the present application, the collecting unit may be used to collect the original log data reported by at least one target application; and cleaning and splitting the original log data by using a processing unit according to preset configuration information to obtain a splitting result, wherein the preset configuration information comprises: the corresponding relation between the category information of the log data and the category information created on the message processor cluster; and carrying out serialization processing and logic operation on the shunting result by utilizing the acquisition unit to obtain an operation result, and storing the operation result. By the data analysis device in the embodiment of the application, the purposes of obtaining the operation result by shunting the original log data in real time and carrying out serialization processing and logic operation on the shunting result are realized, the technical effect of reducing the difficulty of analyzing the service is achieved, and the technical problems that the mode for carrying out real-time analysis on the service in the related technology needs to have enough field knowledge for operators and is difficult to implement are solved.

Example 4

According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program performs the method of data analysis described above.

Example 5

According to another aspect of the embodiment of the present invention, there is provided a processor, configured to execute a program, where the program executes the method for data analysis described above.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A data analysis system, comprising:

the log collection module is used for collecting original log data reported by at least one target application;

the data distribution module is used for cleaning and distributing the original log data according to preset configuration information to obtain a distribution result, wherein the preset configuration information comprises: the corresponding relation between the log type of the original log data and the category information created on the message processor cluster;

The data analysis module is used for carrying out serialization processing and logic operation on the shunting result to obtain an operation result, and storing the operation result;

the data distribution module is further configured to create a plurality of log lists corresponding to the log types in the message processor cluster by using the preset configuration information and the log types of the original log data; cleaning the original log data to obtain cleaned log data; extracting a log type field of the cleaned log data, using the log type field as field information of dimension to be expanded, expanding dimension information of the cleaned log data to obtain expanded log data, and distributing the expanded log data to the plurality of log lists to form the shunting result.

2. The data analysis system according to claim 1, wherein the log collection module is further configured to, after receiving the original log data reported by the at least one target application, perform format conversion on the original log data to obtain original log data in a predetermined format.

3. The data analysis system of claim 1, wherein the log collection module is further configured to determine the at least one target application and category information of log data that the at least one target application needs to collect in a buried point manner, so as to trigger the at least one target application to report the original log data after the at least one target application collects the original log data.

4. A data analysis system according to claim 3, wherein the at least one target application gathers log data for the client by at least one of: logging, nginx access log.

5. The data analysis system of claim 1, wherein the log collection module comprises:

the log acquisition sub-module is used for acquiring original log data reported by the at least one target application;

And the log monitoring sub-module is connected with the log acquisition sub-module and is used for triggering the log acquisition sub-module to send the original log data to the message processor cluster when the original log data are stored in the log acquisition sub-module.

6. The data analysis system of claim 1, wherein the data splitting module is further configured to configure a log type of raw log data that needs to be split by MySQL to create a plurality of log lists corresponding to the log type in the message processor cluster.

7. The data analysis system of claim 6, wherein the data splitting module is further configured to obtain a splitting rule updated to the local cache by using flink distributed computing system through a database connection pool, so as to split the original log data by using the splitting rule, and obtain split log data.

8. The data analysis system of claim 7, wherein the data splitting module is further configured to extract a log type field of the split log data, expand dimension information of the split log data to obtain expanded log data, and distribute the expanded log data to a plurality of log lists of the message processor cluster to form a data source of a data warehouse.

9. The data analysis system of claim 8, wherein the data analysis module is configured to parse the received SQL statement to obtain a parsed SQL statement, and perform serialization processing and logic operation on the data in the data warehouse based on the service requirement corresponding to the parsed SQL statement to obtain the operation result.

10. The data analysis system according to claim 9, wherein the data analysis module is configured to obtain, after receiving the SQL statement, data in the data warehouse through a table source provided by the flink distributed computing system, and submit the obtained data in the data warehouse to a stream operator of the flink distributed computing system, so as to perform statistical analysis on the received data in the data warehouse by using the stream operator, and obtain the statistical analysis result.

11. The data analysis system according to claim 10, wherein the data analysis module is further configured to store an operation result obtained by the stream operator serialization process and the logical operation through a table sink provided by the flink distributed operation system, and store the operation result into a database for a service party to use.

12. The data analysis system of any of claims 1 to 11, wherein the message processor cluster is a kafka cluster.

13. A method of data analysis as claimed in any one of claims 1 to 12, applied to a data analysis system comprising:

Collecting original log data reported by at least one target application;

Cleaning and splitting the original log data according to preset configuration information to obtain a splitting result, wherein the preset configuration information comprises: the corresponding relation between the log type of the original log data and the category information created on the message processor cluster;

Carrying out serialization processing and logic operation on the shunting result to obtain an operation result, and storing the operation result;

Cleaning and splitting the original log data according to the preset configuration information, and obtaining the splitting result comprises the following steps: creating a plurality of log lists corresponding to the log types in the message processor cluster by utilizing the preset configuration information and the log types of the original log data; cleaning the original log data to obtain cleaned log data; extracting a log type field of the cleaned log data, using the log type field as field information of dimension to be expanded, expanding dimension information of the cleaned log data to obtain expanded log data, and distributing the expanded log data to the plurality of log lists to form the shunting result.

14. An apparatus for data analysis, characterized in that the method for data analysis according to claim 13 is used, comprising:

the collecting unit is used for collecting the original log data reported by at least one target application;

The processing unit is used for cleaning and splitting the original log data according to preset configuration information to obtain a splitting result, wherein the preset configuration information comprises: the corresponding relation between the log type of the original log data and the category information created on the message processor cluster;

The acquisition unit is used for carrying out serialization processing and logic operation on the shunting result to obtain an operation result, and storing the operation result;

The processing unit is further configured to create a plurality of log lists corresponding to the log types in the message processor cluster by using the preset configuration information and the log types of the original log data; cleaning the original log data to obtain cleaned log data; extracting a log type field of the cleaned log data, using the log type field as field information of dimension to be expanded, expanding dimension information of the cleaned log data to obtain expanded log data, and distributing the expanded log data to the plurality of log lists to form the shunting result.

15. A storage medium comprising a stored program, wherein the program performs the method of data analysis of claim 13.

16. A processor for running a program, wherein the program when run performs the method of data analysis as claimed in claim 13.