CN106294866B

CN106294866B - Log processing method and device

Info

Publication number: CN106294866B
Application number: CN201610710011.8A
Authority: CN
Inventors: 徐胜国; 王义辉; 王素梅; 沈迪; 李铮
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Priority date: 2016-08-23
Filing date: 2016-08-23
Publication date: 2020-02-11
Anticipated expiration: 2036-08-23
Also published as: CN106294866A

Abstract

The invention discloses a log processing method and a log processing device, wherein the method comprises the following steps: receiving a section of data stream from an input source every other preset time period, wherein each section of data stream comprises a plurality of logs generated in the previous preset time period; after each section of data stream is received, performing primary analysis on each log in the section of data stream to obtain designated structure data corresponding to each log; performing initial aggregation on a plurality of specified structure data corresponding to a plurality of logs in the data stream, and caching the specified structure data after the initial aggregation; and when the calculation time of the preset statistical period is reached, calculating the cached specified structure data to obtain a log processing result corresponding to the statistical period. In the process, a storage medium is not needed, delay caused by starting a process to read and write the storage medium is avoided, the primary analysis, the primary aggregation and the calculation processing of the log are all completed at one time, the efficiency is high, the operation is stable, the intermediate data is not easy to lose, and the real-time performance of log processing is ensured as far as possible.

Description

Log processing method and device

Technical Field

The invention relates to the technical field of internet, in particular to a log processing method and device.

Background

With the continuous development of internet technology, the trend of internet big data is increasingly remarkable, new data is continuously generated by each internet service line, and the further processing of the generated data to feed back the operation of internet services is one of the important tasks. In the prior art, when a user wants to process data output by a certain data source, the user needs to manually write a data processing program according to corresponding processing requirements, different programs need to be rewritten according to different data processing requirements, and different users need to write required programs respectively, so that time and labor are wasted, the data processing efficiency is low, and the development trend of big data is not met.

Disclosure of Invention

In view of the above, the present invention has been made to provide a log processing method and apparatus that overcome or at least partially solve the above problems.

According to an aspect of the present invention, there is provided a log processing method, including:

receiving a section of data stream from an input source every other preset time period, wherein each section of data stream comprises a plurality of logs generated in the previous preset time period;

after each section of data stream is received, performing primary analysis on each log in the section of data stream to obtain designated structure data corresponding to each log; performing initial aggregation on a plurality of specified structure data corresponding to a plurality of logs in the data stream, and caching the specified structure data after the initial aggregation;

and when the calculation time of the preset statistical period is reached, calculating the cached specified structure data to obtain a log processing result corresponding to the statistical period.

Optionally, the performing the primary analysis on each log in the segment of the data stream to obtain the specified structure data corresponding to each log includes:

and extracting one or more specified fields from each log, taking the set of the one or more specified fields as a key, and taking the number of times of the set of the one or more fields appearing in the log as a value, so as to obtain metadata in the form of a key-value pair corresponding to the log.

Optionally, the performing initial aggregation on a plurality of pieces of specified structure data corresponding to a plurality of logs in the segment of data stream includes:

and for a plurality of metadata corresponding to a plurality of logs in the data stream, primarily aggregating values of the metadata according to keys of the metadata to obtain one or more metadata after primary aggregation.

Optionally, the caching the primarily aggregated specified structure data includes:

writing the metadata after the initial aggregation into a distributed file system for caching;

when each metadata is written into the distributed file system, if cached metadata identical to the key of the metadata to be written exists in the distributed file system, updating the value of the cached metadata according to the value of the metadata to be written and a preset updating rule; and if cached metadata which is the same as the key of the metadata to be written does not exist in the distributed file system, directly writing the metadata to be written into the distributed file system.

Optionally, the extracting one or more specified fields from each log, taking a set of the one or more specified fields as a key, and taking the number of times that the set of the one or more fields appears in the log as a value, and obtaining the metadata in the form of a key-value pair corresponding to the log includes:

and extracting a field indicating the identification information of the user and a field indicating the identification information of the service accessed by the user from each log, taking a set of the field indicating the identification information of the user and the field indicating the identification information of the service accessed by the user as a key, taking '1' as a value, and forming metadata corresponding to the log by using the key and the value.

Optionally, the field indicating the identification information of the service accessed by the user includes: URL address, version information, signature information, and/or channel information.

Optionally, the extracting, from each log, a field indicating identification information of an accessing user and a field indicating identification information of a service accessed by the user, taking a set of the field indicating identification information of the accessing user and the field indicating identification information of the service accessed by the user as a key, and taking "1" as a value, and the key and the value forming metadata corresponding to the log further includes:

and extracting a field indicating time information of the user for accessing the service from each log, and recording the time corresponding to the metadata corresponding to each log according to the field.

Optionally, the performing, for a plurality of metadata corresponding to a plurality of logs in the segment of data stream, the initial aggregation of the values of the metadata according to the key of the metadata includes:

acquiring a statistical period in which each metadata falls according to the time corresponding to the metadata; performing primary aggregation on the values of the metadata according to the keys of the metadata on the metadata, wherein the metadata fall into the same statistical period; and regarding a plurality of metadata with the same key falling into the same statistical period, taking the corresponding maximum time as the time corresponding to the corresponding metadata obtained after the initial aggregation of the plurality of metadata.

Optionally, if cached metadata identical to the key of the metadata to be written exists in the distributed file system, updating the value of the cached metadata according to the value of the metadata to be written and a preset update rule further includes:

judging whether the time corresponding to the cached metadata and the time corresponding to the written metadata fall into the same statistical period or not, if so, performing cumulative updating on the value of the cached metadata by using the value of the metadata to be written, and taking the larger time of the time corresponding to the metadata to be written and the time corresponding to the cached metadata as the time corresponding to the updated metadata; otherwise, the metadata to be written is directly written into the distributed file system.

Optionally, when the calculation opportunity of the preset statistical period is reached, performing calculation processing on the cached specified structure data, and obtaining a log processing result corresponding to the statistical period includes:

when a first preset time after the end time of the current statistical period is reached, reading metadata falling into the current statistical period from the distributed file system according to the time corresponding to each metadata;

performing secondary analysis on each read metadata, removing a field indicating the identification information of a user in a key of the metadata, taking the field indicating the identification information of the service accessed by the user as a key of the metadata after the secondary analysis, and taking the value of the metadata as the value of the metadata after the secondary analysis;

performing secondary aggregation on the value of the metadata subjected to secondary analysis according to the key of the metadata subjected to secondary analysis to obtain metadata subjected to secondary aggregation;

and the value of each secondary aggregated metadata represents the total access times of the service indicated by the key of the secondary aggregated metadata in the current statistical period.

performing secondary analysis on each read metadata, removing a field indicating expression information of a user in a key of the metadata, taking a field indicating identification information of a service accessed by the user as a key of the metadata after the secondary analysis, and taking '1' as a value of the metadata after the secondary analysis;

the value of each twice aggregated metadata represents the number of independent visitors to the service indicated by the key of the twice aggregated metadata in the current statistical period.

Optionally, the method further comprises:

and filtering out the expired metadata cached in the distributed file system when a second preset time before the starting time of the current statistical period is reached.

Optionally, filtering out the expired metadata cached in the distributed file system includes:

comparing the time corresponding to each metadata cached in the distributed file system with the starting time of the current statistical period, if the time corresponding to one metadata is less than the starting time of the current statistical period, determining that the metadata is overdue metadata, and setting the value of the metadata to be null;

and deleting the metadata with null values in the distributed file system.

Optionally, the method further comprises:

scheduling a thread to read configuration information input by a user at preset time intervals; the configuration information includes: time configuration information, input source information, parsing rules, and/or computing rules;

the receiving a data stream from an input source at every preset time period comprises: determining a preset time period according to time configuration information in the configuration information, determining an input source according to input source information in the configuration information, and receiving a section of data stream from a corresponding input source every other preset time period;

the primary analysis of each log in the data stream comprises: performing primary analysis on each log according to an analysis rule in the configuration information;

the calculation processing of the cached specified structure data comprises: and calculating the cached specified structure data according to the calculation rule in the configuration information.

Optionally, the configuration information further includes: computing platform information; the log processing method is executed based on the computing platform indicated by the computing platform information;

the computing platform is a Spark Streaming computing platform.

Optionally, the method further comprises:

storing the log processing result corresponding to the statistical period in a storage medium;

the storage medium includes: a Redis database, a Mysql database, and/or a GreenPlum database.

According to another aspect of the present invention, there is provided a log processing apparatus including:

the data receiving unit is suitable for receiving a section of data stream from an input source at intervals of a preset time period, and each section of data stream comprises a plurality of logs generated in the previous preset time period;

the computing processing unit is suitable for performing primary analysis on each log in each segment of data stream after each segment of data stream is received to obtain specified structure data corresponding to each log; performing initial aggregation on a plurality of specified structure data corresponding to a plurality of logs in the data stream, and caching the specified structure data after the initial aggregation; and when the calculation time of the preset statistical period is reached, calculating the cached specified structure data to obtain the log processing result corresponding to the statistical period.

Optionally, the calculation processing unit is adapted to extract one or more specified fields from each log, obtain metadata in the form of a key-value pair corresponding to the log by using a set of the one or more specified fields as a key and using the number of times that the set of the one or more fields appears in the log as a value.

Optionally, the calculation processing unit is adapted to, for a plurality of metadata corresponding to a plurality of logs in the segment of data stream, perform initial aggregation on values of the metadata according to a key of the metadata, to obtain one or more metadata after initial aggregation.

Optionally, the computing processing unit is adapted to write the metadata after the initial aggregation into a distributed file system for caching; when each metadata is written into the distributed file system, if cached metadata identical to the key of the metadata to be written exists in the distributed file system, updating the value of the cached metadata according to the value of the metadata to be written and a preset updating rule; and if cached metadata which is the same as the key of the metadata to be written does not exist in the distributed file system, directly writing the metadata to be written into the distributed file system.

Optionally, the calculation processing unit is adapted to extract, from each log, a field indicating identification information of the user and a field indicating identification information of a service accessed by the user, take a set of the field indicating identification information of the user and the field indicating identification information of the service accessed by the user as a key, take "1" as a value, and form metadata corresponding to the log by the key and the value.

Optionally, the computing processing unit is further adapted to extract a field indicating time information of the user accessing the service from each log, and record a time corresponding to the metadata corresponding to each log according to the field.

Optionally, the calculation processing unit is adapted to obtain, according to a time corresponding to each metadata, a statistical period in which the metadata falls; performing primary aggregation on the values of the metadata according to the keys of the metadata on the metadata, wherein the metadata fall into the same statistical period; and regarding a plurality of metadata with the same key falling into the same statistical period, taking the corresponding maximum time as the time corresponding to the corresponding metadata obtained after the initial aggregation of the plurality of metadata.

Optionally, the calculation processing unit is further adapted to, when it is determined that cached metadata that is the same as a key of metadata to be written exists in the distributed file system, further determine whether a time corresponding to the cached metadata and a time corresponding to the metadata to be written fall within a same statistical period, if yes, perform cumulative update on a value of the cached metadata using the value of the metadata to be written, and use a larger time of the time corresponding to the metadata to be written and the time corresponding to the cached metadata as a time corresponding to the updated metadata; otherwise, the metadata to be written is directly written into the distributed file system.

Optionally, the calculation processing unit is adapted to, when a first preset time after the end time of the current statistics period is reached, read metadata falling into the current statistics period from the distributed file system according to a time corresponding to each metadata; performing secondary analysis on each read metadata, removing a field indicating the identification information of a user in a key of the metadata, taking the field indicating the identification information of the service accessed by the user as a key of the metadata after the secondary analysis, and taking the value of the metadata as the value of the metadata after the secondary analysis; performing secondary aggregation on the value of the metadata subjected to secondary analysis according to the key of the metadata subjected to secondary analysis to obtain metadata subjected to secondary aggregation; and the value of each secondary aggregated metadata represents the total access times of the service indicated by the key of the secondary aggregated metadata in the current statistical period.

Optionally, the calculation processing unit is adapted to, when a first preset time after the end time of the current statistics period is reached, read metadata falling into the current statistics period from the distributed file system according to a time corresponding to each metadata; performing secondary analysis on each read metadata, removing a field indicating expression information of a user in a key of the metadata, taking a field indicating identification information of a service accessed by the user as a key of the metadata after the secondary analysis, and taking '1' as a value of the metadata after the secondary analysis; performing secondary aggregation on the value of the metadata subjected to secondary analysis according to the key of the metadata subjected to secondary analysis to obtain metadata subjected to secondary aggregation; the value of each twice aggregated metadata represents the number of independent visitors to the service indicated by the key of the twice aggregated metadata in the current statistical period.

Optionally, the apparatus further comprises:

and the filtering unit is suitable for filtering out the expired metadata cached in the distributed file system when a second preset time before the starting time of the current statistical period is reached.

Optionally, the filtering unit is adapted to compare a time corresponding to each metadata cached in the distributed file system with a start time of a current statistics period, and if the time corresponding to one metadata is less than the start time of the current statistics period, determine that the metadata is expired metadata, and set a value of the metadata to null; and deleting the metadata with null values in the distributed file system.

Optionally, the apparatus further comprises: configuring a reading unit;

the configuration reading unit is suitable for scheduling a thread to read configuration information input by a user at preset time intervals; the configuration information includes: time configuration information, input source information, parsing rules, and/or computing rules;

the data receiving unit is suitable for determining a preset time period according to the time configuration information in the configuration information, determining an input source according to the input source information in the configuration information, and receiving a section of data stream from the corresponding input source at intervals of the preset time period;

the computing processing unit is suitable for performing primary analysis on each log according to an analysis rule in the configuration information; and the method is suitable for performing calculation processing on the cached specified structure data according to the calculation rule in the configuration information.

Optionally, the configuration information further includes: computing platform information; the log processing device runs on the computing platform indicated by the computing platform information;

the computing platform is a Spark Streaming computing platform.

Optionally, the apparatus further comprises:

the storage processing unit is suitable for storing the log processing result corresponding to the statistical period into a storage medium;

As can be seen from the above, in the technical solution provided by the present invention, the logs are received in a manner of receiving data streams in segments, the logs in each segment of the received data streams are subjected to primary analysis and primary aggregation, and the specified structure data obtained by processing is further subjected to calculation processing to obtain a log processing result; the whole process is in the memory, the designated structure data obtained after the log which continuously flows in is subjected to primary processing is subjected to temporary storage backup by utilizing the cache so as to calculate and process the cached designated structure data when the calculation time arrives, a storage medium is not needed in the middle, delay caused by starting a process to read and write the storage medium and calculating the data in the storage medium is avoided, primary analysis, primary aggregation and calculation and processing of the log are all completed at one time, and the log processing scheme which is high in processing efficiency, stable in operation, difficult in intermediate data loss and capable of guaranteeing real-time performance as far as possible is realized.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 shows a flow diagram of a method of log processing according to one embodiment of the invention;

FIG. 2 shows a schematic diagram of a log processing apparatus according to an embodiment of the invention;

fig. 3 shows a schematic diagram of a log processing apparatus according to another embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Fig. 1 shows a flow chart of a log processing method according to an embodiment of the invention. As shown in fig. 1, the method includes:

step S110, receiving a segment of data stream from an input source every a preset time period, where each segment of data stream includes a plurality of logs generated in a previous preset time period.

Step S120, after each segment of data stream is received, each log in the segment of data stream is analyzed for the first time to obtain the designated structure data corresponding to each log; and performing initial aggregation on a plurality of specified structure data corresponding to a plurality of logs in the section of data stream, and caching the specified structure data after the initial aggregation.

Step S130, when the calculation time of the preset statistical period is reached, calculating the cached specified structure data to obtain the log processing result corresponding to the statistical period.

It can be seen that, in the method shown in fig. 1, the logs are received in a manner of receiving data streams in segments, the logs in each segment of the received data streams are subjected to primary analysis and primary aggregation, and the specified structure data obtained by processing is further subjected to calculation processing to obtain a log processing result; the whole process is in the memory, the designated structure data obtained after the log which continuously flows in is subjected to primary processing is subjected to temporary storage backup by utilizing the cache so as to calculate and process the cached designated structure data when the calculation time arrives, a storage medium is not needed in the middle, delay caused by starting a process to read and write the storage medium and calculating the data in the storage medium is avoided, primary analysis, primary aggregation and calculation and processing of the log are all completed at one time, and the log processing scheme which is high in processing efficiency, stable in operation, difficult in intermediate data loss and capable of guaranteeing real-time performance as far as possible is realized.

In an embodiment of the present invention, after receiving each segment of data stream, step S120 of the method shown in fig. 1 performs primary analysis on each log in the segment of data stream, and obtaining the specified structure data corresponding to each log includes: and extracting one or more specified fields from each log, taking the set of the one or more specified fields as a key, and taking the number of times of the set of the one or more fields appearing in the log as a value, so as to obtain metadata in the form of a key-value pair corresponding to the log. For example, the specified fields include: a field indicating a user ID and a field indicating a URL address clicked by the user, in one log A, the field indicating the user ID is 'xiaoming' indicating the URL address clicked by the userIs a field of " www.soopat.com", then the data is transmitted in sets (xiaoming, www.soopat.com) As a key, the number of times "1" the collection appears in the log a is used as a value, and the metadata corresponding to the log a is obtained as ((xiaoming, www.soopat.com)，1)。

after the metadata in the form of the key value pair corresponding to each log is obtained after the initial analysis, the step S120 of performing the initial aggregation on the plurality of designated structure data corresponding to the plurality of logs in the segment of data stream includes: and for a plurality of metadata corresponding to a plurality of logs in the data stream, primarily aggregating values of the metadata according to keys of the metadata to obtain one or more metadata after primary aggregation. For example, in the first aggregation, for a plurality of metadata having the same key, the first aggregated metadata corresponding to the plurality of metadata having the same key is configured by using the sum of the values of the plurality of metadata as a value and using the same key as a key; and regarding the metadata with keys different from other metadata, taking the metadata as the corresponding metadata after the initial aggregation.

After obtaining the one or more metadata after the initial aggregation, the caching the specified structure data after the initial aggregation in step S120 includes: writing the metadata after the initial aggregation into a distributed file system (HDFS) for caching; when each metadata is written into the distributed file system, if cached metadata identical to the key of the metadata to be written exists in the distributed file system, updating the value of the cached metadata according to the value of the metadata to be written and a preset updating rule; and if cached metadata which is the same as the key of the metadata to be written does not exist in the distributed file system, directly writing the metadata to be written into the distributed file system.

For example, one of the specified structural data after the initial aggregation is ((xiaoming, www.soopat.com) 1), when writing the specified structure data in the HDFS, if metadata already exists in the HDFS ((xiaoming, www.soopat.com) 5), then according to ((xiaoming, www.soopat.com) 1) value of "1" and a preset update rule, for said cached metadata value of "5"Updating is performed, if the preset updating rule is accumulation, the updated cached metadata is (xiaoming, www.soopat.com) 6), indicating that the specified structure data has been written into the HDFS; if the bond is not present in the HDFS (xiaoming, www.soopat.com) The metadata of (1), then the (xiaoming, www.soopat.com) And 1) writing into the HDFS.

It can be seen that, in the embodiment, data generated in the middle of the log processing process is cached through interaction with the HDFS, and since the read-write process of the HDFS is a cache read-write process, the latency is small, the efficiency is high, the real-time performance of the log processing process is ensured as much as possible, and the caching of the HDFS provides data backup for the log processing process, and when an abnormality of data loss occurs in the log processing process, the data can still be recovered as much as possible through the HDFS, so as to reduce the influence on the log processing result.

In an embodiment of the present invention, in the present scheme, a log processing process is executed according to a log processing task configured by a user on a front-end page, where the log processing task configured by the user includes configuration information, and then the method shown in fig. 1 further includes: scheduling a thread to read configuration information input by a user at preset time intervals; the configuration information includes: time configuration information, input source information, parsing rules, and/or computing rules; the time configuration information sets the time/opportunity used in the log processing process, the input source information indicates the source of the received data stream, the parsing rule indicates which field in the log is parsed and which format is parsed, and the calculation rule indicates the calculation opportunity of each statistical period, which data is calculated, which form of calculation is required, and the like. Namely, the user can set the log processing process through the configuration information so as to meet the actual service requirement.

Then, in the process of performing the log processing, the step S110 of receiving a data stream from the input source at intervals of a preset time period includes: and determining a preset time period according to the time configuration information in the configuration information, determining an input source according to the input source information in the configuration information, and receiving a section of data stream from the corresponding input source every other preset time period. The step S120 of performing the primary analysis on each log in the segment of data stream includes: performing primary analysis on each log according to an analysis rule in the configuration information; and the calculation processing of the cached specified structure data comprises the following steps: and calculating the cached specified structure data according to the calculation rule in the configuration information.

According to the log processing method and device, the configuration information input by the user is periodically read, and after the configuration information input by the user is updated, the log processing process is adjusted according to the read configuration information, so that the processing requirements are met.

Further, the configuration information input by the user further includes: computing platform information; the log processing method provided by the scheme is executed based on the computing platform indicated by the computing platform information; in this embodiment, the computing platform is a spare Streaming computing platform, the data receiving manner supported by the spare Streaming computing platform is a segmented periodic inflow instead of a Streaming continuous inflow, and the spare Streaming computing platform has a characteristic of interacting with the HDFS, so that the step of caching the intermediate data in the log processing can be performed more smoothly based on this characteristic.

In one embodiment of the present invention, after step S130, the method shown in fig. 1 further comprises: storing the log processing result corresponding to the statistical period in a storage medium; the storage medium includes: a Redis database, a Mysql database, and/or a GreenPlum database. It can be seen that the present embodiment is completely different from the prior art in which the intermediate data is stored in the storage medium and then the process is called to read the data for secondary calculation after the final log processing result is obtained once and then stored in the corresponding storage medium,

in a specific embodiment of the present invention, the log processing scheme is to calculate the total access times (PV) of each service and/or the number of independent visitors (UV) of each service according to logs generated by a plurality of services, and the log processing flow is executed based on a Spark Streaming computing platform, specifically as follows:

continuously releasing logs generated by a plurality of services to a specified input source, receiving a section of data stream from the specified input source every other preset time period, wherein each section of data stream comprises a plurality of logs generated by a plurality of services in the previous preset time period; after each segment of data stream is received, performing primary analysis on each log in the segment of data stream: extracting a field indicating identification information of a user and a field indicating identification information of a service accessed by the user from each log, taking a set of the field indicating identification information of the user and the field indicating identification information of the service accessed by the user as a key, taking '1' as a value (the access behavior corresponding to each log is only once), and forming metadata corresponding to the log by using the key and the value, wherein the metadata corresponding to each log represents the behavior of accessing the service by the user; wherein, the field of the identification information for indicating the service accessed by the user comprises: URL address, version information, signature information, and/or channel information. For example, a data stream is obtained by segmenting from a Kafka input source, for a segment of data stream X, a field guid1 indicating identification information of a user and a field url1 indicating identification information of a service accessed by the user are extracted from a first log, metadata corresponding to the first log is ((guid1, url1, 1), guid1 and url3) are extracted from the second log, metadata corresponding to the second log is ((guid1, url3, 1), guid3 and url2 are extracted from a third log, metadata corresponding to the third log is ((guid3, url1, 1), and the same is carried out until metadata corresponding to a last log in the segment of data stream X is obtained, that is, the segment of data stream X corresponds to a plurality of metadata [ (guid1, url1), 1), ((guid1, url 638), 1) ("guid 3, url 638), and … … ].

Secondly, performing primary aggregation on a plurality of metadata corresponding to a plurality of logs in the data stream X according to keys of the metadata to obtain one or more metadata subjected to primary aggregation; along the above example, the plurality of metadata [ ((guid1, url1), 1), ((guid1, url3), 1), ((guid3, url1), 1), … … ] corresponding to the segment of data stream X are subjected to primary aggregation, where (guid1, url1), 1) occurs 10 times in total in the plurality of metadata corresponding to the segment of data stream X, ((guid1, url3), 1) occurs 5 times in total in the plurality of metadata corresponding to the segment of data stream X, ((guid3, url1), 1) occurs 3 times in total in the plurality of metadata corresponding to the segment of data stream X, and other similarities, the plurality of metadata obtained after the primary aggregation are [ ((guid1, url1), 10), ((guid1, url3), 5), ((guid3, 1), … … ].

The metadata corresponding to each section of data stream is written into an HDFS for caching after primary aggregation; when each piece of metadata corresponding to a data stream is written into the HDFS, for example ((guid1, url1, 10) corresponding to the data stream X is written into the HDFS, when the metadata corresponding to the data stream X is written into the HDFS, the metadata after the initial aggregation corresponding to the data stream before the data stream exists in the HDFS is accumulated with the value of ((guid1, url1), 90) to be written into ((guid1, url1, 10) if the metadata exists in the HDFS ((guid1, url1, 90)), and is changed into ((guid1, url1, 100), which indicates that ((guid1, url1), 10) has been written into the HDFS, and the writing process of other metadata is the same, after all the metadata after the initial aggregation corresponding to the data stream X is written into the HDFS, the metadata in the HDFS are cached: [ ((guid1, url1), 100), ((guid1, url3), 50), ((guid3, url1), 70), … … ].

When the calculation opportunity is reached, reading metadata from the HDFS, and performing secondary analysis on each read metadata: removing a field indicating the identification information of the user from the key of the metadata, taking the field indicating the identification information of the service accessed by the user as a key of the metadata after secondary resolution, taking the value of the metadata as a first value of the metadata after secondary resolution, and taking '1' as a second value of the metadata after secondary resolution; following the above example, the metadata obtained by performing the secondary parsing on each metadata read from the HDFS is: [ (url1, (100, 1)), (url3, (50,1)), (url1, (70, 1)), … … ].

Secondly, performing secondary aggregation on the values of the secondarily analyzed metadata according to the keys of the secondarily analyzed metadata to obtain secondarily aggregated metadata, wherein the first value of each secondarily aggregated metadata represents the total access times (PV) of the service indicated by the keys of the secondarily aggregated metadata, and the second value of each secondarily aggregated metadata represents the number (UV) of independent visitors of the service indicated by the keys of the secondarily aggregated metadata; in the above example, the metadata obtained after secondary polymerization of [ (url1, 100, 1), (url3, 50,1), (url1, 70, 1), … … ] are: [ (url1, (170, 2)), (url3, (70,15)), … … ], it is stated that the cumulative total number of accesses of the service indicated by url1 is 170, the number of corresponding individual access users in the 170 accesses is 2, the cumulative total number of accesses of the service indicated by url3 is 70, and the number of corresponding individual access users in the 70 accesses is 15.

It can be seen that, in this embodiment, the logs in the received data streams are subjected to primary analysis and primary aggregation to obtain primary aggregated metadata corresponding to each segment of data stream, each primary aggregated metadata corresponding to each segment of data stream is cached, metadata in the cache represents a comprehensive result of the metadata corresponding to each segment of data stream and metadata corresponding to other segments of data streams received before, and the metadata in the cache is subjected to secondary analysis and secondary aggregation when the calculation time arrives, so as to obtain the accumulated total access times and the independent visitor number of each service until the calculation time. In some cases, the total access times and the number of independent visitors of each service in a preset statistical period need to be obtained, for example, the total access times and the number of independent visitors of a certain website in a day are calculated to monitor the operation state of the website, and therefore, the log processing scheme is further optimized in this embodiment:

when each log in each received data stream is analyzed for the first time, a field indicating the identification information of an access user and a field indicating the identification information of a service accessed by the user are extracted from each log, a set of the field indicating the identification information of the access user and the field indicating the identification information of the service accessed by the user is used as a key, 1 is used as a value, the key and the value form metadata corresponding to the log, a field indicating the time information of the service accessed by the user is also extracted from each log, and the time corresponding to the metadata corresponding to each log is recorded according to the field. On the basis of the above example, the data stream X corresponds to a plurality of metadata [ ((guid1, url1), 1, time1), ((guid1, url3), 1, time2), ((guid3, url1), 1, time3), … … ], in this example, the time corresponding to the metadata corresponding to each log is recorded after the value of the metadata.

And then, primarily aggregating the values of the metadata according to the keys of the metadata: acquiring a statistical period in which each metadata falls according to the time corresponding to the metadata; performing primary aggregation on the values of the metadata according to the keys of the metadata on the metadata, wherein the metadata fall into the same statistical period; and regarding a plurality of metadata with the same key falling into the same statistical period, taking the corresponding maximum time as the time corresponding to the corresponding metadata obtained after the initial aggregation of the plurality of metadata. For example, with a statistical period of 0:00 to 24:00 per day, in the stream X corresponding to a plurality of metadata [ ((guid1, url1), 1, time1), ((guid1, url3), 1, time2), ((guid3, url1), 1, time3, … …), ((guid 585, url1), 5731) occur 5 times in total, and ((guid1, url1), 1, 8 months 15 days 23:00), ((guid1, url1), 1, 8 months 15 days 16:30), ((guid 9, url1), 1, 8 months 16 days 1:00), ((guid1, url1), 1, 8 months 16 days 2:00) and ((guid1, 6 1), 1, 8 months 16 days 1:00), and ((7 days 367: 15) are aggregated to obtain a statistical period of metadata, and the first-aggregated metadata is aggregated within (guid 367, 8748 months), url1), 3, 8, 16, 3:30), the same for the first aggregation process of other metadata.

Further, writing the metadata after the initial aggregation corresponding to each segment of data stream into the HDFS, wherein if cached metadata identical to the key of the metadata to be written exists in the distributed file system, updating the value of the cached metadata according to the value of the metadata to be written and a preset update rule further includes: judging whether the time corresponding to the cached metadata and the time corresponding to the written metadata fall into the same statistical period or not, if so, performing cumulative updating on the value of the cached metadata by using the value of the metadata to be written, and taking the larger time of the time corresponding to the metadata to be written and the time corresponding to the cached metadata as the time corresponding to the updated metadata; otherwise, the metadata to be written is directly written into the distributed file system. For example, when metadata ((guid1, url1), 3, 8-month 16-day 3:30) corresponding to the data stream X is written into the HDFS, if metadata exists in the HDFS, wherein the key is (guid1, url1) and the corresponding time falls within the 8-month 16-day statistical period, such as ((guid1, url1), 100, 8-month 16-day 15:00), the metadata is updated to ((guid1, url1), 103, 8-month 16-day 15: 00); if not, then ((guid1, url1), 3, 8 months, 16 days 3:30) is written directly into the HDFS.

When the calculation time of the preset statistical period is reached, calculating the cached specified structure data, and obtaining the log processing result corresponding to the statistical period includes: when a first preset time after the end time of the current statistical period is reached, reading metadata falling into the current statistical period from the distributed file system according to the time corresponding to each metadata; performing secondary analysis on each read metadata, removing a field indicating identification information of a user in a key of the metadata, taking a field indicating identification information of a service accessed by the user as a key of the metadata after the secondary analysis, taking a value of the metadata as a first value of the metadata after the secondary analysis, and taking '1' as a second value of the metadata after the secondary analysis; performing secondary aggregation on the value of the metadata subjected to secondary analysis according to the key of the metadata subjected to secondary analysis to obtain metadata subjected to secondary aggregation; the first value of each secondary aggregated metadata represents the total access times of the service indicated by the key of the secondary aggregated metadata in the current statistical period, and the second value of each secondary aggregated metadata represents the number of independent visitors of the service indicated by the key of the secondary aggregated metadata in the current statistical period. For example, when 2 hours after the end of day 16/8 (i.e., day 17/8: 00/8), metadata falling within the statistical period of day 16/8 is read from the HDFS, [ ((guid1, url1), 103, day 16/8 15:00), ((guid1, url3), 50, day 16/8: 3:00), ((guid3, url1), 70, day 16/8: 11:00), … … ], after secondary parsing, [ url1, (103, 1)), (url3, (50,1)), (url1, (70, 1)), … … ], after secondary aggregation, [ url1, (173, 2)), (url3, (66, 12)), … … ], and it is known that within the statistical period of day 16/8, the total number of accesses of url1 is 173, the total number of accesses of url3 is 2, and the total number of visitors is 66, and the number of visitors is 12.

It can be seen that the PV value and the UV value in each statistical period can be calculated by adding records of the corresponding time of the metadata, wherein the reason that the first preset time after the ending time of the current statistical period is taken as the calculation time of the current statistical period is as follows: in general, there is a delay between the time when each service generates logs and the time when the logs flow into the log processing device, for example, the logs of a user accessing a specified website at 8 months, 16 days, 23:00 may flow in a data stream manner only at 8 months, 17 days, 1:00 days, and then a first preset time is reserved as a buffer to collect and calculate all logs in a statistical period as much as possible to obtain a more accurate result, but on the other hand, the real-time performance of the log processing process needs to be guaranteed as much as possible, so the reserved buffer time is not too long, and the balance is achieved in various aspects, and two hours are selected as the buffer time in the above example.

In addition, when performing calculation processing on a log generated in a current statistical period, the log generated before the current statistical period is outdated data for the current calculation processing process, if the outdated data is still retained in the HDFS, the data volume in the HDFS gradually increases, which causes a slow read-write processing speed of the HDFS, affects log processing efficiency, and causes an inaccurate calculation result once the outdated data is mixed in the calculation, so that the outdated data in the HDFS needs to be regularly cleaned, that is, the method shown in fig. 1 further includes: and filtering out the expired metadata cached in the distributed file system when a second preset time before the starting time of the current statistical period is reached. Specifically, filtering out expired metadata cached in the distributed file system comprises: comparing the time corresponding to each metadata cached in the distributed file system with the starting time of the current statistical period, if the time corresponding to one metadata is less than the starting time of the current statistical period, determining that the metadata is overdue metadata, and setting the value of the metadata to be null; and deleting the metadata with null values in the distributed file system. For example, after the PV value and the UV value of the 8-month 16-day statistical period are calculated, when the PV value and the UV value of the 8-month 17-day statistical period are to be calculated, metadata falling within the 8-month 16-day statistical period in the HDFS become expired data, the value of each of the expired data is set to "none", and the metadata having the value of "none" is deleted from the HDFS.

Fig. 2 shows a schematic diagram of a log processing apparatus according to an embodiment of the present invention. As shown in fig. 2, the log processing apparatus 200 includes:

the data receiving unit 210 is adapted to receive a data stream from an input source every preset time period, wherein each data stream comprises a plurality of logs generated in the previous preset time period.

The calculation processing unit 220 is adapted to, after receiving each segment of data stream, perform primary analysis on each log in the segment of data stream to obtain specified structure data corresponding to each log; performing initial aggregation on a plurality of specified structure data corresponding to a plurality of logs in the data stream, and caching the specified structure data after the initial aggregation; and when the calculation time of the preset statistical period is reached, calculating the cached specified structure data to obtain the log processing result corresponding to the statistical period.

It can be seen that, the apparatus shown in fig. 2 receives the logs by receiving the data streams in segments, performs initial analysis and initial aggregation on the logs in each segment of the received data stream, and further performs calculation processing on the specified structure data obtained by the processing to obtain a log processing result; the whole process is in the memory, the designated structure data obtained after the log which continuously flows in is subjected to primary processing is subjected to temporary storage backup by utilizing the cache so as to calculate and process the cached designated structure data when the calculation time arrives, a storage medium is not needed in the middle, delay caused by starting a process to read and write the storage medium and calculating the data in the storage medium is avoided, primary analysis, primary aggregation and calculation and processing of the log are all completed at one time, and the log processing scheme which is high in processing efficiency, stable in operation, difficult in intermediate data loss and capable of guaranteeing real-time performance as far as possible is realized.

In an embodiment of the present invention, the calculation processing unit 220 is adapted to extract one or more specified fields from each log, and obtain the metadata in the form of the key-value pair corresponding to the log by using a set of the one or more specified fields as a key and using the number of times the set of the one or more fields appears in the log as a value.

Specifically, the calculation processing unit 220 is adapted to perform initial aggregation on values of a plurality of metadata corresponding to a plurality of logs in the segment of data stream according to keys of the metadata to obtain one or more metadata after the initial aggregation. And a calculation processing unit 220, adapted to write the metadata after the initial aggregation into the distributed file system for caching; when each metadata is written into the distributed file system, if cached metadata identical to the key of the metadata to be written exists in the distributed file system, updating the value of the cached metadata according to the value of the metadata to be written and a preset updating rule; and if cached metadata which is the same as the key of the metadata to be written does not exist in the distributed file system, directly writing the metadata to be written into the distributed file system. And a calculation processing unit 220 adapted to extract, from each log, a field indicating identification information of the user and a field indicating identification information of a service accessed by the user, with a set of the field indicating identification information of the user and the field indicating identification information of the service accessed by the user as a key, with "1" as a value, and with the key and the value constituting metadata corresponding to the log.

Wherein, the field of the identification information for indicating the service accessed by the user comprises: URL address, version information, signature information, and/or channel information.

In an embodiment of the present invention, the calculation processing unit 220 is further adapted to extract a field indicating time information of the user accessing the service from each log, and record a time corresponding to the metadata corresponding to each log according to the field.

The calculation processing unit 220 is adapted to obtain the statistical period in which each metadata falls according to the time corresponding to the metadata; performing primary aggregation on the values of the metadata according to the keys of the metadata on the metadata, wherein the metadata fall into the same statistical period; and regarding a plurality of metadata with the same key falling into the same statistical period, taking the corresponding maximum time as the time corresponding to the corresponding metadata obtained after the initial aggregation of the plurality of metadata.

Specifically, the calculation processing unit 220 is further adapted to, when it is determined that cached metadata identical to a key of metadata to be written exists in the distributed file system, further determine whether a time corresponding to the cached metadata and a time corresponding to the metadata to be written fall within a same statistical period, if yes, perform cumulative update on a value of the cached metadata by using the value of the metadata to be written, and use a larger time of the time corresponding to the metadata to be written and the time corresponding to the cached metadata as a time corresponding to the updated metadata; otherwise, the metadata to be written is directly written into the distributed file system.

Specifically, the calculation processing unit 220 is adapted to, when a first preset time after the end time of the current statistics period is reached, read metadata falling into the current statistics period from the distributed file system according to the time corresponding to each metadata; performing secondary analysis on each read metadata, removing a field indicating the identification information of a user in a key of the metadata, taking the field indicating the identification information of the service accessed by the user as a key of the metadata after the secondary analysis, and taking the value of the metadata as the value of the metadata after the secondary analysis; performing secondary aggregation on the value of the metadata subjected to secondary analysis according to the key of the metadata subjected to secondary analysis to obtain metadata subjected to secondary aggregation; and the value of each secondary aggregated metadata represents the total access times of the service indicated by the key of the secondary aggregated metadata in the current statistical period.

Specifically, the calculation processing unit 220 is adapted to, when a first preset time after the end time of the current statistics period is reached, read metadata falling into the current statistics period from the distributed file system according to the time corresponding to each metadata; performing secondary analysis on each read metadata, removing a field indicating expression information of a user in a key of the metadata, taking a field indicating identification information of a service accessed by the user as a key of the metadata after the secondary analysis, and taking '1' as a value of the metadata after the secondary analysis; performing secondary aggregation on the value of the metadata subjected to secondary analysis according to the key of the metadata subjected to secondary analysis to obtain metadata subjected to secondary aggregation; the value of each twice aggregated metadata represents the number of independent visitors to the service indicated by the key of the twice aggregated metadata in the current statistical period.

Fig. 3 shows a schematic diagram of a log processing apparatus according to another embodiment of the present invention. As shown in fig. 3, the log processing apparatus 300 includes: a data receiving unit 310, a calculation processing unit 320, a filtering unit 330, a configuration reading unit 340, and a storage processing unit 350.

The data receiving unit 310 and the calculation processing unit 320 have the same functions as the data receiving unit 210 and the calculation processing unit 220 shown in fig. 2, and the description of the same parts is omitted.

The filtering unit 330 is adapted to filter out expired metadata cached in the distributed file system when a second preset time before the start time of the current statistical period is reached.

Specifically, the filtering unit 330 is adapted to compare a time corresponding to each metadata cached in the distributed file system with a start time of the current statistical period, determine that one metadata is expired metadata if the time corresponding to the one metadata is less than the start time of the current statistical period, and set a value of the one metadata to null; and deleting the metadata with null values in the distributed file system.

The configuration reading unit 340 is adapted to schedule a thread to read configuration information input by a user at preset time intervals; the configuration information includes: time configuration information, input source information, parsing rules, and/or computing rules.

The data receiving unit 310 is adapted to determine a preset time period according to the time configuration information in the configuration information, determine an input source according to the input source information in the configuration information, and receive a segment of data stream from a corresponding input source every preset time period.

The calculation processing unit 320 is adapted to perform primary analysis on each log according to an analysis rule in the configuration information; and the method is suitable for performing calculation processing on the cached specified structure data according to the calculation rule in the configuration information.

In an embodiment of the present invention, the configuration information further includes: computing platform information; the log processing device runs on the computing platform indicated by the computing platform information; the computing platform is a Spark Streaming computing platform.

A storage processing unit 350 adapted to store the log processing result corresponding to the statistical period in a storage medium; the storage medium includes: a Redis database, a Mysql database, and/or a GreenPlum database.

It should be noted that the embodiments of the apparatus shown in fig. 2-3 are the same as the embodiments of the method shown in fig. 1, and the detailed description is given in the foregoing, and will not be repeated herein.

In summary, the technical solution provided by the present invention receives the logs by receiving the data streams in segments, performs initial analysis and initial aggregation on the logs in each segment of the received data stream, and further performs calculation processing on the processed specified structure data to obtain a log processing result; the whole process is in the memory, the designated structure data obtained after the log which continuously flows in is subjected to primary processing is subjected to temporary storage backup by utilizing the cache so as to calculate and process the cached designated structure data when the calculation time arrives, a storage medium is not needed in the middle, delay caused by starting a process to read and write the storage medium and calculating the data in the storage medium is avoided, primary analysis, primary aggregation and calculation and processing of the log are all completed at one time, and the log processing scheme which is high in processing efficiency, stable in operation, difficult in intermediate data loss and capable of guaranteeing real-time performance as far as possible is realized.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a log processing apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The invention discloses A1 and a log processing method, wherein the method comprises the following steps:

A2, the method as in a1, wherein the performing the primary analysis on each log in the segment of data stream to obtain the specified structure data corresponding to each log includes:

A3, the method as in a2, wherein the primarily aggregating the multiple pieces of specified structure data corresponding to the multiple pieces of logs in the segment of data stream includes:

A4, the method as in A3, wherein the caching the primarily aggregated specified structure data includes:

A5, the method as in a4, wherein the extracting one or more specified fields from each log, with the set of one or more specified fields as keys and the number of times the set of one or more fields appears in the log as values, and the obtaining metadata in the form of key-value pairs corresponding to the log comprises:

A6, the method as in a5, wherein the field indicating identification information of the service accessed by the user includes: URL address, version information, signature information, and/or channel information.

A7, the method as in a5, wherein the extracting, from each log, a field indicating identification information of an accessing user and a field indicating identification information of a service accessed by the user, with a set of the field indicating identification information of the accessing user and the field indicating identification information of the service accessed by the user as keys and "1" as a value, and the key and the value forming the metadata corresponding to the log further comprises:

A8, the method as in a7, wherein the primarily aggregating values of metadata according to a key of the metadata for a plurality of metadata corresponding to a plurality of logs in the segment of data stream includes:

A9, the method as in A8, wherein if cached metadata identical to the key of the metadata to be written exists in the distributed file system, updating the value of the cached metadata according to the value of the metadata to be written and a preset update rule further includes:

A10, the method as in a9, wherein the obtaining of the log processing result corresponding to the preset statistical period includes:

A11, the method as in a9, wherein the obtaining of the log processing result corresponding to the preset statistical period includes:

A12, the method of a10 or a11, wherein the method further comprises:

A13, the method as in A12, wherein the filtering out expired metadata cached in the distributed file system comprises:

and deleting the metadata with null values in the distributed file system.

A14, the method of a1, wherein the method further comprises:

A15, the method as in A4, wherein the configuration information further includes: computing platform information; the log processing method is executed based on the computing platform indicated by the computing platform information;

the computing platform is a Spark Streaming computing platform.

A16, the method of a1, wherein the method further comprises:

The invention also discloses B17, a log processing device, wherein, the device includes:

B18, the device of B17, wherein,

the calculation processing unit is suitable for extracting one or more specified fields from each log, taking a set of the one or more specified fields as a key, and taking the number of times of the set of the one or more fields appearing in the log as a value, so as to obtain the metadata in the form of the key value pair corresponding to the log.

B19, the device of B18, wherein,

and the calculation processing unit is suitable for performing initial aggregation on a plurality of metadata corresponding to a plurality of logs in the segment of data stream according to the key of the metadata to obtain one or more metadata after the initial aggregation.

B20, the device of B19, wherein,

the computing processing unit is suitable for writing the metadata after the initial aggregation into a distributed file system for caching; when each metadata is written into the distributed file system, if cached metadata identical to the key of the metadata to be written exists in the distributed file system, updating the value of the cached metadata according to the value of the metadata to be written and a preset updating rule; and if cached metadata which is the same as the key of the metadata to be written does not exist in the distributed file system, directly writing the metadata to be written into the distributed file system.

B21, the device of B20, wherein,

the calculation processing unit is suitable for extracting a field indicating the identification information of the user and a field indicating the identification information of the service accessed by the user from each log, taking a set of the field indicating the identification information of the user and the field indicating the identification information of the service accessed by the user as a key, taking '1' as a value, and forming the metadata corresponding to the log by using the key and the value.

B22, the apparatus as in B21, wherein the field indicating the identification information of the service accessed by the user comprises: URL address, version information, signature information, and/or channel information.

B23, the method according to B21, wherein,

the computing processing unit is further adapted to extract a field indicating time information of the user accessing the service from each log, and record the time corresponding to the metadata corresponding to each log according to the field.

B24, the device of B23, wherein,

the computing processing unit is suitable for acquiring a statistical period in which each metadata falls according to the time corresponding to the metadata; performing primary aggregation on the values of the metadata according to the keys of the metadata on the metadata, wherein the metadata fall into the same statistical period; and regarding a plurality of metadata with the same key falling into the same statistical period, taking the corresponding maximum time as the time corresponding to the corresponding metadata obtained after the initial aggregation of the plurality of metadata.

B25, the device of B24, wherein,

the calculation processing unit is further adapted to, when it is determined that cached metadata identical to a key of metadata to be written exists in the distributed file system, further determine whether time corresponding to the cached metadata and time corresponding to the written metadata fall within a same statistical period, if yes, perform cumulative update on a value of the cached metadata by using the value of the metadata to be written, and use a larger time of the time corresponding to the metadata to be written and the time corresponding to the cached metadata as time corresponding to the updated metadata; otherwise, the metadata to be written is directly written into the distributed file system.

B26, the device of B25, wherein,

the calculation processing unit is suitable for reading the metadata falling into the current statistical period from the distributed file system according to the time corresponding to each metadata when the first preset time after the end time of the current statistical period is reached; performing secondary analysis on each read metadata, removing a field indicating the identification information of a user in a key of the metadata, taking the field indicating the identification information of the service accessed by the user as a key of the metadata after the secondary analysis, and taking the value of the metadata as the value of the metadata after the secondary analysis; performing secondary aggregation on the value of the metadata subjected to secondary analysis according to the key of the metadata subjected to secondary analysis to obtain metadata subjected to secondary aggregation; and the value of each secondary aggregated metadata represents the total access times of the service indicated by the key of the secondary aggregated metadata in the current statistical period.

B27, the device of B25, wherein,

the calculation processing unit is suitable for reading the metadata falling into the current statistical period from the distributed file system according to the time corresponding to each metadata when the first preset time after the end time of the current statistical period is reached; performing secondary analysis on each read metadata, removing a field indicating expression information of a user in a key of the metadata, taking a field indicating identification information of a service accessed by the user as a key of the metadata after the secondary analysis, and taking '1' as a value of the metadata after the secondary analysis; performing secondary aggregation on the value of the metadata subjected to secondary analysis according to the key of the metadata subjected to secondary analysis to obtain metadata subjected to secondary aggregation; the value of each twice aggregated metadata represents the number of independent visitors to the service indicated by the key of the twice aggregated metadata in the current statistical period.

B28, the device according to B26 or 2B7, wherein the device further comprises:

B29, the device of B28, wherein,

the filtering unit is suitable for comparing the time corresponding to each metadata cached in the distributed file system with the starting time of the current statistical period, if the time corresponding to one metadata is less than the starting time of the current statistical period, the metadata is determined to be overdue metadata, and the value of the metadata is set to be null; and deleting the metadata with null values in the distributed file system.

B30, the apparatus of B17, wherein the apparatus further comprises: configuring a reading unit;

The apparatus of B31, as stated in B30, wherein, the configuration information further includes: computing platform information; the log processing device runs on the computing platform indicated by the computing platform information;

the computing platform is a Spark Streaming computing platform.

B32, the apparatus of B17, wherein the apparatus further comprises:

Claims

1. A method of log processing, wherein the method comprises:

when the calculation time of a preset statistical period is reached, calculating the cached specified structure data to obtain a log processing result corresponding to the statistical period;

the receiving a data stream from an input source at every preset time period comprises: and determining a preset time period according to the time configuration information in the configuration information, determining an input source according to the input source information in the configuration information, and receiving a section of data stream from the corresponding input source every other preset time period.

2. The method of claim 1, wherein the performing a primary parsing on each log in the segment of the data stream to obtain the specified structure data corresponding to each log comprises:

3. The method of claim 2, wherein the initially aggregating the specified structure data corresponding to the logs in the segment of the data stream comprises:

4. The method of claim 3, wherein the caching the primarily aggregated specified structure data comprises:

5. The method of claim 4, wherein the extracting one or more specified fields from each log, with the set of one or more specified fields as a key and the number of times the set of one or more fields appears in the log as a value, and obtaining the metadata in the form of key-value pairs corresponding to the log comprises:

6. The method of claim 5, wherein the field indicating identification information of the service accessed by the user comprises: URL address, version information, signature information, and/or channel information.

7. The method of claim 5, wherein the extracting, from each log, a field indicating identification information of an accessing user and a field indicating identification information of a service accessed by the user, with a set of the field indicating identification information of the accessing user and the field indicating identification information of the service accessed by the user as a key and "1" as a value, and with the key and the value constituting metadata corresponding to the log further comprises:

8. The method of claim 7, wherein the primarily aggregating values of the metadata according to the key of the metadata for a plurality of metadata corresponding to a plurality of logs in the segment of data stream comprises:

9. The method of claim 8, wherein if cached metadata identical to the key of the metadata to be written exists in the distributed file system, updating the value of the cached metadata according to the value of the metadata to be written and a preset update rule further comprises:

10. The method of claim 9, wherein when the computing opportunity of the preset statistical period is reached, performing computing processing on the cached specified structure data, and obtaining the log processing result corresponding to the statistical period includes:

11. The method of claim 9, wherein when the computing opportunity of the preset statistical period is reached, performing computing processing on the cached specified structure data, and obtaining the log processing result corresponding to the statistical period includes:

12. The method of claim 10 or 11, wherein the method further comprises:

13. The method of claim 12, wherein filtering out stale metadata cached in the distributed file system comprises:

and deleting the metadata with null values in the distributed file system.

14. The method of claim 1, wherein the method further comprises:

15. The method of claim 14, wherein the configuration information further comprises: computing platform information; the log processing method is executed based on the computing platform indicated by the computing platform information;

the computing platform is a Spark Streaming computing platform.

16. The method of claim 1, wherein the method further comprises:

17. A log processing apparatus, wherein the apparatus comprises:

the computing processing unit is suitable for performing primary analysis on each log in each segment of data stream after each segment of data stream is received to obtain specified structure data corresponding to each log; performing initial aggregation on a plurality of specified structure data corresponding to a plurality of logs in the data stream, and caching the specified structure data after the initial aggregation; when the calculation time of a preset statistical period is reached, calculating the cached specified structure data to obtain a log processing result corresponding to the statistical period;

the apparatus further comprises: configuring a reading unit;

the data receiving unit is suitable for determining a preset time period according to the time configuration information in the configuration information, determining an input source according to the input source information in the configuration information, and receiving a section of data stream from the corresponding input source at intervals of the preset time period.

18. The apparatus of claim 17, wherein,

19. The apparatus of claim 18, wherein,

20. The apparatus of claim 19, wherein,

21. The apparatus of claim 20, wherein,

22. The apparatus of claim 21, wherein the field indicating identification information of the service accessed by the user comprises: URL address, version information, signature information, and/or channel information.

23. The apparatus of claim 21, wherein,

24. The apparatus of claim 23, wherein,

25. The apparatus of claim 24, wherein,

26. The apparatus of claim 25, wherein,

27. The apparatus of claim 25, wherein,

28. The apparatus of claim 26 or 27, wherein the apparatus further comprises:

29. The apparatus of claim 28, wherein,

30. The apparatus of claim 17, wherein,

31. The apparatus of claim 30, wherein the configuration information further comprises: computing platform information; the log processing device runs on the computing platform indicated by the computing platform information;

the computing platform is a spark streaming computing platform.

32. The apparatus of claim 17, wherein the apparatus further comprises: