CN112527881A

CN112527881A - Hive-based data aggregation method

Info

Publication number: CN112527881A
Application number: CN202011488387.1A
Authority: CN
Inventors: 盛妍; 田诺; 张明杰; 宋灿; 顾立涛; 王丽; 牛逸明; 张云志; 徐雨申
Original assignee: CHINA REALTIME DATABASE CO LTD; State Grid Co ltd Customer Service Center; NARI Group Corp
Current assignee: State Grid Co ltd Customer Service Center
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-03-19

Abstract

The invention discloses a hive-based data aggregation method, which comprises the steps of collecting data, uniformly describing entity characteristics of the same data by using labels, dividing the labels of the data into a first-level label and a second-level label according to granularity, and refining the labels into a third-level label according to actual requirements; respectively creating corresponding original data tables according to the attributes of the label index system, and storing data classified according to different attributes; respectively carrying out ETL processing on original data tables stored in different partitions to generate a full label aggregation table for batch query and data management operation; performing row-column conversion on the full label aggregation table, grouping and compressing the three-level labels to which the first-level labels belong, and generating a full label result table; the date entry of the full tag result table is removed and imported into the ElasticSearch for application as interactive and bulk search queries. The decoupling between various labels and the label result table can be realized, and the refreshing efficiency of the label result table is improved.

Description

Hive-based data aggregation method

Technical Field

The invention relates to a hive-based data aggregation method, and belongs to the field of data processing.

Background

The data aggregation is a process of screening, storing and developing the acquired original data, a uniform data center can be formed through the data aggregation, and original materials are provided for value mining of subsequent data assets;

the data is usually associated and merged by adopting a mapreduce algorithm in the calculation and aggregation process (ETL) of the hive platform, but the mapreduce algorithm has high difficulty in realizing complex query logic development, low query efficiency in processing multi-table association, and low execution efficiency of deleting data or deleting data which is not supported in a batch calculation application scene, so that the aggregated tag result table and the tag process table have high coupling degree, the iterative development of tag data is not facilitated, the tag management process is difficult to decouple, background codes are complex, the iterative development period is long, the requirements of a service side cannot be met, and the like.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a data aggregation method capable of realizing decoupling among various labels and decoupling among the labels and a result table.

The technical scheme is as follows: the hive-based data aggregation method comprises the following steps: collecting data, uniformly describing entity characteristics of the same data by using labels, dividing the labels of the data into a first-level label and a second-level label according to granularity, and refining the first-level label and the second-level label into a third-level label according to actual requirements; respectively creating corresponding original data tables according to the attributes of the label index system, and storing data classified according to different attributes; respectively carrying out ETL processing on original data tables stored in different partitions to generate a full label aggregation table for batch query and data management operation; performing row-column conversion on the full label aggregation table, grouping and compressing the three-level labels to which the first-level labels belong, and generating a full label result table; the date entry of the full tag result table is removed and imported into the ElasticSearch for application as interactive and bulk search queries.

Before the full-amount gathering operation, the partition of the secondary label is cleared, the row-column conversion is carried out on the full-amount label gathering table by adopting the hive udf function, and the ETL processing process is to write the secondary label content which is inquired by the partition field of the primary label sorting table and is in each primary label sorting table into the partition with the corresponding date and the secondary label ID of the full-amount label gathering table as the partition field.

The fields of the data labels comprise a numerical value class, a date class, an enumeration class and a text class, the data date is used as a main partition field of the full-scale label aggregation table, and the secondary label ID is used as a sub-partition field of the full-scale label aggregation table.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the label data model is optimized, decoupling among various labels and the label result table can be realized, and meanwhile, the refreshing efficiency of the label result table is improved.

Detailed Description

The technical solution of the present invention is further illustrated by the following examples.

In this embodiment, the data aggregation method of the present invention is used to aggregate user data, and includes the following steps:

collecting data, uniformly describing entity characteristics of the same data by using labels, dividing the labels of the data into a first-level label and a second-level label according to granularity, and refining the first-level label and the second-level label into a third-level label according to actual requirements; for user data, the primary classification includes user information class labels, user access behavior class labels, user payment class labels, user risk control class labels, etc., all label records of a user are stored in different partitions under different primary label tables, the primary labels of this embodiment include demographic attributes, behavioral attributes, consumption characteristic attributes, and risk control attributes, in the raw data table of demographic attributes, the secondary label comprises an age group (user _ info _ age), a gender group (user _ info _ six), and the like, wherein the tertiary label comprises a teenager (user _ info _ age _01), a youth (user _ info _ age _02), a middle year (user _ info _ age _03), and an old (user _ info _ age _04), respectively creating corresponding original data tables according to the attributes of the label index system, and storing data classified according to different attributes; the primary data table structure of the primary label is as follows:

emptying the partitions of the secondary labels, performing ETL processing on original data tables stored in different partitions, writing the secondary label contents of the primary label classification tables inquired by the partition fields of the primary label classification tables into the partitions with the corresponding dates and secondary label IDs of the full label aggregation tables as the partition fields, and generating the full label aggregation tables for batch inquiry and data management operation; the fields of the data labels comprise a numerical value class, a date class, an enumeration class and a text class, the full label aggregation table takes the data date as a main partition field, takes a secondary label ID as a sub-partition field, and the structure of the full label aggregation table is as follows:

field definitions	Field(s)	Examples of the invention	Remarks for note
				User id	userid	4590087
Primary label id	Tag_level1
				Secondary label id	tag_level2	user_info_age,user_info_sex	A sub-partition field
Tertiary tag id	Tag_level3	sex_01,sex_02,hy_01,hy_02
				Date of data	data_date	2020-02-03	Main partition field

In order to meet the requirements of the line query and the batch search of the ElasticSearch, a live udf function is adopted to carry out row-column conversion on a full label aggregation table, three-level labels to which all first-level labels belong are grouped and compressed to generate a full label result table, the three-level labels of the same user are converted into a non-repetitive array and stored in a field, and the processed full label result table is compressed into a table only with a user id and a three-level label data set; since the table only holds the latest data of the user tags, the date items of the data can be removed, and the date items of the result table of the full-scale tags can be removed and imported into the ElasticSearch to be provided for the application to be used for interactive and batch search queries.

Claims

1. A hive-based data aggregation method is characterized by comprising the following steps of: collecting data, uniformly describing entity characteristics of the same data by using labels, dividing the labels of the data into a first-level label and a second-level label according to granularity, and refining the first-level label and the second-level label into a third-level label according to actual requirements; respectively creating corresponding original data tables according to the attributes of the label index system, and storing data classified according to different attributes; respectively carrying out ETL processing on original data tables stored in different partitions to generate a full label aggregation table for batch query and data management operation; performing row-column conversion on the full label aggregation table, grouping and compressing the three-level labels to which the first-level labels belong, and generating a full label result table; the date entry of the full tag result table is removed and imported into the ElasticSearch for application as interactive and bulk search queries.

2. The hive-based data aggregation method of claim 1, wherein the full-volume aggregation operation is preceded by a flush operation on partitions of secondary labels.

3. The hive-based data aggregation method according to claim 1, wherein the data aggregation method uses the udf function of hive to perform row-column transformation on the full-scale label aggregation table.

4. The hive-based data aggregation method according to claim 1, wherein the ETL process writes the secondary tag contents, which are obtained by querying the partition fields of the primary tag classification tables for each primary tag classification table, into the corresponding dates and secondary tag IDs of the full-size tag aggregation table as partitions of the partition fields.

5. The hive-based data aggregation method according to claim 1, wherein the fields of the data labels comprise a numerical class, a date class, an enumeration class, and a text class.

6. The hive-based data aggregation method according to claim 4, wherein the full-size label aggregation table has a data date as a primary partition field and a secondary label ID as a secondary partition field.