CN112527881A - Hive-based data aggregation method - Google Patents

Hive-based data aggregation method Download PDF

Info

Publication number
CN112527881A
CN112527881A CN202011488387.1A CN202011488387A CN112527881A CN 112527881 A CN112527881 A CN 112527881A CN 202011488387 A CN202011488387 A CN 202011488387A CN 112527881 A CN112527881 A CN 112527881A
Authority
CN
China
Prior art keywords
label
data
labels
full
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011488387.1A
Other languages
Chinese (zh)
Inventor
盛妍
田诺
张明杰
宋灿
顾立涛
王丽
牛逸明
张云志
徐雨申
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Co ltd Customer Service Center
Original Assignee
CHINA REALTIME DATABASE CO LTD
State Grid Co ltd Customer Service Center
NARI Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHINA REALTIME DATABASE CO LTD, State Grid Co ltd Customer Service Center, NARI Group Corp filed Critical CHINA REALTIME DATABASE CO LTD
Priority to CN202011488387.1A priority Critical patent/CN112527881A/en
Publication of CN112527881A publication Critical patent/CN112527881A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a hive-based data aggregation method, which comprises the steps of collecting data, uniformly describing entity characteristics of the same data by using labels, dividing the labels of the data into a first-level label and a second-level label according to granularity, and refining the labels into a third-level label according to actual requirements; respectively creating corresponding original data tables according to the attributes of the label index system, and storing data classified according to different attributes; respectively carrying out ETL processing on original data tables stored in different partitions to generate a full label aggregation table for batch query and data management operation; performing row-column conversion on the full label aggregation table, grouping and compressing the three-level labels to which the first-level labels belong, and generating a full label result table; the date entry of the full tag result table is removed and imported into the ElasticSearch for application as interactive and bulk search queries. The decoupling between various labels and the label result table can be realized, and the refreshing efficiency of the label result table is improved.

Description

Hive-based data aggregation method
Technical Field
The invention relates to a hive-based data aggregation method, and belongs to the field of data processing.
Background
The data aggregation is a process of screening, storing and developing the acquired original data, a uniform data center can be formed through the data aggregation, and original materials are provided for value mining of subsequent data assets;
the data is usually associated and merged by adopting a mapreduce algorithm in the calculation and aggregation process (ETL) of the hive platform, but the mapreduce algorithm has high difficulty in realizing complex query logic development, low query efficiency in processing multi-table association, and low execution efficiency of deleting data or deleting data which is not supported in a batch calculation application scene, so that the aggregated tag result table and the tag process table have high coupling degree, the iterative development of tag data is not facilitated, the tag management process is difficult to decouple, background codes are complex, the iterative development period is long, the requirements of a service side cannot be met, and the like.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a data aggregation method capable of realizing decoupling among various labels and decoupling among the labels and a result table.
The technical scheme is as follows: the hive-based data aggregation method comprises the following steps: collecting data, uniformly describing entity characteristics of the same data by using labels, dividing the labels of the data into a first-level label and a second-level label according to granularity, and refining the first-level label and the second-level label into a third-level label according to actual requirements; respectively creating corresponding original data tables according to the attributes of the label index system, and storing data classified according to different attributes; respectively carrying out ETL processing on original data tables stored in different partitions to generate a full label aggregation table for batch query and data management operation; performing row-column conversion on the full label aggregation table, grouping and compressing the three-level labels to which the first-level labels belong, and generating a full label result table; the date entry of the full tag result table is removed and imported into the ElasticSearch for application as interactive and bulk search queries.
Before the full-amount gathering operation, the partition of the secondary label is cleared, the row-column conversion is carried out on the full-amount label gathering table by adopting the hive udf function, and the ETL processing process is to write the secondary label content which is inquired by the partition field of the primary label sorting table and is in each primary label sorting table into the partition with the corresponding date and the secondary label ID of the full-amount label gathering table as the partition field.
The fields of the data labels comprise a numerical value class, a date class, an enumeration class and a text class, the data date is used as a main partition field of the full-scale label aggregation table, and the secondary label ID is used as a sub-partition field of the full-scale label aggregation table.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the label data model is optimized, decoupling among various labels and the label result table can be realized, and meanwhile, the refreshing efficiency of the label result table is improved.
Detailed Description
The technical solution of the present invention is further illustrated by the following examples.
In this embodiment, the data aggregation method of the present invention is used to aggregate user data, and includes the following steps:
collecting data, uniformly describing entity characteristics of the same data by using labels, dividing the labels of the data into a first-level label and a second-level label according to granularity, and refining the first-level label and the second-level label into a third-level label according to actual requirements; for user data, the primary classification includes user information class labels, user access behavior class labels, user payment class labels, user risk control class labels, etc., all label records of a user are stored in different partitions under different primary label tables, the primary labels of this embodiment include demographic attributes, behavioral attributes, consumption characteristic attributes, and risk control attributes, in the raw data table of demographic attributes, the secondary label comprises an age group (user _ info _ age), a gender group (user _ info _ six), and the like, wherein the tertiary label comprises a teenager (user _ info _ age _01), a youth (user _ info _ age _02), a middle year (user _ info _ age _03), and an old (user _ info _ age _04), respectively creating corresponding original data tables according to the attributes of the label index system, and storing data classified according to different attributes; the primary data table structure of the primary label is as follows:
Figure BDA0002840020700000021
emptying the partitions of the secondary labels, performing ETL processing on original data tables stored in different partitions, writing the secondary label contents of the primary label classification tables inquired by the partition fields of the primary label classification tables into the partitions with the corresponding dates and secondary label IDs of the full label aggregation tables as the partition fields, and generating the full label aggregation tables for batch inquiry and data management operation; the fields of the data labels comprise a numerical value class, a date class, an enumeration class and a text class, the full label aggregation table takes the data date as a main partition field, takes a secondary label ID as a sub-partition field, and the structure of the full label aggregation table is as follows:
field definitions Field(s) Examples of the invention Remarks for note
User id userid 4590087
Primary label id Tag_level1
Secondary label id tag_level2 user_info_age,user_info_sex A sub-partition field
Tertiary tag id Tag_level3 sex_01,sex_02,hy_01,hy_02
Date of data data_date 2020-02-03 Main partition field
In order to meet the requirements of the line query and the batch search of the ElasticSearch, a live udf function is adopted to carry out row-column conversion on a full label aggregation table, three-level labels to which all first-level labels belong are grouped and compressed to generate a full label result table, the three-level labels of the same user are converted into a non-repetitive array and stored in a field, and the processed full label result table is compressed into a table only with a user id and a three-level label data set; since the table only holds the latest data of the user tags, the date items of the data can be removed, and the date items of the result table of the full-scale tags can be removed and imported into the ElasticSearch to be provided for the application to be used for interactive and batch search queries.

Claims (6)

1. A hive-based data aggregation method is characterized by comprising the following steps of: collecting data, uniformly describing entity characteristics of the same data by using labels, dividing the labels of the data into a first-level label and a second-level label according to granularity, and refining the first-level label and the second-level label into a third-level label according to actual requirements; respectively creating corresponding original data tables according to the attributes of the label index system, and storing data classified according to different attributes; respectively carrying out ETL processing on original data tables stored in different partitions to generate a full label aggregation table for batch query and data management operation; performing row-column conversion on the full label aggregation table, grouping and compressing the three-level labels to which the first-level labels belong, and generating a full label result table; the date entry of the full tag result table is removed and imported into the ElasticSearch for application as interactive and bulk search queries.
2. The hive-based data aggregation method of claim 1, wherein the full-volume aggregation operation is preceded by a flush operation on partitions of secondary labels.
3. The hive-based data aggregation method according to claim 1, wherein the data aggregation method uses the udf function of hive to perform row-column transformation on the full-scale label aggregation table.
4. The hive-based data aggregation method according to claim 1, wherein the ETL process writes the secondary tag contents, which are obtained by querying the partition fields of the primary tag classification tables for each primary tag classification table, into the corresponding dates and secondary tag IDs of the full-size tag aggregation table as partitions of the partition fields.
5. The hive-based data aggregation method according to claim 1, wherein the fields of the data labels comprise a numerical class, a date class, an enumeration class, and a text class.
6. The hive-based data aggregation method according to claim 4, wherein the full-size label aggregation table has a data date as a primary partition field and a secondary label ID as a secondary partition field.
CN202011488387.1A 2020-12-16 2020-12-16 Hive-based data aggregation method Pending CN112527881A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011488387.1A CN112527881A (en) 2020-12-16 2020-12-16 Hive-based data aggregation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011488387.1A CN112527881A (en) 2020-12-16 2020-12-16 Hive-based data aggregation method

Publications (1)

Publication Number Publication Date
CN112527881A true CN112527881A (en) 2021-03-19

Family

ID=75000746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011488387.1A Pending CN112527881A (en) 2020-12-16 2020-12-16 Hive-based data aggregation method

Country Status (1)

Country Link
CN (1) CN112527881A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6983291B1 (en) * 1999-05-21 2006-01-03 International Business Machines Corporation Incremental maintenance of aggregated and join summary tables
CN105976161A (en) * 2016-04-29 2016-09-28 随身云(北京)信息技术有限公司 Time axis-based intelligent recommendation calendar and user-based presentation method
CN108764984A (en) * 2018-05-17 2018-11-06 国网冀北电力有限公司电力科学研究院 A kind of power consumer portrait construction method and system based on big data
CN109101652A (en) * 2018-08-27 2018-12-28 宜人恒业科技发展(北京)有限公司 A kind of creation of label and management system
CN109376161A (en) * 2018-08-22 2019-02-22 中国平安人寿保险股份有限公司 Label data update method, device, medium and electronic equipment based on big data
CN111159276A (en) * 2018-11-08 2020-05-15 北京航天长峰科技工业集团有限公司 Holographic image system construction method based on hybrid storage mode
CN111475509A (en) * 2020-04-03 2020-07-31 李俊宏 Big data-based user portrait and multidimensional analysis system
CN111506621A (en) * 2020-03-31 2020-08-07 新华三大数据技术有限公司 Data statistical method and device
CN111881221A (en) * 2020-07-07 2020-11-03 上海中通吉网络技术有限公司 Method, device and equipment for customer portrait in logistics service

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6983291B1 (en) * 1999-05-21 2006-01-03 International Business Machines Corporation Incremental maintenance of aggregated and join summary tables
CN105976161A (en) * 2016-04-29 2016-09-28 随身云(北京)信息技术有限公司 Time axis-based intelligent recommendation calendar and user-based presentation method
CN108764984A (en) * 2018-05-17 2018-11-06 国网冀北电力有限公司电力科学研究院 A kind of power consumer portrait construction method and system based on big data
CN109376161A (en) * 2018-08-22 2019-02-22 中国平安人寿保险股份有限公司 Label data update method, device, medium and electronic equipment based on big data
CN109101652A (en) * 2018-08-27 2018-12-28 宜人恒业科技发展(北京)有限公司 A kind of creation of label and management system
CN111159276A (en) * 2018-11-08 2020-05-15 北京航天长峰科技工业集团有限公司 Holographic image system construction method based on hybrid storage mode
CN111506621A (en) * 2020-03-31 2020-08-07 新华三大数据技术有限公司 Data statistical method and device
CN111475509A (en) * 2020-04-03 2020-07-31 李俊宏 Big data-based user portrait and multidimensional analysis system
CN111881221A (en) * 2020-07-07 2020-11-03 上海中通吉网络技术有限公司 Method, device and equipment for customer portrait in logistics service

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵宏田: "《用户画像:方法论与工程化解决方案》", 31 May 2020, 机械工业出版社 *

Similar Documents

Publication Publication Date Title
US11681733B2 (en) Massive scale heterogeneous data ingestion and user resolution
CN108733681B (en) Information processing method and device
CN110990638B (en) Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment
WO2016004813A1 (en) Data storage method, query method and device
CN101000626A (en) Information storing method and method for converting search inquiry into inquiry statement
CN103473276B (en) Ultra-large type date storage method, distributed data base system and its search method
CN114119058B (en) User portrait model construction method, device and storage medium
CN111506559A (en) Data storage method and device, electronic equipment and storage medium
CN105159971B (en) A kind of cloud platform data retrieval method
CN114880486A (en) Industry chain identification method and system based on NLP and knowledge graph
CN109948913A (en) A kind of multi-source feature power consumer composite portrait system based on double-deck xgboost algorithm
CN105893380A (en) Improved text classification characteristic selection method
CN110990529A (en) Enterprise industry detail division method and system
CN111522950B (en) Rapid identification system for unstructured massive text sensitive data
Prasad et al. uCLUST-a new algorithm for clustering unstructured data
CN105359172A (en) Calculating a probability of a business being delinquent
Tiwari et al. Comparative investigation of k-means and k-medoid algorithm on iris data
CN114064660B (en) Data structured analysis method based on ElasticSearch
JPWO2007020849A1 (en) Shared memory multiprocessor system and information processing method thereof
CN114840766A (en) User portrait construction method, system, equipment and storage medium
CN112527881A (en) Hive-based data aggregation method
CN103995832A (en) Complex relational data storage technology implementation method based on separation of attributes and relations
Li et al. Efficient behavior targeting using svm ensemble indexing
Tan Different types of association rules mining review
TWM621407U (en) Customer credit rating system for international trade and data serverice processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Sheng Yan

Inventor after: Tian Nuo

Inventor after: Zhang Mingjie

Inventor after: Song Can

Inventor after: Gu Litao

Inventor after: Wang Li

Inventor after: Niu Yiming

Inventor before: Sheng Yan

Inventor before: Tian Nuo

Inventor before: Zhang Mingjie

Inventor before: Song Can

Inventor before: Gu Litao

Inventor before: Wang Li

Inventor before: Niu Yiming

Inventor before: Zhang Yunzhi

Inventor before: Xu Yushen

CB03 Change of inventor or designer information
TA01 Transfer of patent application right

Effective date of registration: 20220616

Address after: No.21, Lihu Ring Road, Dongli District, Tianjin 300309

Applicant after: STATE GRID Co.,Ltd. CUSTOMER SERVICE CENTER

Address before: No.21, Lihu Ring Road, Dongli District, Tianjin 300309

Applicant before: STATE GRID Co.,Ltd. CUSTOMER SERVICE CENTER

Applicant before: CHINA REALTIME DATABASE Co.,Ltd.

Applicant before: NARI Group Corp.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20210319

RJ01 Rejection of invention patent application after publication