CN109086293B - Hive file reading and writing method and device - Google Patents

Hive file reading and writing method and device Download PDF

Info

Publication number
CN109086293B
CN109086293B CN201810593791.1A CN201810593791A CN109086293B CN 109086293 B CN109086293 B CN 109086293B CN 201810593791 A CN201810593791 A CN 201810593791A CN 109086293 B CN109086293 B CN 109086293B
Authority
CN
China
Prior art keywords
data
hive file
reading
server
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810593791.1A
Other languages
Chinese (zh)
Other versions
CN109086293A (en
Inventor
吴强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiufu Jinke Holding Group Co ltd
Original Assignee
Jiufu Jinke Holding Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiufu Jinke Holding Group Co ltd filed Critical Jiufu Jinke Holding Group Co ltd
Priority to CN201810593791.1A priority Critical patent/CN109086293B/en
Publication of CN109086293A publication Critical patent/CN109086293A/en
Application granted granted Critical
Publication of CN109086293B publication Critical patent/CN109086293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a Hive file reading and writing method and a Hive file reading and writing device, wherein the Hive file reading and writing method comprises the following steps: reading a data access table, and acquiring server information and parallelism information; generating an executive program to connect with a server where the Hive file is located according to the server information; determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information; and accessing the Hive file data for data consumption. The method provided by the invention also comprises Hive file data grouping service configuration, assembling the read Hive file data, and determining the data consumption sequence according to the data consumption priority configuration. Compared with the traditional Hive file reading and writing method, the technical scheme provided by the invention avoids a complicated compiling process, and solves the problems of low reading efficiency and incapability of controlling the reading process.

Description

Hive file reading and writing method and device
Technical Field
The invention relates to a Hive file reading and writing method and device, and belongs to the field of Hadoop data warehouse application.
Background
Hive is a data warehouse processing tool with Hadoop packaged at the bottom layer, data query is realized by using HQL language, and all Hive file data are stored in a Hadoop compatible file system. In the prior art, a Hive partition is read by adopting an HQL language, a Hive client needs to be started, then inquiry is carried out by adopting a mode that a select statement in the HQL specifies a partition, and data is returned through a Hive built-in mapreduce model. On the other hand, the technical scheme needs to access metadata of Hive, occupies mapreduce computing resources, cannot realize high-efficiency data reading, cannot be separated from the complicated compiling process of HQL, and has the technical problems that the read data block is large and the reading process cannot be controlled.
Disclosure of Invention
In order to alleviate the above technical problems, an object of the present invention is to provide a Hive file reading and writing method and device, which achieve efficient and portable reading of Hive files and can control the reading process.
In a first aspect, the present invention provides a Hive file reading and writing method, including: reading a data access table, and acquiring server information and parallelism information; generating an executive program to connect with a server where the Hive file is located according to the server information; determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information; and accessing Hive file data for data consumption.
Furthermore, the server information at least comprises one server node, and the server nodes are deployed in a centralized mode or a distributed mode; the Hive file contains at least one table; the parallelism information is determined by the network delay of the server nodes and/or the routing condition of each server node.
Further, after reading the data access table, if the server pointed by the server information is unavailable and/or the server information has errors, the data access table is configured and saved.
Furthermore, the number of the reading threads is determined by the reading path of the Hive file; the number of the processing threads is determined by the upper limit of the residual threads of the server node and the estimated size of single-line data of the Hive file; the batch size is determined by the estimated size of single-line data of the Hive file.
Further, the data access table also includes service grouping configuration; reading service grouping configuration when reading the data access table; before accessing the Hive file data, the method also comprises the step of assembling the read Hive file data according to the service grouping configuration.
Further, the data access table also comprises a priority; reading the priority when reading the data access table; before the data consumption, the method also comprises the step of determining the data consumption sequence according to the priority.
Further, if the data of the access Hive file is abnormal, recording abnormal information or breakpoint information, and continuing or stopping the access Hive file data.
Further, the data consumption further comprises: and (4) additionally calculating the blocking interval according to the calculation scale and/or the network delay.
Further, after the data consumption, the method also comprises the following steps: and returning to process the Hive file after all the reading threads and the processing threads are finished.
In a second aspect, the present invention further provides a Hive file reading/writing apparatus, including: the configuration module reads the data access table and acquires server information and parallelism information; the connection module is used for generating an executive program according to the server information so as to connect the server where the Hive file is located; the execution module is used for determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information; and the data consumption module is used for accessing the Hive file data to perform data consumption.
The invention has the advantages that the automatic allocation of system resources is carried out according to the server information and the parallelism information in the Hive file reading and writing, and the reading and writing efficiency of the Hive file is improved. In addition, the invention also introduces the service grouping configuration and the priority setting, so that the Hive file reading and writing process is flexible and controllable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are one embodiment of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a Hive file reading and writing method according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating an effect of a Hive file reading and writing method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a Hive file reading/writing device according to an embodiment of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and the described embodiments are some, but not all embodiments of the present invention.
The first embodiment is as follows:
fig. 1 is a flowchart of a Hive file read-write method according to an embodiment of the present invention, fig. 2 is an effect diagram of a Hive file read-write method according to an embodiment of the present invention, and as shown in fig. 1 and fig. 2, the embodiment of the present invention is implemented by the following 9 steps.
S101: and reading the data access table. The data access table includes server information and parallelism information, and in an optional embodiment, the server information includes: the server node, the hive table, the hive partition 11, the hive bucket 13, and the CPU and the memory of the server where the execution program is located. The parallelism information is determined by network delay of the server nodes and route condition refinement of each node of the server.
Optionally, the data access table further includes a service packet configuration, and when the data access table is read, the service packet configuration is read.
Optionally, the data access table further includes a priority, and when the data access table is read, the priority is read.
S102: and judging whether the server is available, and if not, configuring and storing the data access table. Specifically, after reading the data access table, if the server pointed by the server information is unavailable and/or the server information has an error, the data access table is configured and saved. In an optional embodiment, a visual configuration interface is entered, and a data access table is configured and stored to read a new data access table for subsequent Hive file reading and writing work.
Preferably, server information, parallelism information, traffic packet configuration and priority in the data access table are configured.
S103: and generating an executive program and connecting the server where the Hive file is located.
And generating a server where the Hive file is connected with the executive program according to the IP address, the service port number and the like of the server node where the Hive file is located. In an alternative embodiment, the execution program is generated using a Java API to connect to the server where the Hive file is located.
S104: and determining the number of reading threads, the number of processing threads and the batch size of the Hive file. Specifically, the number of the reading threads is determined by the reading path of the Hive file; the number of the processing threads is determined by the upper limit of the residual threads of the server node and the estimated size of single-line data of the Hive file; the batch size is determined by the estimated size of single-line data of the Hive file.
As shown in fig. 2, the read path 14 of the Hive file is formed by splicing a server node (node), a target table, a partition 11 and a sub-bucket 13.
S105: and assembling the read Hive file data according to the service grouping configuration. As shown in fig. 2, packets are determined and packet data 17 is assembled using the bucket 13 and the traffic packet hash rule according to the configuration in the data access table.
S106: and determining the sequence of data consumption according to the priority. As shown in fig. 2, the consumption priority 18 of the packet data 17 is determined according to the configuration in the data access table.
S107: and starting data access, and if an exception occurs, performing exception handling. In the data access process, if the data of the access Hive file is abnormal, recording abnormal information or breakpoint information, and continuing or stopping the access Hive file data. Specifically, as shown in fig. 2, there is a possibility that an abnormality occurs in the grouping data 17, the consumption priority 18, and the data reading and writing links. If the abnormal situation occurs, the read path 14 and the line number 16 of the Hive file are recorded, and the data access of other data lines 15 or other paths 14 of the file 12 is continued. In an alternative embodiment, the access of data may also be suspended.
S108: and performing data consumption and supplementing to calculate the blocking interval. Specifically, data consumption includes: data stream writing files, data calculation, data analysis and data network transmission. And additionally calculating the blocking interval according to the calculation scale and/or the network delay. After the data consumption, the parallel computing process runs idle to a specified time limit and ends.
S109: and returning to process the Hive file. Specifically, after all the reading threads and the processing threads are finished, the Hive file is returned to be processed.
Example two:
the embodiment of the invention also provides a Hive file reading and writing device which is mainly used for executing the Hive file reading and writing method provided by the embodiment of the invention.
Fig. 3 is a schematic diagram of a Hive file reading/writing apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes a configuration module 21, a connection module 22, an execution module 23, and a data consumption module 24;
the configuration module 21 reads the data access table and acquires server information and parallelism information;
the connection module 22 is used for generating an execution program to connect the server where the Hive file is located according to the server information;
the execution module 23 is used for determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information;
and the data consumption module 24 accesses the Hive file data to perform data consumption.
The present invention has not been described in detail in part as is known in the art.

Claims (9)

1. A Hive file reading and writing method is characterized by comprising the following steps:
reading a data access table, and acquiring server information and parallelism information;
generating an executive program to connect with a server where the Hive file is located according to the server information;
determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information;
accessing the Hive file data for data consumption,
the server information at least comprises one server node, and the server nodes are deployed in a centralized mode or a distributed mode; the Hive file contains at least one table; the parallelism information is determined by the network delay of the server nodes and/or the routing condition of each server node.
2. The method of claim 1, comprising: and after the data access table is read, if the server pointed by the server information is unavailable and/or the server information has errors, configuring and storing the data access table.
3. The method of claim 1, comprising: the number of the reading threads is determined by the reading path of the Hive file; the number of the processing threads is determined by the upper limit of the residual threads of the server nodes and the estimated size of single-line data of the Hive file; the batch size is determined by the estimated size of single-line data of the Hive file.
4. The method of claim 1, comprising: the data access table further comprises service grouping configuration; reading the service grouping configuration when reading the data access table; before accessing the Hive file data, the method also comprises assembling the read Hive file data according to the service grouping configuration.
5. The method of claim 1, comprising: the data access table further comprises a priority; reading the priority when reading the data access table; and before the data consumption, determining the order of the data consumption according to the priority.
6. The method according to claim 1, wherein the accessing Hive file data further comprises: and if the data of the access Hive file is abnormal, recording abnormal information or breakpoint information, and continuing or stopping the data of the access Hive file.
7. The method of claim 1, wherein said data consuming further comprises: and (4) additionally calculating the blocking interval according to the calculation scale and/or the network delay.
8. The method of claim 1, wherein after said data consumption, further comprising: and returning to process the Hive file after all the reading threads and the processing threads are finished.
9. A Hive file reading and writing device is characterized by comprising:
the configuration module reads the data access table and acquires server information and parallelism information;
the connection module is used for generating an executive program to connect the server where the Hive file is located according to the server information;
the execution module is used for determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information;
the data consumption module accesses the Hive file data to consume the data,
the server information at least comprises one server node, and the server nodes are deployed in a centralized mode or a distributed mode; the Hive file contains at least one table; the parallelism information is determined by the network delay of the server nodes and/or the routing condition of each server node.
CN201810593791.1A 2018-06-11 2018-06-11 Hive file reading and writing method and device Active CN109086293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810593791.1A CN109086293B (en) 2018-06-11 2018-06-11 Hive file reading and writing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810593791.1A CN109086293B (en) 2018-06-11 2018-06-11 Hive file reading and writing method and device

Publications (2)

Publication Number Publication Date
CN109086293A CN109086293A (en) 2018-12-25
CN109086293B true CN109086293B (en) 2020-11-27

Family

ID=64839865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810593791.1A Active CN109086293B (en) 2018-06-11 2018-06-11 Hive file reading and writing method and device

Country Status (1)

Country Link
CN (1) CN109086293B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103188346A (en) * 2013-03-05 2013-07-03 北京航空航天大学 Distributed decision making supporting massive high-concurrency access I/O (Input/output) server load balancing system
CN103218210A (en) * 2013-04-28 2013-07-24 北京航空航天大学 File level partitioning system suitable for big data high concurrence access
CN104484447A (en) * 2014-12-22 2015-04-01 国云科技股份有限公司 Large-level text file processing system and running method thereof
US9514188B2 (en) * 2009-04-02 2016-12-06 Pivotal Software, Inc. Integrating map-reduce into a distributed relational database
US9594782B2 (en) * 2013-12-23 2017-03-14 Ic Manage, Inc. Hierarchical file block variant store apparatus and method of operation
CN106547614A (en) * 2016-11-01 2017-03-29 山东浪潮商用***有限公司 A kind of mass data based on message queue postpones deriving method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110119369A1 (en) * 2009-11-13 2011-05-19 International Business Machines,Corporation Monitoring computer system performance

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9514188B2 (en) * 2009-04-02 2016-12-06 Pivotal Software, Inc. Integrating map-reduce into a distributed relational database
CN103188346A (en) * 2013-03-05 2013-07-03 北京航空航天大学 Distributed decision making supporting massive high-concurrency access I/O (Input/output) server load balancing system
CN103218210A (en) * 2013-04-28 2013-07-24 北京航空航天大学 File level partitioning system suitable for big data high concurrence access
US9594782B2 (en) * 2013-12-23 2017-03-14 Ic Manage, Inc. Hierarchical file block variant store apparatus and method of operation
CN104484447A (en) * 2014-12-22 2015-04-01 国云科技股份有限公司 Large-level text file processing system and running method thereof
CN106547614A (en) * 2016-11-01 2017-03-29 山东浪潮商用***有限公司 A kind of mass data based on message queue postpones deriving method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An Automatic Adjustment Approach of Thread Quantity to Optimize Resource Usage;Zou Lida et al;《The Open Automationand Control System Journal》;20141231;第296-301页 *
基于Storm的数据分析***设计与实现;孙朝华;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150415;第2015年卷(第04期);第I138-623页 *

Also Published As

Publication number Publication date
CN109086293A (en) 2018-12-25

Similar Documents

Publication Publication Date Title
US10805363B2 (en) Method, device and system for pushing file
CN110865867B (en) Method, device and system for discovering application topological relation
TWI728036B (en) Information processing method, device and system
US10776170B2 (en) Software service execution apparatus, system, and method
US9438665B1 (en) Scheduling and tracking control plane operations for distributed storage systems
CN110147407B (en) Data processing method and device and database management server
CN108280148A (en) A kind of data migration method and data migration server
US9535754B1 (en) Dynamic provisioning of computing resources
JP2008507201A5 (en)
CN107181636B (en) Health check method and device in load balancing system
CN105450759A (en) System mirror image management method and device
WO2018054221A1 (en) Pipeline dependent tree query optimizer and scheduler
Petrov et al. Adaptive performance model for dynamic scaling Apache Spark Streaming
CN111327651A (en) Resource downloading method, device, edge node and storage medium
US20240061712A1 (en) Method, apparatus, and system for creating training task on ai training platform, and medium
CN105740249B (en) Processing method and system in parallel scheduling process of big data job
WO2016188077A1 (en) Burn-in test method and device
CN112600931B (en) API gateway deployment method and device
US11301436B2 (en) File storage method and storage apparatus
CN109086293B (en) Hive file reading and writing method and device
US9537941B2 (en) Method and system for verifying quality of server
CN108696559A (en) Method for stream processing and device
CN116594734A (en) Container migration method and device, storage medium and electronic equipment
CN109213566B (en) Virtual machine migration method, device and equipment
CN110417860A (en) File transfer management method, apparatus, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant