CN109086293B - Hive file reading and writing method and device - Google Patents
Hive file reading and writing method and device Download PDFInfo
- Publication number
- CN109086293B CN109086293B CN201810593791.1A CN201810593791A CN109086293B CN 109086293 B CN109086293 B CN 109086293B CN 201810593791 A CN201810593791 A CN 201810593791A CN 109086293 B CN109086293 B CN 109086293B
- Authority
- CN
- China
- Prior art keywords
- data
- hive file
- reading
- server
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention provides a Hive file reading and writing method and a Hive file reading and writing device, wherein the Hive file reading and writing method comprises the following steps: reading a data access table, and acquiring server information and parallelism information; generating an executive program to connect with a server where the Hive file is located according to the server information; determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information; and accessing the Hive file data for data consumption. The method provided by the invention also comprises Hive file data grouping service configuration, assembling the read Hive file data, and determining the data consumption sequence according to the data consumption priority configuration. Compared with the traditional Hive file reading and writing method, the technical scheme provided by the invention avoids a complicated compiling process, and solves the problems of low reading efficiency and incapability of controlling the reading process.
Description
Technical Field
The invention relates to a Hive file reading and writing method and device, and belongs to the field of Hadoop data warehouse application.
Background
Hive is a data warehouse processing tool with Hadoop packaged at the bottom layer, data query is realized by using HQL language, and all Hive file data are stored in a Hadoop compatible file system. In the prior art, a Hive partition is read by adopting an HQL language, a Hive client needs to be started, then inquiry is carried out by adopting a mode that a select statement in the HQL specifies a partition, and data is returned through a Hive built-in mapreduce model. On the other hand, the technical scheme needs to access metadata of Hive, occupies mapreduce computing resources, cannot realize high-efficiency data reading, cannot be separated from the complicated compiling process of HQL, and has the technical problems that the read data block is large and the reading process cannot be controlled.
Disclosure of Invention
In order to alleviate the above technical problems, an object of the present invention is to provide a Hive file reading and writing method and device, which achieve efficient and portable reading of Hive files and can control the reading process.
In a first aspect, the present invention provides a Hive file reading and writing method, including: reading a data access table, and acquiring server information and parallelism information; generating an executive program to connect with a server where the Hive file is located according to the server information; determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information; and accessing Hive file data for data consumption.
Furthermore, the server information at least comprises one server node, and the server nodes are deployed in a centralized mode or a distributed mode; the Hive file contains at least one table; the parallelism information is determined by the network delay of the server nodes and/or the routing condition of each server node.
Further, after reading the data access table, if the server pointed by the server information is unavailable and/or the server information has errors, the data access table is configured and saved.
Furthermore, the number of the reading threads is determined by the reading path of the Hive file; the number of the processing threads is determined by the upper limit of the residual threads of the server node and the estimated size of single-line data of the Hive file; the batch size is determined by the estimated size of single-line data of the Hive file.
Further, the data access table also includes service grouping configuration; reading service grouping configuration when reading the data access table; before accessing the Hive file data, the method also comprises the step of assembling the read Hive file data according to the service grouping configuration.
Further, the data access table also comprises a priority; reading the priority when reading the data access table; before the data consumption, the method also comprises the step of determining the data consumption sequence according to the priority.
Further, if the data of the access Hive file is abnormal, recording abnormal information or breakpoint information, and continuing or stopping the access Hive file data.
Further, the data consumption further comprises: and (4) additionally calculating the blocking interval according to the calculation scale and/or the network delay.
Further, after the data consumption, the method also comprises the following steps: and returning to process the Hive file after all the reading threads and the processing threads are finished.
In a second aspect, the present invention further provides a Hive file reading/writing apparatus, including: the configuration module reads the data access table and acquires server information and parallelism information; the connection module is used for generating an executive program according to the server information so as to connect the server where the Hive file is located; the execution module is used for determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information; and the data consumption module is used for accessing the Hive file data to perform data consumption.
The invention has the advantages that the automatic allocation of system resources is carried out according to the server information and the parallelism information in the Hive file reading and writing, and the reading and writing efficiency of the Hive file is improved. In addition, the invention also introduces the service grouping configuration and the priority setting, so that the Hive file reading and writing process is flexible and controllable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are one embodiment of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a Hive file reading and writing method according to an embodiment of the present invention;
fig. 2 is a schematic diagram illustrating an effect of a Hive file reading and writing method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a Hive file reading/writing device according to an embodiment of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and the described embodiments are some, but not all embodiments of the present invention.
The first embodiment is as follows:
fig. 1 is a flowchart of a Hive file read-write method according to an embodiment of the present invention, fig. 2 is an effect diagram of a Hive file read-write method according to an embodiment of the present invention, and as shown in fig. 1 and fig. 2, the embodiment of the present invention is implemented by the following 9 steps.
S101: and reading the data access table. The data access table includes server information and parallelism information, and in an optional embodiment, the server information includes: the server node, the hive table, the hive partition 11, the hive bucket 13, and the CPU and the memory of the server where the execution program is located. The parallelism information is determined by network delay of the server nodes and route condition refinement of each node of the server.
Optionally, the data access table further includes a service packet configuration, and when the data access table is read, the service packet configuration is read.
Optionally, the data access table further includes a priority, and when the data access table is read, the priority is read.
S102: and judging whether the server is available, and if not, configuring and storing the data access table. Specifically, after reading the data access table, if the server pointed by the server information is unavailable and/or the server information has an error, the data access table is configured and saved. In an optional embodiment, a visual configuration interface is entered, and a data access table is configured and stored to read a new data access table for subsequent Hive file reading and writing work.
Preferably, server information, parallelism information, traffic packet configuration and priority in the data access table are configured.
S103: and generating an executive program and connecting the server where the Hive file is located.
And generating a server where the Hive file is connected with the executive program according to the IP address, the service port number and the like of the server node where the Hive file is located. In an alternative embodiment, the execution program is generated using a Java API to connect to the server where the Hive file is located.
S104: and determining the number of reading threads, the number of processing threads and the batch size of the Hive file. Specifically, the number of the reading threads is determined by the reading path of the Hive file; the number of the processing threads is determined by the upper limit of the residual threads of the server node and the estimated size of single-line data of the Hive file; the batch size is determined by the estimated size of single-line data of the Hive file.
As shown in fig. 2, the read path 14 of the Hive file is formed by splicing a server node (node), a target table, a partition 11 and a sub-bucket 13.
S105: and assembling the read Hive file data according to the service grouping configuration. As shown in fig. 2, packets are determined and packet data 17 is assembled using the bucket 13 and the traffic packet hash rule according to the configuration in the data access table.
S106: and determining the sequence of data consumption according to the priority. As shown in fig. 2, the consumption priority 18 of the packet data 17 is determined according to the configuration in the data access table.
S107: and starting data access, and if an exception occurs, performing exception handling. In the data access process, if the data of the access Hive file is abnormal, recording abnormal information or breakpoint information, and continuing or stopping the access Hive file data. Specifically, as shown in fig. 2, there is a possibility that an abnormality occurs in the grouping data 17, the consumption priority 18, and the data reading and writing links. If the abnormal situation occurs, the read path 14 and the line number 16 of the Hive file are recorded, and the data access of other data lines 15 or other paths 14 of the file 12 is continued. In an alternative embodiment, the access of data may also be suspended.
S108: and performing data consumption and supplementing to calculate the blocking interval. Specifically, data consumption includes: data stream writing files, data calculation, data analysis and data network transmission. And additionally calculating the blocking interval according to the calculation scale and/or the network delay. After the data consumption, the parallel computing process runs idle to a specified time limit and ends.
S109: and returning to process the Hive file. Specifically, after all the reading threads and the processing threads are finished, the Hive file is returned to be processed.
Example two:
the embodiment of the invention also provides a Hive file reading and writing device which is mainly used for executing the Hive file reading and writing method provided by the embodiment of the invention.
Fig. 3 is a schematic diagram of a Hive file reading/writing apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes a configuration module 21, a connection module 22, an execution module 23, and a data consumption module 24;
the configuration module 21 reads the data access table and acquires server information and parallelism information;
the connection module 22 is used for generating an execution program to connect the server where the Hive file is located according to the server information;
the execution module 23 is used for determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information;
and the data consumption module 24 accesses the Hive file data to perform data consumption.
The present invention has not been described in detail in part as is known in the art.
Claims (9)
1. A Hive file reading and writing method is characterized by comprising the following steps:
reading a data access table, and acquiring server information and parallelism information;
generating an executive program to connect with a server where the Hive file is located according to the server information;
determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information;
accessing the Hive file data for data consumption,
the server information at least comprises one server node, and the server nodes are deployed in a centralized mode or a distributed mode; the Hive file contains at least one table; the parallelism information is determined by the network delay of the server nodes and/or the routing condition of each server node.
2. The method of claim 1, comprising: and after the data access table is read, if the server pointed by the server information is unavailable and/or the server information has errors, configuring and storing the data access table.
3. The method of claim 1, comprising: the number of the reading threads is determined by the reading path of the Hive file; the number of the processing threads is determined by the upper limit of the residual threads of the server nodes and the estimated size of single-line data of the Hive file; the batch size is determined by the estimated size of single-line data of the Hive file.
4. The method of claim 1, comprising: the data access table further comprises service grouping configuration; reading the service grouping configuration when reading the data access table; before accessing the Hive file data, the method also comprises assembling the read Hive file data according to the service grouping configuration.
5. The method of claim 1, comprising: the data access table further comprises a priority; reading the priority when reading the data access table; and before the data consumption, determining the order of the data consumption according to the priority.
6. The method according to claim 1, wherein the accessing Hive file data further comprises: and if the data of the access Hive file is abnormal, recording abnormal information or breakpoint information, and continuing or stopping the data of the access Hive file.
7. The method of claim 1, wherein said data consuming further comprises: and (4) additionally calculating the blocking interval according to the calculation scale and/or the network delay.
8. The method of claim 1, wherein after said data consumption, further comprising: and returning to process the Hive file after all the reading threads and the processing threads are finished.
9. A Hive file reading and writing device is characterized by comprising:
the configuration module reads the data access table and acquires server information and parallelism information;
the connection module is used for generating an executive program to connect the server where the Hive file is located according to the server information;
the execution module is used for determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information;
the data consumption module accesses the Hive file data to consume the data,
the server information at least comprises one server node, and the server nodes are deployed in a centralized mode or a distributed mode; the Hive file contains at least one table; the parallelism information is determined by the network delay of the server nodes and/or the routing condition of each server node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810593791.1A CN109086293B (en) | 2018-06-11 | 2018-06-11 | Hive file reading and writing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810593791.1A CN109086293B (en) | 2018-06-11 | 2018-06-11 | Hive file reading and writing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109086293A CN109086293A (en) | 2018-12-25 |
CN109086293B true CN109086293B (en) | 2020-11-27 |
Family
ID=64839865
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810593791.1A Active CN109086293B (en) | 2018-06-11 | 2018-06-11 | Hive file reading and writing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109086293B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103188346A (en) * | 2013-03-05 | 2013-07-03 | 北京航空航天大学 | Distributed decision making supporting massive high-concurrency access I/O (Input/output) server load balancing system |
CN103218210A (en) * | 2013-04-28 | 2013-07-24 | 北京航空航天大学 | File level partitioning system suitable for big data high concurrence access |
CN104484447A (en) * | 2014-12-22 | 2015-04-01 | 国云科技股份有限公司 | Large-level text file processing system and running method thereof |
US9514188B2 (en) * | 2009-04-02 | 2016-12-06 | Pivotal Software, Inc. | Integrating map-reduce into a distributed relational database |
US9594782B2 (en) * | 2013-12-23 | 2017-03-14 | Ic Manage, Inc. | Hierarchical file block variant store apparatus and method of operation |
CN106547614A (en) * | 2016-11-01 | 2017-03-29 | 山东浪潮商用***有限公司 | A kind of mass data based on message queue postpones deriving method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110119369A1 (en) * | 2009-11-13 | 2011-05-19 | International Business Machines,Corporation | Monitoring computer system performance |
-
2018
- 2018-06-11 CN CN201810593791.1A patent/CN109086293B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9514188B2 (en) * | 2009-04-02 | 2016-12-06 | Pivotal Software, Inc. | Integrating map-reduce into a distributed relational database |
CN103188346A (en) * | 2013-03-05 | 2013-07-03 | 北京航空航天大学 | Distributed decision making supporting massive high-concurrency access I/O (Input/output) server load balancing system |
CN103218210A (en) * | 2013-04-28 | 2013-07-24 | 北京航空航天大学 | File level partitioning system suitable for big data high concurrence access |
US9594782B2 (en) * | 2013-12-23 | 2017-03-14 | Ic Manage, Inc. | Hierarchical file block variant store apparatus and method of operation |
CN104484447A (en) * | 2014-12-22 | 2015-04-01 | 国云科技股份有限公司 | Large-level text file processing system and running method thereof |
CN106547614A (en) * | 2016-11-01 | 2017-03-29 | 山东浪潮商用***有限公司 | A kind of mass data based on message queue postpones deriving method |
Non-Patent Citations (2)
Title |
---|
An Automatic Adjustment Approach of Thread Quantity to Optimize Resource Usage;Zou Lida et al;《The Open Automationand Control System Journal》;20141231;第296-301页 * |
基于Storm的数据分析***设计与实现;孙朝华;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150415;第2015年卷(第04期);第I138-623页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109086293A (en) | 2018-12-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10805363B2 (en) | Method, device and system for pushing file | |
CN110865867B (en) | Method, device and system for discovering application topological relation | |
TWI728036B (en) | Information processing method, device and system | |
US10776170B2 (en) | Software service execution apparatus, system, and method | |
US9438665B1 (en) | Scheduling and tracking control plane operations for distributed storage systems | |
CN110147407B (en) | Data processing method and device and database management server | |
CN108280148A (en) | A kind of data migration method and data migration server | |
US9535754B1 (en) | Dynamic provisioning of computing resources | |
JP2008507201A5 (en) | ||
CN107181636B (en) | Health check method and device in load balancing system | |
CN105450759A (en) | System mirror image management method and device | |
WO2018054221A1 (en) | Pipeline dependent tree query optimizer and scheduler | |
Petrov et al. | Adaptive performance model for dynamic scaling Apache Spark Streaming | |
CN111327651A (en) | Resource downloading method, device, edge node and storage medium | |
US20240061712A1 (en) | Method, apparatus, and system for creating training task on ai training platform, and medium | |
CN105740249B (en) | Processing method and system in parallel scheduling process of big data job | |
WO2016188077A1 (en) | Burn-in test method and device | |
CN112600931B (en) | API gateway deployment method and device | |
US11301436B2 (en) | File storage method and storage apparatus | |
CN109086293B (en) | Hive file reading and writing method and device | |
US9537941B2 (en) | Method and system for verifying quality of server | |
CN108696559A (en) | Method for stream processing and device | |
CN116594734A (en) | Container migration method and device, storage medium and electronic equipment | |
CN109213566B (en) | Virtual machine migration method, device and equipment | |
CN110417860A (en) | File transfer management method, apparatus, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |