CN109086293B

CN109086293B - Hive file reading and writing method and device

Info

Publication number: CN109086293B
Application number: CN201810593791.1A
Authority: CN
Inventors: 吴强
Original assignee: Jiufu Jinke Holding Group Co ltd
Current assignee: Jiufu Jinke Holding Group Co ltd
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2020-11-27
Anticipated expiration: 2038-06-11
Also published as: CN109086293A

Abstract

The invention provides a Hive file reading and writing method and a Hive file reading and writing device, wherein the Hive file reading and writing method comprises the following steps: reading a data access table, and acquiring server information and parallelism information; generating an executive program to connect with a server where the Hive file is located according to the server information; determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information; and accessing the Hive file data for data consumption. The method provided by the invention also comprises Hive file data grouping service configuration, assembling the read Hive file data, and determining the data consumption sequence according to the data consumption priority configuration. Compared with the traditional Hive file reading and writing method, the technical scheme provided by the invention avoids a complicated compiling process, and solves the problems of low reading efficiency and incapability of controlling the reading process.

Description

Hive file reading and writing method and device

Technical Field

The invention relates to a Hive file reading and writing method and device, and belongs to the field of Hadoop data warehouse application.

Background

Hive is a data warehouse processing tool with Hadoop packaged at the bottom layer, data query is realized by using HQL language, and all Hive file data are stored in a Hadoop compatible file system. In the prior art, a Hive partition is read by adopting an HQL language, a Hive client needs to be started, then inquiry is carried out by adopting a mode that a select statement in the HQL specifies a partition, and data is returned through a Hive built-in mapreduce model. On the other hand, the technical scheme needs to access metadata of Hive, occupies mapreduce computing resources, cannot realize high-efficiency data reading, cannot be separated from the complicated compiling process of HQL, and has the technical problems that the read data block is large and the reading process cannot be controlled.

Disclosure of Invention

In order to alleviate the above technical problems, an object of the present invention is to provide a Hive file reading and writing method and device, which achieve efficient and portable reading of Hive files and can control the reading process.

In a first aspect, the present invention provides a Hive file reading and writing method, including: reading a data access table, and acquiring server information and parallelism information; generating an executive program to connect with a server where the Hive file is located according to the server information; determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information; and accessing Hive file data for data consumption.

Furthermore, the server information at least comprises one server node, and the server nodes are deployed in a centralized mode or a distributed mode; the Hive file contains at least one table; the parallelism information is determined by the network delay of the server nodes and/or the routing condition of each server node.

Further, after reading the data access table, if the server pointed by the server information is unavailable and/or the server information has errors, the data access table is configured and saved.

Furthermore, the number of the reading threads is determined by the reading path of the Hive file; the number of the processing threads is determined by the upper limit of the residual threads of the server node and the estimated size of single-line data of the Hive file; the batch size is determined by the estimated size of single-line data of the Hive file.

Further, the data access table also includes service grouping configuration; reading service grouping configuration when reading the data access table; before accessing the Hive file data, the method also comprises the step of assembling the read Hive file data according to the service grouping configuration.

Further, the data access table also comprises a priority; reading the priority when reading the data access table; before the data consumption, the method also comprises the step of determining the data consumption sequence according to the priority.

Further, if the data of the access Hive file is abnormal, recording abnormal information or breakpoint information, and continuing or stopping the access Hive file data.

Further, the data consumption further comprises: and (4) additionally calculating the blocking interval according to the calculation scale and/or the network delay.

Further, after the data consumption, the method also comprises the following steps: and returning to process the Hive file after all the reading threads and the processing threads are finished.

In a second aspect, the present invention further provides a Hive file reading/writing apparatus, including: the configuration module reads the data access table and acquires server information and parallelism information; the connection module is used for generating an executive program according to the server information so as to connect the server where the Hive file is located; the execution module is used for determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information; and the data consumption module is used for accessing the Hive file data to perform data consumption.

The invention has the advantages that the automatic allocation of system resources is carried out according to the server information and the parallelism information in the Hive file reading and writing, and the reading and writing efficiency of the Hive file is improved. In addition, the invention also introduces the service grouping configuration and the priority setting, so that the Hive file reading and writing process is flexible and controllable.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are one embodiment of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart of a Hive file reading and writing method according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating an effect of a Hive file reading and writing method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a Hive file reading/writing device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and the described embodiments are some, but not all embodiments of the present invention.

The first embodiment is as follows:

fig. 1 is a flowchart of a Hive file read-write method according to an embodiment of the present invention, fig. 2 is an effect diagram of a Hive file read-write method according to an embodiment of the present invention, and as shown in fig. 1 and fig. 2, the embodiment of the present invention is implemented by the following 9 steps.

S101: and reading the data access table. The data access table includes server information and parallelism information, and in an optional embodiment, the server information includes: the server node, the hive table, the hive partition 11, the hive bucket 13, and the CPU and the memory of the server where the execution program is located. The parallelism information is determined by network delay of the server nodes and route condition refinement of each node of the server.

Optionally, the data access table further includes a service packet configuration, and when the data access table is read, the service packet configuration is read.

Optionally, the data access table further includes a priority, and when the data access table is read, the priority is read.

S102: and judging whether the server is available, and if not, configuring and storing the data access table. Specifically, after reading the data access table, if the server pointed by the server information is unavailable and/or the server information has an error, the data access table is configured and saved. In an optional embodiment, a visual configuration interface is entered, and a data access table is configured and stored to read a new data access table for subsequent Hive file reading and writing work.

Preferably, server information, parallelism information, traffic packet configuration and priority in the data access table are configured.

S103: and generating an executive program and connecting the server where the Hive file is located.

And generating a server where the Hive file is connected with the executive program according to the IP address, the service port number and the like of the server node where the Hive file is located. In an alternative embodiment, the execution program is generated using a Java API to connect to the server where the Hive file is located.

S104: and determining the number of reading threads, the number of processing threads and the batch size of the Hive file. Specifically, the number of the reading threads is determined by the reading path of the Hive file; the number of the processing threads is determined by the upper limit of the residual threads of the server node and the estimated size of single-line data of the Hive file; the batch size is determined by the estimated size of single-line data of the Hive file.

As shown in fig. 2, the read path 14 of the Hive file is formed by splicing a server node (node), a target table, a partition 11 and a sub-bucket 13.

S105: and assembling the read Hive file data according to the service grouping configuration. As shown in fig. 2, packets are determined and packet data 17 is assembled using the bucket 13 and the traffic packet hash rule according to the configuration in the data access table.

S106: and determining the sequence of data consumption according to the priority. As shown in fig. 2, the consumption priority 18 of the packet data 17 is determined according to the configuration in the data access table.

S107: and starting data access, and if an exception occurs, performing exception handling. In the data access process, if the data of the access Hive file is abnormal, recording abnormal information or breakpoint information, and continuing or stopping the access Hive file data. Specifically, as shown in fig. 2, there is a possibility that an abnormality occurs in the grouping data 17, the consumption priority 18, and the data reading and writing links. If the abnormal situation occurs, the read path 14 and the line number 16 of the Hive file are recorded, and the data access of other data lines 15 or other paths 14 of the file 12 is continued. In an alternative embodiment, the access of data may also be suspended.

S108: and performing data consumption and supplementing to calculate the blocking interval. Specifically, data consumption includes: data stream writing files, data calculation, data analysis and data network transmission. And additionally calculating the blocking interval according to the calculation scale and/or the network delay. After the data consumption, the parallel computing process runs idle to a specified time limit and ends.

S109: and returning to process the Hive file. Specifically, after all the reading threads and the processing threads are finished, the Hive file is returned to be processed.

Example two:

the embodiment of the invention also provides a Hive file reading and writing device which is mainly used for executing the Hive file reading and writing method provided by the embodiment of the invention.

Fig. 3 is a schematic diagram of a Hive file reading/writing apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes a configuration module 21, a connection module 22, an execution module 23, and a data consumption module 24;

the configuration module 21 reads the data access table and acquires server information and parallelism information;

the connection module 22 is used for generating an execution program to connect the server where the Hive file is located according to the server information;

the execution module 23 is used for determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information;

and the data consumption module 24 accesses the Hive file data to perform data consumption.

The present invention has not been described in detail in part as is known in the art.

Claims

1. A Hive file reading and writing method is characterized by comprising the following steps:

reading a data access table, and acquiring server information and parallelism information;

generating an executive program to connect with a server where the Hive file is located according to the server information;

determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information;

accessing the Hive file data for data consumption,

the server information at least comprises one server node, and the server nodes are deployed in a centralized mode or a distributed mode; the Hive file contains at least one table; the parallelism information is determined by the network delay of the server nodes and/or the routing condition of each server node.

2. The method of claim 1, comprising: and after the data access table is read, if the server pointed by the server information is unavailable and/or the server information has errors, configuring and storing the data access table.

3. The method of claim 1, comprising: the number of the reading threads is determined by the reading path of the Hive file; the number of the processing threads is determined by the upper limit of the residual threads of the server nodes and the estimated size of single-line data of the Hive file; the batch size is determined by the estimated size of single-line data of the Hive file.

4. The method of claim 1, comprising: the data access table further comprises service grouping configuration; reading the service grouping configuration when reading the data access table; before accessing the Hive file data, the method also comprises assembling the read Hive file data according to the service grouping configuration.

5. The method of claim 1, comprising: the data access table further comprises a priority; reading the priority when reading the data access table; and before the data consumption, determining the order of the data consumption according to the priority.

6. The method according to claim 1, wherein the accessing Hive file data further comprises: and if the data of the access Hive file is abnormal, recording abnormal information or breakpoint information, and continuing or stopping the data of the access Hive file.

7. The method of claim 1, wherein said data consuming further comprises: and (4) additionally calculating the blocking interval according to the calculation scale and/or the network delay.

8. The method of claim 1, wherein after said data consumption, further comprising: and returning to process the Hive file after all the reading threads and the processing threads are finished.

9. A Hive file reading and writing device is characterized by comprising:

the configuration module reads the data access table and acquires server information and parallelism information;

the connection module is used for generating an executive program to connect the server where the Hive file is located according to the server information;

the execution module is used for determining the number of reading threads, the number of processing threads and the batch size of the Hive file according to the server information and the parallelism information;

the data consumption module accesses the Hive file data to consume the data,