CN104408147A

CN104408147A - Multithreading data uploading method

Info

Publication number: CN104408147A
Application number: CN201410722793.8A
Authority: CN
Inventors: 金洪殿; 辛国茂; 刘伟; 卢军佐
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-12-02
Filing date: 2014-12-02
Publication date: 2015-03-11

Abstract

The invention provides a multithreading data uploading method. The multithreading data uploading method comprises the steps of configuring the information of a source path of a file needing to be uploaded, the information of a destination path for uploading the file to an HDFS system and the information of the number of available threads, determining the data range needing to be processed by each thread according to the data size of the file needing to be uploaded and the configured information of the thread number, and performing multithreading parallel data uploading based on the configured information and the determined data range. The multithreading data uploading method is capable of dividing a large text file into a plurality of files for uploading to the HDFS system, and therefore, the writing speed is increased and the file uploading time is greatly reduced.

Description

A kind of multi-thread data method for uploading

Technical field

The present invention relates to technical field of information storage, be specifically related to a kind of multi-thread data method for uploading.

Background technology

Along with human society enters the information age comprehensively, data become the strategic resource of equal importance with water, oil.By excavating mass data, the operational decisions of Government and enterprise can be made to be based upon on the foundation basis of science more, to improve the efficiency of decision-making, crisis adaptibility to response and public service level.Large data (bigdata), or claim flood tide data, refer to involved data quantity huge to by current main software instrument, acquisition, management cannot being reached within reasonable time, processing and arrange the information becoming and help enterprise management decision-making.

Hadoop Distributed File System (HDFS) is designed to be applicable to operating in the distributed file system on common hardware (commodity hardware).HDFS is the system of an Error Tolerance, is applicable to being deployed on cheap machine.HDFS can provide the data access of high-throughput, is applicable to very much the application on large-scale dataset.The program operated on HDFS has very a large amount of data sets.Typical HDFS file size is TB rank, so HDFS is adjusted to support large files, should provide very high aggregated data bandwidth, support hundreds of nodes in a cluster, also should support other file of millions in a cluster.

Document source on HDFS has a lot of approach, and the existing file in file server (such as NFS) is a kind of very important source.Such as, the tables of data unloaded from database in banking system, can generate a delta file every day to file server, will carry out mining analysis, first it will be uploaded in HDFS these files.The file that some analysis may need is very large.Traditional method uses unit upload file, and the bandwidth of file server does not make full use of on the one hand, and each back end of HDFS is not fully utilized on the other hand, and in this way for uploading mass data, often mistake consuming time for a long time cannot real world applications in institute.Therefore, need to propose a kind of new scheme, make full use of the bandwidth of file server, improve files passe efficiency.

Summary of the invention

The invention of this skill provides a kind of multi-thread data method for uploading, takes into full account the characteristic of HDFS, makes full use of resource (bandwidth, disk I/O etc.), greatly improves efficiency that mass data uploads and ensures file line atomicity.Described method comprises:

S1: configuration needs file place source path information, the file uploaded to need to upload to the destination path information of HDFS system and operable number of threads information;

S2: the data volume of file uploaded according to described needs and the described number of threads information of configuration determine that each thread needs data area to be processed;

S3: the described data area that described information and step S2 based on step S1 configuration determine performs multi-threaded parallel data upload.

Especially:

Data area described in described step S2 comprises starting position side-play amount and the end position side-play amount of the file data that each thread needs are uploaded.

Especially:

Described multi-threaded parallel data upload specifically comprises the steps:

S31: first described thread judges whether data upload starting position side-play amount is 0, if so, then performs step S32, otherwise performs step S33;

S32: side-play amount place, described starting position to the data upload at described end position side-play amount place to HDFS system, and is performed step S34 by described thread;

S33: described thread reads each byte data backward successively from described starting position side-play amount, until the data read are newline, will arrive the data upload at described end position side-play amount place to HDFS system after described newline;

S34: read each byte data successively backward from described end position side-play amount and upload, until the data read are newline, flow process terminates.

The invention has the beneficial effects as follows: be divided into multiple file in parallel to upload in HDFS system a large text, thus improve writing speed, greatly reduce the time of files passe.

Accompanying drawing explanation

The multi-thread data method for uploading process flow diagram that accompanying drawing 1 proposes for the present invention.

The process flow diagram of the data uploading method of the guarantee HDFS file line atomicity based on multithreading that accompanying drawing 2 proposes for the present invention.

Embodiment

Describe the multi-thread data method for uploading of the present invention's proposition in detail below in conjunction with accompanying drawing, described method can ensure HDFS file line atomicity.

The present invention mainly considers uploading data that can be parallel on the basis ensureing data line atomicity, makes full use of network I/O and system resource.The data volume that each thread is uploaded is defaulted as: file size/total number of threads.Before each thread upload file starts, first judge whether the beginning side-play amount first character read is newline, if be newline, then read backward by byte, until read newline, then upload file content from after newline, if starting side-play amount is 0, then do not need to judge whether it is newline, read data and start to upload.When the content of thread upload file reaches the content of distribution, need to continue to judge whether the character late after terminating side-play amount is newline, if not being newline, then needs to continue to upload, until last character is newline.Which achieves each thread when upload file content, move backward during beginning, at the end of also move backward, thus ensure that file line atomicity.

See accompanying drawing 1, the multi-thread data method for uploading that the present invention proposes, described method comprises:

S3: the described data area that information and step S2 based on step S1 configuration determine performs multi-threaded parallel data upload.

Wherein said data area comprises starting position side-play amount and the end position side-play amount that each thread needs uploading data.

See accompanying drawing 2, it illustrates the row atomicity in order to ensure upload file, the data upload step flow process performed by each thread, comprising:

S31: first thread judges whether its data upload starting position side-play amount is 0, if so, then performs step S32, otherwise performs step S33;

In this step, if data upload starting position side-play amount is 0, then show that the starting position of this thread uploading data is the beginning of whole file.

S32: described thread will upload side-play amount place, starting position to uploading the data upload at end position side-play amount place to HDFS system, and performs step S34;

S33: described thread reads each byte data backward successively from starting position side-play amount, until the data read are newline, will arrive the data upload at end position side-play amount place to HDFS system after described newline;

In this step, thread reads the data of a byte backward successively from starting position side-play amount, often read and once just judge whether this byte data is newline, judgement is performed again, till the data of the byte read are newline if not the data then sequentially reading next byte.

S34: read each byte data successively backward from end position side-play amount and upload, until the data read are newline, flow process terminates.

In this step, thread reads the data of a byte backward successively from end position side-play amount, often read and once just judge whether this byte data is newline, if not then uploading this data, and the data that order reads next byte perform judgement and uploading step, again till the data of the byte read are newline.

Certainly; the present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art are when making various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection domain that all should belong to claim of the present invention.

Claims

1. a multi-thread data method for uploading, is characterized in that, comprising:

2. the method for claim 1, is characterized in that:

3. method as claimed in claim 2, is characterized in that:

Described multi-threaded parallel data upload specifically comprises the steps: