CN104408147A - Multithreading data uploading method - Google Patents

Multithreading data uploading method Download PDF

Info

Publication number
CN104408147A
CN104408147A CN201410722793.8A CN201410722793A CN104408147A CN 104408147 A CN104408147 A CN 104408147A CN 201410722793 A CN201410722793 A CN 201410722793A CN 104408147 A CN104408147 A CN 104408147A
Authority
CN
China
Prior art keywords
data
play amount
file
upload
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410722793.8A
Other languages
Chinese (zh)
Inventor
金洪殿
辛国茂
刘伟
卢军佐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201410722793.8A priority Critical patent/CN104408147A/en
Publication of CN104408147A publication Critical patent/CN104408147A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a multithreading data uploading method. The multithreading data uploading method comprises the steps of configuring the information of a source path of a file needing to be uploaded, the information of a destination path for uploading the file to an HDFS system and the information of the number of available threads, determining the data range needing to be processed by each thread according to the data size of the file needing to be uploaded and the configured information of the thread number, and performing multithreading parallel data uploading based on the configured information and the determined data range. The multithreading data uploading method is capable of dividing a large text file into a plurality of files for uploading to the HDFS system, and therefore, the writing speed is increased and the file uploading time is greatly reduced.

Description

A kind of multi-thread data method for uploading
Technical field
The present invention relates to technical field of information storage, be specifically related to a kind of multi-thread data method for uploading.
Background technology
Along with human society enters the information age comprehensively, data become the strategic resource of equal importance with water, oil.By excavating mass data, the operational decisions of Government and enterprise can be made to be based upon on the foundation basis of science more, to improve the efficiency of decision-making, crisis adaptibility to response and public service level.Large data (bigdata), or claim flood tide data, refer to involved data quantity huge to by current main software instrument, acquisition, management cannot being reached within reasonable time, processing and arrange the information becoming and help enterprise management decision-making.
Hadoop Distributed File System (HDFS) is designed to be applicable to operating in the distributed file system on common hardware (commodity hardware).HDFS is the system of an Error Tolerance, is applicable to being deployed on cheap machine.HDFS can provide the data access of high-throughput, is applicable to very much the application on large-scale dataset.The program operated on HDFS has very a large amount of data sets.Typical HDFS file size is TB rank, so HDFS is adjusted to support large files, should provide very high aggregated data bandwidth, support hundreds of nodes in a cluster, also should support other file of millions in a cluster.
Document source on HDFS has a lot of approach, and the existing file in file server (such as NFS) is a kind of very important source.Such as, the tables of data unloaded from database in banking system, can generate a delta file every day to file server, will carry out mining analysis, first it will be uploaded in HDFS these files.The file that some analysis may need is very large.Traditional method uses unit upload file, and the bandwidth of file server does not make full use of on the one hand, and each back end of HDFS is not fully utilized on the other hand, and in this way for uploading mass data, often mistake consuming time for a long time cannot real world applications in institute.Therefore, need to propose a kind of new scheme, make full use of the bandwidth of file server, improve files passe efficiency.
Summary of the invention
The invention of this skill provides a kind of multi-thread data method for uploading, takes into full account the characteristic of HDFS, makes full use of resource (bandwidth, disk I/O etc.), greatly improves efficiency that mass data uploads and ensures file line atomicity.Described method comprises:
S1: configuration needs file place source path information, the file uploaded to need to upload to the destination path information of HDFS system and operable number of threads information;
S2: the data volume of file uploaded according to described needs and the described number of threads information of configuration determine that each thread needs data area to be processed;
S3: the described data area that described information and step S2 based on step S1 configuration determine performs multi-threaded parallel data upload.
Especially:
Data area described in described step S2 comprises starting position side-play amount and the end position side-play amount of the file data that each thread needs are uploaded.
Especially:
Described multi-threaded parallel data upload specifically comprises the steps:
S31: first described thread judges whether data upload starting position side-play amount is 0, if so, then performs step S32, otherwise performs step S33;
S32: side-play amount place, described starting position to the data upload at described end position side-play amount place to HDFS system, and is performed step S34 by described thread;
S33: described thread reads each byte data backward successively from described starting position side-play amount, until the data read are newline, will arrive the data upload at described end position side-play amount place to HDFS system after described newline;
S34: read each byte data successively backward from described end position side-play amount and upload, until the data read are newline, flow process terminates.
The invention has the beneficial effects as follows: be divided into multiple file in parallel to upload in HDFS system a large text, thus improve writing speed, greatly reduce the time of files passe.
Accompanying drawing explanation
The multi-thread data method for uploading process flow diagram that accompanying drawing 1 proposes for the present invention.
The process flow diagram of the data uploading method of the guarantee HDFS file line atomicity based on multithreading that accompanying drawing 2 proposes for the present invention.
Embodiment
Describe the multi-thread data method for uploading of the present invention's proposition in detail below in conjunction with accompanying drawing, described method can ensure HDFS file line atomicity.
The present invention mainly considers uploading data that can be parallel on the basis ensureing data line atomicity, makes full use of network I/O and system resource.The data volume that each thread is uploaded is defaulted as: file size/total number of threads.Before each thread upload file starts, first judge whether the beginning side-play amount first character read is newline, if be newline, then read backward by byte, until read newline, then upload file content from after newline, if starting side-play amount is 0, then do not need to judge whether it is newline, read data and start to upload.When the content of thread upload file reaches the content of distribution, need to continue to judge whether the character late after terminating side-play amount is newline, if not being newline, then needs to continue to upload, until last character is newline.Which achieves each thread when upload file content, move backward during beginning, at the end of also move backward, thus ensure that file line atomicity.
See accompanying drawing 1, the multi-thread data method for uploading that the present invention proposes, described method comprises:
S1: configuration needs file place source path information, the file uploaded to need to upload to the destination path information of HDFS system and operable number of threads information;
S2: the data volume of file uploaded according to described needs and the described number of threads information of configuration determine that each thread needs data area to be processed;
S3: the described data area that information and step S2 based on step S1 configuration determine performs multi-threaded parallel data upload.
Wherein said data area comprises starting position side-play amount and the end position side-play amount that each thread needs uploading data.
See accompanying drawing 2, it illustrates the row atomicity in order to ensure upload file, the data upload step flow process performed by each thread, comprising:
S31: first thread judges whether its data upload starting position side-play amount is 0, if so, then performs step S32, otherwise performs step S33;
In this step, if data upload starting position side-play amount is 0, then show that the starting position of this thread uploading data is the beginning of whole file.
S32: described thread will upload side-play amount place, starting position to uploading the data upload at end position side-play amount place to HDFS system, and performs step S34;
S33: described thread reads each byte data backward successively from starting position side-play amount, until the data read are newline, will arrive the data upload at end position side-play amount place to HDFS system after described newline;
In this step, thread reads the data of a byte backward successively from starting position side-play amount, often read and once just judge whether this byte data is newline, judgement is performed again, till the data of the byte read are newline if not the data then sequentially reading next byte.
S34: read each byte data successively backward from end position side-play amount and upload, until the data read are newline, flow process terminates.
In this step, thread reads the data of a byte backward successively from end position side-play amount, often read and once just judge whether this byte data is newline, if not then uploading this data, and the data that order reads next byte perform judgement and uploading step, again till the data of the byte read are newline.
Certainly; the present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art are when making various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection domain that all should belong to claim of the present invention.

Claims (3)

1. a multi-thread data method for uploading, is characterized in that, comprising:
S1: configuration needs file place source path information, the file uploaded to need to upload to the destination path information of HDFS system and operable number of threads information;
S2: the data volume of file uploaded according to described needs and the described number of threads information of configuration determine that each thread needs data area to be processed;
S3: the described data area that described information and step S2 based on step S1 configuration determine performs multi-threaded parallel data upload.
2. the method for claim 1, is characterized in that:
Data area described in described step S2 comprises starting position side-play amount and the end position side-play amount of the file data that each thread needs are uploaded.
3. method as claimed in claim 2, is characterized in that:
Described multi-threaded parallel data upload specifically comprises the steps:
S31: first described thread judges whether data upload starting position side-play amount is 0, if so, then performs step S32, otherwise performs step S33;
S32: side-play amount place, described starting position to the data upload at described end position side-play amount place to HDFS system, and is performed step S34 by described thread;
S33: described thread reads each byte data backward successively from described starting position side-play amount, until the data read are newline, will arrive the data upload at described end position side-play amount place to HDFS system after described newline;
S34: read each byte data successively backward from described end position side-play amount and upload, until the data read are newline, flow process terminates.
CN201410722793.8A 2014-12-02 2014-12-02 Multithreading data uploading method Pending CN104408147A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410722793.8A CN104408147A (en) 2014-12-02 2014-12-02 Multithreading data uploading method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410722793.8A CN104408147A (en) 2014-12-02 2014-12-02 Multithreading data uploading method

Publications (1)

Publication Number Publication Date
CN104408147A true CN104408147A (en) 2015-03-11

Family

ID=52645778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410722793.8A Pending CN104408147A (en) 2014-12-02 2014-12-02 Multithreading data uploading method

Country Status (1)

Country Link
CN (1) CN104408147A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105049524A (en) * 2015-08-13 2015-11-11 浙江鹏信信息科技股份有限公司 Hadhoop distributed file system (HDFS) based large-scale data set loading method
CN109325002A (en) * 2018-09-03 2019-02-12 北京京东金融科技控股有限公司 Text file processing method, device, system, electronic equipment, storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831211A (en) * 2012-08-14 2012-12-19 中山大学 Data sheet migration method based on sheet relation analysis
CN103106068A (en) * 2013-02-28 2013-05-15 江苏物联网研究发展中心 Internet of things big data fast calibration method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831211A (en) * 2012-08-14 2012-12-19 中山大学 Data sheet migration method based on sheet relation analysis
CN103106068A (en) * 2013-02-28 2013-05-15 江苏物联网研究发展中心 Internet of things big data fast calibration method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马登邑: ""基于Hadoop存储的文件管理***的研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105049524A (en) * 2015-08-13 2015-11-11 浙江鹏信信息科技股份有限公司 Hadhoop distributed file system (HDFS) based large-scale data set loading method
CN105049524B (en) * 2015-08-13 2019-02-05 浙江鹏信信息科技股份有限公司 A method of the large-scale dataset based on HDFS loads
CN109325002A (en) * 2018-09-03 2019-02-12 北京京东金融科技控股有限公司 Text file processing method, device, system, electronic equipment, storage medium
CN109325002B (en) * 2018-09-03 2021-03-05 北京京东金融科技控股有限公司 Text file processing method, device and system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN103136243B (en) File system duplicate removal method based on cloud storage and device
US9851917B2 (en) Method for de-duplicating data and apparatus therefor
CN105205154B (en) Data migration method and device
CN106599091B (en) RDF graph structure storage and index method based on key value storage
CN104361065A (en) Orderly sequence number generating method of Zookeeper-based distributed system
CN105446893A (en) Data storage method and device
CN106611035A (en) Retrieval algorithm for deleting repetitive data in cloud storage
CN105260464B (en) The conversion method and device of data store organisation
KR20190075962A (en) Data processing method and data processing apparatus
CN106844682A (en) Method for interchanging data, apparatus and system
CN104462389A (en) Method for implementing distributed file systems on basis of hierarchical storage
CN104699723A (en) Data exchange adapter and system and method for synchronizing data among heterogeneous systems
CN102456076A (en) Massive fragment data aggregation system and method
CN104123237A (en) Hierarchical storage method and system for massive small files
CN104572679A (en) Public opinion data storage method and device
CN104965835B (en) A kind of file read/write method and device of distributed file system
CN104572505A (en) System and method for ensuring eventual consistency of mass data caches
CN111444192A (en) Method, device and equipment for generating Hash of global state in block chain type account book
CN105049524B (en) A method of the large-scale dataset based on HDFS loads
CN103473258A (en) Cloud storage file system
CN104391961A (en) Read-write solution strategy for tens of millions of small file data
CN103793468A (en) Data storage method and device and data reading method and device
US9952771B1 (en) Method and system for choosing an optimal compression algorithm
CN102479211B (en) Mass data processing system and method on basis of database
CN104408147A (en) Multithreading data uploading method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150311

RJ01 Rejection of invention patent application after publication