TW202001618A - File processing method and device - Google Patents

File processing method and device Download PDF

Info

Publication number
TW202001618A
TW202001618A TW108107394A TW108107394A TW202001618A TW 202001618 A TW202001618 A TW 202001618A TW 108107394 A TW108107394 A TW 108107394A TW 108107394 A TW108107394 A TW 108107394A TW 202001618 A TW202001618 A TW 202001618A
Authority
TW
Taiwan
Prior art keywords
file
block
line
segment
starting
Prior art date
Application number
TW108107394A
Other languages
Chinese (zh)
Other versions
TWI711935B (en
Inventor
王玉潑
吳連亮
Original Assignee
香港商阿里巴巴集團服務有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 香港商阿里巴巴集團服務有限公司 filed Critical 香港商阿里巴巴集團服務有限公司
Publication of TW202001618A publication Critical patent/TW202001618A/en
Application granted granted Critical
Publication of TWI711935B publication Critical patent/TWI711935B/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A file processing method and device, the method comprising: acquiring row capacity by means of an initial file block; according to a preset number of rows of fragments and the row capacity, determining a delineation file block; and acquiring a row separator in the delineation file block by means of downloading same, thereby obtaining index data corresponding to a fragment file at least on the basis of the row separator, which is used for a parsing device to parse said fragment file from a cloud storage server according to the index data, and thus the effectiveness of file processing is improved.

Description

檔案處理的方法及裝置File processing method and device

本說明書一個或多個實施例有關電腦技術領域,尤其有關透過電腦檔案處理的方法和裝置。One or more embodiments of this specification relate to the field of computer technology, and in particular to methods and devices for processing files through computers.

雲端運算(Cloud Computing) 是分散式處理(Distributed Computing)、平行處理(Parallel Computing) 和網格運算(Grid Computing) 的發展,透過網路將龐大的運算處理程式拆分成若干較小的子程式,將這些小程式分別交由多台伺服器所組成的系統進行運算,並輸出運算結果。雲端儲存是在雲端運算上延伸出來的概念,一般是指透過叢集應用、網格技術或分散式檔案系統等功能,將網路中大量不同類型的儲存設備透過應用軟體集合在一起,實現協同工作,共同對外提供資料儲存和業務存取功能。即雲端儲存系統就是一個以資料儲存和管理為核心的雲端運算系統。雲端儲存系統可以透過一定的應用軟體或應用介面,為用戶提供一定類型的儲存服務和存取服務。 通常,需要解析檔案的情況下,例如需要將檔案從其他格式解析成內部可以處理的格式時,如果檔案較大,往往需要將大檔案切割為較小的切片檔案,然後由解析設備叢集對各個切片檔案進行解析。這個過程通常涉及大檔案和切割好的切片檔案的下載和上傳,產生較多耗時。因此,希望能有改進的方案,在解析大檔案時,透過有效的檔案分割,減少耗時,提高檔案處理的有效性。Cloud computing is the development of distributed computing, parallel computing, and grid computing. It divides a huge computing program into several smaller subprograms through the network ,Send these small programs to a system composed of multiple servers for calculation, and output the calculation results. Cloud storage is a concept that extends from cloud computing. It generally refers to the use of cluster applications, grid technology, or distributed file systems to integrate a large number of different types of storage devices in the network through application software to achieve collaborative work. , Jointly provide external data storage and business access functions. The cloud storage system is a cloud computing system with data storage and management as the core. The cloud storage system can provide users with certain types of storage services and access services through certain application software or application interfaces. Generally, when files need to be parsed, for example, when files need to be parsed from other formats into a format that can be processed internally, if the file is large, the large file often needs to be cut into smaller slice files, and then the parsing equipment clusters each Slice files for analysis. This process usually involves the downloading and uploading of large files and cut slice files, which is more time-consuming. Therefore, it is hoped that there can be an improved solution to reduce the time and improve the efficiency of file processing through effective file division when parsing large files.

本說明書一個或多個實施例描述了一種方法和裝置,可以選擇性地下載部分待處理檔案,透過確定各個片段檔案的索引資訊確定待處理檔案的劃分方案,而無需下載整個檔案並真實切割,從而減少耗時,提高檔案處理的有效性。 根據第一態樣,提供了一種檔案處理的方法,適用於透過解析設備叢集針對待雲端儲存伺服器中的處理檔案進行解析的情況,包括:從雲端儲存伺服器下載起始檔案塊,以獲取所述待處理檔案的首個行分隔符的位置,所述起始檔案塊是所述待處理檔案中從起始位置開始的、包括所述首個行分隔符的檔案塊;基於所述首個行分隔符的位置確定所述待處理檔案的行容量;根據預設片段行數和所述行容量,下載所述待處理檔案中的第一分界檔案塊,所述第一分界檔案塊包括,當按照所述預設片段行數將所述待處理檔案進行劃分時,劃分得到的多個片段檔案中第一片段檔案的結束位置的行分隔符;至少基於所述第一分界檔案塊中的行分隔符的位置,確定所述第一片段檔案的第一索引資料,所述第一索引資料包括第一開始索引和第一結束索引,所述第一索引資料用於所述解析設備叢集中的解析設備按照所述第一索引資料,從所述雲端儲存伺服器解析所述第一片段檔案。 在一些實施例中,所述從雲端儲存伺服器下載起始檔案塊包括:從起始位置開始下載預定大小的檔案塊作為起始檔案塊,並從所述起始檔案塊中查找行分隔符;在未查找到行分隔符的情況下,向後增加一個預定大小的檔案塊以更新所述起始檔案塊,直到從中查找到首個行分隔符。 在一些實施例中,基於所述首個行分隔符的位置確定所述待分割檔案的行容量包括:將所述行容量確定為,所述待處理檔案的起始位置至所述首個分隔符的位置所包含的位元組數。 在一些實施例中,根據預設片段行數和所述行容量下載所述待處理檔案中的第一分界檔案塊包括:確定所述第一片段檔案的檔案開始位置;確定所述第一分界檔案塊的塊開始位置為,所述檔案開始位置加上片段容量的位置,所述第一分界檔案塊的大小為一個行容量,其中,所述片段容量為,所述預設片段檔案行數與所述行容量的乘積。 在一些實施例中,根據預設片段行數和所述行容量,下載所述待處理檔案中的第一分界檔案塊還包括:下載所述第一分界檔案塊,並從所述第一分界檔案塊中查找行分隔符;在未查找到行分隔符的情況下,向後增加一個行容量大小的檔案塊以更新所述第一分界檔案塊,並下載更新後的第一分界檔案塊,直到從中查找到行分隔符。 在一些實施例中,其中,確定所述第一片段檔案的檔案開始位置包括:在所述第一片段檔案是所述待處理檔案的第一個片段檔案的情況下,將所述待處理檔案的所述起始位置作為所述第一片段檔案的檔案開始位置;否則,將所述第一片段檔案的前一個片段檔案的結束位置作為所述檔案開始位置。 在一些實施例中,所述第一片段檔案為所述待處理檔案的第一個片段檔案,所述確定所述第一片段檔案的第一索引資料包括:確定所述第一開始索引指向所述待處理檔案的所述起始位置;確定所述第一結束索引指向所述第一分界檔案塊中的行分隔符的位置。 在一些實施例中,所述方法還包括:獲取所述待處理檔案的檔案大小資訊;以及,根據預設片段行數和所述行容量,下載所述待處理檔案中的第一分界檔案塊還包括:獲取所述待處理檔案的檔案大小資訊;基於所述檔案大小資訊檢測所述第一分界檔案塊是否超出所述待處理檔案的檔案大小範圍;在超出的情況下,確定最後一個片段檔案的所述第一結束索引指向所述待處理檔案的結束位置。 在一些實施例中,所述確定所述第一片段檔案的第一索引資料包括:將所述第一片段檔案的前一個片段檔案的結束索引作為所述第一開始索引;確定所述第一結束索引指向所述第一分界檔案塊中的行分隔符的位置。 在一些實施例中,所述方法還包括,將第一索引資料添加到用於所述多個片段檔案的索引資訊中。 在一些實施例中,所述方法還包括,利用所述索引資訊更新所述雲端儲存伺服器中的任務配置表,以供所述雲端儲存伺服器按照所述任務配置表的分發規則向所述解析設備叢集分發所述索引資訊。 在一些實施例中,所述方法還包括,將所述第一索引資料作為傳遞參數,透過參數呼叫的方式發送至所述解析設備叢集。 根據第二態樣,提供一種檔案處理的裝置,適用於透過解析設備叢集針對待處理檔案進行解析的情況,包括: 起始檔案塊下載單元,配置為從所述雲端儲存伺服器下載起始檔案塊,以獲取所述待處理檔案的首個行分隔符的位置,所述起始檔案塊是所述待處理檔案中從起始位置開始的、包括所述首個行分隔符的檔案塊;行容量確定單元,配置為基於所述首個行分隔符的位置確定所述待處理檔案的行容量;分界檔案塊下載單元,配置為根據預設片段行數和所述行容量,下載所述待處理檔案中的第一分界檔案塊,所述第一分界檔案塊包括,當按照所述預設片段行數將所述待處理檔案進行劃分時,劃分得到的多個片段檔案中第一片段檔案的結束位置的行分隔符;索引資料確定單元,配置為至少基於所述第一分界檔案塊中的行分隔符的位置,確定所述第一片段檔案的第一索引資料,所述第一索引資料包括第一開始索引和第一結束索引,所述第一索引資料用於解析設備按照所述第一索引資料,從所述雲端儲存伺服器解析所述第一片段檔案。 根據第三態樣,提供了一種電腦可讀媒體儲存媒體,其上儲存有電腦程式,當所述電腦程式在電腦中執行時,令電腦執行第一態樣的方法。 根據第四態樣,提供了一種運算設備,包括記憶體和處理器,其特徵在於,所述記憶體中儲存有可執行碼,所述處理器執行所述可執行碼時,實現第一態樣的方法。 透過本說明書實施例提供的方法和裝置,首先從待處理檔案的起始位置下載起始檔案塊,並根據起始檔案塊中首個行分隔符的位置確定行容量,然後基於行容量和預設片段行數下載分界檔案塊,透過讀取分界檔案塊獲取其中的行分隔符,從而至少基於該行分隔符得到片段檔案的索引資料,用於解析設備根據索引資料從雲端儲存伺服器解析片段檔案。如此,僅僅需要從雲端儲存伺服器獲取起始檔案塊和分界檔案塊,由於選擇性地下載部分待處理檔案,透過確定各個片段檔案的索引資訊確定待處理檔案的劃分方案,而不用下載整個檔案並對檔案進行真實切割,可以減少耗時,提高檔案處理的有效性。One or more embodiments of this specification describe a method and device that can selectively download some files to be processed, and determine the division scheme of the files to be processed by determining the index information of each fragment file without downloading the entire file and cutting it down, Thereby reducing time consumption and improving the effectiveness of file processing. According to the first aspect, a file processing method is provided, which is suitable for parsing the processing files in the cloud storage server through the parsing device cluster, including: downloading the starting file block from the cloud storage server to obtain The position of the first line separator of the to-be-processed file, the starting file block is the file block including the first line delimiter from the starting position in the to-be-processed file; based on the first The position of each line separator determines the line capacity of the to-be-processed file; according to the preset segment line number and the line capacity, download the first delimited file block in the to-be-processed file, the first delimited file block includes , When dividing the to-be-processed file according to the preset segment line number, the line separator at the end position of the first segment file among the divided segment files; at least based on the first boundary file block The position of the line separator of the, determines the first index data of the first fragment file, the first index data includes a first start index and a first end index, the first index data is used for the parsing device cluster The parsing device in parses the first segment file from the cloud storage server according to the first index data. In some embodiments, the downloading of the starting file block from the cloud storage server includes: downloading a file block of a predetermined size from the starting position as the starting file block, and searching for a line separator from the starting file block ; When no line separator is found, add a file block of a predetermined size backward to update the starting file block until the first line separator is found from it. In some embodiments, determining the line capacity of the file to be divided based on the position of the first line separator includes: determining the line capacity as the starting position of the file to be processed to the first partition The number of bytes contained in the position of the character. In some embodiments, downloading the first boundary file block in the to-be-processed file according to the preset segment line number and the line capacity includes: determining a file start position of the first segment file; determining the first boundary The block start position of the file block is: the file start position plus the segment capacity, and the size of the first boundary file block is a line capacity, wherein the segment capacity is the preset segment file line number The product of the row capacity. In some embodiments, downloading the first boundary file block in the to-be-processed file according to the preset segment line number and the line capacity further includes: downloading the first boundary file block, and removing the first boundary file from the first boundary Find the line separator in the file block; if no line separator is found, add a file block with a line size backward to update the first boundary file block, and download the updated first boundary file block until Find the line separator from it. In some embodiments, wherein determining the file start position of the first clip file includes: when the first clip file is the first clip file of the to-be-processed file, storing the to-be-processed file The start position of is taken as the file start position of the first clip file; otherwise, the end position of the previous clip file of the first clip file is taken as the file start position. In some embodiments, the first segment file is the first segment file of the to-be-processed file, and determining the first index data of the first segment file includes: determining that the first start index points to the Describe the starting position of the file to be processed; determine that the first end index points to the position of the line separator in the first delimited file block. In some embodiments, the method further includes: obtaining file size information of the file to be processed; and, according to a preset number of segment lines and the line capacity, downloading the first boundary file block in the file to be processed It also includes: obtaining file size information of the pending file; detecting whether the first boundary file block exceeds the file size range of the pending file based on the file size information; in the case of exceeding, determining the last segment The first end index of the file points to the end position of the file to be processed. In some embodiments, the determining the first index data of the first segment file includes: using the end index of the previous segment file of the first segment file as the first start index; determining the first The end index points to the position of the line separator in the first boundary archive block. In some embodiments, the method further includes adding the first index data to the index information for the plurality of clip files. In some embodiments, the method further includes using the index information to update a task configuration table in the cloud storage server for the cloud storage server to send the task configuration table to the The parsing device cluster distributes the index information. In some embodiments, the method further includes sending the first index data as a transfer parameter to the cluster of parsing devices through a parameter call. According to the second aspect, a file processing device is provided, which is suitable for the case of parsing a file to be processed through a cluster of parsing equipment, including: A starting file block downloading unit configured to download the starting file block from the cloud storage server to obtain the position of the first line separator of the pending file, the starting file block is the pending file In the file block starting from the starting position, including the first line separator; a line capacity determination unit configured to determine the line capacity of the file to be processed based on the position of the first line separator; the delimited file block The downloading unit is configured to download the first boundary file block in the to-be-processed file according to the preset segment line number and the line capacity, the first boundary file block includes, when When the to-be-processed file is divided, a line separator at the end position of the first clip file in the divided clip files; an index data determination unit configured to be based at least on the line separator in the first delimited file block Determine the first index data of the first segment file, the first index data includes a first start index and a first end index, and the first index data is used by the parsing device according to the first index data , Parse the first segment file from the cloud storage server. According to a third aspect, there is provided a computer-readable medium storage medium on which a computer program is stored, and when the computer program is executed in a computer, a method of causing the computer to execute the first aspect. According to a fourth aspect, a computing device is provided, including a memory and a processor, wherein the memory stores executable code, and when the processor executes the executable code, the first state is realized Kind of method. Through the method and device provided by the embodiments of the present specification, the starting file block is first downloaded from the starting position of the file to be processed, and the line capacity is determined according to the position of the first line separator in the starting file block, and then based on the line capacity and the Set the number of fragment lines to download the delimited file block, and obtain the line delimiter by reading the delimited file block, so as to obtain the index data of the fragment file based on the line delimiter at least, which is used by the parsing device to analyze the fragment from the cloud storage server according to the index data file. In this way, it is only necessary to obtain the starting file block and the delimited file block from the cloud storage server. Since some of the pending files are selectively downloaded, the division scheme of the pending files is determined by determining the index information of each fragment file without downloading the entire file And the real cut of the file can reduce the time and improve the effectiveness of file processing.

下面結合圖式,對本說明書提供的方案進行描述。 圖1為本說明書揭露的一個實施例的實施場景示意圖。如圖1所示,雲端儲存伺服器110可以為運算平台130等提供分散式儲存服務,例如某個雲端物件儲存服務平台OSS(Object Storage Service)。運算平台130、解析設備121、122、123等都可以按照一定規則向雲端儲存伺服器110儲存資料,或者從雲端儲存伺服器110獲取資料。例如,運算平台130或解析設備121、122、123等可以透過發送包含指定欄位(如第10位元組至第100位元組)的請求,從雲端儲存伺服器110下載該指定欄位的資料(如從第10位元組開始至第100位元組結束的資料)。運算平台130與解析設備121、122、123等也可以透過各種有線或無線網路交互資料。運算平台130是具有一定資料處理能力的電子設備,如為客戶端應用提供支持的伺服器,比如支付寶伺服器、釘釘伺服器、購物應用伺服器等。 當運算平台130中的檔案處理器要處理的檔案儲存在雲端儲存伺服器110中,且雲端儲存伺服器110中儲存的檔案格式(如html格式)不是該檔案處理器能夠處理的檔案格式(如txt格式)時,如果要處理的檔案較大,運算平台130自身無法解析,或解析需要耗費時間較長,造成資料處理壓力的情況下,可以將解析任務分發給解析設備121、122、123等來完成。 通常,運算平台130從雲端儲存伺服器110下載待處理檔案,保存到本地,然後從待處理檔案中每讀取預設行數(如5000行)保存成一個片段檔案,再將各個片段檔案上傳至雲端儲存伺服器110。雲端儲存伺服器110用各個片段檔案資訊更新任務配置表,然後將各個片段檔案的檔案資訊作為分發參數分發給解析設備121、122、123等,解析設備121、122、123等根據收到的檔案資訊解析相應的片段檔案,產生運算平台130中的檔案處理器能夠處理的檔案格式的檔案。這種方法的檔案解析總耗時至少包括:運算平台130從雲端儲存伺服器110下載待處理檔案的耗時t11 、運算平台130針對下載到本地的待處理檔案逐行讀取以分割成各個片段檔案的耗時t12 、運算平台130向雲端儲存伺服器110上傳各個片段檔案的耗時t13 、雲端儲存伺服器110用各個片段檔案資訊更新任務配置表的耗時t14 、雲端儲存伺服器110向解析設備121、122、123等分發解析任務的耗時t15 、解析設備121、122、123等解析相應片段檔案耗時t16 。 在本說明書的實施例中,運算平台130可以首先從待處理檔案的起始位置下載起始檔案塊,並根據起始檔案塊中首個行分隔符的位置確定行容量,然後基於行容量和預設片段行數下載分界檔案塊用於確定片段檔案的索引資料。這裡,運算平台130從雲端儲存伺服器110是選擇性地下載待處理檔案中的檔案塊,而非全部下載,所確定的是片段檔案的索引資料,而沒有對待處理檔案進行真實切割。之後,運算平台130可以將這些索引資訊上傳至雲端儲存伺服器110,更新任務配置表,由雲端儲存伺服器110根據任務配置表向解析設備121、122、123等分發解析任務,也可以直接將索引資訊作為參數傳遞至解析設備121、122、123等,以分發解析任務。 這樣,由雲端儲存伺服器110分發解析任務的情況下,檔案解析的耗時包括:運算平台130從雲端儲存伺服器110選擇性下載部分檔案塊的耗時t21 、從所下載的檔案塊中確定各個片段檔案索引資訊的耗時t22 、向雲端儲存伺服器110上傳各個片段檔案的索引資訊的耗時t23 、雲端儲存伺服器110用各個片段檔案資訊更新任務配置表的耗時t24 、雲端儲存伺服器110向解析設備121、122、123等分發解析任務的耗時t25 、解析設備121、122、123等解析相應片段檔案耗時t26 。可以理解,從雲端儲存伺服器110選擇性下載部分檔案塊的耗時t21 遠遠小於從雲端儲存伺服器110下載整個待處理檔案的耗時t11 ,從所下載的檔案塊中確定出索引資訊的耗時t22 小於針對下載到本地的待處理檔案逐行讀取以分割成各個片段檔案的耗時t12 ,運算平台130向雲端儲存伺服器110上傳各個片段檔案的索引資訊的耗時t23 遠遠小於上傳各個片段檔案的耗時t13 ,其他耗時t14 、t15 、t16 與t24 、t25 、t26 基本一致,因此,可以大大減少檔案解析耗時。 由雲端運算平台130分發解析任務的情況下,檔案解析的耗時包括:運算平台130從雲端儲存伺服器110選擇性下載部分檔案塊的耗時t31 、從所下載的檔案塊中確定各個片段檔案索引資訊的耗時t32 、運算平台130向解析設備121、122、123等分發解析任務的耗時t33 、解析設備121、122、123等解析相應片段檔案耗時t34 。與現有技術相比,從雲端儲存伺服器110選擇性下載部分檔案塊的耗時t21 小於從雲端儲存伺服器110下載待處理檔案的耗時t11 ,從所下載的檔案塊中確定出索引資訊的耗時t22 小於針對下載到本地的待處理檔案逐行讀取以分割成各個片段檔案的耗時t12 ,解析設備121、122、123等解析相應片段檔案耗時t34 與t16 基本一致,運算平台130直接向解析設備121、122、123等分發解析任務,耗時t33 ,省去了運算平台130向雲端儲存伺服器110上傳各個片段檔案的耗時t13 、雲端儲存伺服器110用各個片段檔案資訊更新任務配置表的耗時t14 ,小於或等於雲端儲存伺服器110向解析設備121、122、123等分發解析任務的耗時t15 ,因此,耗時大大減小。 以下詳細說明上述運算平台130透過選擇性下載的檔案塊確定各個片段檔案的索引資訊的過程。 圖2示出根據一個實施例的檔案處理的方法流程圖。該方法適用於透過解析設備叢集針對待處理檔案進行解析的情況,其執行主體可以是任何具有運算、處理能力的系統、設備、裝置、平台或伺服器,例如圖1所示的運算平台。 如圖2所示,該方法包括以下步驟:步驟21,從雲端儲存伺服器下載待處理檔案中從起始位置開始的起始檔案塊,以獲取待處理檔案的首個行分隔符的位置;步驟22,基於首個行分隔符的位置確定待處理檔案的行容量;步驟23,根據預設片段行數和上述行容量,確定待處理檔案中的第一分界檔案塊,該第一分界檔案塊包括,當按照預設片段行數將待處理檔案進行劃分時,劃分得到的多個片段檔案中第一片段檔案的結束位置的行分隔符;步驟24,至少基於第一分界檔案塊中的行分隔符的位置,確定第一片段檔案的第一索引資料,該第一索引資料包括第一開始索引和第一結束索引,第一索引資料用於解析設備按照第一索引資料,從雲端儲存伺服器解析上述第一片段檔案。 首先,在步驟21,從雲端儲存伺服器下載起始檔案塊,以獲取待處理檔案的首個行分隔符的位置,其中起始檔案塊是待處理檔案中從起始位置開始的檔案塊,並包括首個行分隔符。值得說明的是,這裡的檔案塊不是檔案儲存時的塊block,而是檔案裡面指定位置和/或大小的一段檔案。例如,指定起始位置為待處理檔案的起始位置、大小為4千位元組(4kb)的檔案塊,就是從待處理檔案起始位置開始的包括4千個位元組的一段檔案。 起始檔案塊的大小可以根據經驗確定,也可以是隨機設置的一個較小數值,以從雲端儲存伺服器下載盡可能小的檔案塊。以根據經驗確定為例,可以透過統計多個檔案的首行檔案大小,取其中的最大值,或大於預定比例(如90%)的首行檔案大小的值,確定為起始檔案塊的大小。如圖3所示,可以根據預定檔案塊大小,下載從位置31到位置32之間的起始檔案塊。 針對下載的起始檔案塊,可以從起始位置開始檢測行分隔符,以查找首個行分隔符的位置。可以理解,由於起始檔案塊是從待處理檔案的起始位置開始的,其到首個行分隔符的位置處的檔案大小就是首行檔案大小。 在一些實施例中,起始檔案塊中可以包括一個或多個行分隔符,此時,將檢測到的第一個行分隔符確定為首個行分隔符。如圖3所示,假設位置32處有一個行分隔符,則將其確定為首個行分隔符。 在另一些實施例中,起始檔案塊中也可能包括0個行分隔符。此時,起始檔案塊的大小還可以根據實際情況進行改變。例如,初始時,從起始位置開始下載預定大小(如4kb)的檔案塊作為起始檔案塊,並從該起始檔案塊查找行分隔符;在未查找到行分隔符的情況下,向後增加一個預定大小(與前面的預定大小4kb可以相同,也可以不同)的檔案塊以更新上述起始檔案塊,直到從中查找到首個行分隔符。假設初始時從待處理檔案的起始位置下載預定大小4kb的一個檔案塊,當從這4kb的檔案塊裡沒有查找到行分隔符時,可以將初始檔案塊的大小更新為(4+4=8)kb,即,可以從待處理檔案的起始位置下載大小為8kb的一段檔案作為初始檔案塊,並從中查找行分隔符。如果查找到行分隔符,則可以確定首個行分隔符的位置。否則,繼續更新初始檔案塊,初始檔案塊大小為12kb,並從中查找行分隔符。以此類推,直到從初始檔案塊中查找到行分隔符,確定出首個行分隔符的位置。 步驟22,基於首個行分隔符的位置確定待處理檔案的行容量。可以理解,行容量可以用於表示待處理檔案的一行檔案的大小,如圖3所示,位置31和位置32 之間的檔案塊的大小。在一個實施例中,可以將行容量確定為,待處理檔案的起始位置(如位置31)至首個行分隔符的位置(如位置32)所包含的位元組數(如4kb)。 在一些可選的實施例中,待處理檔案的第一行檔案可能較特殊,例如只有一個檔案頭等,此時,可以透過預先設定排除條件(如小於10位元組等)將其排除,然後查找下一個行分隔符,將首個行分隔符與下一個行分隔符之間的檔案塊的大小確定為行容量。 在另一些可選的實施例中,還可以根據待處理檔案的起始位置和首個行分隔符之間的檔案段大小確定第一行容量,根據首個行分隔符和第二個行分隔符之間的檔案大小確定第二行容量,將第一行容量和第二行容量的平均值或較大值作為本步驟22所確定的行容量。 如此,可以排除待處理檔案第一行較特殊的情況。 步驟23,根據預設片段行數和上述行容量,下載待處理檔案中的第一分界檔案塊。該第一分界檔案塊包括,當按照預設片段行數將待處理檔案進行劃分時,劃分得到的多個片段檔案中第一片段檔案的結束位置的行分隔符。值得說明的是,第一片段檔案和第一分解檔案塊中的“第一”並不表示順序,而是表示某一個、任一個,為了便於描述片段檔案和分界檔案塊的對應關係,透過“第一”來表示。 其中,預設片段行數可以是人為設置的,也可以是根據待處理檔案大小、行容量和解析設備數量確定的。例如,待處理檔案大小為1000兆位元組(Mb),行容量為4kb,則預估行數為1000Mb/4kb=1000×1024/4=256000,假設空閒解析設備為64個,則可以將待處理檔案按照4000行為一個片段檔案,得到64個片段檔案,以保證每個解析設備都可以分配到片段檔案,加快處理速度。可以理解,此處僅為示例,實踐中可以根據情況確定預設片段行數,本申請案對此不作限定。 根據預設片段行數和上述行容量,可以確定待處理檔案中的第一分界檔案塊。該第一分界檔案塊可以表示第一片段檔案結束的大概位置。容易理解,由於針對待處理檔案劃分片段檔案時,是按照行數劃分,因此,一個片段檔案的結束位置是一行結束的位置,因此,可以根據預估的分界位置,查找相應的行分隔符,來精確確定一個片段檔案的結束位置。由此,可以將包括有用於劃分第一片段檔案的結束位置的行分隔符的一段檔案作為預估的分界位置,並確定為第一分界檔案塊進行下載。 實踐中,可以先確定第一片段檔案的檔案開始位置,例如當第一片段檔案為整個待處理檔案的第一個片段檔案時,第一片段檔案的檔案開始位置為待處理檔案的起始位置,當第一片段檔案為其他片段檔案時,第一片段檔案的檔案開始位置為前一個片段檔案的結束位置。 為了便於說明,可以將預設片段檔案行數與行容量的乘積稱之為片段容量。則每一個片段檔案的大小在一個片段容量上下浮動。可以理解,如果預設片段行數為5000行,上述行容量為4kb,對於待處理檔案的第一個片段檔案,其結束位置大概在距離待處理檔案起始位置5000×4kb的位置附近。實踐中,為了盡可能保證所下載的檔案塊中能夠找到行分隔符,分界檔案塊的大小一般取為一個行容量大小。當然,分界檔案塊的大小還可以取為其他值,例如1.5個行容量大小,本申請案對此不作限定。在這裡僅以取為一個行容量大小為例進行說明。 根據一態樣的實施例,可以以距離某個片段檔案起始位置5000×4kb處為中心,前後各取2kb,共4kb大小的檔案塊作為該片段檔案對應的分界檔案塊。即,將分界檔案塊的塊開始位置確定為,片段檔案起始位置加上片段容量並減去行容量的一半的位置。以第一個片段檔案為例,第一個分界檔案塊為:(5000×4kb-2kb)處至(5000×4kb+2kb)處的一段檔案。 根據另一態樣的實施例,也可以將從距離片段檔案起始位置5000×4kb處,向後取大小為4kb大小的檔案塊作為該片段檔案對應的分界檔案塊。即,確定第一分界檔案塊的塊開始位置為,檔案開始位置加上片段容量的位置。以第一個片段檔案為例,對應的第一個分界檔案塊可以為:(5000×4kb)處至(5001×4kb)處的一段檔案。如圖3所示,假設片段檔案的起始位置加上一個片段容量的位置在位置33,則可以取位置33至位置34之間的、包含一個行容量大小的一段檔案作為對應的分界檔案塊。 接下來,還需要從分界檔案塊中查找行分隔符,以進一步確定片段檔案的具體結束位置。可以理解,與起始檔案塊類似地,分界檔案塊中可能包含1個或多個行分隔符,也可能未包含行分隔符。 在一個實施例中,分界檔案塊包含1個或多個行分隔符,可以將其中第一個或最後一個行分隔符的位置確定為對應片段檔案的結束位置。如在以位置33開始,位置34結束的分界檔案塊中查找到位置35處具有行分隔符,則將位置35確定為該片段檔案的結束位置。 在另一個實施例中,分界檔案塊包含0個行分隔符,則可以向後增加一個行容量大小的檔案塊以更新該分界檔案塊,並下載更新後的分界檔案塊,直到從中查找到行分隔符。仍以第一個片段檔案,第一個分界檔案塊為:(5000×4kb)處至(5001×4kb)處的一段檔案為例,如果在該第一個分界檔案塊(5000×4kb)處至(5001×4kb)處的一段檔案中未查找到行分隔符,則將第一個分界檔案塊更新為(5000×4kb)處至(5002×4kb)處的一段檔案,下載更新後的分界檔案塊,從中查找行分隔符。以此類推,直到在第一個分界檔案塊中查找到行分隔符。 在一個實施例中,分界檔案塊包含0個行分隔符,還可以向前後各增加一段檔案,共增加一個行容量大小的檔案段以更新該分界檔案塊,並下載更新後的分界檔案塊,直到從中查找到行分隔符。例如分界檔案塊為(5000×4kb-2kb)處至(5000×4kb+2kb)處的一段檔案,更新為(5000×4kb-4kb)處至(5000×4kb+4kb)處的一段檔案。在此不再贅述。 容易理解,當第一片段檔案是整個待處理檔案的最後一個片段檔案時,從前一個片段檔案的結束位置到待處理檔案的結束位置的大小可能不足一個片段容量的大小,此時,如果仍按照前述確定第一分界檔案塊的位置的方法,將會導致錯誤。因此,根據一些可能的實施例,在該流程步驟21開始之前,還可以包括以下步驟:獲取待處理檔案的檔案大小資訊。其中,該檔案大小資訊可以透過從雲端記憶體讀取的待處理檔案的元資訊中獲取,例如1000Mb。 由此,在一個可選的實施方式中,在本步驟23之前,或在步驟23中,還可以進一步包括判斷第一片段檔案是否為最後一個片段檔案的步驟。具體包括:獲取待處理檔案的檔案大小資訊;基於檔案大小資訊檢測第一分界檔案塊是否超出待處理檔案的檔案大小範圍。在超出的情況下,確定第一片段檔案為最後一個片段檔案,無需下載相應的分界檔案塊。 步驟24,至少基於第一分界檔案塊中的行分隔符的位置,確定第一片段檔案的第一索引資料。同理,此處的“第一索引資料”中的“第一”不表示順序,而是和“第一分界檔案塊”、“第一片段檔案”對應,表示同一個。 該第一索引資料可以包括第一開始索引和第一結束索引。如,第i個片段檔案的第一索引資料pi 用(pi0 ,pi1 )表示,其中,第一開始索引pi0 指向第i個片段檔案的開始位置,第一結束索引pi1 指向第i個片段檔案的結束位置。可選地,第一開始索引pi0 可以是第i個片段檔案的開始位置本身,第一結束索引pi1 可以是第i個片段檔案的結束位置本身。 在第一片段檔案為待處理檔案的第一個片段檔案的情況下,可以確定第一開始索引指向待處理檔案的起始位置;確定第一結束索引指向第一分界檔案塊中的行分隔符的位置。如圖3所示,假設位置33至位置34之間的檔案塊是第一個片段檔案的第一分界檔案塊,則第一個片段檔案的第一索引資料p1 中(p10 ,p11 )可以為(位置31,位置35),例如位置31,位置35可以分別為(0,5000×4kb),將待處理檔案起始位置至第5000×4kb位元組位置的一段檔案確定為第一個片段檔案。 對於其他片段檔案,可以將前一個片段檔案的結束索引作為其開始索引,同樣,確定其結束索引指向相對應的分界檔案塊中的行分隔符的位置。例如,在第一個片段檔案的索引資料p1 (p10 ,p11 )為(0,5000×4kb)的情況下,第二個片段檔案的索引資料p2 可以為(p20 ,p21 ),如p20 、p21 分別指向(5000×4kb,10000×4kb),則將第5000×4kb位元組位置至第10000×4kb位置的一段檔案確定為第二個片段檔案。其中,當前片段檔案(第二個片段檔案)的開始索引p20 與前一片段檔案(第一個片段檔案)的結束索引p11 指向相同的位置。 如此,從第一個片段檔案開始,循環步驟23和步驟24,可以按順序確定出各個片段檔案的索引資料。索引資料可以用於解析設備按照該索引資料,從雲端儲存伺服器解析相對應的片段檔案。 在一個可能的實施方式中,第一片段檔案為最後一個片段檔案,且第一分界檔案塊超出待處理檔案的檔案大小範圍,此時,可以將待處理檔案的結束位置確定為第一結束索引指向的位置。作為示例,假設透過檔案大小資訊得到待處理檔案的大小為1000Mb,如果第一片段檔案的起始位置為距離待處理檔案起始位置990Mb處,確定相對應的第一分界檔案塊位置為:(990Mb+5000×4kb)至(990Mb +5001×4kb),而待處理檔案結束位置為(990Mb+ 10240kb)=1000Mb,由此,可以判斷第一分界檔案塊超出待處理檔案的檔案大小範圍,此時,可以直接將待處理檔案結束位置1000Mb確定為第一結束索引指向的位置。 根據一種實施方式,可以將各個片段檔案的索引資料添加到索引資訊中。該索引資訊用於儲存針對待處理檔案劃分的多個片段檔案的索引資料。該索引資訊可以保存為表格、數組、集合等等。例如:索引資訊[p1 (p10 ,p11 );p2 (p20 ,p21 )……],用於將待處理檔案劃分時,劃分為[(0,5000×4kb);(5000×4kb,10000×4kb)……]的片段檔案。 在一個實施例中,可以利用上述索引資訊更新雲端儲存伺服器中的任務配置表,以供雲端儲存伺服器按照任務配置表的分發規則向多個解析設備分發索引資訊中的各個索引資料。其中,任務配置表的分發規則可以是,諸如運算各個解析設備的當前任務量,向當前任務量較少的解析設備分發較多的解析任務,等等一切公知技術可以提供的分發規則,在此不再贅述。 在一個實施例中,還可以不向雲端儲存伺服器上傳索引資訊,而是直接將索引資訊中的索引資料作為傳遞參數,透過參數呼叫的方式(如RPC呼叫)發送至解析設備叢集中的某個解析設備。可選地,還可以每得到一個索引資料,就直接將該索引資料作為傳遞參數,透過參數呼叫的方式發送至一個解析設備。 解析設備得到索引資料後,可以根據索引資料指向的位置,解析對應的片段檔案,如解析圖3中以位置31開始、以位置35結束的片段檔案。 回顧以上過程,透過起始檔案塊獲取行容量,再根據預設片段行數和行容量確定分界檔案塊,透過下載分界檔案塊獲取其中的行分隔符,從而至少基於該行分隔符得到片段檔案的索引資料,用於解析設備根據索引資料從雲端儲存伺服器解析片段檔案,如此,可以僅僅從雲端儲存伺服器獲取起始檔案塊和分界檔案塊,由於選擇性地下載部分待處理檔案,透過確定各個片段檔案的索引資訊確定待處理檔案的劃分方案,而不用下載整個檔案並對檔案進行真實切割,可以減少耗時,提高檔案處理的有效性。 根據另一態樣的實施例,還提供一種檔案處理的裝置,適用於透過解析設備叢集針對待處理檔案進行解析的情況。圖4示出根據一個實施例的用於檔案處理的裝置的示意性方塊圖。如圖4所示,用於檔案處理的裝置400包括:包括:起始檔案塊下載單元41,配置為從所述雲端儲存伺服器下載起始檔案塊,以獲取所述待處理檔案的首個行分隔符的位置,該起始檔案塊是待處理檔案中從起始位置開始的、包括首個行分隔符的檔案塊;行容量確定單元42,配置為基於首個行分隔符的位置確定待處理檔案的行容量;分界檔案塊下載單元43,配置為根據預設片段行數和行容量,下載待處理檔案中的第一分界檔案塊,第一分界檔案塊包括,當按照預設片段行數將待處理檔案進行劃分時,劃分得到的多個片段檔案中第一片段檔案的結束位置的行分隔符;索引資料確定單元44,配置為至少基於第一分界檔案塊中的行分隔符的位置,確定第一片段檔案的第一索引資料,第一索引資料包括第一開始索引和第一結束索引,第一索引資料用於解析設備按照第一索引資料,從雲端儲存伺服器解析第一片段檔案。 根據一態樣的實施例,起始檔案塊下載單元41進一步配置為:從起始位置開始下載預定大小的檔案塊作為起始檔案塊,並從起始檔案塊中查找行分隔符;在未查找到行分隔符的情況下,向後增加一個預定大小的檔案塊以更新起始檔案塊,直到從中查找到首個行分隔符。 在一個實施例中,行容量確定單元42進一步配置為:將行容量確定為,待處理檔案的起始位置至首個分隔符的位置所包含的位元組數。 根據一種可能的設計,分界檔案塊下載單元43包括:第一確定模組,配置為確定第一片段檔案的檔案開始位置;第二確定模組,配置為確定第一分界檔案塊的塊開始位置為,檔案開始位置加上片段容量的位置,所述第一分界檔案塊的塊結束位置為,所述塊開始位置加上一個行容量的位置,其中,片段容量為,預設片段檔案行數與行容量的乘積。 在進一步的實施例中,分界檔案塊下載單元43還包括:下載模組,配置為下載第一分界檔案塊;查找模組,配置為從第一分界檔案塊中查找行分隔符;在未查找到行分隔符的情況下,上述第二確定模組還可以向後增加一個行容量大小的檔案塊以更新第一分界檔案塊,並透過上述下載模組下載更新後的第一分界檔案塊,直到上述查找模組從中查找到行分隔符。 在一個實施例中,第一確定模組進一步可以配置為:在第一片段檔案是待處理檔案的第一個片段檔案的情況下,將待處理檔案的起始位置作為第一片段檔案的檔案開始位置;否則,將第一片段檔案的前一個片段檔案的結束位置作為檔案開始位置。 當第一片段檔案為待處理檔案的第一個片段檔案時,索引資料確定單元44進一步可以配置為:確定第一開始索引指向待處理檔案的起始位置;確定第一結束索引指向第一分界檔案塊中的行分隔符的位置。 當第一片段檔案為待處理檔案的其他片段檔案時,索引資料確定單元44進一步配置為:將第一片段檔案的前一個片段檔案的結束索引作為第一開始索引;確定第一結束索引指向第一分界檔案塊中的行分隔符的位置。 在一些實現中,裝置400還可以包括: 獲取單元(未示出),配置為獲取所述待處理檔案的檔案大小資訊;獲取單元,配置為獲取所述待處理檔案的檔案大小資訊;以及 檢測單元(未示出),配置為基於檔案大小資訊檢測第一分界檔案塊是否超出待處理檔案的檔案大小範圍; 在超出的情況下,索引資料確定單元44還可以配置為,確定所述第一結束索引指向所述待處理檔案的結束位置。 根據一個可能的設計,裝置400還可以包括:添加單元(未示出),配置為將第一索引資料添加到用於多個片段檔案的索引資訊中。 在進一步的實施例中,裝置400還可以包括,更新單元(未示出),配置為利用上述索引資訊更新雲端儲存伺服器中的任務配置表,以供雲端儲存伺服器按照任務配置表的分發規則向解析設備叢集分發索引資訊。 在一些實施例中,裝置400還可以包括,呼叫單元(未示出),配置為將第一索引資料作為傳遞參數,透過參數呼叫的方式發送至解析設備叢集。 透過以上裝置400,可以僅僅從雲端儲存伺服器獲取起始檔案塊、至少一個分界檔案塊,由於選擇性地下載部分待處理檔案,透過確定各個片段檔案的索引資訊確定待處理檔案的劃分方案,而不用下載整個檔案並對檔案進行真實切割,可以減少耗時,提高檔案處理的有效性。 根據另一態樣的實施例,還提供一種電腦可讀媒體儲存媒體,其上儲存有電腦程式,當所述電腦程式在電腦中執行時,令電腦執行結合圖2所描述的方法。 根據再一態樣的實施例,還提供一種運算設備,包括記憶體和處理器,所述記憶體中儲存有可執行碼,所述處理器執行所述可執行碼時,實現結合圖2所述的方法。 本領域技術人員應該可以意識到,在上述一個或多個示例中,本發明所描述的功能可以用硬體、軟體、韌體或它們的任意組合來實現。當使用軟體實現時,可以將這些功能儲存在電腦可讀媒體中或者作為電腦可讀媒體上的一個或多個指令或碼進行傳輸。 以上所述的實施例,對本發明的目的、技術方案和有益效果進行了進一步詳細說明,所應理解的是,以上所述僅為本發明的實施例而已,並不用於限定本發明的保護範圍,凡在本發明的技術方案的基礎之上,所做的任何修改、等同替換、改進等,均應包括在本發明的保護範圍之內。The scheme provided in this specification will be described below in conjunction with the drawings. FIG. 1 is a schematic diagram of an implementation scenario disclosed in this specification. As shown in FIG. 1, the cloud storage server 110 can provide a distributed storage service for the computing platform 130 and the like, such as a cloud object storage service platform OSS (Object Storage Service). The computing platform 130, the parsing devices 121, 122, 123, etc. can all store data to the cloud storage server 110 or obtain data from the cloud storage server 110 according to certain rules. For example, the computing platform 130 or the parsing devices 121, 122, 123, etc. can download the specified field from the cloud storage server 110 by sending a request containing the specified field (such as the 10th byte to the 100th byte) Data (such as data from the 10th byte to the 100th byte). The computing platform 130 and the analysis devices 121, 122, 123, etc. can also exchange data through various wired or wireless networks. The computing platform 130 is an electronic device with certain data processing capabilities, such as a server that provides support for client applications, such as Alipay server, nail server, shopping application server, and so on. When the file to be processed by the file processor in the computing platform 130 is stored in the cloud storage server 110, and the file format (such as html format) stored in the cloud storage server 110 is not a file format that the file processor can handle (such as txt format), if the file to be processed is large, the computing platform 130 cannot parse itself, or the analysis takes a long time, and the data processing pressure is caused, the analysis task can be distributed to the analysis devices 121, 122, 123, etc. To be done. Generally, the computing platform 130 downloads the pending file from the cloud storage server 110, saves it locally, and then saves a preset file for each preset line number (such as 5000 lines) read from the pending file, and then uploads each segment file To the cloud storage server 110. The cloud storage server 110 updates the task configuration table with the information of each fragment file, and then distributes the file information of each fragment file as distribution parameters to the parsing devices 121, 122, 123, etc. The parsing devices 121, 122, 123, etc. according to the received files The information parses the corresponding fragment files to generate files in a file format that the file processor in the computing platform 130 can process. The total time required for file analysis in this method includes at least: the time t 11 for the computing platform 130 to download the pending file from the cloud storage server 110, and the computing platform 130 reads the pending file downloaded to the local line by line to divide it into each Processed t fragment file 12, computing platform 130 of each segment storage server 110 to upload files to the cloud 13 is Processed t, t cloud storage server 110 Processed with each segment profile information update task configuration table 14, the servo cloud storage It takes time t 15 for the parser 110 to distribute parsing tasks to the parsing devices 121, 122, 123, etc., and t 16 for parsing the corresponding segment files by the parsing devices 121, 122, 123, etc. In the embodiment of the present specification, the computing platform 130 may first download the starting file block from the starting position of the file to be processed, and determine the line capacity according to the position of the first line separator in the starting file block, and then based on the line capacity and The preset segment line number download boundary file block is used to determine the index data of the segment file. Here, the computing platform 130 selectively downloads the file blocks in the file to be processed from the cloud storage server 110, rather than downloading all of them. What is determined is the index data of the fragment file, and the file to be processed is not actually cut. After that, the computing platform 130 can upload the index information to the cloud storage server 110 to update the task configuration table, and the cloud storage server 110 distributes the analysis task to the analysis devices 121, 122, 123, etc. according to the task configuration table, or directly The index information is passed as parameters to the analysis devices 121, 122, 123, etc. to distribute the analysis tasks. In this way, when the cloud storage server 110 distributes the parsing task, the time for file parsing includes: the time t 21 for the computing platform 130 to selectively download part of the file blocks from the cloud storage server 110, and from the downloaded file blocks Time t 22 for determining the index information of each clip file, time t 23 for uploading the index information of each clip file to the cloud storage server 110, and time t 24 for the cloud storage server 110 to update the task configuration table with each piece file information , consuming cloud storage server 110 to resolve the distribution device 121, 122 parsing tasks like t 25, 121, 122 and other analytical parsing device file corresponding fragment Processed t 26. It can be understood that the time t 21 for selectively downloading some file blocks from the cloud storage server 110 is much less than the time t 11 for downloading the entire pending file from the cloud storage server 110, and the index is determined from the downloaded file blocks The time t 22 for the information is less than the time t 12 for reading the file to be downloaded to the local line-by-line to divide it into each segment file. The computing platform 130 uploads the index information of each segment file to the cloud storage server 110 t 23 is far less than the time-consuming t 13 for uploading individual clip files. The other time-consuming t 14 , t 15 , and t 16 are basically the same as t 24 , t 25 , and t 26. Therefore, the time for analyzing files can be greatly reduced. When the cloud computing platform 130 distributes the parsing task, the time for file parsing includes: the time t 31 for the computing platform 130 to selectively download part of the file blocks from the cloud storage server 110, and determining each segment from the downloaded file blocks Processed encyclopedia information t 32, computing platform 130 takes to resolve the distribution device 121, 122 and other analytical tasks t 33, 121, 122 and other analytical parsing device file corresponding fragment Processed t 34. Compared with the prior art, the time t 21 for selectively downloading some file blocks from the cloud storage server 110 is less than the time t 11 for downloading pending files from the cloud storage server 110, and the index is determined from the downloaded file blocks The time t 22 of the information is less than the time t 12 for reading the file to be downloaded to the local line by line to divide it into each segment file. The parsing devices 121, 122, 123, etc. parse the corresponding segment file t 34 and t 16 consistent, computing platform 130 directly to the parsing device 121 to distribute the like parsing tasks, time-consuming t 33, the computing platform 130 eliminates the need for storage server 110 to upload files each segment Processed t 13 to the cloud, the cloud storage servo The time t 14 for the task configuration table to be updated by the device 110 with each piece of file information is less than or equal to the time t 15 for the cloud storage server 110 to distribute the analysis task to the analysis devices 121, 122, 123, etc. Therefore, the time consumption is greatly reduced . The following is a detailed description of the process by which the computing platform 130 determines the index information of each segment file through the selectively downloaded file blocks. FIG. 2 shows a flowchart of a file processing method according to an embodiment. This method is suitable for parsing files to be processed through a cluster of parsing equipment. Its execution subject may be any system, equipment, device, platform, or server with computing and processing capabilities, such as the computing platform shown in FIG. 1. As shown in FIG. 2, the method includes the following steps: Step 21: Download the starting file block from the starting position in the pending file from the cloud storage server to obtain the position of the first line separator of the pending file; Step 22, determine the line capacity of the file to be processed based on the position of the first line separator; Step 23, determine the first boundary file block in the file to be processed according to the preset segment line number and the above line capacity, the first boundary file The block includes, when dividing the to-be-processed file according to the preset segment line number, the line separator of the end position of the first segment file among the divided segment files; step 24, based at least on the first boundary file block The position of the line separator determines the first index data of the first segment file. The first index data includes a first start index and a first end index. The first index data is used by the parsing device to store from the cloud according to the first index data The server parses the first fragment file. First, in step 21, download the starting file block from the cloud storage server to obtain the position of the first line separator of the pending file, where the starting file block is the file block from the starting position in the pending file, And include the first line separator. It is worth noting that the file block here is not a block block when the file is stored, but a section of the file at a specified location and/or size in the file. For example, specifying the starting position as the starting position of the file to be processed and the file block with a size of 4 kilobytes (4 kb) is a section of file containing 4 thousand bytes starting from the starting position of the file to be processed. The size of the starting file block can be determined based on experience, or it can be a randomly set smaller value to download the smallest possible file block from the cloud storage server. Taking the determination based on experience as an example, the size of the first file can be determined as the size of the starting file block by counting the file size of the first line of multiple files, taking the maximum value, or the value of the file size of the first line greater than a predetermined ratio (such as 90%) . As shown in FIG. 3, the starting file block from position 31 to position 32 can be downloaded according to the predetermined file block size. For the starting file block of the download, the line separator can be detected from the starting position to find the position of the first line separator. It can be understood that, since the starting file block starts from the starting position of the file to be processed, the file size from the position to the first line separator is the file size of the first line. In some embodiments, one or more line separators may be included in the starting file block. In this case, the first line separator detected is determined as the first line separator. As shown in FIG. 3, assuming that there is a line separator at position 32, it is determined as the first line separator. In other embodiments, the starting file block may also include 0 line separators. At this time, the size of the starting file block can also be changed according to the actual situation. For example, initially, download a file block of a predetermined size (such as 4kb) from the starting position as the starting file block, and find the line separator from the starting file block; if no line separator is found, go backward Add a file block of a predetermined size (which can be the same or different from the previous predetermined size of 4kb) to update the above-mentioned starting file block until the first line separator is found from it. Assuming that an archive block with a predetermined size of 4 kb is downloaded from the beginning of the file to be processed initially, when no line separator is found in the 4 kb archive block, the size of the initial archive block can be updated to (4+4= 8) KB, that is, you can download a file of 8 kb in size from the beginning of the file to be processed as the initial file block, and find the line separator from it. If a line separator is found, the position of the first line separator can be determined. Otherwise, continue to update the initial archive block, the initial archive block size is 12kb, and find the line separator from it. And so on, until the line separator is found from the initial file block, and the position of the first line separator is determined. Step 22: Determine the line capacity of the file to be processed based on the position of the first line separator. It can be understood that the line capacity can be used to indicate the size of a line of files to be processed, as shown in FIG. 3, the size of the file block between position 31 and position 32. In one embodiment, the line capacity may be determined as the number of bytes (eg 4 kb) contained in the starting position of the file to be processed (eg position 31) to the position of the first line separator (eg position 32). In some optional embodiments, the first line of the file to be processed may be more special, for example, there is only one file header, etc. At this time, it can be excluded by preset exclusion conditions (such as less than 10 bytes, etc.), Then find the next line separator, and determine the size of the file block between the first line separator and the next line separator as the line capacity. In other optional embodiments, the capacity of the first line can also be determined according to the size of the file segment between the starting position of the file to be processed and the first line separator, and the first line separator and the second line are separated The file size between the characters determines the capacity of the second line, and the average or larger value of the capacity of the first line and the capacity of the second line is used as the line capacity determined in step 22. In this way, you can rule out the more special case of the first line of the pending file. Step 23: Download the first boundary file block in the file to be processed according to the preset segment line number and the line capacity. The first boundary file block includes a line separator at the end position of the first clip file among the divided clip files when the file to be processed is divided according to the preset clip line number. It is worth noting that the "first" in the first fragment file and the first decomposition file block does not indicate the order, but represents a certain one or any one. In order to facilitate the description of the correspondence between the fragment file and the boundary file block, through ""First" to express. Among them, the preset segment line number may be set manually, or may be determined according to the size of the file to be processed, the line capacity, and the number of analysis devices. For example, if the size of the file to be processed is 1000 megabytes (Mb) and the line capacity is 4 kb, the estimated line number is 1000 Mb/4 kb=1000×1024/4=256000. Assuming that there are 64 idle resolution devices, you can change The to-be-processed file is a fragment file in accordance with 4000 behaviors, and 64 fragment files are obtained to ensure that each parsing device can be assigned to the fragment file to speed up processing. It can be understood that this is only an example, and in practice, the preset number of segment lines can be determined according to the situation, which is not limited in this application. According to the preset segment line number and the above-mentioned line capacity, the first boundary file block in the file to be processed can be determined. The first boundary file block may indicate the approximate location of the end of the first segment file. It is easy to understand that because the segment file is divided according to the number of lines when the segment file is to be processed, the end position of a segment file is the end position of a line. Therefore, you can find the corresponding line separator according to the estimated boundary position. To accurately determine the end position of a clip file. Therefore, a section of the file including the line separator for dividing the end position of the first segment file can be used as the estimated boundary position, and determined as the first boundary file block for downloading. In practice, the file start position of the first clip file can be determined first, for example, when the first clip file is the first clip file of the entire file to be processed, the file start position of the first clip file is the start position of the file to be processed When the first clip file is another clip file, the file start position of the first clip file is the end position of the previous clip file. For the convenience of description, the product of the preset segment file line number and line capacity may be called segment capacity. Then the size of each clip file fluctuates up and down in the capacity of a clip. It can be understood that if the preset number of segment lines is 5000 lines and the above-mentioned line capacity is 4 kb, the end position of the first segment file of the to-be-processed file is approximately 5000×4 kb from the start position of the to-be-processed file. In practice, in order to ensure that the line separator can be found in the downloaded file block as much as possible, the size of the delimited file block is generally taken as a line capacity. Of course, the size of the demarcation file block can also be taken as other values, such as a size of 1.5 lines, which is not limited in this application. Here, only one row capacity is taken as an example for description. According to an exemplary embodiment, a file block with a size of 4 kb can be taken as the demarcation file block corresponding to the clip file, centered at 5000×4 kb from the start position of a clip file, and 2 kb each before and after. That is, the block start position of the delimited file block is determined as the position where the clip file start position is added to the clip capacity and minus half the line capacity. Taking the first segment file as an example, the first boundary file block is: a file from (5000×4kb-2kb) to (5000×4kb+2kb). According to another embodiment, a file block with a size of 4 kb may be taken backward from the starting position of the clip file at 5000×4 kb as the boundary file block corresponding to the clip file. That is, it is determined that the block start position of the first boundary file block is the file start position plus the position of the clip capacity. Taking the first segment file as an example, the corresponding first boundary file block may be: a file from (5000×4kb) to (5001×4kb). As shown in FIG. 3, assuming that the starting position of the segment file plus a segment capacity is at position 33, a segment of file containing a line size between position 33 and position 34 can be taken as the corresponding delimited file block . Next, you need to find the line separator from the demarcation file block to further determine the specific end position of the clip file. It can be understood that, similar to the starting file block, the boundary file block may contain one or more line separators, or may not contain line separators. In one embodiment, the demarcation file block contains one or more line separators, and the position of the first or last line separator can be determined as the end position of the corresponding segment file. If a line separator is found at position 35 in the boundary archive block starting at position 33 and ending at position 34, then position 35 is determined as the end position of the segment archive. In another embodiment, if the demarcation file block contains 0 line separators, a file block with a line size can be added backward to update the demarcation file block, and the updated demarcation file block can be downloaded until the line delimiter is found from it symbol. Still taking the first segment file, the first boundary file block is: a section from (5000×4kb) to (5001×4kb) as an example, if it is in the first boundary file block (5000×4kb) If the line separator is not found in a file from (5001×4kb), update the first delimited file block to a file from (5000×4kb) to (5002×4kb), download the updated delimiter File block to find the line separator. And so on, until the line separator is found in the first delimited file block. In one embodiment, the demarcation file block contains 0 line separators, and a segment of files can be added to the front and back to increase the demarcation file block by updating the demarcation file block. Until the line separator is found. For example, the boundary file block is a section of files from (5000×4kb-2kb) to (5000×4kb+2kb), which is updated to a section of files from (5000×4kb-4kb) to (5000×4kb+4kb). I will not repeat them here. It is easy to understand that when the first clip file is the last clip file of the entire pending file, the size from the end position of the previous clip file to the end position of the pending file may be less than the size of a clip capacity. The aforementioned method of determining the location of the first boundary file block will cause an error. Therefore, according to some possible embodiments, before step 21 of the process starts, the following step may also be included: acquiring file size information of the file to be processed. Among them, the file size information can be obtained from the meta information of the pending file read from the cloud memory, for example, 1000Mb. Therefore, in an optional embodiment, before this step 23 or in step 23, it may further include a step of determining whether the first clip file is the last clip file. Specifically, it includes: obtaining file size information of the pending file; detecting whether the first boundary file block exceeds the file size range of the pending file based on the file size information. In the case of exceeding, it is determined that the first segment file is the last segment file, and there is no need to download the corresponding boundary file block. Step 24: Determine the first index data of the first segment file based at least on the position of the line separator in the first boundary file block. Similarly, the "first" in the "first index data" here does not indicate the order, but corresponds to the "first boundary archive block" and "first segment archive", indicating the same. The first index data may include a first start index and a first end index. For example, the first index data p i of the i-th clip file is represented by (p i0 , p i1 ), where the first start index p i0 points to the start position of the i-th clip file, and the first end index p i1 points to the End position of i clip files. Alternatively, the first start index p i0 may be the start position of the i-th segment file itself, and the first end index p i1 may be the end position of the i-th segment file itself. In the case where the first segment file is the first segment file of the to-be-processed file, it can be determined that the first start index points to the starting position of the to-be-processed file; and the first end index is determined to point to the line separator in the first delimited file block s position. As shown in FIG. 3, assuming that the file block between position 33 and position 34 is the first boundary file block of the first clip file, the first index data p 1 (p 10 , p 11 ) Can be (position 31, position 35), for example, position 31, position 35 can be (0, 5000×4kb) respectively, and a section of the file to be processed from the starting position of the file to be processed to the position of 5000×4kb byte is determined as the first A clip file. For other clip files, the end index of the previous clip file can be used as its start index. Similarly, the end index points to the position of the line separator in the corresponding delimited file block. For example, in the case where the index data p 1 (p 10 , p 11 ) of the first clip file is (0, 5000×4 kb), the index data p 2 of the second clip file may be (p 20 , p 21 ), if p 20 and p 21 respectively point to (5000×4kb, 10000×4kb), then the segment file from the position of 5000×4kb byte to the position of 10000×4kb is determined as the second segment file. Among them, the start index p 20 of the current clip file (second clip file) and the end index p 11 of the previous clip file (first clip file) point to the same position. In this way, starting from the first clip file, looping steps 23 and 24, the index data of each clip file can be determined in sequence. The index data can be used by the parsing device to parse the corresponding fragment file from the cloud storage server according to the index data. In a possible implementation manner, the first segment file is the last segment file, and the first boundary file block exceeds the file size range of the file to be processed. At this time, the end position of the file to be processed can be determined as the first end index Where to point. As an example, assuming that the size of the file to be processed is 1000Mb through the file size information, if the starting position of the first fragment file is 990Mb away from the starting position of the file to be processed, determine the position of the corresponding first boundary file block as: ( 990Mb+5000×4kb) to (990Mb+5001×4kb), and the end position of the pending file is (990Mb+10240kb)=1000Mb, from this, it can be judged that the first boundary file block exceeds the file size range of the pending file. , The end position of the file to be processed 1000Mb can be directly determined as the position pointed by the first end index. According to one embodiment, the index data of each segment file can be added to the index information. The index information is used to store index data for multiple segment files divided into files to be processed. The index information can be saved as tables, arrays, collections, etc. For example: index information [p 1 (p 10 , p 11 ); p 2 (p 20 , p 21 )...], used to divide the file to be processed into [(0, 5000×4kb); (5000 ×4kb, 10000×4kb)......] clip files. In one embodiment, the above index information may be used to update the task configuration table in the cloud storage server, so that the cloud storage server distributes each index data in the index information to multiple parsing devices according to the distribution rules of the task configuration table. Among them, the distribution rules of the task configuration table may be, for example, calculating the current task amount of each analysis device, distributing more analysis tasks to the analysis device with less current task amount, and so on, all the distribution rules that can be provided by all known technologies, here No longer. In one embodiment, the index information in the index information may not be uploaded to the cloud storage server, but the index data in the index information may be directly used as a transmission parameter and sent to a certain cluster in the resolution device cluster through a parameter call (such as an RPC call). Resolution devices. Optionally, each time the index data is obtained, the index data can be directly used as a transmission parameter and sent to a parsing device through a parameter call. After the parsing device obtains the index data, it can parse the corresponding segment file according to the location pointed by the index data, such as parsing the segment file starting at position 31 and ending at position 35 in FIG. 3. Recalling the above process, the line capacity is obtained through the starting file block, and then the boundary file block is determined according to the preset segment line number and line capacity, and the line separator is obtained by downloading the boundary file block, so as to obtain the segment file based at least on the line separator The index data is used by the parsing device to parse the fragment file from the cloud storage server according to the index data. In this way, the starting file block and the boundary file block can be obtained only from the cloud storage server. Due to the selective download of some pending files, through Determine the index information of each fragment file to determine the division plan of the file to be processed, instead of downloading the entire file and cutting the file in real time, which can reduce time consumption and improve the effectiveness of file processing. According to another aspect of the embodiment, a file processing device is also provided, which is suitable for the case of parsing a file to be processed through a cluster of parsing equipment. FIG. 4 shows a schematic block diagram of an apparatus for file processing according to an embodiment. As shown in FIG. 4, the device 400 for file processing includes: a start file block downloading unit 41 configured to download the start file block from the cloud storage server to obtain the first file to be processed The position of the line separator, the starting file block is the file block that includes the first line separator from the starting position in the file to be processed; the line capacity determination unit 42 is configured to determine based on the position of the first line separator Line capacity of the to-be-processed file; the boundary file block download unit 43 is configured to download the first boundary file block in the file to be processed according to the preset segment line number and line capacity, the first boundary file block includes, when When dividing the to-be-processed file by the number of lines, the line separator at the end position of the first clip file in the divided clip files; the index data determination unit 44 is configured to be based at least on the line separator in the first delimited file block Determine the first index data of the first fragment file. The first index data includes the first start index and the first end index. The first index data is used by the parsing device to parse the first index data from the cloud storage server according to the first index data. A clip file. According to an exemplary embodiment, the starting file block downloading unit 41 is further configured to: download a file block of a predetermined size from the starting position as the starting file block, and find a line separator from the starting file block; When a line separator is found, an archive block of a predetermined size is added backward to update the starting archive block until the first line separator is found from it. In one embodiment, the line capacity determination unit 42 is further configured to determine the line capacity as the number of bytes contained in the position from the start position of the file to be processed to the position of the first separator. According to a possible design, the boundary file block downloading unit 43 includes: a first determination module configured to determine the file start position of the first segment file; a second determination module configured to determine the block start position of the first boundary file block Is the file start position plus the position of the clip capacity, the block end position of the first boundary file block is the position of the block start position plus a line capacity, where the clip capacity is the default number of clip file lines The product of the row capacity. In a further embodiment, the demarcation file block download unit 43 further includes: a download module configured to download the first demarcation file block; a search module configured to search the line delimiter from the first demarcation file block; In the case of a line separator, the second determination module can also add a file block with a line size backward to update the first boundary file block, and download the updated first boundary file block through the download module until The above search module finds the line separator. In one embodiment, the first determining module may be further configured to: when the first segment file is the first segment file of the file to be processed, use the starting position of the file to be processed as the file of the first segment file Start position; otherwise, the end position of the previous clip file of the first clip file is taken as the file start position. When the first segment file is the first segment file of the to-be-processed file, the index data determination unit 44 may be further configured to: determine that the first start index points to the start position of the to-be-processed file; and determine that the first end index points to the first boundary The position of the line separator in the file block. When the first segment file is another segment file to be processed, the index data determination unit 44 is further configured to: use the end index of the previous segment file of the first segment file as the first start index; determine that the first end index points to the first The position of the line separator in a delimited file block. In some implementations, the device 400 may further include: an acquisition unit (not shown) configured to acquire file size information of the pending file; an acquisition unit configured to acquire file size information of the pending file; and detection A unit (not shown) configured to detect whether the first boundary file block exceeds the file size range of the file to be processed based on the file size information; in the case of exceeding, the index data determining unit 44 may also be configured to determine the first The end index points to the end position of the file to be processed. According to a possible design, the device 400 may further include: an adding unit (not shown) configured to add the first index data to the index information for multiple clip files. In a further embodiment, the device 400 may further include an update unit (not shown) configured to update the task configuration table in the cloud storage server using the above index information for the cloud storage server to distribute according to the task configuration table The rule distributes index information to the cluster of parsing devices. In some embodiments, the apparatus 400 may further include a calling unit (not shown) configured to send the first index data as a transmission parameter to the cluster of parsing devices through a parameter call. Through the above device 400, it is possible to obtain only the starting file block and at least one boundary file block from the cloud storage server. Due to the selective download of some pending files, the division plan of the pending files is determined by determining the index information of each fragment file. Instead of downloading the entire file and cutting the file in real time, it can reduce time consumption and improve the effectiveness of file processing. According to another embodiment, there is also provided a computer-readable medium storage medium on which a computer program is stored, and when the computer program is executed in a computer, the computer is caused to perform the method described in conjunction with FIG. 2. According to yet another embodiment, there is also provided a computing device, including a memory and a processor, where executable code is stored in the memory, and when the processor executes the executable code, the implementation is combined with FIG. 2. Described method. Those skilled in the art should be aware that in the above one or more examples, the functions described in the present invention may be implemented by hardware, software, firmware, or any combination thereof. When implemented using software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium. The above-mentioned embodiments provide further detailed descriptions of the purpose, technical solutions and beneficial effects of the present invention. It should be understood that the above are only embodiments of the present invention and are not intended to limit the scope of protection of the present invention Any modification, equivalent replacement, and improvement made on the basis of the technical solution of the present invention should be included in the protection scope of the present invention.

21‧‧‧方法步驟 22‧‧‧方法步驟 23‧‧‧方法步驟 24‧‧‧方法步驟 41‧‧‧起始檔案塊下載單元 42‧‧‧行容量確定單元 43‧‧‧分界檔案塊下載單元 44‧‧‧索引資料確定單元 110‧‧‧雲端儲存伺服器 121‧‧‧解析設備 122‧‧‧解析設備 123‧‧‧解析設備 130‧‧‧運算平台 400‧‧‧裝置21‧‧‧Method steps 22‧‧‧Method steps 23‧‧‧Method steps 24‧‧‧Method steps 41‧‧‧Starting file block download unit 42‧‧‧Line capacity determination unit 43‧‧‧Demarcation file block download unit 44‧‧‧ Index data determination unit 110‧‧‧ cloud storage server 121‧‧‧Analysis equipment 122‧‧‧Analysis equipment 123‧‧‧Analysis equipment 130‧‧‧ Computing platform 400‧‧‧device

為了更清楚地說明本發明實施例的技術方案,下面將對實施例描述中所需要使用的圖式作簡單地介紹,顯而易見地,下面描述中的圖式僅僅是本發明的一些實施例,對於本發明所屬技術領域中具有通常知識者來講,在不付出創造性勞動的前提下,還可以根據這些圖式獲得其它的圖式。 圖1示出本說明書揭露的一個實施例的實施場景示意圖; 圖2示出根據一個實施例的檔案處理的方法流程圖; 圖3示出確定待處理檔案的行分隔符和分界檔案塊的一個具體例子; 圖4示出根據一個實施例的用於檔案處理的裝置的示意性方塊圖。In order to more clearly explain the technical solutions of the embodiments of the present invention, the drawings required in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those with ordinary knowledge in the technical field to which the present invention belongs, on the premise of not paying creative labor, other drawings can also be obtained according to these drawings. FIG. 1 shows a schematic diagram of an implementation scenario disclosed in this specification; 2 shows a flowchart of a file processing method according to an embodiment; FIG. 3 shows a specific example of determining the line separator and delimited file block of the file to be processed; FIG. 4 shows a schematic block diagram of an apparatus for file processing according to an embodiment.

Claims (26)

一種檔案處理的方法,適用於透過解析設備叢集針對雲端儲存伺服器中儲存的待處理檔案進行解析的情況,包括: 從該雲端儲存伺服器下載起始檔案塊,以獲取該待處理檔案的首個行分隔符的位置,該起始檔案塊是該待處理檔案中從起始位置開始的、包括該首個行分隔符的檔案塊; 基於該首個行分隔符的位置確定該待處理檔案的行容量; 根據預設片段行數和該行容量,下載該待處理檔案中的第一分界檔案塊,該第一分界檔案塊包括,當按照該預設片段行數將該待處理檔案進行劃分時,劃分得到的多個片段檔案中第一片段檔案的結束位置的行分隔符;以及 至少基於該第一分界檔案塊中的行分隔符的位置,確定該第一片段檔案的第一索引資料,該第一索引資料包括第一開始索引和第一結束索引,該第一索引資料用於該解析設備叢集中的解析設備按照該第一索引資料,從該雲端儲存伺服器解析該第一片段檔案。A file processing method is suitable for parsing the to-be-processed files stored in the cloud storage server through the parsing device cluster, including: Download the starting file block from the cloud storage server to obtain the position of the first line separator of the pending file, the starting file block is the starting line from the starting position in the pending file, including the first line Delimiter file block; Determine the line capacity of the pending file based on the position of the first line separator; Download the first boundary file block in the to-be-processed file according to the preset segment line number and the line capacity, the first boundary file block includes, when the to-be-processed file is divided according to the preset segment line number, divide The line separator at the end position of the first clip file in the obtained multiple clip files; and The first index data of the first segment file is determined based at least on the position of the line separator in the first boundary file block. The first index data includes a first start index and a first end index. The first index data is used The parsing device in the parsing device cluster parses the first segment file from the cloud storage server according to the first index data. 根據申請專利範圍第1項所述的方法,其中,該從雲端儲存伺服器下載起始檔案塊包括: 從起始位置開始下載預定大小的檔案塊作為起始檔案塊,並從該起始檔案塊中查找行分隔符;以及 在未查找到行分隔符的情況下,向後增加一個預定大小的檔案塊以更新該起始檔案塊,直到從中查找到首個行分隔符。The method according to item 1 of the patent application scope, wherein the downloading of the starting file block from the cloud storage server includes: Download a file block of a predetermined size from the starting position as the starting file block, and find the line separator from the starting file block; and When no line separator is found, a file block of a predetermined size is added backward to update the starting file block until the first line separator is found from it. 根據申請專利範圍第1項所述的方法,其中,基於該首個行分隔符的位置確定該待分割檔案的行容量包括: 將該行容量確定為,該待處理檔案的起始位置至該首個分隔符的位置所包含的位元組數。The method according to item 1 of the patent application scope, wherein determining the line capacity of the file to be divided based on the position of the first line separator includes: The capacity of the line is determined as the number of bytes contained in the position from the starting position of the file to be processed to the position of the first separator. 根據申請專利範圍第1項所述的方法,其中,根據預設片段行數和該行容量,下載該待處理檔案中的第一分界檔案塊包括: 確定該第一片段檔案的檔案開始位置; 確定該第一分界檔案塊的塊開始位置為,該檔案開始位置加上片段容量的位置,該第一分界檔案塊的塊結束位置為,該塊開始位置加上一個行容量的位置,其中,該片段容量為,該預設片段檔案行數與該行容量的乘積。The method according to item 1 of the patent application scope, wherein, according to the preset number of segment lines and the capacity of the line, downloading the first boundary file block in the file to be processed includes: Determine the file start position of the first segment file; The block start position of the first boundary archive block is determined as the position of the file start position plus the segment capacity, and the block end position of the first boundary archive block is the position of the block start position plus a line capacity, where The segment capacity is the product of the preset segment file line number and the line capacity. 根據申請專利範圍第4項所述的方法,其中,根據預設片段行數和該行容量,下載該待處理檔案中的第一分界檔案塊還包括: 下載該第一分界檔案塊,並從該第一分界檔案塊中查找行分隔符;以及 在未查找到行分隔符的情況下,向後增加一個行容量大小的檔案塊以更新該第一分界檔案塊,並下載更新後的第一分界檔案塊,直到從中查找到行分隔符。The method according to item 4 of the patent application scope, wherein, according to the preset number of segment lines and the capacity of the line, downloading the first boundary file block in the file to be processed further includes: Download the first demarcation file block, and find the line delimiter from the first demarcation file block; and When no line separator is found, an archive block with a line capacity is added backwards to update the first boundary file block, and the updated first boundary file block is downloaded until the line separator is found there. 根據申請專利範圍第4項所述的方法,其中,確定該第一片段檔案的檔案開始位置包括: 在該第一片段檔案是該待處理檔案的第一個片段檔案的情況下,將該待處理檔案的該起始位置作為該第一片段檔案的檔案開始位置; 否則,將該第一片段檔案的前一個片段檔案的結束位置作為該檔案開始位置。The method according to item 4 of the patent application scope, wherein determining the file start position of the first segment file includes: In the case that the first segment file is the first segment file of the to-be-processed file, the starting position of the to-be-processed file is taken as the file start position of the first segment file; Otherwise, the end position of the previous clip file of the first clip file is used as the start position of the file. 根據申請專利範圍第1項所述的方法,其中,該第一片段檔案為該待處理檔案的第一個片段檔案, 該確定該第一片段檔案的第一索引資料包括: 確定該第一開始索引指向該待處理檔案的該起始位置;以及 確定該第一結束索引指向該第一分界檔案塊中的行分隔符的位置。The method according to item 1 of the patent application scope, wherein the first segment file is the first segment file of the pending file, The first index data that determines the first segment file includes: Determine that the first starting index points to the starting position of the pending file; and It is determined that the first end index points to the position of the line separator in the first boundary archive block. 根據申請專利範圍第1項所述的方法,其中,該方法還包括: 獲取該待處理檔案的檔案大小資訊; 基於該檔案大小資訊檢測該第一分界檔案塊是否超出該待處理檔案的檔案大小範圍;以及 在超出的情況下,確定最後一個片段檔案的該第一結束索引指向該待處理檔案的結束位置。The method according to item 1 of the patent application scope, wherein the method further comprises: Obtain the file size information of the pending file; Detecting whether the first boundary file block exceeds the file size range of the file to be processed based on the file size information; and In the case of exceeding, it is determined that the first end index of the last clip file points to the end position of the file to be processed. 根據申請專利範圍第1項所述的方法,其中,該確定該第一片段檔案的第一索引資料包括: 將該第一片段檔案的前一個片段檔案的結束索引作為該第一開始索引;以及 確定該第一結束索引指向該第一分界檔案塊中的行分隔符的位置。The method according to item 1 of the patent application scope, wherein the first index data for determining the first segment file includes: Use the end index of the previous segment file of the first segment file as the first start index; and It is determined that the first end index points to the position of the line separator in the first boundary archive block. 根據申請專利範圍第1項所述的方法,該方法還包括,將第一索引資料添加到用於該多個片段檔案的索引資訊中。According to the method described in item 1 of the patent application scope, the method further includes adding the first index data to the index information for the plurality of clip files. 根據申請專利範圍第10項所述的方法,該方法還包括, 利用該索引資訊更新該雲端儲存伺服器中的任務配置表,以供該雲端儲存伺服器按照該任務配置表的分發規則向該解析設備叢集分發該索引資訊。According to the method described in item 10 of the patent application scope, the method further includes, Use the index information to update the task configuration table in the cloud storage server for the cloud storage server to distribute the index information to the parsing device cluster according to the distribution rules of the task configuration table. 根據申請專利範圍第1或10項所述的方法,其中,該方法還包括,將該第一索引資料作為傳遞參數,透過參數呼叫的方式發送至該解析設備叢集。The method according to item 1 or 10 of the patent application scope, wherein the method further includes sending the first index data as a transfer parameter to the cluster of parsing devices through a parameter call. 一種檔案處理的裝置,適用於透過解析設備叢集針對雲端儲存伺服器中儲存的待處理檔案進行解析的情況,包括: 起始檔案塊下載單元,配置為從該雲端儲存伺服器下載起始檔案塊,以獲取該待處理檔案的首個行分隔符的位置,該起始檔案塊是該待處理檔案中從起始位置開始的、包括該首個行分隔符的檔案塊; 行容量確定單元,配置為基於該首個行分隔符的位置確定該待處理檔案的行容量; 分界檔案塊下載單元,配置為根據預設片段行數和該行容量,下載該待處理檔案中的第一分界檔案塊,該第一分界檔案塊包括,當按照該預設片段行數將該待處理檔案進行劃分時,劃分得到的多個片段檔案中第一片段檔案的結束位置的行分隔符;以及 索引資料確定單元,配置為至少基於該第一分界檔案塊中的行分隔符的位置,確定該第一片段檔案的第一索引資料,該第一索引資料包括第一開始索引和第一結束索引,該第一索引資料用於該解析設備叢集中的解析設備按照該第一索引資料,從該雲端儲存伺服器解析該第一片段檔案。A file processing device suitable for parsing files to be processed stored in a cloud storage server through a cluster of parsing equipment, including: The starting file block download unit is configured to download the starting file block from the cloud storage server to obtain the position of the first line separator of the pending file. The starting file block is the starting file block in the pending file The file block starting at the position and including the first line separator; The line capacity determination unit is configured to determine the line capacity of the file to be processed based on the position of the first line separator; The boundary file block download unit is configured to download the first boundary file block in the to-be-processed file according to the preset segment line number and the line capacity, the first boundary file block includes, when the predetermined segment line number is When the to-be-processed file is divided, the line separator at the end position of the first clip file among the divided clip files; and The index data determining unit is configured to determine the first index data of the first segment file based on at least the position of the line separator in the first boundary file block, the first index data including a first start index and a first end index The first index data is used by the analysis devices in the analysis device cluster to parse the first segment file from the cloud storage server according to the first index data. 根據申請專利範圍第13項所述的裝置,其中,該起始檔案塊下載單元進一步配置為: 從起始位置開始下載預定大小的檔案塊作為起始檔案塊,並從該起始檔案塊中查找行分隔符;以及 在未查找到行分隔符的情況下,向後增加一個預定大小的檔案塊以更新該起始檔案塊,直到從中查找到首個行分隔符。The device according to item 13 of the patent application scope, wherein the starting file block downloading unit is further configured to: Download a file block of a predetermined size from the starting position as the starting file block, and find the line separator from the starting file block; and When no line separator is found, a file block of a predetermined size is added backward to update the starting file block until the first line separator is found from it. 根據申請專利範圍第13項所述的裝置,其中,該行容量確定單元進一步配置為: 將該行容量確定為,該待處理檔案的起始位置至該首個分隔符的位置所包含的位元組數。The device according to item 13 of the patent application scope, wherein the row capacity determination unit is further configured to: The capacity of the line is determined as the number of bytes contained in the position from the starting position of the file to be processed to the position of the first separator. 根據申請專利範圍第13項所述的裝置,其中,分界檔案塊下載單元包括: 第一確定模組,配置為確定該第一片段檔案的檔案開始位置; 第二確定模組,配置為確定該第一分界檔案塊的塊開始位置為,該檔案開始位置加上片段容量的位置,該第一分界檔案塊的塊結束位置為,該塊開始位置加上一個行容量的位置,其中,該片段容量為,該預設片段檔案行數與該行容量的乘積。The device according to item 13 of the patent application scope, wherein the demarcation file block download unit includes: The first determining module is configured to determine the file start position of the first segment file; The second determination module is configured to determine the block start position of the first boundary file block as the file start position plus the segment capacity, and the block end position of the first boundary file block as the block start position plus The location of a line capacity, where the segment capacity is the product of the number of lines in the preset segment file and the line capacity. 根據申請專利範圍第16項所述的裝置,其中,分界檔案塊下載單元還包括: 下載模組,配置為下載該第一分界檔案塊; 查找模組,配置為從該第一分界檔案塊中查找行分隔符;以及 在未查找到行分隔符的情況下,該第二確定模組向後增加一個行容量大小的檔案塊以更新該第一分界檔案塊,並透過該下載模組下載更新後的第一分界檔案塊,直到該查找模組從中查找到行分隔符。The device according to item 16 of the patent application scope, wherein the demarcation file block download unit further includes: Download module, configured to download the first boundary file block; A search module configured to search for a line separator from the first boundary file block; and When no line separator is found, the second determination module adds a file block with a line size backward to update the first boundary file block, and downloads the updated first boundary file block through the download module Until the search module finds the line separator. 根據申請專利範圍第16項所述的裝置,其中,第一確定模組進一步配置為: 在該第一片段檔案是該待處理檔案的第一個片段檔案的情況下,將該待處理檔案的該起始位置作為該第一片段檔案的檔案開始位置;以及 否則,將該第一片段檔案的前一個片段檔案的結束位置作為該檔案開始位置。The device according to item 16 of the patent application scope, wherein the first determination module is further configured to: In the case where the first clip file is the first clip file of the file to be processed, the starting position of the file to be processed is taken as the file start position of the first clip file; and Otherwise, the end position of the previous clip file of the first clip file is used as the start position of the file. 根據申請專利範圍第13項所述的裝置,其中,該第一片段檔案為該待處理檔案的第一個片段檔案, 該索引資料確定單元進一步配置為: 確定該第一開始索引指向該待處理檔案的該起始位置;以及 確定該第一結束索引指向該第一分界檔案塊中的行分隔符的位置。The device according to item 13 of the patent application scope, wherein the first segment file is the first segment file of the pending file, The index data determination unit is further configured as: Determine that the first starting index points to the starting position of the pending file; and It is determined that the first end index points to the position of the line separator in the first boundary archive block. 根據申請專利範圍第13項所述的裝置,其中,該裝置還包括: 獲取單元,配置為獲取該待處理檔案的檔案大小資訊;以及 檢測單元,配置為基於該檔案大小資訊檢測該第一分界檔案塊是否超出該待處理檔案的檔案大小範圍; 其中,在超出的情況下,該索引資料確定單元還配置為,確定該第一結束索引指向該待處理檔案的結束位置。The device according to item 13 of the patent application scope, wherein the device further comprises: An obtaining unit configured to obtain file size information of the pending file; and The detection unit is configured to detect whether the first boundary file block exceeds the file size range of the file to be processed based on the file size information; Wherein, in case of exceeding, the index data determining unit is further configured to determine that the first end index points to the end position of the file to be processed. 根據申請專利範圍第13項所述的裝置,其中,該索引資料確定單元進一步配置為: 將該第一片段檔案的前一個片段檔案的結束索引作為該第一開始索引;以及 確定該第一結束索引指向該第一分界檔案塊中的行分隔符的位置。The device according to item 13 of the patent application scope, wherein the index data determining unit is further configured to: Use the end index of the previous segment file of the first segment file as the first start index; and It is determined that the first end index points to the position of the line separator in the first boundary archive block. 根據申請專利範圍第13項所述的裝置,該裝置還包括: 添加單元,配置為將第一索引資料添加到用於該多個片段檔案的索引資訊中。According to the device described in item 13 of the patent application scope, the device further includes: The adding unit is configured to add the first index data to the index information for the plurality of clip files. 根據申請專利範圍第22項所述的裝置,該裝置還包括, 更新單元,配置為利用該索引資訊更新該雲端儲存伺服器中的任務配置表,以供該雲端儲存伺服器按照該任務配置表的分發規則向該解析設備叢集分發所述索引資訊。According to the device described in item 22 of the patent application scope, the device further includes, The updating unit is configured to update the task configuration table in the cloud storage server using the index information, so that the cloud storage server distributes the index information to the parsing device cluster according to the distribution rules of the task configuration table. 根據申請專利範圍第13或22項所述的裝置,其中,該裝置還包括,呼叫單元,配置為將該第一索引資料作為傳遞參數,透過參數呼叫的方式發送至該解析設備。The device according to item 13 or 22 of the patent application scope, wherein the device further includes a calling unit configured to send the first index data as a transfer parameter to the parsing device through a parameter call. 一種電腦可讀媒體儲存媒體,其上儲存有電腦程式,當該電腦程式在電腦中執行時,令電腦執行申請專利範圍第1至12項中任一項所述的方法。A computer-readable medium storage medium on which a computer program is stored, and when the computer program is executed in a computer, causes the computer to execute the method described in any one of the items 1 to 12 of the patent application scope. 一種運算設備,包括記憶體和處理器,其特徵在於,該記憶體中儲存有可執行碼,該處理器執行該可執行碼時,實現申請專利範圍第1至12項中任一項所述的方法。A computing device, including a memory and a processor, characterized in that an executable code is stored in the memory, and when the processor executes the executable code, it implements any one of items 1 to 12 of the patent application range Methods.
TW108107394A 2018-06-22 2019-03-06 File processing method and device TWI711935B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810652326.0 2018-06-22
CN201810652326.0A CN109086307B (en) 2018-06-22 2018-06-22 File processing method and device

Publications (2)

Publication Number Publication Date
TW202001618A true TW202001618A (en) 2020-01-01
TWI711935B TWI711935B (en) 2020-12-01

Family

ID=64839745

Family Applications (1)

Application Number Title Priority Date Filing Date
TW108107394A TWI711935B (en) 2018-06-22 2019-03-06 File processing method and device

Country Status (3)

Country Link
CN (1) CN109086307B (en)
TW (1) TWI711935B (en)
WO (1) WO2019242359A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086307B (en) * 2018-06-22 2020-04-14 阿里巴巴集团控股有限公司 File processing method and device
CN110532237B (en) * 2019-09-05 2022-02-08 恒生电子股份有限公司 Concurrent processing method, device and system for format data file
CN110955515A (en) * 2019-10-21 2020-04-03 量子云未来(北京)信息科技有限公司 File processing method and device, electronic equipment and storage medium
CN110955637A (en) * 2019-11-27 2020-04-03 集奥聚合(北京)人工智能科技有限公司 Method for realizing ordering of oversized files based on low memory
CN111752946B (en) * 2020-06-22 2021-04-30 上海众言网络科技有限公司 Method and device for preprocessing research data based on fragmentation mode

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100477698B1 (en) * 2003-01-13 2005-03-18 삼성전자주식회사 An IPv6 header receiving apparatus and an IPv6 header processing method
US9268801B2 (en) * 2013-03-11 2016-02-23 Business Objects Software Ltd. Automatic file structure and field data type detection
TWI531219B (en) * 2014-07-21 2016-04-21 元智大學 A method and system for transferring real-time audio/video stream
US10216783B2 (en) * 2014-10-02 2019-02-26 Microsoft Technology Licensing, Llc Segmenting data with included separators
CN104750846B (en) * 2015-04-10 2017-12-08 浪潮集团有限公司 A kind of substring lookup method and device
CN106156197A (en) * 2015-04-22 2016-11-23 中兴通讯股份有限公司 The querying method of a kind of data base and device
CN106649403B (en) * 2015-11-04 2020-07-28 深圳市腾讯计算机***有限公司 Index implementation method and system in file storage
CN106919553A (en) * 2016-08-24 2017-07-04 阿里巴巴集团控股有限公司 Document analysis method and apparatus
CN107633102A (en) * 2017-10-25 2018-01-26 郑州云海信息技术有限公司 A kind of method, apparatus, system and equipment for reading metadata
CN109086307B (en) * 2018-06-22 2020-04-14 阿里巴巴集团控股有限公司 File processing method and device

Also Published As

Publication number Publication date
CN109086307B (en) 2020-04-14
CN109086307A (en) 2018-12-25
WO2019242359A1 (en) 2019-12-26
TWI711935B (en) 2020-12-01

Similar Documents

Publication Publication Date Title
TWI711935B (en) File processing method and device
CN109040252B (en) File transmission method, system, computer device and storage medium
CN109218133B (en) Network speed testing system, method, device and computer readable storage medium
CN110636340B (en) Video file uploading method, storage device, terminal device and storage medium
CN108390933B (en) Message distribution method, device, server and storage medium
CN111447248A (en) File transmission method and device
CN110019239B (en) Storage method and device of reported data, electronic equipment and storage medium
US20140359066A1 (en) System, method and device for offline downloading resource and computer storage medium
CN108337100B (en) Cloud platform monitoring method and device
CN114048201A (en) Distributed stream computing engine Flink-based key field real-time deduplication method
CN104503983A (en) Method and device for providing website certification data for search engine
US11847219B2 (en) Determining a state of a network
CN112269726A (en) Data processing method and device
CN112395337B (en) Data export method and device
CN110620722A (en) Order processing method and device
CN116132448B (en) Data distribution method based on artificial intelligence and related equipment
CN112035413A (en) Metadata information query method and device and storage medium
CN109144991B (en) Method and device for dynamic sub-metering, electronic equipment and computer-storable medium
CN103428231B (en) Offline download method and system
CN107977381B (en) Data configuration method, index management method, related device and computing equipment
CN112202895B (en) Method and system for collecting monitoring index data, electronic equipment and storage medium
CN113760876A (en) Data filtering method and device
US20210117406A1 (en) System and method for consistency checks in cloud object stores using microservices
CN112764988A (en) Data segmentation acquisition method and device
CN111190858A (en) Software information storage method, device, equipment and storage medium