CN109446165B

CN109446165B - File merging method and device for big data platform

Info

Publication number: CN109446165B
Application number: CN201811182327.XA
Authority: CN
Inventors: 毛恒
Original assignee: Zhongying Youchuang Information Technology Co Ltd
Current assignee: Zhongying Youchuang Information Technology Co Ltd
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2021-05-07
Anticipated expiration: 2038-10-11
Also published as: CN109446165A

Abstract

The invention provides a file merging method and a device for a big data platform, wherein the method comprises the following steps: monitoring the directory change of the big data platform, and judging whether the number of files under the changed directory changes; under the condition that the number of the files under the changed directory is changed, grouping the files with similar characteristics under the changed directory; judging whether the files in the same group have set number of small files with the size smaller than the integral multiple of the set data blocks; and under the condition that the small files exist in the same group of files, acquiring the small files in the same group, and merging the small files in the same group. By the scheme, small files can be reduced, memory occupation of the namenode is optimized, and a large data platform can accommodate more files.

Description

File merging method and device for big data platform

Technical Field

The invention relates to the technical field of computers, in particular to a file merging method and device for a big data platform.

Background

When data analysis is performed in a large data platform, such as a Hadoop cluster, a large number of small files often exist in a data directory, and the existence of the small files causes great pressure on a namenode, so that the calculation efficiency of the cluster is reduced by several times or even tens of times. In the prior art, functional components need to be developed for each set of data directory or each type of target data respectively to merge files.

However, existing file merging schemes can only allocate scheduling plans according to time. This maintenance mode has many drawbacks: firstly, the development content is relatively trivial and the development cost is high; secondly, a scheduling plan cannot be arranged according to actual data conditions, small files may be not available when a task is started, cluster computing resources are wasted, or new files are written into a directory after the task is executed, and the problem of the number of the small files is not well solved; thirdly, the computing resources of each file processing can not be dynamically applied according to the actual situation.

Disclosure of Invention

In view of this, the present invention provides a file merging method and apparatus for a big data platform, so as to reduce small files, optimize memory usage of a namenode, and enable the big data platform to accommodate more files.

In order to achieve the purpose, the invention adopts the following scheme:

in an embodiment of the present invention, a file merging method for a big data platform includes:

monitoring the directory change of the big data platform, and judging whether the number of files under the changed directory changes;

under the condition that the number of the files under the changed directory is changed, grouping the files with similar characteristics under the changed directory;

judging whether the files in the same group have set number of small files with the size smaller than the integral multiple of the set data blocks;

and under the condition that the small files exist in the same group of files, acquiring the small files in the same group, and merging the small files in the same group.

In an embodiment of the present invention, a file merging device for a big data platform includes:

the file monitoring unit is used for monitoring the directory change of the big data platform and judging whether the number of files under the changed directory changes;

the file grouping unit is used for grouping the files with similar characteristics under the changed directory under the condition that the number of the files under the changed directory is changed;

the small file judging unit is used for judging whether a set number of small files with the size smaller than the integral multiple of the set data blocks exist in the same group of files;

and the file merging unit is used for acquiring the small files of the same group and merging the small files of the same group under the condition that the small files exist in the files of the same group.

In one embodiment of the present invention, a computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the method of the above-mentioned embodiment.

In an embodiment of the invention, a computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of the above-mentioned embodiment.

According to the file merging method of the big data platform, the file merging device of the big data platform, the computer equipment and the computer readable storage medium, files with similar characteristics under changed directories are grouped by monitoring the directory change of the big data platform, and small files smaller than the integral multiple size of the set data block in the same group of files are merged, so that the small files can be reduced, the memory occupation of the namenode is optimized, the big data platform (such as a cluster) can contain more files, and the fine control of the big data platform on the file size can be realized. The files are merged based on the monitored directory change of the big data platform, and merging can be completed within the shortest time after the small files are generated, so that the real-time performance of file merging can be improved, and the file merging efficiency is improved. The grouping and merging are carried out under the condition that small files exist in the same group of files, so that the resources of a large data platform can be greatly saved, and the resource distribution is more reasonable.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a flowchart illustrating a file merging method of a big data platform according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for monitoring a change in a directory of a big data platform and determining whether the number of files in the changed directory changes according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for monitoring a change in a directory of a big data platform and determining whether the number of files in the changed directory changes according to another embodiment of the present invention;

FIG. 4 is a flowchart illustrating a method for grouping files with similar characteristics under a changed directory according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a file merging method of a big data platform according to another embodiment of the present invention;

FIG. 6 is a flowchart illustrating a method for merging small files in the same group according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating a file merging method of a big data platform according to another embodiment of the present invention;

FIG. 8 is a flow chart illustrating a method for grouping files with similar characteristics under a changed directory according to another embodiment of the present invention;

FIG. 9 is an interaction diagram of a file merging method of a big data platform according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a file merging device of a big data platform according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

Fig. 1 is a schematic flowchart of a file merging method of a big data platform according to an embodiment of the present invention. As shown in fig. 1, the file merging method of the big data platform of some embodiments may include:

step S110: monitoring the directory change of the big data platform, and judging whether the number of files under the changed directory changes;

step S120: under the condition that the number of the files under the changed directory is changed, grouping the files with similar characteristics under the changed directory;

step S130: judging whether the files in the same group have set number of small files with the size smaller than the integral multiple of the set data blocks;

step S140: and under the condition that the small files exist in the same group of files, acquiring the small files in the same group, and merging the small files in the same group.

In step S110, the big data platform may be, for example, a Hadoop cluster, and the directory change of the Hadoop cluster may be monitored by monitoring the HDFS. The change of the main directory of the big data platform can be monitored, or the change of the leaf directory of the main directory can be monitored at the same time, and the specific monitored directory can be pre-configured and determined according to the requirement. The changed directory may include modified directories, for example, a directory under which a file is modified, a file is added, a stabilization time after the file is modified, and the like, and the directory is a modified directory. The number of files under a directory can be known from a list of files.

In step S120, when the number of files in the changed directory changes, for example, the number of files increases, redundant files may occur in the changed directory. The similar characteristics can be judged according to file names, file contents, file modes and the like. Files with similar characteristics may include files with certain commonality in file name, file name suffix, file format, etc., for example, files with the same file name suffix, the same file name length, specific rules for characters in file names, etc., files with the same file name suffix may be divided into the same group, or files with the same file name suffix and the same file name length all being set length may be divided into the same group.

In step S130 above, the data block may be the smallest storage unit of the large data platform configuration. The set number and the integer multiple of the set data blocks can be configured as required, for example, the integer multiple of the set data blocks is determined according to the storage mode of the big data platform, and the set number is determined by combining the rule of the file size, the integer multiple of the set data blocks and the storage mode of the big data platform.

In step S140, there may be many files under the changed directory, and the files under the same changed directory may be divided into one or more groups. Each set of small files may be returned by the big data platform to group the merged file data. File merging may be performed using existing or specially designed merging procedures. For example, for small files of a Hadoop cluster, an API (application programming interface) provided by hive/spark ql can be directly used, after data of a plurality of small files is read at one time, the data is rewritten into a file according to a predetermined rule, so as to implement a file merging operation.

The main goal of conventional file merging is to reduce the number of files and the space occupied by the files, while the main goal of file merging for large data platforms is not to reduce the number of files or reduce the total space occupied by the disk, but to finely control the size of each file, so that the size of each file after merging is exactly an integer multiple of the storage BLOCK of the large data platform, such as the size of a cluster BLOCK (usually 64M/128M/256M). The BLOCK of the cluster is 128M, then one 128M file occupies only one BLOCK, and one 129M file occupies two. Large data platforms do not simply seek to reduce the number of files. For example, if the data is also 1G, and is placed in 1 file, and the number of cluster copies is 3, the data only exists on 3 different data nodes, and the calculation can be performed on the three nodes concurrently, otherwise, the network overhead of data pulling is considered. And if the 1G data is put in 4 256M files, the data may be dispersed on 10 or more nodes, and the concurrency can be higher when calculating. Of course, too many files will cause the load of the memory of the namenode node to be increased, so the complexity of merging the small files of the big data platform is higher.

Assuming that the storage block size of a platform configuration is 128M, if a file on the platform is only 40M, and a single file is much smaller than the data block size, but occupies one data block, merging is required. And another file has a size of 150M, and a single file is larger than a data block but much smaller than the size of two data blocks, which wastes the storage of the second data block and needs to be merged.

In this embodiment, by monitoring the directory change of the big data platform, files with similar characteristics under the changed directory are grouped under the condition that the number of files under the changed directory changes, and small files in the same group of files smaller than the integral multiple of the set data block are merged, so that the cluster directory can be monitored, the small files are automatically analyzed and merged, the small files are reduced, the memory occupation of the namenode is optimized, a big data platform (such as a cluster) can accommodate more files, and the fine control of the big data platform on the file size can be realized. Moreover, the files are merged based on the monitored directory change of the big data platform, and merging can be completed within the shortest time after the small files are generated, so that the real-time performance of file merging can be improved, and the file merging efficiency is improved. Under the condition that small files exist in the same group of files, the files with similar characteristics are grouped and merged, rather than simply allocating a scheduling plan for file merging according to a timing task or setting an execution interval, so that the resources of a large data platform can be greatly saved, and the resource allocation is more reasonable.

Fig. 2 is a flowchart illustrating a method for monitoring a directory change of a big data platform and determining whether the number of files in the changed directory changes according to an embodiment of the present invention. As shown in fig. 2, the step S110 of monitoring the directory change of the big data platform and determining whether the number of files in the changed directory changes may include:

step S111: acquiring a directory to be monitored; polling and inquiring directory information of a big data platform according to the directory to be monitored, and acquiring a current file list of a changed directory;

step S112: acquiring a history file list of the changed directory; and judging whether the number of the files under the changed directory is increased or not by comparing the current file list with the historical file list so as to judge whether the number of the files under the changed directory is changed or not.

In step S111, the directory to be monitored may be configured as needed, for example, the directory may include a backbone directory of a large data platform, or include leaf directories of the backbone directory or other backbone directories at the same time. The directory information of the big data platform can be sequentially and regularly inquired through polling inquiry, for example, the directory information of the Hadoop cluster can be inquired through a namenode/API/HDFS client and the like. The directory information may reflect the current status of the directory to be monitored, and may include, for example, a file directory, a most recent modification time of the directory, a list of files under the directory, and so on.

In step S112, the current file list may be a list of file names of the current directory and its sub-directories. The history file list may be a list of file names of the history directory and its subdirectories. The historical state of the directory of the big data platform, including historical modification information, a historical file list and the like, can be recorded in the recording center of the big data platform, so that a corresponding directory can be found in the recording center according to the changed directory, and then a corresponding historical file list can be obtained according to the found directory. For each changed directory, whether a new file exists can be judged by comparing the corresponding current file list with the corresponding history file list. In other embodiments, the existence of a new data file in the directory may be determined by comparing the current directory state with the historical state registered by the record center.

In this embodiment, the change in the number of files can be easily determined by comparing the current file list and the history file list. By judging whether the number of the files is increased or not, the situation that file redundancy or fine control of the size of the file is needed is most probably judged.

Fig. 3 is a flowchart illustrating a method for monitoring a directory change of a big data platform and determining whether the number of files in the changed directory changes according to another embodiment of the present invention. As shown in fig. 3, the method for monitoring a directory change of the big data platform and determining whether the number of files in the changed directory changes as shown in fig. 2 may further include:

step S113: acquiring the current modification time and the latest historical modification time of the changed catalog;

step S114: and judging whether the difference value between the current modification time and the latest historical modification time is greater than a set time length so as to judge whether the number of the files in the changed directory changes.

In step S113, directory information of a big data platform may be polled and queried according to the directory to be monitored, and current modification time of the changed directory may be obtained. The current modification time of the changed directory may be obtained together when the current file list of the changed directory is obtained. The last historical modification time of the changed directory may be obtained after obtaining the current modification time of the changed directory. The latest historical modification time, which may refer to the time of the latest modification, may be recorded in the recording center in advance and acquired from the recording center when needed.

In step S114, the set time period may be configured as needed, and the latest modification may be determined by determining whether the difference between the current modification time and the latest historical modification time is greater than the set time period.

In this embodiment, by determining whether the difference between the current modification time and the latest historical modification time is greater than a set time, it may be determined whether the file in the changed directory has been modified for the latest time and has been stabilized for a period of time, so that resources of the big data platform may be reasonably used according to the file modification condition, and resource waste caused by executing a file merging operation under an unnecessary condition may be avoided.

Fig. 4 is a flowchart illustrating a method for grouping files with similar characteristics under a changed directory according to an embodiment of the present invention. As shown in fig. 4, in step S120, grouping the files with similar characteristics under the changed directory may include:

step S121: determining whether the files under the changed directory have similar characteristics according to a file naming rule; or reading partial data of the file under the changed directory, and determining whether the file under the changed directory has similar characteristics according to mode information contained in the read partial data; the file naming rule comprises one or more of a rule of file name length, a rule of characters contained in the file name and consistency of suffix of the file name;

step S122: and dividing files with similar characteristics under the changed directories into the same group.

In step S121, the file name length may be set to be the same as or equal to the file name length of any two files. The file name may contain characters according to the rule that the file names of two files contain the same letter or number. The consistency of the filename suffixes may be that the filename suffixes of some two files are identical. And determining whether the files under the changed directories have similar characteristics according to file naming rules, and finding the files which can be merged based on file names.

Part of the data of the file under the changed directory is read, for example, schema information contained in the file such as json and orc is read. When the schema information of the read part of data is analyzed to be consistent, the files under the changed directories can be considered to have similar characteristics, and therefore the file merging action is triggered. In this way, files that can be merged can be found based on the file content.

In this embodiment, by grouping files according to file names or file contents, merging of files can be easily achieved.

Fig. 5 is a flowchart illustrating a file merging method of a big data platform according to another embodiment of the present invention. As shown in fig. 5, before step S130, that is, before determining whether there are a set number of small files in the same group, the file merging method of the big data platform shown in fig. 1, which are smaller than the integer multiple of the set data block size, may further include:

step S150: and determining the integral multiple size of the set data block according to the size of the storage block configured by the big data platform.

In a large data platform, such as a cluster, files are stored in BLOCKs by integral multiple of the size of a cluster BLOCK (BLOCK), a cluster BLOCK is occupied by a file smaller than the size of a cluster BLOCK, the integral multiple of the size of the set data BLOCK is determined according to the size of a storage BLOCK configured by the large data platform, for example, the size of the cluster BLOCK, so that small files smaller than the integral multiple of the size of the set data BLOCK or files larger than the size of a single file BLOCK but not larger than the integral multiple of the size of the BLOCK are found for merging, small files occupying the cluster BLOCK but not fully utilizing the cluster BLOCK can be merged into a larger file, the number of files can be reduced, and the size of each file can be controlled in a refined manner.

In this embodiment, the size of each file can be finely controlled by determining and setting the integral multiple size of the data block according to the size of the storage block configured by the big data platform.

In some embodiments, the file merging method of the big data platform shown in fig. 5 may further include: and splitting the merged small file according to the size of the storage block configured by the big data platform. For example, data in orc files consistent with all schemas under the xxxx directory is loaded, the data is rewritten under the yyy directory in orc format, and one file is split every 256M. By splitting the merged small file according to the size of the storage block configured by the big data platform, the number of nodes can be reduced, and the size of the file can be finely controlled.

Fig. 6 is a flowchart illustrating a method for merging small files in the same group according to an embodiment of the present invention. As shown in fig. 6, the step S140, that is, when the small files exist in the same group of files, acquiring the small files in the same group and merging the small files in the same group, may include:

step S141: under the condition that the small files exist in the same group of files, acquiring the small files in the same group, and applying for resources to the big data platform according to the number of the small files in the group and the size of the files;

step S142: and utilizing the applied resource calling file merging program to merge the small files of the group.

In step S141, a resource list is selected according to the number and size of the files in the small file group, and a resource is applied to the cluster. In the above step S142, the small file merge operation may be performed using a designated merge program. If the configuration center has a processing module for providing the catalog or the file type, the configured module is used, otherwise, the default module is called according to the file type to generate a merging program.

In this embodiment, the resources required for the merging operation may be dynamically applied according to the actual data situation, and the most reasonable resource allocation is automatically selected.

Fig. 7 is a flowchart illustrating a file merging method of a big data platform according to another embodiment of the present invention. As shown in fig. 7, the file merging method of the big data platform shown in fig. 1 may further include:

step S160: updating and recording the directory information of the merged small files, and recording corresponding merging information; the merging information includes file names before and after the file merging.

In step S160, the file names before and after the file merging are the file name of each file before the merging and the file name of the file after the merging. The processed directory information may be re-registered in the recording center. The merging information may further include merging time and the like.

In this embodiment, the merged file can be conveniently found by updating the directory information. By recording the merging information, the modification history of the merged file can be conveniently found.

Fig. 8 is a flow chart illustrating a method for grouping files with similar characteristics under a changed directory according to another embodiment of the present invention. As shown in fig. 8, the method for grouping files under the changed directory having similar characteristics shown in fig. 4 may further include, before step S121, that is, before determining whether the files under the changed directory have similar characteristics according to the file naming rules:

step S123: acquiring a file naming rule; the file naming rule is generated by configuration or by classification or cluster learning by using historical merging information.

In the above step S123, the history merging information may include a file name of the file before merging, a file name of the file after merging, merging time, directory, and the like, and may be obtained by recording the merging information after each completion of the merging process.

The file naming rule configured for merging the groups is self-defined, and a plurality of rules can be directly configured, for example, under the condition of specifying an XX directory and subdirectories, files with xxx as a suffix and the file name length of 10 need to be monitored and automatically merged, and the configuration rule needs to support a regular expression.

Self-learning configures file naming rules for merging groups, similar file rules may be automatically analyzed through machine learning algorithms. One way, depending on classification, for example, a directory with close file naming rules (e.g., same suffix, same file name length, certain rules for distribution of letters, numbers, symbols in the file name, etc.), it is determined that the group of files can be merged. Alternatively, by relying on historical processing to analogize decisions, e.g., once merging of certain types of files has been performed under the A directory, either manually or by configuration, and detecting that the B directory also contains similar files, it is guessed that the files under the B directory can also be merged. Alternatively, the file content analysis may determine, according to the format of the file, for example, files containing schema information such as json and orc, and may trigger the merging of the files after reading part of the file data and analyzing that the schema of the files is consistent.

In order that those skilled in the art will better understand the present invention, the following description will illustrate the practice of the invention in a specific embodiment. For the sake of illustration, the document merging method of the big data platform is described by taking a Hadoop cluster as an example, but the method is not limited to the big data platform.

Fig. 9 is an interaction diagram of a file merging method of a big data platform according to an embodiment of the present invention. Referring to fig. 9, the file merging method of the big data platform according to the embodiment may include:

step 1: the main control program reads configuration information of the monitoring directory from the configuration center, and the configuration information includes but is not limited to configurations of a main directory and a leaf directory to be monitored, a file wildcard, a file block size, a file quantity threshold to be merged, directory stabilization time and the like.

Step 2: and polling and inquiring directory information (inquiring in a mode of a namenode/API/HDFS client and the like) according to the configured directory, and acquiring the latest modification time of the directory and a file list.

And step 3: and comparing the current directory state with the historical state registered by the recording center, and judging whether a new data file exists in the directory.

And 4, step 4: and (4) grouping the small files according to the original configuration and the file naming rule self-learned in the step (8), and returning the small file groups with consistent content and format and the number of the files smaller than the block size reaching the configuration threshold.

And 5: processing the small file group and generating a merging plan, if the configuration center has a processing module for providing the catalog or the file type, using the configured module, otherwise, calling a default module according to the file type to generate a merging program.

Step 6: and (5) selecting a resource list according to the number and the size of the files in the small file group, applying for resources from the cluster, and executing small file combination operation by using the combination program specified in the step 5.

And 7: and re-registering the processed directory information in the recording center.

And 8: and (4) the recording center merges records according to the history, calls a data algorithm for classification or clustering, analyzes the file classification rule, and solidifies the file classification rule for the step 4.

In this embodiment, a directory change condition of the HDFS is monitored, when the number of files in the directory changes and is stabilized for X minutes later, it is determined whether N files smaller than M exist in the directory (X, N, M is configurable), and it is determined whether the group of small files are data sources of the same direction according to the file names (the determination may be made according to methods such as file suffixes, segmentation rules and lengths, wildcards or regular, classification analysis of file names by historical processing, and the like), and if the small files reach a threshold, the dynamic application resources are merged according to the number and size of the files. The small files of the cluster are merged only by planning in advance, independent development and manual intervention are not needed, maintenance cost is reduced, different merging rules can be respectively appointed according to cold and hot data, and configuration of the cluster files is more reasonable. The real-time performance of file combination is higher, combination can be completed in the shortest time after the small files are generated, and cluster resources are greatly saved. Resources required by the merging operation can be dynamically applied according to actual data conditions, the most reasonable resource proportion is automatically selected, the efficiency is higher, and the resource allocation is more reasonable.

Based on the same inventive concept as the file merging method of the big data platform shown in fig. 1, the embodiment of the present invention further provides a file merging device of the big data platform, as described in the following embodiments. Because the principle of solving the problems of the file merging device of the big data platform is similar to the file merging method of the big data platform, the implementation of the file merging device of the big data platform can refer to the implementation of the file merging method of the big data platform, and repeated parts are not repeated.

Fig. 10 is a schematic structural diagram of a file merging device of a big data platform according to an embodiment of the present invention. As shown in fig. 10, the file merging device of the big data platform of some embodiments may include: a file monitoring unit 210, a file grouping unit 220, a small file judging unit 230, and a file merging unit 240, which are connected in sequence.

The file monitoring unit 210 is configured to monitor a directory change of the big data platform, and determine whether the number of files in the changed directory changes;

a file grouping unit 220, configured to group files with similar characteristics in the changed directory if the number of files in the changed directory changes;

a small file determining unit 230 configured to determine whether a set number of small files smaller than an integer multiple of the set data block exist in the same group of files;

a file merging unit 240, configured to, if the small files exist in the same group of files, obtain the small files in the same group, and merge the small files in the same group.

In some embodiments, the document monitoring unit 210 may include: the device comprises a current file list acquisition module and a file quantity judgment module which are connected with each other. The current file list acquisition module is used for acquiring a directory to be monitored; polling and inquiring directory information of a big data platform according to the directory to be monitored, and acquiring a current file list of a changed directory; the file quantity judging module is used for acquiring a history file list of the changed directory; and judging whether the number of the files under the changed directory is increased or not by comparing the current file list with the historical file list so as to judge whether the number of the files under the changed directory is changed or not.

In some embodiments, the file monitoring unit 210 may further include: the modification time acquisition module and the change time judgment module are connected with each other. A modification time acquisition module for acquiring the current modification time and the latest historical modification time of the changed catalog; and the change time judging module is used for judging whether the difference value between the current modification time and the latest historical modification time is greater than a set time length so as to judge whether the number of the files in the changed directory changes.

In some embodiments, the file grouping unit 220 may include: the device comprises an identity judging module and a file group dividing module which are connected with each other. The identity judgment module is used for determining whether the files under the changed directory have similar characteristics according to a file naming rule; or reading partial data of the file under the changed directory, and determining whether the file under the changed directory has similar characteristics according to mode information contained in the read partial data; the file naming rule comprises one or more of a rule of file name length, a rule of characters contained in the file name and consistency of suffix of the file name; and the file group dividing module is used for dividing the files with similar characteristics in the changed directories into the same group.

In some embodiments, the file merging method for the big data platform may further include: the integer multiple size determination unit of the data block is set and connected to the small file determination unit 230. And the integral multiple size determining unit of the set data block is used for determining the integral multiple size of the set data block according to the size of the storage block configured by the large data platform.

In some embodiments, the file merging unit 240 may include: the resource application module and the file merging module are connected with each other. The resource application module is used for acquiring the small files in the same group under the condition that the small files exist in the files in the same group, and applying for resources to the big data platform according to the number of the small files in the group and the size of the small files; and the file merging module is used for merging the small files of the group by using the applied resource calling file merging program.

In some embodiments, the file merging method for the big data platform may further include: and the merging recording unit is connected with the file merging unit 240. The merging recording unit is used for updating and recording the directory information of the merged small files and recording corresponding merging information; the merging information includes file names before and after the file merging.

In some embodiments, the file grouping unit 220 may further include: and the rule learning module is connected with the identity judging module. The rule learning module is used for acquiring a file naming rule; the file naming rule is generated by configuration or by classification or cluster learning by using historical merging information.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the steps of the method described in the above embodiment are implemented.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method described in the above embodiments.

In summary, according to the file merging method for a big data platform, the file merging device for a big data platform, the computer device, and the computer readable storage medium of the embodiments of the present invention, by monitoring the directory change of the big data platform, files with similar characteristics in the changed directory are grouped under the condition that the number of files in the changed directory changes, and small files in the same group of files that are smaller than the integral multiple of the set data block are merged, the cluster directory can be monitored, the small files can be automatically analyzed and merged, the small files can be reduced, the memory occupation of namenode can be optimized, so that the big data platform (e.g., a cluster) can accommodate more files, and the fine control of the big data platform on the file size can be facilitated. Moreover, the files are merged based on the monitored directory change of the big data platform, and merging can be completed within the shortest time after the small files are generated, so that the real-time performance of file merging can be improved, and the file merging efficiency is improved. Under the condition that small files exist in the same group of files, the files with similar characteristics are grouped and combined, rather than distributing a scheduling plan according to time for file combination, so that resources of a large data platform can be greatly saved, and the resource distribution is more reasonable.

In the description herein, reference to the description of the terms "one embodiment," "a particular embodiment," "some embodiments," "for example," "an example," "a particular example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. The sequence of steps involved in the various embodiments is provided to schematically illustrate the practice of the invention, and the sequence of steps is not limited and can be suitably adjusted as desired.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A file merging method for a big data platform is characterized by comprising the following steps:

under the condition that the small files exist in the same group of files, acquiring the small files in the same group, and merging the small files in the same group to enable the size of each merged file to be equal to the integral multiple of the set data block; wherein the data block is a minimum storage unit of the big data platform configuration;

the monitoring of the change of the directory of the big data platform and the judgment of whether the number of the files under the changed directory changes comprises the following steps:

acquiring the current modification time and the latest historical modification time of the changed catalog;

and judging whether the difference value between the current modification time and the latest historical modification time is greater than a set time length so as to judge whether the number of the files in the changed directory changes.

2. The method for merging files of a big data platform according to claim 1, wherein monitoring the directory change of the big data platform and determining whether the number of files under the changed directory changes comprises:

acquiring a directory to be monitored; polling and inquiring directory information of a big data platform according to the directory to be monitored, and acquiring a current file list of a changed directory;

acquiring a history file list of the changed directory; and judging whether the number of the files under the changed directory is increased or not by comparing the current file list with the historical file list so as to judge whether the number of the files under the changed directory is changed or not.

3. The file merging method for the big data platform according to claim 1, wherein grouping files with similar characteristics under the changed directory comprises:

determining whether the files under the changed directory have similar characteristics according to a file naming rule; or reading partial data of the file under the changed directory, and determining whether the file under the changed directory has similar characteristics according to mode information contained in the read partial data; the file naming rule comprises one or more of a rule of file name length, a rule of characters contained in the file name and consistency of suffix of the file name;

and dividing files with similar characteristics under the changed directories into the same group.

4. The method as claimed in claim 1, wherein before determining whether there are a set number of small files in the same group that are smaller than an integer multiple of the set data block size, the method further comprises:

and determining the integral multiple size of the set data block according to the size of the storage block configured by the big data platform.

5. The file merging method of the big data platform according to claim 1, wherein in a case that the small files exist in the same group of files, acquiring the small files of the same group and merging the small files of the same group, comprises:

under the condition that the small files exist in the same group of files, acquiring the small files in the same group, and applying for resources to the big data platform according to the number of the small files in the group and the size of the files;

and utilizing the applied resource calling file merging program to merge the small files of the group.

6. The file merging method for the big data platform according to claim 3, wherein the method further comprises:

updating and recording the directory information of the merged small files, and recording corresponding merging information; the merging information comprises file names before and after the files are merged;

before determining whether the files under the changed directory have similar features according to a file naming rule, grouping the files under the changed directory having similar features, further comprising:

acquiring a file naming rule; the file naming rule is generated by configuration or by classification or cluster learning by using historical merging information.

7. A file merging device of a big data platform is characterized by comprising:

the file merging unit is used for acquiring the small files in the same group under the condition that the small files exist in the files in the same group, and merging the small files in the same group to enable the size of each merged file to be equal to the integral multiple of the set data block; wherein the data block is a minimum storage unit of the big data platform configuration;

the file monitoring unit is specifically configured to obtain current modification time and recent historical modification time of the changed directory; and judging whether the difference value between the current modification time and the latest historical modification time is greater than a set time length so as to judge whether the number of the files in the changed directory changes.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.