CN111881092B - Method and device for merging files based on cassandra database - Google Patents

Method and device for merging files based on cassandra database Download PDF

Info

Publication number
CN111881092B
CN111881092B CN202010576064.1A CN202010576064A CN111881092B CN 111881092 B CN111881092 B CN 111881092B CN 202010576064 A CN202010576064 A CN 202010576064A CN 111881092 B CN111881092 B CN 111881092B
Authority
CN
China
Prior art keywords
file
merging
data
disk
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010576064.1A
Other languages
Chinese (zh)
Other versions
CN111881092A (en
Inventor
叶志钢
王化民
张本军
王赟
谭国权
赵雨佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Greenet Information Service Co Ltd
Original Assignee
Wuhan Greenet Information Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Greenet Information Service Co Ltd filed Critical Wuhan Greenet Information Service Co Ltd
Priority to CN202010576064.1A priority Critical patent/CN111881092B/en
Publication of CN111881092A publication Critical patent/CN111881092A/en
Application granted granted Critical
Publication of CN111881092B publication Critical patent/CN111881092B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of databases, in particular to a method and a device for merging files based on cassandra databases. Mainly comprises the following steps: receiving a data file generated by a database, and generating a combined file list of each disk; the merging process of each disk obtains a merging file list of the corresponding disk, and obtains the size of a data file to be merged in the merging file list of each disk; starting; and the parallel merging process of the database calculates that when the sum of the total data file sizes of the data file sizes obtained by the merging process of each disk reaches a merging file threshold value, the parallel merging process merges the data files needing to be merged in all the disks at one time. The method and the device can timely merge the small files under the condition of using fewer merging layers and temporary files, reduce merging times, reduce file occupation space in a disk, reduce disk IO times and disk IO contention, improve the file merging performance and improve the read-write stability of a database.

Description

Method and device for merging files based on cassandra database
[ Field of technology ]
The invention relates to the field of databases, in particular to a method and a device for merging files based on cassandra databases.
[ Background Art ]
The Cassandra database is a storage scheme of open-source distributed hybrid storage, and has the characteristics of decentralization, expandability, high availability, fault tolerance, configurable consistency and the like. Since cassandra sequentially flushes the buffered data to disk, a plurality of data files (Sorted String Table, abbreviated as sstable) of about 1-10MB are generated. When cassandra database data processing capacity is large, the huge number of files can seriously affect the stability of the database and slow down the query speed.
To reduce the number of files to be processed, cassandra databases provide a file merge (Compaction) mechanism to merge multiple large files into a small number of small files. The current common file merging strategies are:
(1)Size Tiered Compaction Strategy
And carrying out multi-layer combination based on the file size, putting the small files into a small file combination thread to be combined, and putting the large files into a large file combination thread to be combined, so as to finally obtain the required file size. The number of merge files is determined by the max_threshold parameter, which theoretically provides a means for the user to control the merge size, but the effect in an actual production environment is: when using a smaller max_threshold parameter (e.g. 32), the number of files can be reduced rapidly, but the files after merging are still very small, and secondary merging is triggered, and the disk IO resources required by the secondary merging are more, and often, the secondary merging is performed simultaneously with the writing of the next batch, thus robbing the disk IO resources. When a larger max_threshold parameter (e.g., 128) is used, the number of files cannot be reduced rapidly, and when the merge condition is reached, writing of the next batch is often performed simultaneously, so that the disk IO resources are robbed, and the writing performance is severely dithered. The disadvantage of this strategy is that: after a plurality of small files are combined, the required file size can not be achieved, and the large file with the required file size can be changed into a large file with the required file size only by multiple times of rolling combination, so that one data is required to be combined multiple times, and the read-write burden of a magnetic disk is amplified by multiple times of read-write.
(2)Leveled Compaction Strategy
Scanning is started from the highest level to determine whether a comparison is needed, and if so, sstable of the layer is formed into a task. Preferential compatibility of the higher layers sstable effectively reduces the number of sstable that are combined when the lower to higher layers are compatible. The disadvantage of this strategy is that: 1. the preferential compression of the high Level sstable does reduce the number of sstable involved in the compatibility, but preferential processing of the high layer sstable results in the accumulation of sstable in the bottom layer L0, with excessive sstable for single point management, affecting read and write performance. 2. Before the merging of a plurality of large files is not completed, the merged data always exists on a disk, and in an extreme case, 2 times of disk space is needed, and the situation that the disk space is insufficient may exist.
(3)Time Window CompactionStrategy
Based on the merging of the time window, the data is merged in the time window according to the strategy 1, and files exceeding the time window are not merged any more, which is essentially that the merging is abandoned when the merging cannot be completed on time. The disadvantage of this strategy is that: 1. since policy 1 is used, there are the same disadvantages as policy 1. 2. When the warehousing frequency is higher, for example, the warehousing frequency reaches 5 minutes/time, the small files are rapidly increased, and when the time window is larger (1 day), the advantages of the window strategy are lost, and the larger merging IO pressure is generated as in strategy 1, so that the data warehousing is seriously influenced, and the service is not available; when the time window is smaller (1 hour), although writing can be preferentially guaranteed, files which are not merged in the window time are abandoned and merged, the number of files is rapidly increased, the number of data files per day is still close to 10 ten thousand, and when the data storage time is longer, the number of opened files finally exceeds the processing limit of a database process, and the database process is stopped.
In view of this, how to overcome the defects existing in the prior art and solve the defects existing in the existing file merging strategy is a problem to be solved in the technical field.
[ Invention ]
Aiming at the defects or improvement demands of the prior art, the invention solves the problems of large disk read-write burden, more disk space occupation, large merging IO pressure and the like caused by triggering merging actions according to the number of files, merging layers or time windows in the existing file merging strategy.
The embodiment of the invention adopts the following technical scheme:
In a first aspect, the present invention provides a method for merging files based on cassandra databases, specifically: receiving a data file generated by a database, and generating a combined file list of each disk; the merging process of each disk obtains a merging file list of the corresponding disk, and obtains the size of a data file to be merged in the merging file list of each disk; starting a parallel merging process of the database, and calculating the sum of the sizes of the data files obtained by the merging process of each disk; when the sum of the sizes of the data files reaches a merging file threshold, the parallel merging process merges the data files needing to be merged in all the magnetic discs at one time.
Preferably, if the total size of the files to be merged does not reach the merged file threshold, the merging is not performed, and the data files generated next time by the database are waited for being accepted.
Preferably, judging whether the size of each data file is smaller than a file data amount threshold; if the data quantity of the file is smaller than the file data quantity threshold, merging the file threshold by using a first merged file threshold; if the data quantity of the file is larger than the file data quantity threshold, a second merged file threshold is used for merging the file threshold; wherein the first merge file threshold is greater than the second merge file threshold.
Preferably, the system's own merge policy is disabled before accepting the generated data file.
Preferably, if the size of the data file to be merged exceeds the file size threshold, merging the data file to be merged into at least one file according to the file size threshold, and the remaining part smaller than the file size threshold is not merged.
Preferably, the directories to be merged are designated, and the merging process is started to merge files only for the directories to be merged.
Preferably, after merging all the data files to be merged at one time, the method further comprises: counting merging time by a parallel merging process; judging whether the merging time exceeds a merging time threshold value or not; and alarming the magnetic disk with the combination time exceeding the combination time threshold.
Preferably, if the merge time exceeds the merge time threshold, the database process is automatically restarted.
Preferably, the source file of the data file to be combined is marked as deleted, and after all the data files to be combined are combined at one time, the data file marked as deleted is deleted
On the other hand, the invention provides a device for a file merging method based on cassandra databases, which specifically comprises the following steps: the method comprises the steps of connecting at least one processor with a memory through a data bus, wherein the memory stores instructions executed by the at least one processor, and the instructions are used for completing the file merging method based on cassandra databases in the first aspect after being executed by the processor.
Compared with the prior art, the embodiment of the invention has the beneficial effects that: by using a dynamic merging mode and using a merging file threshold as a trigger item of merging action, the generated small files are merged once in time when the small file size reaches the merging file threshold, and a rapid and stable small file merging method is provided. In the preferred scheme, the optimized combination of the number of the files after merging and the writing performance is obtained by adjusting the threshold value of the files after merging, the abnormality of the merging process is fed back in time through merging time detection and alarming, and the file storage structure and IO efficiency are optimized through a garbage recycling mechanism.
[ Description of the drawings ]
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments of the present invention will be briefly described below. It is evident that the drawings described below are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flowchart of a method for file merging based on cassandra databases according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for file merging based on cassandra databases according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for file merging based on cassandra databases according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for file merging based on cassandra databases according to an embodiment of the present invention;
Fig. 5 is a schematic diagram of a file merging device based on cassandra database according to an embodiment of the present invention.
[ Detailed description ] of the invention
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The present invention is an architecture of a specific functional system, so that in a specific embodiment, functional logic relationships of each structural module are mainly described, and specific software and hardware implementations are not limited.
In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other. The invention will be described in detail below with reference to the drawings and examples.
Example 1:
In cassandra database, when the client writes data, the client program determines the server node to which the data should be sent according to the token range on the cluster, the server side receives the data in parallel in multiple threads, and each thread sorts the received data to generate a data file smaller than 10M. In some implementation scenarios, when the amount of data handled by cassandra database standalone reaches 4 TB/day, the number of data files that the database process needs to open may exceed 20 ten thousand, and when 7 days of data storage is required, the number of data files that the process needs to open reaches 140 ten thousand, and the data files need to be combined to reduce the number of files. Therefore, the embodiment provides a new dynamic small file merging method, which avoids the defects existing in the existing file merging strategy.
Cassandra is a distributed database of NoSQL, and adopts Log Structured MERGE TREE (abbreviated as LSM Tree) architecture. The LSM is a layered, ordered and disk-oriented data structure, and is realized by layered batch read-write by utilizing the characteristic that the sequential write of disk batches is far higher than the random write performance. To achieve batch reading and writing of files, LSM has two important processes: the sequential flushing of data into the disk generates a data file (sstable) and a data file merge (Compaction). The Cassandra database reads in the data written by the user and caches it in memory. When the cache is full, the data in the cache is flushed to disk, and sstable files are generated. The quantity can be written in batches through the write cache, and the characteristic of better sequential writing performance of the magnetic disk is utilized. On the other hand, the data in the cache are pre-ordered before being read into the disk, so that the data can be quickly searched by a dichotomy, and the searching efficiency of the data is improved.
As shown in fig. 1, the method for merging files based on cassandra database according to the embodiment of the present invention specifically includes the following steps:
step 101: and receiving the data file generated by the database, and generating a combined file list of each disk.
The Cassandra database utilizes a distributed storage engine to store data in massive structured data on a large number of common commercial level servers in a distributed manner so as to avoid data loss or service anomalies caused by single-point faults. When the client writes data each time, a client program of the cassandra database starts a process of sequentially brushing data into a disk to generate a data file, acquires token range information on a cluster, determines a server node to which the data should be sent according to the token range, receives the data in parallel by the server, and each thread sorts the received data and generates a large number of sstable so as to facilitate the subsequent writing into the disk. Before file merging is carried out, files to be merged of each disk in each node in the distributed storage are required to be obtained, a merged file list of each disk is generated, and the files are obtained according to the merged file list during subsequent merging. In a specific embodiment, the file to be merged may be an index file, a data file, a bloom file, or the like.
In the existing LSM architecture, after the data sequence is written into the disk to generate the data file, the process of merging the data files based on the existing merging strategy is automatically started. In order to perform merging in cassandra databases by using only the method for merging files provided in this embodiment, all merging policies of the system need to be disabled before a process of sequentially flushing data into a disk to generate a data file is started, that is, before the generated data file is accepted, so as to avoid conflict of merging policies of the data file. Specifically, all merge policies of the system itself may be disabled by specifying parameters in the database table creation, compare = { ' enabled ': false '.
In some embodiments, only files under some directories need to be merged, or the merged files are stored under specified directories. Therefore, before generating the merged file list of each disk, the directories to be merged can be designated according to the actual needs of the implementation scene, and the merging process is started only for the directories to be merged to merge files, so that the number of files to be processed is reduced, and the file processing efficiency is improved.
Step 102: the merging process of each disk obtains a merging file list of the corresponding disk, and obtains the size of a data file to be merged in the merging file list of each disk.
Because cassandra database organizes the stored files into Distributed FILE SYSTEM (DFS for short), the physical storage resources used by each client or server are not necessarily directly connected to the local node, but are connected to the node through a computer network; or a complete hierarchical file system formed by combining a plurality of different logical disk partitions or labels. The DFS provides a logical tree file system structure for resources distributed at any position on the network, thereby facilitating user access to shared files distributed on the network. When the files are combined, in order to combine the files on different nodes, the combined file catalogues of different disks in each node need to be read. In this embodiment, a merging process is started in each disk, and a merged file list in the disk is read, and the size of a data file to be merged is obtained, so that the method is convenient for use in subsequent steps.
Step 103: and starting a parallel merging process of the database, and calculating the sum of the sizes of the data files obtained by the merging process of each disk.
In order to uniformly manage and combine files to be combined in all disks on a cluster, a unique parallel combining process is required to be started in a database system, the parallel combining process can scan the combining process of each disk, the sizes of data files acquired by the combining process of each disk are summarized, and the sum of the sizes of the data files is calculated and is used as a basis for whether to start the data file combining process. In the actual use scenario of this embodiment, the data file in each disk only writes data in each time at the client and generates a change after the data file is generated, so that the sum of the sizes of the data files acquired by each disk merging process is calculated only after the client writes data in each time and generates the data file, and whether file merging is needed is determined.
Step 104: when the sum of the sizes of the data files reaches a merging file threshold, the parallel merging process merges the data files needing to be merged in all the magnetic discs at one time.
In order to reduce the number of file merging, it is strictly ensured that one data is merged at a time, and the size of the merged target file, that is, the merged file threshold, needs to be specified. After the parallel merging thread of the database writes data and generates data files each time at the client, calculating the sum of the sizes of the data files acquired by the merging processes of each disk, and dynamically judging whether the files in each disk reach the merged file threshold value. And once the size of the file to be combined exceeds the file combining threshold, immediately starting the data file combining process corresponding to the data writing, combining the generated data files, and moving the combined files to the next layer. On the other hand, if the total size of the files to be combined does not reach the combined file threshold, the current combination is not carried out, the data files generated next time by the database are waited to be accepted, whether the sum of the sizes of the data files generated twice reaches the combined file threshold is calculated, and if the sum reaches the combined file threshold, the combination is carried out. In some implementations, the amount of data written by the client is small, and the sum of the sizes of the data files can reach the merge file threshold after multiple times of writing.
Through steps 101-104, sstable generated in each disk is scanned when data written by a client is received each time, file data and a combined file threshold are compared, and the received data are combined only when the received data are combined, and are not combined when the received data are not combined, so that multiple rolling combination is avoided, and the read-write burden of the disk is reduced.
In a specific implementation manner of this embodiment, in order to ensure that the size of the merged file reaches the merged file threshold, as shown in fig. 2, the process of merging the data files may be implemented by the following steps:
Step 201: and receiving the data file generated by the database, and generating a combined file list of each disk.
Step 202: the merging process of each disk obtains a merging file list of the corresponding disk, and obtains the size of a data file to be merged in the merging file list of each disk.
Step 203: and starting a parallel merging process of the database, and calculating the sum of the sizes of the data files obtained by the merging process of each disk.
Step 204: and judging whether the sum of the sizes of the data files to be merged is smaller than a merging file threshold value.
Step 205: and if the sum of the sizes of the data files to be combined is smaller than the combined file threshold value, waiting to accept the data file generated next time by the database.
Step 206: the parallel merging process merges the data files to be merged in all the disks at one time.
Step 201 to step 203 correspond to step 101 to step 103, and step 206 corresponds to step 104. Through steps 201-206, by comparing the sum of the sizes of the data files to be merged with the size relation of the merged file threshold, it is dynamically determined whether to start the merging process of the data files, so as to reduce the number of times of merging the files, and the merged file must reach the required size, so that secondary merging caused by too small merged data files is avoided.
Further, if the data size written by the client at each time is not an integer multiple of the file size threshold, the generated sstable will generate a file smaller than the file size threshold after being merged according to the file size threshold, resulting in the need of secondary merging of the file. If the size of the data files to be combined exceeds the file size threshold, combining the data files to be combined into at least one file according to the file size threshold, and not combining the rest parts smaller than the file size threshold. In a specific implementation scenario, the sum of the sizes of the data files to be combined, which are acquired by the combining process of each disk, is 2.6G, and the threshold of the file size is 1G. At this time, the parallel merging process merges the data files to be merged into 2 1G files, and the remaining 0.6G files are not merged this time and are merged after waiting for receiving the database file generated by the database next time. The processing mode carries out partial merging on the files to be merged, not only ensures that the size of the files after each merging is not smaller than the size threshold value of the files, but also reduces the retention of the data files to be merged in a disk through partial merging, and saves the occupation of the disk space.
In order to adapt to different practical use scenarios, an optimal balance between the number of files and the writing performance can be achieved by adjusting the merge file threshold. As shown in fig. 3, the specific steps are as follows:
step 301: it is determined whether the size of each data file is less than a file data size threshold.
Step 302: if the data quantity threshold value is smaller than the file data quantity threshold value, the first merged file threshold value is used for merging the file threshold values.
Step 303. If the first combined file threshold is larger than the file data amount threshold, the combined file threshold uses a second combined file threshold, wherein the first combined file threshold is larger than the second combined file threshold.
The size of the merge file threshold is adjusted according to the file data amount threshold to adjust the number of files and the write performance, via steps 301-303. When the received business file data volume is smaller than the file data volume threshold, a larger first merging file threshold is used to merge more small files into a large file, and the number of the merged files is further reduced; when the received business data volume is larger than the file data volume threshold, a smaller second combined file threshold is used, and although the number of files is increased, the writing performance can be preferentially ensured, the stable disk reading and writing performance is ensured, and the database business stably runs. In a specific implementation manner of this embodiment, each data server processes 4TB of original data every day, where the compressed data is 1.3TB, and when the combined file threshold is set to 1G, the combined file number=1.3tb×1024=1331, and the service data files are stored for 15 days, and the total file number=15×1331= 19965. The number of the combined files is completely written in the controllable range, the combination is completed stably within a specified time, and the database runs stably.
In order to further ensure the running stability of the database and avoid the performance problems of excessive memory occupation, increased inquiry time delay and the like caused by file merging abnormality, the small file merging method in the embodiment also comprises a merging performance detection and alarm mechanism.
In a specific implementation scenario of the present embodiment, as shown in fig. 4, the steps of performing the combination performance detection and alarm are as follows:
step 401: and the parallel merging process counts the merging time.
Step 402: and judging whether the merging time exceeds a merging time threshold value.
Step 403: and alarming the magnetic disk with the combination time exceeding the combination time threshold.
In actual use, under the condition of stable system performance, the time length for merging files is also stable, and if the time for actually merging is far longer than the theoretical time for merging or the average time for historical merging, the process of merging is indicated to be abnormal and the abnormal processing is needed. In a specific implementation scenario, the merge time threshold may be determined by estimating the time at which the merge is theoretically performed or calculating the average time of historical merges.
In order to timely process the abnormal merging process, when the parallel merging process finds that the merging time of a certain merging process exceeds a merging time threshold, namely the merging process is abnormal, the found merging abnormality needs to be warned. In a specific use scenario, the alert information may be displayed through a management interface of the cassandra database. In order to avoid waiting with other functions of cassandra database, alarm information can be displayed in time, or a special monitoring process can be used for displaying the combined state and the alarm information. In order to enable the manager and the user of the database to acquire more detailed merging exception information, the specific situation of the merging thread with the exception can be output through an exception log and other modes, and specific exception data are provided so as to be convenient for the manager or the user of the database to process.
In order to improve the automation degree of exception handling, the threads with merged exceptions can be automatically handled by a preset exception handling program. According to the needs of specific implementation scenes, targeted processing can be performed according to abnormal specific data, and database processes can also be restarted directly, so that the performance problems of excessive memory occupation, increased query time delay and the like caused by merging of the anomalies are avoided, and the running stability of the database is improved.
In the storage management mechanism of cassandra, in order to reduce the number of times of disk IO, a garbage collection mechanism is used, and data to be deleted is not directly deleted, but only marked as deleted, but still stored in a disk, and data marked as deleted is uniformly and truly deleted when the merging process is performed. The small file merging method provided in this embodiment also uses a garbage collection mechanism to delete the data to be deleted, and when merging, marks the source file of the data file to be merged as deleted, and after merging all the data files to be merged at one time, deletes the data file marked as deleted. By the method, multiple disk IO caused by deleting a large number of files respectively can be avoided, the disk load is reduced, and the merging efficiency is improved.
According to the cassandra database-based file merging method, the sizes of the received files are dynamically monitored, merging is immediately carried out when the sizes of the files to be merged reach the merging size threshold, small files are timely merged under the condition that fewer merging layers and temporary files are used, merging times are reduced, occupied space of the files in a disk is reduced, IO (input/output) times of the disk and IO contention of the disk are reduced, file merging performance is improved, and read-write stability of the database is improved.
Compared with Size Tiered Compaction Strategy strategies, the small file merging method provided by the embodiment has the advantages that the files after each merging must reach the required size, the problem that the merged files are still very small does not occur, the merging is not needed to be rolled for multiple times, the IO robbery of a disk is reduced, and the performance jitter and the read-write burden of the disk are increased.
Compared with Leveled Compaction Strategy strategies, the small file merging method provided by the embodiment processes files of different levels equally, sstable accumulation of the bottommost layer L0 layer is avoided, the number of sstable needed by single-point management is stable, sstable meeting merging standards can be merged in time, and excessive disk space is not occupied.
Compared with Time Window CompactionStrategy strategies, the small file merging method provided by the embodiment is mainly used for sequentially processing the received data files, the situation that merging is abandoned due to the end of a time window is avoided, all files needing to be merged can be processed, and the rapid increase of the number of files is avoided. Further, the abnormal merging process is processed through an abnormality detection and alarm mechanism, so that the stability of the system is ensured; the number of the files after being merged is optimized by adjusting the threshold value of the merged files, so that merging performance is improved; the number of disk IO is further reduced by using a garbage recycling mechanism, and the merging efficiency is improved.
Example 2:
On the basis of the method for merging files based on cassandra databases provided in the above embodiment 1, the present invention further provides a device for merging files based on cassandra databases, which can be used to implement the method, as shown in fig. 5, and is a schematic diagram of a device architecture in an embodiment of the present invention. The cassandra database-based file merging apparatus of this embodiment includes one or more processors 21 and a memory 22. In fig. 5, a processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or otherwise, for example in fig. 5.
The memory 22 is used as a nonvolatile computer readable storage medium for storing nonvolatile software programs, nonvolatile computer executable programs, and modules, as a file merging method based on cassandra database, such as the file merging method based on cassandra database in embodiment 1. The processor 21 performs various functional applications and data processing of the cassandra database-based file merging apparatus, that is, implements the cassandra database-based file merging method of embodiment 1, by running nonvolatile software programs, instructions, and modules stored in the memory 22.
The memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 22 may optionally include memory located remotely from processor 21, which may be connected to processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Program instructions/modules are stored in the memory 22 that, when executed by the one or more processors 21, perform the method of cassandra database-based file merging in embodiment 1 described above, e.g., perform the various steps shown in fig. 1-4 described above.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the embodiments may be implemented by a program that instructs associated hardware, the program may be stored on a computer readable storage medium, the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (7)

1. A method for file merging based on cassandra databases, characterized in that:
Receiving a data file generated by a database, and generating a combined file list of each disk;
the merging process of each disk obtains a merging file list of the corresponding disk, and obtains the size of a data file to be merged in the merging file list of each disk;
Starting a unique parallel merging process of a database, and calculating the sum of the sizes of the data files obtained by the merging process of each disk;
if the total size of the files to be combined does not reach the combined file threshold value, not combining this time, and waiting for receiving the data file generated next time by the database;
when the sum of the sizes of the data files reaches a file merging threshold, the parallel merging process merges the data files to be merged in all the magnetic disks at one time, merges the data files to be merged into at least one file according to the file size threshold, and does not merge the rest part smaller than the file size threshold;
Judging whether the size of each data file is smaller than a file data amount threshold value;
If the data quantity of the file is smaller than the file data quantity threshold, merging the file threshold by using a first merged file threshold;
if the data quantity of the file is larger than the file data quantity threshold, a second merged file threshold is used for merging the file threshold;
Wherein the first merge file threshold is greater than the second merge file threshold.
2. The method for file merging based on cassandra databases as claimed in claim 1, wherein: and before the generated data file is accepted, disabling the merging strategy of the system.
3. The method for file merging based on cassandra databases as claimed in claim 1, wherein: and designating the catalogs to be combined, and starting a combining process for file combination only for the catalogs to be combined.
4. The method for merging files based on cassandra databases as claimed in claim 1, further comprising, after merging all the data files to be merged at one time:
Counting merging time by a parallel merging process;
judging whether the merging time exceeds a merging time threshold value or not;
and alarming the magnetic disk with the combination time exceeding the combination time threshold.
5. The method for file merging based on cassandra databases as claimed in claim 1, wherein: and if the merging time exceeds the merging time threshold, automatically restarting the database process.
6. The method for file merging based on cassandra databases as claimed in claim 1, wherein: and marking the source file of the data files to be combined as deletion, and deleting the data files marked as deletion after combining all the data files to be combined at one time.
7. A cassandra database-based file merging device, characterized in that:
Comprising at least one processor and a memory connected by a data bus, the memory storing instructions for execution by the at least one processor, the instructions, when executed by the processor, for performing the method of cassandra database-based file merging of any one of claims 1-6.
CN202010576064.1A 2020-06-22 2020-06-22 Method and device for merging files based on cassandra database Active CN111881092B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010576064.1A CN111881092B (en) 2020-06-22 2020-06-22 Method and device for merging files based on cassandra database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010576064.1A CN111881092B (en) 2020-06-22 2020-06-22 Method and device for merging files based on cassandra database

Publications (2)

Publication Number Publication Date
CN111881092A CN111881092A (en) 2020-11-03
CN111881092B true CN111881092B (en) 2024-07-09

Family

ID=73156949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010576064.1A Active CN111881092B (en) 2020-06-22 2020-06-22 Method and device for merging files based on cassandra database

Country Status (1)

Country Link
CN (1) CN111881092B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732433A (en) * 2021-03-30 2021-04-30 骊阳(广东)节能科技股份有限公司 Data processing system capable of carrying out priority allocation
CN113238712A (en) * 2021-04-23 2021-08-10 深圳市智微智能软件开发有限公司 Disk space utilization method, device, terminal and storage medium
CN115981570B (en) * 2023-01-10 2023-12-29 创云融达信息技术(天津)股份有限公司 Distributed object storage method and system based on KV database

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727685A (en) * 2019-10-09 2020-01-24 苏州浪潮智能科技有限公司 Data compression method, equipment and storage medium based on Cassandra database

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104360824B (en) * 2014-11-10 2017-12-12 北京奇虎科技有限公司 The method and apparatus that a kind of data merge
KR101668397B1 (en) * 2015-12-24 2016-10-21 한국과학기술정보연구원 Method and apparatus for the fast analysis of large-scale scientific data files
CN105956183B (en) * 2016-05-30 2019-04-30 广东电网有限责任公司电力调度控制中心 The multilevel optimization's storage method and system of mass small documents in a kind of distributed data base
US20180349095A1 (en) * 2017-06-06 2018-12-06 ScaleFlux, Inc. Log-structured merge tree based data storage architecture
EP4270209A3 (en) * 2017-10-26 2023-11-22 Druva Inc. Deduplicated merged indexed object storage file system
CN108021702A (en) * 2017-12-26 2018-05-11 百度在线网络技术(北京)有限公司 Classification storage method, device, OLAP database system and medium based on LSM-tree
CN109446165B (en) * 2018-10-11 2021-05-07 中盈优创资讯科技有限公司 File merging method and device for big data platform
CN110609813B (en) * 2019-08-14 2023-01-31 北京华电天仁电力控制技术有限公司 Data storage system and method
CN111221922A (en) * 2019-12-31 2020-06-02 苏州浪潮智能科技有限公司 RocksDB database data writing method and RocksDB database

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727685A (en) * 2019-10-09 2020-01-24 苏州浪潮智能科技有限公司 Data compression method, equipment and storage medium based on Cassandra database

Also Published As

Publication number Publication date
CN111881092A (en) 2020-11-03

Similar Documents

Publication Publication Date Title
CN111881092B (en) Method and device for merging files based on cassandra database
US9355112B1 (en) Optimizing compression based on data activity
US10901619B2 (en) Selecting pages implementing leaf nodes and internal nodes of a data set index for reuse
CN103020255B (en) Classification storage means and device
CN109376156B (en) Method for reading hybrid index with storage awareness
CN110727685B (en) Data compression method, equipment and storage medium based on Cassandra database
CN105630834B (en) Method and device for deleting repeated data
CN108614837B (en) File storage and retrieval method and device
CN104854582B (en) Storage is efficient, updates the method and system of the affairs type full-text index view maintenance of optimization
US20190087437A1 (en) Scheduling database compaction in ip drives
CN107784108A (en) A kind of data storage and management method, device and equipment
CN112463542B (en) Log abnormality cause diagnosis method, device, computer equipment and storage medium
CN114116634B (en) Caching method and device and readable storage medium
CN111177090A (en) Client caching method and system based on sub-model optimization algorithm
JP4233564B2 (en) Data processing apparatus, data processing program and recording medium
CN117033424A (en) Query optimization method and device for slow SQL (structured query language) statement and computer equipment
CN116204130A (en) Key value storage system and management method thereof
JP2023531751A (en) Vehicle data storage method and system
CN108614879A (en) Small documents processing method and device
CN112711564B (en) Merging processing method and related equipment
CN111913913A (en) Access request processing method and device
EP4414858A1 (en) Processing method and apparatus for communication service data, and computer storage medium
CN108021562B (en) Disk storage method and device applied to distributed file system and distributed file system
CN114780489A (en) Method and device for realizing distributed block storage bottom layer GC
CN116820323A (en) Data storage method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant