CN117908794A - Garbage recycling method in computer distributed storage system - Google Patents

Garbage recycling method in computer distributed storage system Download PDF

Info

Publication number
CN117908794A
CN117908794A CN202410085602.5A CN202410085602A CN117908794A CN 117908794 A CN117908794 A CN 117908794A CN 202410085602 A CN202410085602 A CN 202410085602A CN 117908794 A CN117908794 A CN 117908794A
Authority
CN
China
Prior art keywords
data
garbage
application file
storage system
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410085602.5A
Other languages
Chinese (zh)
Inventor
熊健辉
瞿国庆
顾林强
顾俐俐
金飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong Zhida Information Technology Co ltd
Jiangsu Vocational College of Business
Original Assignee
Nantong Zhida Information Technology Co ltd
Jiangsu Vocational College of Business
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong Zhida Information Technology Co ltd, Jiangsu Vocational College of Business filed Critical Nantong Zhida Information Technology Co ltd
Priority to CN202410085602.5A priority Critical patent/CN117908794A/en
Publication of CN117908794A publication Critical patent/CN117908794A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a garbage recycling method in a computer distributed storage system, and relates to the field of garbage recycling. The invention discloses a garbage recycling method in a computer distributed storage system, which comprises the following steps: step one: acquiring a data structure and an actual data size of an application file; step two: monitoring the running conditions of all application files in real time; step three: scanning the application file in the idle state in real time; step four: judging whether garbage is recycled or not; step five: and recycling the junk files according to the judging result. The invention solves the problem that the existing garbage data recovery can cause the application file to be suspended or suspended during the use period of the application file, and reduces the stability of the distributed storage system.

Description

Garbage recycling method in computer distributed storage system
Technical Field
The invention relates to the technical field of garbage recovery, in particular to a garbage recovery method in a computer distributed storage system.
Background
The current memory pool management method in the distributed storage system is mainly limited to self management, taking data reading as an example, receiving the data from the bottom storage device, and placing the data into the self memory pool in the distributed storage system according to the self data organization mode, wherein the memory copy is adopted for how the data is sent from the self memory pool to the network card memory pool to the far end through the RDMA technology.
The Chinese patent with publication number CN114442949A discloses a garbage data recovery method and system, wherein the method comprises the steps of obtaining an object and the actual data size of the object; traversing the data structure blobs of the objects, and calculating the actual disk space of the distributed storage system occupied by all the data structure blobs; when the actual data size and the occupied actual disk space of the distributed storage system meet the preset recovery threshold, creating a new data structure blob, rewriting the object into the allocated new disk space, and releasing the original data and the storage space thereof after the object is written. The object with the garbage data can be effectively identified and processed, the space occupied by the garbage data is released, and the space amplification of a storage system is reduced. And because a single thread is used and starting conditions are set, the influence of garbage data recovery on normal business can be reduced.
In the practical use process of the above patent, garbage data recovery may cause that application files in the distributed storage system are suspended or suspended during the use period of application files, so that normal use is affected, and the stability of the distributed storage system is reduced, therefore, the existing requirements are not met, and a garbage recovery method in the computer distributed storage system is provided.
Disclosure of Invention
The invention aims to provide a garbage collection method in a computer distributed storage system, which can ensure the normal operation of application files and improve the stability of the distributed storage system by scanning data and collecting garbage through application files in idle state, thereby ensuring the normal operation of application files, recording logs in the garbage collection process, quickly locking the residual application files under the condition of residual garbage, timely carrying out secondary collection, ensuring the integrity of garbage collection, ensuring the storage space of the distributed storage system and solving the problems in the prior art.
In order to achieve the above purpose, the present invention provides the following technical solutions: the garbage recycling method in the computer distributed storage system comprises the following steps:
step one: the method comprises the steps of obtaining a data structure of an application file in a distributed storage system and the actual data size of the application file, calculating the actual disk space of the distributed storage system occupied by all data structures, and presetting a recovery threshold of garbage generated in the distributed storage system.
Step two: the monitoring system monitors the running conditions of all application files in the distributed storage system in real time and judges whether the application files are in an idle state or not.
Step three: the scanning system scans application files in an idle state in the distributed storage system in real time to acquire junk data of each application file, sorts and marks the generated junk files according to the sizes from small to large, and judges whether the application files comprise junk files or not.
Step four: the recycling system receives the cleaned junk files, judges whether to recycle the junk, acquires the read-write quantity of the junk files to be recycled, and judges the junk files according to the read-write quantity.
Step five: and formulating a corresponding recycling strategy for the junk file according to the judging result, and recycling the junk file according to the recycling strategy.
Preferably, the scanning system specifically includes:
The information acquisition module is used for acquiring information of the distributed storage system and acquiring all application files in the distributed storage system, actual data sizes and data structures of the application files.
And the information scanning module is used for scanning all application files in the distributed storage system and obtaining garbage of each application file.
And the information marking module is used for sorting the names of the plurality of application files in descending order according to the size of the generated garbage and marking the garbage in descending order.
Preferably, the recovery system specifically includes:
And the calculation module is used for calculating the actual use memory value of the application file currently operated by the distributed storage system according to the memory occupation information.
And the judging and recycling module is used for judging whether the actually used memory value is larger than a preset threshold value according to the acquired information, and sending out an alarm if the actually used memory value is larger than the preset threshold value.
And the garbage recycling module is used for receiving the alarm information and executing a system garbage recycling command according to the judging result.
Preferably, the garbage recycling module specifically includes:
And the execution module is used for executing the system garbage to carry out the recycling command according to the judging result and creating a new data structure.
And the deleting module is used for rewriting the application file into the allocated new disk space, and releasing the original data and the storage space thereof after the application file is written.
And the analysis module is used for recording the log corresponding to the garbage recovery and analyzing whether the garbage is recovered according to the garbage recovery condition.
Preferably, the monitoring system specifically includes:
And the real-time monitoring module is used for respectively monitoring the running conditions of all application files in the distributed storage system.
And the real-time judging module is used for judging whether the application files are in an idle state according to the operation conditions of all the application files in the distributed storage system obtained by monitoring, and if so, carrying out garbage data scanning.
Preferably, the scanning system scans application files in an idle state in the distributed storage system in real time to obtain garbage data of each application file, including:
Acquiring an application file in an idle state, extracting a file identifier of the application file, and generating a data access request based on the file identifier;
Accessing an operation parameter library of the application file according to the data access request based on the application program interface to obtain a script file and file execution parameters of the application file in an idle state, and analyzing the script file and the file execution parameters to obtain service characteristics of the application file in the idle state;
determining basic parameters of internal data accessed when the application file in the idle state normally operates based on the service characteristics, and determining measurement indexes of the junk data based on the basic parameters;
Splitting application data in the idle application file in an equal specification to obtain data groups to be scanned, and respectively extracting historical access parameters of each data to be scanned in each data group to be scanned in a preset time period;
Determining a reference count value of each piece of data to be scanned in a preset time period based on the historical access parameters, performing first measurement on the reference count value based on the measurement index to obtain first piece of garbage data, determining an reachable parameter of each piece of data to be scanned based on the historical access parameters, and performing second measurement on the reachable parameter based on the measurement index to obtain second piece of garbage data;
determining garbage data grades of the first garbage data and the second garbage data based on the measurement indexes, and performing difference marking on the first garbage data and the second garbage data by adopting a difference marker based on the garbage data grades;
and generating a junk data attribute list based on the difference marking result, and recording and storing the junk data attribute list in the file header of the application file in the idle state.
Preferably, the method judges whether the junk file is hot spot data according to the read-write quantity.
And if the set threshold value is exceeded, recording the set threshold value as hot data, recovering the junk file to a first data area, and collectively deleting the data of the data area.
If the set threshold value is not exceeded, the data is recorded as cold data, and the junk file is recovered to a second data area for storage.
And (3) carrying out real-time monitoring on the junk files in the second data area, and transferring to the first data area and deleting when the set threshold value is exceeded.
Preferably, the step of making a corresponding recycling policy for the junk file according to the judging result specifically includes:
And acquiring historical operation data of the distributed storage system, and determining performance indexes and service pressure of the distributed storage system according to the historical operation data.
And determining a garbage recycling strategy according to the determined performance index and the service pressure of the distributed storage system, wherein the garbage recycling strategy comprises a full garbage recycling strategy and a partial garbage recycling strategy.
Setting an invalid data occupancy threshold corresponding to the full-block garbage collection strategy and the partial garbage collection strategy.
And detecting the invalid data occupancy rate of the application files corresponding to the liveness level of each application file in a time-sharing manner, and selecting a garbage collection strategy according to the invalid data occupancy rate.
And determining garbage recycling parameters according to the garbage recycling strategy, and recycling garbage according to the garbage recycling parameters.
Preferably, the determining whether to recycle the garbage specifically includes:
and judging whether a preset recovery threshold is met or not according to the actual data size of the application file and the occupied actual disk space of the distributed storage system.
If yes, triggering and executing the garbage collection of the distributed storage system, creating a new data structure, rewriting the application file into the allocated new disk space, releasing the original data and the storage space thereof after the application file is written, and monitoring the log record corresponding to the garbage collection of the distributed storage system.
Judging whether the current system garbage collection is finished according to the log records, if the current system garbage collection is finished, updating the storage memory of the distributed storage system, and if the current system garbage collection is not finished, analyzing the garbage collection condition according to the log records, and continuously collecting the incompletely collected garbage.
Preferably, determining whether the application file is in an idle state includes:
reading the running conditions of all application files, and determining the running parameters of all application files;
Determining the access frequency of each application file in a preset time period and the data access quantity of each application file in single access according to the operation parameters of the application files;
Calculating the average access frequency of the application files for access based on the access frequency of each application file in a preset time period;
Wherein phi represents the average access frequency of the application file for access; n represents the total number of application files; i represents the sequence number value of the application file; f i represents the access frequency of the ith application file;
Calculating a first idle evaluation value of an ith application file based on the average access frequency of the application files for access and the access frequency of each application file in a preset time period;
Wherein F 1i represents the first free assessment score for the ith application file; mu 1 represents a first impact weight; delta 1 represents a first error factor and the range of values is (0.01,0.02); e represents a natural constant; ln (·) represents a logarithmic function based on e;
acquiring a preset access amount threshold, and calculating a second idle evaluation value of the ith application file based on the data access amount of the application file in single access and the access amount threshold;
Wherein F 2i represents the ith second idle assessment score; s i represents the data access amount of the ith application file in single access; mu 2 represents a second impact weight; delta 2 represents a second error factor and the range of values is (0.01,0.02); s represents a preset access amount threshold
Based on the first idle evaluation score and the second idle evaluation score, calculating a comprehensive evaluation score for evaluating whether the ith application file is in an idle state;
Fi=F1i+F2i
Wherein F i represents a comprehensive evaluation score for evaluating whether the ith application file is in an idle state;
Acquiring a set evaluation qualification threshold, comparing the comprehensive evaluation value for evaluating whether the ith application file is in an idle state with the evaluation qualification threshold, and judging whether the ith application file is in the idle state;
When the comprehensive evaluation value of evaluating whether the ith application file is in the idle state is equal to or greater than the evaluation qualification threshold, judging that the ith application file is not in the idle state;
Otherwise, it is determined that the ith application file is not in an idle state.
Compared with the prior art, the invention has the beneficial effects that:
1. According to the invention, through real-time monitoring of the running condition of the application files, data scanning and garbage collection are carried out on the application files in an idle state, the application files in use are not influenced, the normal running of the application files is ensured, the stability of the distributed storage system is improved, log recording is carried out in the garbage collection process, the application files with garbage residues can be rapidly locked under the condition that the garbage residues exist, secondary collection is carried out in time, the integrity of garbage collection is ensured, and the storage space of the distributed storage system is ensured.
2. By determining the file identification of the application file in the idle state, the script file and the file execution parameter of the application file in the idle state are accurately and effectively determined from the operation parameter library according to the file identification, so that the service characteristics of the application file in the idle state can be effectively determined, and then the basic parameters of the internal data accessed during normal operation of the application file in the idle state can be analyzed according to the determined service characteristics, so that the measurement index of the junk data can be effectively obtained, the situation-based measurement of the application data in the application file in the idle state can be effectively locked according to the obtained measurement index, finally, the junk data grades of the first junk data and the second junk data can be determined, the corresponding marks can be carried out according to the junk data grades, and the record storage can be carried out in the file head of the application file after the junk data attribute list is generated, so that the junk data can be accurately and effectively processed or recovered.
3. The method comprises the steps of determining the access frequency of each application file in a preset time period and the data access quantity of each application file in single access, accurately and effectively calculating a first idle evaluation value and a second idle evaluation value of each application file according to the access frequency and the data access quantity, summarizing the calculated first idle evaluation value and second idle evaluation value, accurately and effectively calculating the comprehensive evaluation value, and finally comparing the calculated comprehensive evaluation value with a set evaluation qualification threshold value, accurately and effectively evaluating the idle state of each application file, and guaranteeing the accuracy of judging whether the application file is in the idle state.
Drawings
FIG. 1 is a schematic diagram of a garbage collection method in a distributed storage system of a computer according to the present invention;
FIG. 2 is a flowchart of the garbage collection method in the computer distributed storage system according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the problem that the existing garbage data recovery may cause that an application file in a distributed storage system is suspended or suspended during the use period of using the application file, affecting normal use and reducing the stability of the distributed storage system, referring to fig. 1-2, the present embodiment provides the following technical solutions:
the garbage recycling method in the computer distributed storage system comprises the following steps:
Step one: the method comprises the steps of obtaining the data structure of the application file and the actual data size of the application file in the distributed storage system, calculating the actual disk space of the distributed storage system occupied by all the data structures, calculating the space occupied by the application file in the distributed storage system by calculating the actual data size of the application file, better calculating the garbage data contained in the application file, presetting the recovery threshold of garbage generated in the distributed storage system, automatically recovering the garbage when the garbage collection condition is met by presetting the garbage recovery threshold, and further guaranteeing the garbage recovery efficiency and the release effect of the disk space of the distributed storage system.
Step two: the monitoring system monitors the running conditions of all application files in the distributed storage system in real time, judges whether the application files are in an idle state, and performs data scanning and garbage collection on the application files in the idle state by monitoring the running conditions of the application files in real time, so that the application files in use are not influenced, normal running of the application files is ensured, and the stability of the distributed storage system is improved.
Step three: the scanning system scans application files in an idle state in the distributed storage system in real time to acquire junk data of each application file, sorts and marks the application files according to the size of generated junk from small to large, judges whether the application files comprise the junk files, and determines the junk data of the application files by scanning the data of the application files so as to facilitate recovery of the junk files.
Step four: the recycling system receives the cleaned garbage files, judges whether garbage is recycled, acquires the read-write quantity of the garbage files to be recycled, judges the garbage files according to the read-write quantity, conveniently and quickly recycles the garbage files by judging the garbage files, and ensures the garbage recycling efficiency and the garbage recycling integrity.
Step five: and formulating a corresponding recycling strategy for the junk file according to the judging result, and recycling the junk file according to the recycling strategy, so that the junk file is recycled more thoroughly, the residue of the junk file is avoided, and the storage effect of the distributed storage system is influenced.
The scanning system specifically comprises:
the information acquisition module is used for acquiring the information of the distributed storage system, acquiring all application files in the distributed storage system and the actual data sizes and data structures of the application files, so that the storage condition of the distributed storage system is conveniently known, garbage files are timely scanned and recovered, and the storage function of the distributed storage system is guaranteed.
The information scanning module is used for scanning all application files in the distributed storage system to obtain garbage of each application file, and the information scanning module is used for scanning the application files in the distributed storage system to conveniently know the memory service condition of each application file and whether garbage files exist or not, so that the garbage files are timely recovered, and the timeliness and effectiveness of garbage recovery are guaranteed.
The information marking module is used for ordering the names of the application files in descending order according to the sizes of the generated garbage and marking the garbage according to the descending order, and the garbage condition of the application files in the distributed storage system can be rapidly known by marking the garbage, so that the garbage of the application files is rapidly recovered, and the garbage recovery efficiency is ensured.
The recovery system specifically comprises:
The calculation module is used for calculating the actual use memory value of the application file currently running in the distributed storage system according to the memory occupation information, and the calculation module is used for occupying the real-time memory of the application file in the distributed storage system, and can quickly judge whether the application file has a garbage file or not according to the actual memory occupied value of the application file acquired by the information acquisition module, so that garbage recovery can be performed timely.
And the judging and recycling module judges whether the actually used memory value is larger than a preset threshold value according to the acquired information, if so, the judging and recycling module gives an alarm, judges whether the garbage files exist or not, and if so, the judging and recycling module gives an alarm and transmits the result to the garbage recycling module.
And the garbage recycling module is used for receiving the alarm information, executing a system garbage recycling command according to the judging result, recycling the generated garbage files, timely releasing the memory of the distributed storage system and ensuring the normal use of the distributed storage system.
The garbage recycling module specifically comprises:
And the execution module is used for executing the system garbage to carry out the recycling command according to the judging result, creating a new data structure, executing the garbage recycling command through the execution module, and transmitting the recycled garbage file to the deletion module for deletion.
And the deleting module is used for rewriting the application file into the allocated new disk space, releasing the original data and the storage space thereof after the application file is written, deleting the junk file through the deleting module, transmitting the deleting result to the analyzing module, and analyzing whether the junk residue exists or not.
The analysis module is used for carrying out log record according to the situation that garbage is recovered, analyzing whether garbage is recovered, carrying out log record in the garbage recovery process, and rapidly locking the application file of garbage residue under the condition that garbage residue exists, carrying out secondary recovery in time, ensuring the integrity of garbage recovery and ensuring the storage space of the distributed storage system.
The monitoring system specifically comprises:
The real-time monitoring module is used for respectively monitoring the running conditions of all application files in the distributed storage system, can ensure that the application files which are being used are not influenced when garbage is recovered by carrying out real-time monitoring on the running conditions of the application files, and can timely know the application files which are not being used, so that the use stability of the distributed storage system is ensured.
The real-time judging module is used for judging whether the application files are in an idle state according to the operation conditions of all the application files in the distributed storage system obtained by monitoring, if so, carrying out garbage data scanning, judging the use conditions of the application files through the real-time judging module, timely knowing the application files in a space state, carrying out garbage scanning and recycling on the application files in the idle state, and avoiding the influence on normal use caused by scanning of the application files in use.
And judging whether the junk file is hot spot data according to the read-write quantity.
And if the set threshold value is exceeded, recording the set threshold value as hot data, recovering the junk file to a first data area, and collectively deleting the data of the data area.
If the set threshold value is not exceeded, the data is recorded as cold data, and the junk file is recovered to a second data area for storage.
And when the set threshold value is exceeded, the garbage files in the second data area are transferred to the first data area and deleted, so that garbage recovery of the application files is more thorough, the space release effect of the disk of the distributed storage system is further ensured, and the storage effect of the distributed storage system is improved.
Making a corresponding recycling strategy for the junk file according to the judging result, wherein the method specifically comprises the following steps:
And acquiring historical operation data of the distributed storage system, and determining performance indexes and service pressure of the distributed storage system according to the historical operation data.
And determining a garbage recycling strategy according to the determined performance index and the service pressure of the distributed storage system, wherein the garbage recycling strategy comprises a full garbage recycling strategy and a partial garbage recycling strategy.
Setting an invalid data occupancy threshold corresponding to the full-block garbage collection strategy and the partial garbage collection strategy.
And detecting the invalid data occupancy rate of the corresponding application file of the liveness level of each application file in a time-sharing manner, selecting a garbage collection strategy according to the invalid data occupancy rate, selecting a full-block garbage collection strategy when the invalid data occupancy rate exceeds 60%, and selecting a partial garbage collection strategy when the invalid data occupancy rate does not exceed 40%.
According to the garbage collection strategy, the garbage collection parameters are determined, and the garbage collection parameters are set so as to avoid the influence of garbage collection on the normal processing of the distributed storage system on the service, thereby improving the stability of the distributed storage system, recycling the garbage according to the garbage collection parameters and improving the garbage collection efficiency.
Judging whether to recycle garbage or not, specifically comprising:
and judging whether a preset recovery threshold is met or not according to the actual data size of the application file and the occupied actual disk space of the distributed storage system.
If yes, triggering and executing the garbage collection of the distributed storage system, creating a new data structure, rewriting the application file into the allocated new disk space, releasing the original data and the storage space thereof after the application file is written, and monitoring the log record corresponding to the garbage collection of the distributed storage system.
Judging whether the system garbage collection is finished according to the log records, if yes, updating the storage memory of the distributed storage system, if not, analyzing the garbage collection condition according to the log records, continuously collecting the incompletely collected garbage, avoiding the situation of garbage residue, and ensuring the garbage collection effect.
The embodiment provides a garbage collection method in a distributed storage system of a computer, in the third step, a scanning system scans application files in an idle state in the distributed storage system in real time to obtain garbage data of each application file, including:
Acquiring an application file in an idle state, extracting a file identifier of the application file, and generating a data access request based on the file identifier;
Accessing an operation parameter library of the application file according to the data access request based on the application program interface to obtain a script file and file execution parameters of the application file in an idle state, and analyzing the script file and the file execution parameters to obtain service characteristics of the application file in the idle state;
determining basic parameters of internal data accessed when the application file in the idle state normally operates based on the service characteristics, and determining measurement indexes of the junk data based on the basic parameters;
Splitting application data in the idle application file in an equal specification to obtain data groups to be scanned, and respectively extracting historical access parameters of each data to be scanned in each data group to be scanned in a preset time period;
Determining a reference count value of each piece of data to be scanned in a preset time period based on the historical access parameters, performing first measurement on the reference count value based on the measurement index to obtain first piece of garbage data, determining an reachable parameter of each piece of data to be scanned based on the historical access parameters, and performing second measurement on the reachable parameter based on the measurement index to obtain second piece of garbage data;
determining garbage data grades of the first garbage data and the second garbage data based on the measurement indexes, and performing difference marking on the first garbage data and the second garbage data by adopting a difference marker based on the garbage data grades;
and generating a junk data attribute list based on the difference marking result, and recording and storing the junk data attribute list in the file header of the application file in the idle state.
In this embodiment, the file identifier may be a tag symbol that characterizes the type or purpose of the different application files, which correspond to different file identifiers.
In this embodiment, the data access request is generated based on the file identification, a script file for calling the application file in the idle state, and file execution parameters.
In this embodiment, the application program interface is set in advance, and is an entry for accessing an operation parameter library, where the operation parameter library is known in advance and is used to store operation data corresponding to different application files.
In this embodiment, the script file may be a configuration encoding parameter of an application file in an idle state, including an application environment and an application scenario of the application file.
In this embodiment, the file execution parameter may be an execution policy of an application file in an idle state in a running process or a data reference relationship when accessed, and the like.
In this embodiment, the service feature may be a case that the application file in the idle state calls the corresponding service type during normal operation and data in the application file during operation.
In this embodiment, the basic parameters may be the frequency at which internal data is accessed when the application file in the idle state is operating normally, and the data access amount at each access.
In this embodiment, the metric may be a measurement parameter or a measurement standard that measures whether the data in the application file is junk data, and is determined according to the basic parameter.
In this embodiment, the application data may be all encoded data contained in the application file in the idle state.
In this embodiment, the data set to be scanned may be a part of application data obtained by splitting application data in the application file in the idle state into equal specifications.
In this embodiment, the preset time period is set in advance, and may be one month or the like.
In this embodiment, the historical access parameter may be an access or reference condition parameter of the data to be scanned within a preset period of time.
In this embodiment, the reference count value may be a total number of times that characterizes different scan data being referenced within a preset period of time, with a larger reference count value indicating that the data to be scanned is not spam.
In this embodiment, the first metric may be analysis of reference count values of different data to be scanned by a metric index, so as to determine data that is not frequently referenced, that is, determine the first junk data.
In this embodiment, the reachability parameter may be a parameter indicating the degree to which the access request can access the data, and when the reachability parameter has a value of 0, it indicates that the data to be scanned has never been accessed or referenced.
In this embodiment, the second metric may be an analysis of the reachable parameters by means of a metric index, in order to determine data of the application data that has never been accessed, i.e. to determine the second spam data.
In this embodiment, the garbage data level is determined according to the access degree of different data, wherein the garbage data level of the first garbage data is lower than the garbage data level of the second garbage data.
In this embodiment, the differential markers may be different marking symbols, and the garbage data of different garbage data levels are marked by the differential markers, i.e. differential markers.
In this embodiment, the garbage data attribute list may be to generate a report from all garbage data in the application file in the idle state, so as to facilitate corresponding processing or recycling of the garbage data.
The working principle and the beneficial effects of the technical scheme are as follows: by determining the file identification of the application file in the idle state, the script file and the file execution parameter of the application file in the idle state are accurately and effectively determined from the operation parameter library according to the file identification, so that the service characteristics of the application file in the idle state can be effectively determined, and then the basic parameters of the internal data accessed during normal operation of the application file in the idle state can be analyzed according to the determined service characteristics, so that the measurement index of the junk data can be effectively obtained, the situation-based measurement of the application data in the application file in the idle state can be effectively locked according to the obtained measurement index, finally, the junk data grades of the first junk data and the second junk data can be determined, the corresponding marks can be carried out according to the junk data grades, and the record storage can be carried out in the file head of the application file after the junk data attribute list is generated, so that the junk data can be accurately and effectively processed or recovered.
The embodiment provides a garbage collection method in a distributed storage system of a computer, in the second step, judging whether an application file is in an idle state or not, including:
reading the running conditions of all application files, and determining the running parameters of all application files;
Determining the access frequency of each application file in a preset time period and the data access quantity of each application file in single access according to the operation parameters of the application files;
Calculating the average access frequency of the application files for access based on the access frequency of each application file in a preset time period;
Wherein phi represents the average access frequency of the application file for access; n represents the total number of application files; i represents the sequence number value of the application file; f i represents the access frequency of the ith application file;
Calculating a first idle evaluation value of an ith application file based on the average access frequency of the application files for access and the access frequency of each application file in a preset time period;
Wherein F 1i represents the first free assessment score for the ith application file; mu 1 represents a first impact weight; delta 1 represents a first error factor and the range of values is (0.01,0.02); e represents a natural constant; ln (·) represents a logarithmic function based on e;
acquiring a preset access amount threshold, and calculating a second idle evaluation value of the ith application file based on the data access amount of the application file in single access and the access amount threshold;
Wherein F 2i represents the ith second idle assessment score; s i represents the data access amount of the ith application file in single access; mu 2 represents a second impact weight; delta 2 represents a second error factor and the range of values is (0.01,0.02); s represents a preset access amount threshold
Based on the first idle evaluation score and the second idle evaluation score, calculating a comprehensive evaluation score for evaluating whether the ith application file is in an idle state;
Fi=F1i+F2i
Wherein F i represents a comprehensive evaluation score for evaluating whether the ith application file is in an idle state;
Acquiring a set evaluation qualification threshold, comparing the comprehensive evaluation value for evaluating whether the ith application file is in an idle state with the evaluation qualification threshold, and judging whether the ith application file is in the idle state;
When the comprehensive evaluation value of evaluating whether the ith application file is in the idle state is equal to or greater than the evaluation qualification threshold, judging that the ith application file is not in the idle state;
Otherwise, it is determined that the ith application file is not in an idle state.
In this embodiment, the first idle evaluation value may be a measurement parameter that is calculated by using an access frequency of each application file in a preset time period and is capable of indicating whether the application file is in an idle state, where the smaller the value, the more likely the application file is in the idle state.
In this embodiment, the preset access amount threshold is set in advance, and is a criterion for measuring whether the minimum access amount is satisfied, and can be adjusted.
In this embodiment, the second idle evaluation value may be a measurement parameter that is obtained by calculating the data access amount and is capable of indicating whether the application file is in an idle state, where the smaller the value, the more likely the application file is in the idle state.
In this embodiment, the comprehensive evaluation score is parameter information of an application file that is obtained by integrating the first idle evaluation score and the second idle evaluation score and is capable of overall representing whether the application file is in an idle state.
In this embodiment, the set evaluation pass threshold is set in advance and is adjustable.
The working principle and the beneficial effects of the technical scheme are as follows: the method comprises the steps of determining the access frequency of each application file in a preset time period and the data access quantity of each application file in single access, accurately and effectively calculating a first idle evaluation value and a second idle evaluation value of each application file according to the access frequency and the data access quantity, summarizing the calculated first idle evaluation value and second idle evaluation value, accurately and effectively calculating the comprehensive evaluation value, and finally comparing the calculated comprehensive evaluation value with a set evaluation qualification threshold value, accurately and effectively evaluating the idle state of each application file, and guaranteeing the accuracy of judging whether the application file is in the idle state.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. The garbage collection method in the computer distributed storage system is characterized by comprising the following steps of:
Step one: acquiring a data structure of an application file in a distributed storage system and the actual data size of the application file, calculating the actual disk space of the distributed storage system occupied by all the data structures, and presetting a recovery threshold of garbage generated in the distributed storage system;
step two: the monitoring system monitors the running conditions of all application files in the distributed storage system in real time and judges whether the application files are in an idle state or not;
Step three: the scanning system scans application files in an idle state in the distributed storage system in real time to acquire junk data of each application file, sorts and marks the generated junk files according to the sizes from small to large, and judges whether the application files comprise junk files or not;
step four: the recycling system receives the cleaned junk files, judges whether to recycle the junk, acquires the read-write quantity of the junk files to be recycled, and judges the junk files according to the read-write quantity;
Step five: and formulating a corresponding recycling strategy for the junk file according to the judging result, and recycling the junk file according to the recycling strategy.
2. The method for garbage collection in a computer distributed storage system according to claim 1, wherein: the scanning system specifically comprises:
The information acquisition module is used for acquiring information of the distributed storage system and acquiring all application files in the distributed storage system, actual data sizes and data structures of the application files;
The information scanning module is used for scanning all application files in the distributed storage system and obtaining garbage of each application file;
and the information marking module is used for sorting the names of the plurality of application files in descending order according to the size of the generated garbage and marking the garbage in descending order.
3. The method for garbage collection in a computer distributed storage system according to claim 1, wherein: the recovery system specifically comprises:
The calculation module is used for calculating the actual use memory value of the application file currently operated by the distributed storage system according to the memory occupation information;
the judging and recycling module judges whether the actually used memory value is larger than a preset threshold value according to the acquired information, and if so, an alarm is sent out;
and the garbage recycling module is used for receiving the alarm information and executing a system garbage recycling command according to the judging result.
4. A method for garbage collection in a computer distributed storage system according to claim 3, wherein: the garbage recycling module specifically comprises:
The execution module is used for executing the system garbage to carry out the recycling command according to the judging result and creating a new data structure;
The deleting module is used for rewriting the application file into the allocated new disk space, and releasing the original data and the storage space thereof after the application file is written;
and the analysis module is used for recording the log corresponding to the garbage recovery and analyzing whether the garbage is recovered according to the garbage recovery condition.
5. The method for garbage collection in a computer distributed storage system according to claim 1, wherein: the monitoring system specifically comprises:
the real-time monitoring module is used for respectively monitoring the running conditions of all application files in the distributed storage system;
And the real-time judging module is used for judging whether the application files are in an idle state according to the operation conditions of all the application files in the distributed storage system obtained by monitoring, and if so, carrying out garbage data scanning.
6. The method for recycling garbage in a distributed storage system of claim 1, wherein in step three, the scanning system scans application files in an idle state in the distributed storage system in real time to obtain garbage data of each application file, and the method comprises the following steps:
Acquiring an application file in an idle state, extracting a file identifier of the application file, and generating a data access request based on the file identifier;
Accessing an operation parameter library of the application file according to the data access request based on the application program interface to obtain a script file and file execution parameters of the application file in an idle state, and analyzing the script file and the file execution parameters to obtain service characteristics of the application file in the idle state;
determining basic parameters of internal data accessed when the application file in the idle state normally operates based on the service characteristics, and determining measurement indexes of the junk data based on the basic parameters;
Splitting application data in the idle application file in an equal specification to obtain data groups to be scanned, and respectively extracting historical access parameters of each data to be scanned in each data group to be scanned in a preset time period;
Determining a reference count value of each piece of data to be scanned in a preset time period based on the historical access parameters, performing first measurement on the reference count value based on the measurement index to obtain first piece of garbage data, determining an reachable parameter of each piece of data to be scanned based on the historical access parameters, and performing second measurement on the reachable parameter based on the measurement index to obtain second piece of garbage data;
determining garbage data grades of the first garbage data and the second garbage data based on the measurement indexes, and performing difference marking on the first garbage data and the second garbage data by adopting a difference marker based on the garbage data grades;
and generating a junk data attribute list based on the difference marking result, and recording and storing the junk data attribute list in the file header of the application file in the idle state.
7. The method for garbage collection in a computer distributed storage system according to claim 1, wherein: judging whether the junk file is hot spot data or not according to the read-write quantity;
If the set threshold value is exceeded, recording the set threshold value as hot data, recycling the junk file to a first data area, and deleting the data of the data area in a collective manner;
if the set threshold value is not exceeded, recording the set threshold value as cold data, and recovering the junk file to a second data area for storage;
And (3) carrying out real-time monitoring on the junk files in the second data area, and transferring to the first data area and deleting when the set threshold value is exceeded.
8. The method for garbage collection in a computer distributed storage system according to claim 1, wherein: making a corresponding recycling strategy for the junk file according to the judging result, wherein the method specifically comprises the following steps:
acquiring historical operation data of the distributed storage system, and determining performance indexes and service pressure of the distributed storage system according to the historical operation data;
determining a garbage recovery strategy according to the determined performance index and the service pressure of the distributed storage system, wherein the garbage recovery strategy comprises a full garbage recovery strategy and a partial garbage recovery strategy;
setting an invalid data occupancy threshold corresponding to a full-block garbage collection strategy and a partial garbage collection strategy;
Detecting the invalid data occupancy rate of the application files corresponding to the liveness level of each application file in a time-sharing manner, and selecting a garbage collection strategy according to the invalid data occupancy rate;
And determining garbage recycling parameters according to the garbage recycling strategy, and recycling garbage according to the garbage recycling parameters.
9. The method for garbage collection in a computer distributed storage system according to claim 1, wherein: judging whether to recycle garbage or not, specifically comprising:
judging whether a preset recovery threshold is met or not according to the actual data size of the application file and the occupied actual disk space of the distributed storage system;
If yes, triggering and executing garbage collection of the distributed storage system, creating a new data structure, rewriting the application file into the allocated new disk space, releasing the original data and the storage space thereof after the application file is written, and monitoring the log record corresponding to the garbage collection of the distributed storage system;
Judging whether the current system garbage collection is finished according to the log records, if the current system garbage collection is finished, updating the storage memory of the distributed storage system, and if the current system garbage collection is not finished, analyzing the garbage collection condition according to the log records, and continuously collecting the incompletely collected garbage.
10. The method for garbage collection in a distributed storage system according to claim 1, wherein in the second step, determining whether the application file is in an idle state comprises:
reading the running conditions of all application files, and determining the running parameters of all application files;
Determining the access frequency of each application file in a preset time period and the data access quantity of each application file in single access according to the operation parameters of the application files;
Calculating the average access frequency of the application files for access based on the access frequency of each application file in a preset time period;
Wherein phi represents the average access frequency of the application file for access; n represents the total number of application files; i represents the sequence number value of the application file; f i represents the access frequency of the ith application file;
Calculating a first idle evaluation value of an ith application file based on the average access frequency of the application files for access and the access frequency of each application file in a preset time period;
Wherein F 1i represents the first free assessment score for the ith application file; mu 1 represents a first impact weight; delta 1 represents a first error factor and the range of values is (0.01,0.02); e represents a natural constant; ln (·) represents a logarithmic function based on e;
acquiring a preset access amount threshold, and calculating a second idle evaluation value of the ith application file based on the data access amount of the application file in single access and the access amount threshold;
Wherein F 2i represents the ith second idle assessment score; s i represents the data access amount of the ith application file in single access; mu 2 represents a second impact weight; delta 2 represents a second error factor and the range of values is (0.01,0.02); s represents a preset access amount threshold
Based on the first idle evaluation score and the second idle evaluation score, calculating a comprehensive evaluation score for evaluating whether the ith application file is in an idle state;
Fi=F1i+F2i
Wherein F i represents a comprehensive evaluation score for evaluating whether the ith application file is in an idle state;
Acquiring a set evaluation qualification threshold, comparing the comprehensive evaluation value for evaluating whether the ith application file is in an idle state with the evaluation qualification threshold, and judging whether the ith application file is in the idle state;
When the comprehensive evaluation value of evaluating whether the ith application file is in the idle state is equal to or greater than the evaluation qualification threshold, judging that the ith application file is not in the idle state;
Otherwise, it is determined that the ith application file is not in an idle state.
CN202410085602.5A 2024-01-22 2024-01-22 Garbage recycling method in computer distributed storage system Pending CN117908794A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410085602.5A CN117908794A (en) 2024-01-22 2024-01-22 Garbage recycling method in computer distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410085602.5A CN117908794A (en) 2024-01-22 2024-01-22 Garbage recycling method in computer distributed storage system

Publications (1)

Publication Number Publication Date
CN117908794A true CN117908794A (en) 2024-04-19

Family

ID=90681769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410085602.5A Pending CN117908794A (en) 2024-01-22 2024-01-22 Garbage recycling method in computer distributed storage system

Country Status (1)

Country Link
CN (1) CN117908794A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160232088A1 (en) * 2014-07-17 2016-08-11 Sandisk Enterprise Ip Llc Garbage Collection in Storage System with Distributed Processors
US20170300410A1 (en) * 2016-04-13 2017-10-19 Nanjing University Method and System for Optimizing Deterministic Garbage Collection in Nand Flash Storage Systems
CN109376094A (en) * 2018-09-27 2019-02-22 郑州云海信息技术有限公司 A kind of method of the cold and hot Data Migration of garbage reclamation in storage system
CN116225319A (en) * 2022-12-29 2023-06-06 天翼云科技有限公司 Method and device for reducing data migration during garbage data recovery
CN116467267A (en) * 2023-03-31 2023-07-21 阿里巴巴(中国)有限公司 Garbage recycling method, device, storage medium and system
US20230237064A1 (en) * 2021-07-08 2023-07-27 Tencent Technology (Shenzhen) Company Limited Data processing method, apparatus, and system, computer device, readable storage medium, and computer program product
CN116700634A (en) * 2023-08-08 2023-09-05 苏州浪潮智能科技有限公司 Garbage recycling method and device for distributed storage system and distributed storage system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160232088A1 (en) * 2014-07-17 2016-08-11 Sandisk Enterprise Ip Llc Garbage Collection in Storage System with Distributed Processors
US20170300410A1 (en) * 2016-04-13 2017-10-19 Nanjing University Method and System for Optimizing Deterministic Garbage Collection in Nand Flash Storage Systems
CN109376094A (en) * 2018-09-27 2019-02-22 郑州云海信息技术有限公司 A kind of method of the cold and hot Data Migration of garbage reclamation in storage system
US20230237064A1 (en) * 2021-07-08 2023-07-27 Tencent Technology (Shenzhen) Company Limited Data processing method, apparatus, and system, computer device, readable storage medium, and computer program product
CN116225319A (en) * 2022-12-29 2023-06-06 天翼云科技有限公司 Method and device for reducing data migration during garbage data recovery
CN116467267A (en) * 2023-03-31 2023-07-21 阿里巴巴(中国)有限公司 Garbage recycling method, device, storage medium and system
CN116700634A (en) * 2023-08-08 2023-09-05 苏州浪潮智能科技有限公司 Garbage recycling method and device for distributed storage system and distributed storage system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵楠楠: "分布式键值存储***高效能数据布局技术研究", 信息科技, vol. 2017, no. 08, 31 August 2017 (2017-08-31), pages 137 - 11 *

Similar Documents

Publication Publication Date Title
CN107145445A (en) The automatic analysis method and system of the daily record that reports an error of software automated testing
CN107704203A (en) It polymerize delet method, device, equipment and the computer-readable storage medium of big file
CN112528279B (en) Method and device for establishing intrusion detection model
CN109918448A (en) A kind of cloud storage data classification method based on user behavior
CN110674145B (en) Data consistency detection method, device, computer equipment and storage medium
CN111427876A (en) Data asset management method and system
CN117908794A (en) Garbage recycling method in computer distributed storage system
CN111090395B (en) Electronic information intelligent storage system for accounting industry
CN107590233A (en) A kind of file management method and device
CN111597510A (en) Power transmission and transformation operation and inspection data quality assessment method and system
CN112631941B (en) Method and system for locating linux kernel slub memory leakage
CN112052468A (en) Quick data deleting method for information protection
CN113821166A (en) Method, device and equipment for aggregating multi-version small objects
CN114546957A (en) Intelligent centralized data processing service platform
CN114092284A (en) Intellectual property automatic retrieval and analysis management system
CN102024104A (en) Computer safety inspecting and scoring method
CN118170325A (en) Big data storage and processing control method
CN114020794B (en) Processing method, device, medium and equipment for slow query of MySQL database
CN116680751B (en) Data security acquisition and processing method in domestic platform MES system
CN115658626B (en) Distributed network small file storage management method
CN116974468B (en) Equipment data storage management method and system based on big data
CN117290145B (en) Big data-based computer equipment management system and method
CN113625955B (en) Dirty data processing method, device and medium of distributed storage system
CN115001754B (en) Network security system capable of monitoring sensitive digital information transmission in real time
CN113449199B (en) Document monitoring and management system based on comprehensive security audit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination