CN109739817B

CN109739817B - Method and system for storing data file in big data storage system

Info

Publication number: CN109739817B
Application number: CN201811604490.0A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shenzhen Guangdian Software Technology Co ltd
Current assignee: Shenzhen Guangdian Software Technology Co ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2023-01-03
Anticipated expiration: 2038-12-26
Also published as: CN109739817A

Abstract

The invention discloses a method and a system for storing data files in a big data storage system, wherein the method comprises the following steps: determining first access record information of the big data storage system in a current operation time interval and second access record information in a previous operation time interval; determining a first total number of accesses FAN of all data files in the current running time interval and a second total number of accesses SAN in the previous running time interval; analyzing device record files stored in recording devices of a big data storage system to determine a first device number FDN of the storage devices effectively operating in a current operation time interval and determine a second device number SDN of the storage devices effectively operating in a previous operation time interval; when it is determined to enter the storage process for the data file, the big data storage system sends a notification message to each of all the storage devices to instruct each storage device to start performing the storage process for the data file.

Description

Method and system for storing data file in big data storage system

Technical Field

The present invention relates to the field of big data storage, and more particularly, to a method and system for storing data files in a big data storage system.

Background

Currently, with the increasing development of information technology, more and more devices are capable of generating and using various types of data. To be able to better use data based on analysis of the data, it is often necessary to store the data using a large data storage system. However, in current large data storage systems, the storage of data files is poorly managed. For example, when there is a significant time or topic relevance between data files in a big data storage system, the current big data system cannot perform efficient classified storage according to the time or topic relevance between the data files.

Disclosure of Invention

According to an aspect of the present invention, there is provided a method of storing data files in a large data storage system, the method comprising:

when the current operation time interval of the big data storage system is ended, analyzing an access record file stored in recording equipment of the big data storage system to determine first access record information of the big data storage system in the current operation time interval and second access record information of the big data storage system in a previous operation time interval, wherein the previous operation time interval and the current operation time interval comprise the same number of natural days and are two adjacent operation time intervals in terms of time;

analyzing the first access record information to determine a first accessed total FAN of all data files of the big data storage system in a current running time interval, and analyzing the second access record information to determine a second accessed total SAN of all data files of the big data storage system in a previous running time interval;

analyzing device record files stored in recording devices of a big data storage system to determine a first device number FDN of the storage devices which effectively operate in a current operation time interval and determine a second device number SDN of the storage devices which effectively operate in a previous operation time interval, wherein effective operation means that when the continuous operation time of the storage devices in the big data storage system in a specific operation time interval reaches a predetermined number of natural days, the storage devices are determined to be effectively operated in the specific operation time interval;

when the first device number FDN is larger than the second device number SDN and the first total number of accessed FAN is larger than the second total number SAN, determining whether the ratio FDN/SDN is larger than 120%, if yes, determining whether the ratio FAN/SAN of the first total number FAN of accessed to the second total number SAN is larger than the ratio FDN/SDN of the first device number to the second device number, and if yes, entering a storage process of the data file;

or when the first device number FDN is smaller than the second device number SDN and the first accessed total number FAN is smaller than the second accessed total number SAN, determining whether the ratio FAN/SAN of the first accessed total number FAN to the second accessed total number SAN is larger than the ratio FDN/TDN of the first device number to the second device number, if so, determining whether the absolute value of the difference between the ratio FAN/SAN and the ratio FDN/TDN is larger than an increase threshold, and if so, entering a storage process of the data file;

when it is determined that the storage process of the data file is entered, the big data storage system sends a notification message to each of all the storage devices to instruct each storage device to start the storage process of the data file:

taking the storage device receiving the notification message as a current storage device, analyzing local record files stored in a record storage area of the current storage device to obtain accessed records of all data files in the current storage device in a current running time interval, and counting the accessed records of all data files in the current storage device in the current running time interval by taking a preset time length as a basic time unit to determine the number of times of accessing each basic time unit of all data files of the current storage device in the current running time interval;

taking a basic time unit with the access times larger than the access time threshold value in the plurality of basic time units as a statistical time unit to obtain at least one statistical time unit, determining a plurality of data files related in each statistical time unit, and acquiring metadata or profile data of each data file of the plurality of data files related in each statistical time unit;

generating summary information of each statistical time unit based on metadata or profile data of a plurality of data files involved in each statistical time unit, and creating an associated storage space for each statistical time unit and using the summary information of each statistical time unit as the summary information of the associated storage space;

when a current storage device receives a new data file, acquiring metadata or profile data of the new data file, performing content matching on the metadata or profile data of the new data file and summary information of each associated storage space to determine the content matching degree of the new data file and each associated storage space, determining the associated storage space with the maximum content matching degree of the new data file, and storing the new data file into the associated storage space with the maximum content matching degree.

Setting a plurality of running time intervals for the running of the big data storage system when the big data storage system initially runs, wherein each running time interval comprises the same number of natural days, and determining the running time interval in which the current time is positioned as the current running time interval;

wherein each runtime interval includes 50 natural days, 80 natural days, 100 natural days, or 120 natural days;

alternatively, each runtime interval comprises at least 100 natural days;

the previous running time interval and the current running time interval comprise at least 100 natural days;

the time intervals adjacent to the current operation time interval are a previous operation time interval and a next operation time interval;

the recording equipment is used for storing an access record file, the access record file comprises a plurality of access record information, and each access record information is associated with a corresponding running time interval and is used for recording all access records of the big data storage system in the corresponding running time interval;

wherein all access records are a collection of accessed information for all data files within the big data storage system;

wherein each access record is accessed information for a single data file within the big data storage system; each access record is used to record one access of a single data file within the big data storage system;

when the current running time interval of the big data storage system is ended, analyzing the access record file stored in the recording equipment of the big data storage system into:

when the current running time interval of the big data storage system is ended and a next running time interval is not entered, analyzing an access record file stored in recording equipment of the big data storage system;

wherein any two adjacent running time intervals in the plurality of running time intervals have a transition time period therebetween; the transition time period occupies a period of time of a beginning part of a latter one of any two adjacent operation time intervals, or the transition time period occupies a period of time of an ending part of a former one of any two adjacent operation time intervals.

Wherein each access record comprises at least: an identifier and an access start time of the data file;

taking the number of access records with access start times within the current running time interval in all access records of the first access record information as a first accessed total number FAN of all data files of the large data storage system within the current running time interval;

and taking the number of access records with access start times located in the previous running time interval in all the access records of the second access record information as a second accessed total SAN of all the data files of the large data storage system in the previous running time interval.

The recording device is further configured to store a device record file, the device record file including a plurality of device record information, wherein each device record information is associated with a respective run-time interval for recording run records of all storage devices within the big data storage system in the respective run-time interval; wherein the operation records of all the storage devices are a set of operation information of all the storage devices in the big data storage system;

each running record is a single running record of the storage device in the big data storage system, and each running record or single running record at least comprises an identifier of the storage device, the starting running time of the device and the ending running time of the device; each running record or single-time running record is used for recording one-time running of a single storage device;

taking the number of the storage devices of the natural days that the device record files in all operation records of all operation time intervals have the device start operation time in the current operation time interval and the device operation time length in the current operation time interval is greater than the preset number as the first device number FDN of the storage devices which effectively operate in the current operation time interval;

taking the number of storage devices of a natural day in which the device start operation time is in the previous operation time interval and the device operation time length in the previous operation time interval is greater than the preset number in all operation records of the device record file as a second device number SDN of the storage devices which effectively operate in the previous operation time interval;

the storage devices which effectively operate in the current operation time interval are the storage devices which continuously operate in the current operation time interval and reach a preset number of natural days in all the storage devices of the big data storage system;

the storage devices which effectively operate in the previous operation time interval are the storage devices which continuously operate in the previous operation time interval and reach a preset number of natural days in all the storage devices of the large data storage system;

wherein the number of natural days included in the specific operation time interval is TD;

when TD ≧ 100, wherein the predetermined number of natural days is

Each natural day;

or, when 100>TD is greater than or equal to 50, wherein the predetermined number of natural days is

And (4) a natural day.

The threshold increase is 5%, 10%, 15%, 20%, 25%, 30% or 40%.

Further comprising: entering a data file acquisition process to cause the big data storage system to acquire a new data file from an external device when the first device number FDN is greater than the second device number SDN and the first accessed total number FAN is less than or equal to the second accessed total number SAN;

entering a storage device acquisition process to cause addition of a new storage device within the big data storage system when the first number of devices FDN is less than or equal to the second number of devices SDN and the first total number of accesses FAN is greater than the second total number of accesses SAN.

Wherein the current storage device is one of all storage devices within the big data storage system;

wherein each storage device has a record storage area for storing local record files;

the local record file comprises a plurality of local record information, wherein each local record information is associated with a corresponding operation time interval and is used for recording all local records of the storage device in the corresponding operation time interval;

wherein all local records are a collection of accessed information for all data files within the storage device;

wherein each local record is accessed information for a single data file within the storage device;

wherein each local record is for recording a single access of a single data file, each local record comprising at least: an identifier and an access start time of the data file;

analyzing the local record files stored in the record storage area of the current storage device, and taking a plurality of local records of which the access start time is positioned in the current running time interval in all local records in all running time intervals in the local record files as accessed records of all data files in the current storage device in the current running time interval;

the predetermined length of time is 1 minute, 2 minutes, 5 minutes, 10 minutes, 15 minutes, 20 minutes, or 30 minutes; dividing a current running time interval into a plurality of basic time units according to a preset time length, distributing a plurality of (all) local records in the current running time interval into corresponding basic time units according to the access starting time of each local record, and counting the number of the plurality of local records in each basic time unit to determine the number of the plurality of local records in each basic time unit as the number of times of accessing each basic time unit of all data files of the current storage device in the current running time interval;

the threshold number of visits is 20, 30, 40, 50, 60, 80, 100, 200, 300, or 500;

wherein the same data file can belong to at least two different statistical time units;

each data file has metadata or profile data describing content information, function information or topic information of the data file.

Wherein generating summary information for each statistical time cell based on metadata or profile data for a plurality of data files involved in each statistical time cell comprises:

clustering metadata or profile data of each data file in a plurality of data files related in each statistical time unit, and generating summary information of each statistical time unit according to category information obtained by clustering; or

Selecting a predetermined number of selected data files from the plurality of data files involved in each statistical time cell, and combining metadata or profile data of each selected data file to generate summary information for each statistical time cell;

classifying metadata or profile data of each data file in a plurality of data files related in each statistical time unit according to a similar meaning word classification mode to generate a plurality of classifications, determining the classification with the largest number of included data files as a selected classification, and combining all metadata or profile data related to the selected classification to generate summary information of each statistical time unit;

after the associated storage space is created for each statistical time unit, all data files related to each statistical time unit are saved in the associated storage space;

wherein any two associated memory spaces of the plurality of associated memory spaces can have at least one identical data file;

alternatively, after creating the associated storage space for each statistical time unit, no data files are saved into the associated storage space associated with each statistical time unit, and the associated storage space is used only for saving new data files.

The content matching comprises semantic matching, keyword matching, text matching, content matching or word meaning matching;

content matching metadata or profile data of the new data file with summary information of each associated storage space comprises: matching the content information, the function information or the subject information in the metadata or the profile data of the new data file with the summary information of each associated storage space;

and when at least two associated storage spaces with the maximum content matching degree with the new data file exist, randomly saving the new data file to one of the at least two associated storage spaces with the maximum content matching degree.

According to another aspect of the present invention, there is provided a system for storing data files in a large data storage system, the system comprising:

the analysis device analyzes the access record file stored in the recording equipment of the big data storage system when the current operation time interval of the big data storage system is finished so as to determine first access record information of the big data storage system in the current operation time interval and second access record information of the big data storage system in the previous operation time interval, wherein the previous operation time interval and the current operation time interval both comprise the same number of natural days and are two adjacent operation time intervals in time; analyzing the first access record information to determine a first accessed total FAN of all data files of the big data storage system in the current running time interval, and analyzing the second access record information to determine a second accessed total SAN of all data files of the big data storage system in the previous running time interval;

analyzing device record files stored in recording devices of a big data storage system to determine a first device number FDN of the storage devices which effectively operate in a current operation time interval and determine a second device number SDN of the storage devices which effectively operate in a previous operation time interval, wherein effective operation refers to that when the continuous operation time of the storage devices in the big data storage system in a specific operation time interval reaches a preset number of natural days, the storage devices are determined to be effectively operated in the specific operation time interval;

the device comprises a judging device and a data file storage device, wherein when the first device number FDN is greater than the second device number SDN and the first accessed total number FAN is greater than the second accessed total number SAN, the judging device determines whether the ratio FDN/SDN is greater than 120%, if yes, the judging device determines whether the ratio FAN/SAN of the first accessed total number FAN to the second accessed total number SAN is greater than the ratio FDN/SDN of the first device number to the second device number, and if yes, the data file storage device enters data file storage processing;

or when the first device number FDN is smaller than the second device number SDN and the first accessed total number FAN is smaller than the second accessed total number SAN, determining whether the ratio FAN/SAN of the first accessed total number FAN to the second accessed total number SAN is larger than the ratio FDN/TDN of the first device number to the second device number, if so, determining whether the absolute value of the difference between the ratio FAN/SAN and the ratio FDN/TDN is larger than an increase threshold, and if so, entering the storage processing of the data file;

and the identification device is used for sending a notification message to each storage device in all the storage devices by the big data storage system when determining to enter the storage processing of the data file so as to instruct each storage device to start the storage processing of the data file:

taking a basic time unit with the access times larger than the threshold value of the access times in the basic time units as a statistical time unit to obtain at least one statistical time unit, determining a plurality of data files related in each statistical time unit, and acquiring metadata or profile data of each data file of the plurality of data files related in each statistical time unit;

generating summary information of each statistical time unit based on metadata or profile data of a plurality of data files involved in each statistical time unit, and creating an associated storage space for each statistical time unit and taking the summary information of each statistical time unit as the summary information of the associated storage space;

the storage device is used for acquiring metadata or profile data of a new data file when the current storage equipment receives the new data file, performing content matching on the metadata or the profile data of the new data file and the summary information of each associated storage space to determine the content matching degree of the new data file and each associated storage space, determining the associated storage space with the maximum content matching degree of the new data file, and storing the new data file into the associated storage space with the maximum content matching degree.

alternatively, each runtime interval comprises at least 100 natural days;

the recording device is used for storing an access record file, the access record file comprises a plurality of access record information, and each access record information is associated with a corresponding running time interval and is used for recording all access records of the big data storage system in the corresponding running time interval;

when the current operation time interval of the big data storage system is ended and the next operation time interval is not entered, analyzing an access record file stored in recording equipment of the big data storage system;

wherein any two adjacent running time intervals in the plurality of running time intervals have a transition time period therebetween; the transition time period occupies a period of time of a beginning part of a later operation time interval in any two adjacent operation time intervals, or the transition time period occupies a period of time of an ending part of a previous operation time interval in any two adjacent operation time intervals.

each running record is a single running record of a storage device in the big data storage system, and each running record or single running record at least comprises an identifier of the storage device, a device starting running time and a device ending running time; each running record or single-time running record is used for recording one-time running of a single storage device;

taking the number of storage devices of the natural days in which the device starting operation time of the device record file in all operation records of all operation time intervals is in the current operation time interval and the device operation time length in the current operation time interval is greater than the preset number as a first device number FDN of the storage devices which effectively operate in the current operation time interval;

wherein the storage devices that are actively operating in the current runtime interval are storage devices of all storage devices of the big data storage system that have been continuously operating for a predetermined number of natural days in the current runtime interval;

wherein the storage devices that were actively operating in the previous runtime interval are storage devices of all storage devices of the big data storage system that were continuously operating for the previous runtime interval for a predetermined number of natural days;

when TD ≧ 100, wherein the predetermined number of natural days are

A natural day;

And (4) a natural day.

The increase threshold is 20%.

Further comprising: when the first device number FDN is larger than the second device number SDN and the first total number of accessed FAN is smaller than or equal to the second total number of accessed SAN, entering a data file acquisition process to cause the big data storage system to acquire a new data file from an external device;

the predetermined length of time is 10 minutes; dividing a current running time interval into a plurality of basic time units according to a preset time length, distributing a plurality of (all) local records in the current running time interval into corresponding basic time units according to the access starting time of each local record, and counting the number of the plurality of local records in each basic time unit to determine the number of the plurality of local records in each basic time unit as the number of times of accessing each basic time unit of all data files of the current storage device in the current running time interval;

the threshold number of accesses is 20.

clustering metadata or profile data of each data file in a plurality of data files related in each statistical time unit, and generating summary information of each statistical time unit according to category information obtained by clustering; or alternatively

wherein any two associated storage spaces of the plurality of associated storage spaces can have at least one identical data file;

the content matching of the metadata or profile data of the new data file with the summary information of each associated storage space comprises: performing content matching on content information, function information or subject information in metadata or profile data of the new data file and summary information of each associated storage space;

Drawings

A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:

FIG. 1 is a flow chart of a method of storing data files in a big data storage system according to the present invention;

FIG. 2 is a schematic diagram of a data storage method according to the present invention; and

FIG. 3 is a schematic diagram of a system for storing data files in a large data storage system according to the present invention.

Detailed Description

FIG. 1 is a flow chart of a method 100 of storing data files in a large data storage system according to the present invention. As shown in fig. 1, at the end of a current operation time interval of a big data storage system, an access log file stored in a recording device of the big data storage system is parsed to determine first access log information of the big data storage system in the current operation time interval and second access log information of the big data storage system in a previous operation time interval, wherein the previous operation time interval and the current operation time interval both include the same number of days of nature and are two adjacent operation time intervals in time, in step 101.

The method further comprises the steps of setting a plurality of running time intervals for the running of the big data storage system when the big data storage system is initially run, wherein each running time interval comprises the same number of natural days, and determining the running time interval in which the current time is positioned as the current running time interval. Wherein each of the operation time intervals comprises 50 natural days, 80 natural days, 100 natural days, or 120 natural days, or each of the operation time intervals comprises at least 100 natural days. For example, the previous and current run-time intervals each include at least 100 natural days. The time intervals adjacent to the current run time interval are the previous run time interval and the next run time interval.

The recording device is used for storing an access record file, the access record file comprises a plurality of access record information, and each access record information is associated with a corresponding running time interval and is used for recording all access records of the big data storage system in the corresponding running time interval. Where all access records are a collection of accessed information for all data files within the large data storage system. Wherein each access record is accessed information for a single data file within the big data storage system; each access record is used to record one access of a single data file within the large data storage system. When the current running time interval of the big data storage system is ended, analyzing the access record file stored in the recording equipment of the big data storage system into: when the current operation time interval of the big data storage system is ended and a transition time period is entered or the effective time (actual operation period) of the next operation time interval is not entered, analyzing the access record file stored in the recording equipment of the big data storage system;

wherein any two adjacent running time intervals in the plurality of running time intervals have a transition time period therebetween; the transition time period occupies a period of time of a beginning part of a latter one of any two adjacent operation time intervals, or the transition time period occupies a period of time of an ending part of a former one of any two adjacent operation time intervals. For example, each operating time interval is 10 natural days, for example, the current operating time interval is between 2018 and 10 months and 1 day and 10 days, i.e., the current operating time interval is between 00 of 2018 and 10 months and 1 day and 00 of 00. The latter operating time interval was 00 from 2018 on day 10, month 11 to 24 on day 20. In general, the present application may use 1 hour of 00. Alternatively, 23. It should be appreciated that the transition period may be considered to belong to the runtime interval in which it is located, e.g., the transition period belongs to a subsequent runtime interval, or to a current runtime interval. When the relevant information is counted in any operation time interval, the data information in the transition time period is merged into the statistics, namely the transition time period belongs to the operation time interval and participates in the data statistics. However, the transition period is actually used for processing related to data file storage. That is, whether the length of the transition period is 1 hour or 2 hours, or other reasonable value, the current run-time interval is 10 complete weekdays and the next run-time interval is also 10 complete weekdays.

At step 102, the first access log information is parsed to determine a first total number of accesses FAN for all data files of the big data storage system in a current runtime interval, and the second access log information is parsed to determine a second total number of accesses SAN for all data files of the big data storage system in a previous runtime interval.

Wherein each access record comprises at least: an identifier of the data file and an access start time. And taking the number of access records with access start times within the current running time interval in all the access records of the first access record information as the first accessed total FAN of all the data files of the large data storage system within the current running time interval. And taking the number of the access records with the access start time in the previous operation time interval in all the access records of the second access record information as a second accessed total SAN of all the data files of the large data storage system in the previous operation time interval.

In step 103, a device log file stored in a recording device of the big data storage system is parsed to determine a first device number FDN of the storage device that is effectively operated in a current operation time interval, and determine a second device number SDN of the storage device that is effectively operated in a previous operation time interval, where effective operation refers to when a time for which the storage device in the big data storage system continuously operates in a specific operation time interval reaches a predetermined number of natural days, and then the storage device is determined to be effectively operated in the specific operation time interval.

The recording device is further configured to store a device record file, the device record file including a plurality of device record information, wherein each device record information is associated with a corresponding run-time interval for recording run records of all storage devices within the mass data storage system in the corresponding run-time interval; wherein the operation record of all storage devices is a collection of operation information of all storage devices within the big data storage system. Each running record is a single running record of a storage device in the big data storage system, and each running record or single running record at least comprises an identifier of the storage device, a device starting running time and a device ending running time; each run record or single run record is used to record one run of a single storage device.

And taking the number of the storage devices of the natural days of which the device starting operation time of the device record files in all operation records of all operation time intervals is in the current operation time interval and the device operation time length in the current operation time interval is greater than the preset number as the first device number FDN of the storage devices which are effectively operated in the current operation time interval. And taking the number of storage devices of all the operation records of the device record file, of which the device start operation time is in the previous operation time interval and the device operation time length in the previous operation time interval is greater than the preset number of natural days, as a second device number SDN of the storage devices which are effectively operated in the previous operation time interval.

Wherein the storage devices that are actively operating in the current runtime interval are storage devices of all storage devices of the big data storage system that have been continuously operating for a predetermined number of natural days in the current runtime interval. Wherein the storage devices that were actively operating in the previous runtime interval are storage devices of all storage devices of the big data storage system that were continuously operating for the previous runtime interval for a predetermined number of natural days. Wherein the number of natural days included in the specific operation time interval is TD. When TD ≧ 100, wherein the predetermined number of natural days is

And (4) a natural day. Or, when 100>TD ≧ 50, in which a predetermined number ofIt is then

And (4) a natural day.

In step 104, when the first device number FDN is greater than the second device number SDN and the first total number of accessed FANs is greater than the second total number SAN, determining whether a ratio FDN/SDN is greater than 120%, if yes, determining whether a ratio FAN/SAN of the first total number of accessed FANs to the second total number SAN is greater than a ratio FDN/SDN of the first device number to the second device number, and if yes, entering a storage process of the data file; or when the first device number FDN is smaller than the second device number SDN and the first accessed total number FAN is smaller than the second accessed total number SAN, determining whether the ratio FAN/SAN of the first accessed total number FAN to the second accessed total number SAN is larger than the ratio FDN/TDN of the first device number to the second device number, if so, determining whether the absolute value of the difference between the ratio FAN/SAN and the ratio FDN/TDN is larger than an increase threshold, and if so, entering a storage process of the data file; when it is determined to enter the storage process for the data file, the big data storage system sends a notification message to each of all the storage devices to instruct each storage device to start performing the storage process for the data file.

The threshold increase is 5%, 10%, 15%, 20%, 25%, 30% or 40%. Further comprising: entering a data file acquisition process to cause the big data storage system to acquire a new data file from an external device when the first number of devices FDN is greater than the second number of devices SDN and the first total number of accesses FAN is less than or equal to the second total number of accesses SAN. Entering a storage device acquisition process to cause addition of a new storage device within the big data storage system when the first number of devices FDN is less than or equal to the second number of devices SDN and the first total number of accesses FAN is greater than the second total number of accesses SAN.

In step 105, the storage device receiving the notification message is taken as the current storage device, the local record files stored in the record storage area of the current storage device are parsed to obtain accessed records of all data files in the current storage device in the current running time interval, and the accessed records of all data files in the current storage device in the current running time interval are counted by taking a predetermined time length as a basic time unit to determine the number of times of accessing each basic time unit of all data files of the current storage device in the current running time interval. I.e., each of the plurality of storage devices may be treated as the current storage device for subsequent steps or processes, e.g., steps 105-108.

Wherein the current storage device is one of all storage devices within the big data storage system, and each of all storage devices is selected as the current storage device for subsequent processes. Wherein each storage device has a record storage area for storing local record files. The local record file includes a plurality of local record information, wherein each local record information is associated with a corresponding runtime interval for recording all local records of the storage device during the corresponding runtime interval. Where all local records are the collection of accessed information for all data files within the storage device. Where each local record is accessed information for a single data file within the storage device. Wherein each local record is for recording a single access of a single data file, each local record comprising at least: an identifier of the data file and an access start time.

And analyzing the local record files stored in the record storage area of the current storage device, and taking a plurality of local records of which the access start time is positioned in the current running time interval in all local records in all running time intervals in the local record files as accessed records of all data files in the current storage device in the current running time interval. The predetermined length of time is 1 minute, 2 minutes, 5 minutes, 10 minutes, 15 minutes, 20 minutes, or 30 minutes. Dividing the current running time interval into a plurality of basic time units according to a preset time length, allocating a plurality of (all) local records in the current running time interval into corresponding basic time units according to the access starting time of each local record, and counting the number of the plurality of local records in each basic time unit so as to determine the number of the plurality of local records in each basic time unit as the number of times that all data files of the current storage device are accessed in each basic time unit in the current running time interval.

In step 106, the basic time unit with the access times larger than the threshold value of the access times in the plurality of basic time units is used as a statistical time unit to obtain at least one statistical time unit, a plurality of data files involved in each statistical time unit are determined, and metadata or profile data of each data file of the plurality of data files involved in each statistical time unit is obtained. The threshold number of visits is 20, 30, 40, 50, 60, 80, 100, 200, 300 or 500. Wherein the same data file can belong to at least two different statistical time units. Each data file has metadata or profile data describing content information, function information or topic information of the data file.

In step 107, summary information for each statistical time unit is generated based on metadata or profile data of the plurality of data files involved in each statistical time unit, and an associated storage space is created for each statistical time unit and the summary information for each statistical time unit is taken as summary information of the associated storage space.

Wherein generating summary information for each statistical time cell based on metadata or profile data for a plurality of data files involved in each statistical time cell comprises: clustering metadata or profile data of each data file in a plurality of data files related in each statistical time unit, and generating summary information of each statistical time unit according to category information obtained by clustering; or selecting a predetermined number of selected data files from the plurality of data files involved in each statistical time cell, and combining the metadata or profile data of each selected data file to generate summary information for each statistical time cell.

Classifying metadata or profile data of each data file in a plurality of data files related in each statistical time unit according to a synonym classification mode to generate a plurality of classifications, determining the classification with the maximum number of included data files as a selected classification, and combining all metadata or profile data related to the selected classification to generate summary information of each statistical time unit. After creating the associated memory space for each statistical time unit, all data files involved in each statistical time unit are saved into the associated memory space. Wherein any two associated storage spaces of the plurality of associated storage spaces can have at least one identical data file. Alternatively, after creating the associated storage space for each statistical time unit, no data files are saved into the associated storage space associated with each statistical time unit, and the associated storage space is used only for saving new data files.

In step 108, when the current storage device receives a new data file, obtaining metadata or profile data of the new data file, performing content matching between the metadata or profile data of the new data file and the summary information of each associated storage space to determine a content matching degree between the new data file and each associated storage space, determining an associated storage space with the maximum content matching degree with the new data file, and saving the new data file into the associated storage space with the maximum content matching degree.

The content matching comprises semantic matching, keyword matching, text matching, content matching or word meaning matching. The content matching of the metadata or profile data of the new data file with the summary information of each associated storage space comprises: and matching the content information, the function information or the subject information in the metadata or the profile data of the new data file with the summary information of each associated storage space. And when at least two associated storage spaces with the maximum content matching degree with the new data file exist, randomly saving the new data file to one of the at least two associated storage spaces with the maximum content matching degree.

FIG. 2 is a diagram of a data storage system 200 according to the present invention. In the application, a basic time unit with the access times larger than the access time threshold value in a plurality of basic time units is used as a statistical time unit to obtain at least one statistical time unit. Further, a plurality of data files involved in each statistical time unit is determined, and metadata or profile data of each data file of the plurality of data files involved in each statistical time unit is acquired. The summary information of each statistical time unit is generated based on metadata or profile data of a plurality of data files involved in each statistical time unit, and an associated storage space is created for each statistical time unit and the summary information of each statistical time unit is taken as the summary information of the associated storage space.

As shown in FIG. 2, in storage device 202, N statistical time units are determined, and for this purpose, N corresponding associative memory spaces, namely associative memory spaces 201-1, 201-2, 201-3, 201-4, \ 8230; \8230; 201-N, need to be created. Wherein each associated storage space is used for storing a plurality of data files referred to by the associated statistical time unit. It should be appreciated that any two statistical time units or associated storage spaces may or may not have the same data file between them. In addition, each associated storage space has respective summary information or summary data.

Wherein generating summary information for each statistical time cell based on metadata or profile data for a plurality of data files involved in each statistical time cell comprises: and clustering the metadata or profile data of each data file in the plurality of data files related in each statistical time unit, and generating summary information of each statistical time unit according to the category information obtained by clustering. Or a predetermined number (e.g., 5 or 10) of selected data files are selected (selected in descending order from the beginning of the most accessed number of times, or randomly) from the plurality of data files involved in each statistical time unit, and the metadata or profile data of each selected data file is combined to generate summary information for each statistical time unit.

Or classifying the metadata or the profile data of each data file in the plurality of data files related in each statistical time unit according to a synonym classification mode to generate a plurality of classifications, determining the classification with the maximum number of included data files as a selected classification, and combining all the metadata or the profile data related to the selected classification to generate summary information of each statistical time unit. Or, the profile data of the data file with the most number of times of access in the affiliated statistical time unit in the plurality of data files involved in each statistical time unit is used as the summary information of each statistical time unit.

After the associated memory space is created for each statistical time unit, all data files involved in each statistical time unit are saved into the associated memory space. Wherein any two associated storage spaces of the plurality of associated storage spaces can have at least one identical data file. Alternatively, after creating the associated storage space for each statistical time unit, no data files are saved into the associated storage space associated with each statistical time unit, and the associated storage space is used only for saving new data files.

Subsequently, when the storage device 202 receives a new data file, the storage manager 203 acquires metadata or profile data of the new data file, performs content matching on the metadata or profile data of the new data file and the summary information of each associated storage space to determine a content matching degree of the new data file and each associated storage space, determines an associated storage space having a maximum content matching degree with the new data file, and saves the new data file into the associated storage space having the maximum content matching degree.

The content matching comprises semantic matching, keyword matching, text matching, content matching or word sense matching. Content matching metadata or profile data of the new data file with summary information of each associated storage space comprises: and matching the content information, the function information or the subject information in the metadata or the profile data of the new data file with the summary information of each associated storage space.

FIG. 3 is a block diagram of a system 300 for storing data files in a large data storage system according to the present invention. The system 300 includes: analysis means 301, determination means 302, recognition means 303 and storage means 304.

When the current operation time interval of the big data storage system is ended, the parsing device 301 parses an access record file stored in a recording device of the big data storage system to determine first access record information of the big data storage system in the current operation time interval and second access record information of the big data storage system in a previous operation time interval, wherein the previous operation time interval and the current operation time interval both include the same number of natural days and are two adjacent operation time intervals in terms of time; and analyzing the first access record information to determine a first accessed total FAN of all the data files of the big data storage system in the current operation time interval, and analyzing the second access record information to determine a second accessed total SAN of all the data files of the big data storage system in the previous operation time interval.

The parsing device 301 parses the device record files stored in the recording devices of the big data storage system to determine a first device number FDN of the storage devices that are effectively operated in a current operation time interval, and determine a second device number SDN of the storage devices that are effectively operated in a previous operation time interval, where effective operation refers to when a time for which the storage devices in the big data storage system continuously operate in a specific operation time interval reaches a predetermined number of natural days, and then determines that the storage devices are effectively operated in the specific operation time interval.

The determining device 302 determines whether a ratio FDN/SDN is greater than 120% when the first device number FDN is greater than the second device number SDN and the first total accessed number FAN is greater than the second total accessed number SAN, determines whether a ratio FAN/SAN between the first total accessed number FAN and the second total accessed number SAN is greater than the ratio FDN/SDN between the first device number and the second device number if the ratio FDN/SDN is greater than the ratio FDN/SDN between the first total accessed number FAN and the second total accessed number SAN, and enters the storage process of the data file if the ratio FDN/SDN is greater than the ratio between the first device number and the second device number;

or when the first device number FDN is smaller than the second device number SDN and the first total accessed number FAN is smaller than the second total accessed number SAN, determining whether the ratio FAN/SAN of the first total accessed number FAN to the second total accessed number SAN is larger than the ratio FDN/TDN of the first device number to the second device number, if yes, determining whether the absolute value of the difference between the ratio FAN/SAN and the ratio FDN/TDN is larger than an increase threshold, and if yes, entering the storage processing of the data file;

the identifying means 303, when determining to enter the storage processing of the data file, the big data storage system sends a notification message to each storage device of all the storage devices to instruct each storage device to start the storage processing of the data file:

taking the storage device receiving the notification message as a current storage device, analyzing the local record files stored in the record storage area of the current storage device to obtain accessed records of all data files in the current storage device in the current running time interval, and counting the accessed records of all data files in the current storage device in the current running time interval by taking a preset time length as a basic time unit to determine the accessed times of each basic time unit of all data files of the current storage device in the current running time interval.

The method comprises the steps of taking a basic time unit with the access times larger than a threshold value of the access times in a plurality of basic time units as a statistical time unit to obtain at least one statistical time unit, determining a plurality of data files related in each statistical time unit, and obtaining metadata or profile data of each data file of the plurality of data files related in each statistical time unit.

The summary information of each statistical time unit is generated based on metadata or profile data of a plurality of data files involved in each statistical time unit, and an associated storage space is created for each statistical time unit and the summary information of each statistical time unit is taken as the summary information of the associated storage space.

When the current storage device receives a new data file, the saving device 304 acquires metadata or profile data of the new data file, performs content matching on the metadata or profile data of the new data file and the summary information of each associated storage space to determine a content matching degree of the new data file and each associated storage space, determines an associated storage space with the maximum content matching degree of the new data file, and saves the new data file into the associated storage space with the maximum content matching degree.

Setting a plurality of running time intervals for the running of the big data storage system when the big data storage system initially runs, wherein each running time interval comprises the same number of natural days, and determining the running time interval in which the current time is positioned as the current running time interval; wherein each runtime interval includes 50 natural days, 80 natural days, 100 natural days, or 120 natural days; alternatively, each run time interval comprises at least 100 natural days; the previous operating time interval and the current operating time interval comprise at least 100 natural days; the time intervals adjacent to the current operation time interval are a previous operation time interval and a next operation time interval;

the recording device is used for storing an access record file, the access record file comprises a plurality of access record information, and each access record information is associated with a corresponding running time interval and is used for recording all access records of the big data storage system in the corresponding running time interval; wherein all access records are a collection of accessed information for all data files within the big data storage system; wherein each access record is accessed information for a single data file within the big data storage system; each access record is used to record one access of a single data file within the big data storage system;

when the current running time interval of the big data storage system is ended, analyzing the access record file stored in the recording equipment of the big data storage system into: when the current running time interval of the big data storage system is ended and a next running time interval is not entered, analyzing an access record file stored in recording equipment of the big data storage system; wherein any two adjacent running time intervals in the plurality of running time intervals have a transition time period therebetween; the transition time period occupies a period of time of a beginning part of a latter one of any two adjacent operation time intervals, or the transition time period occupies a period of time of an ending part of a former one of any two adjacent operation time intervals.

Wherein each access record comprises at least: an identifier and an access start time of the data file; taking the number of access records with access start times within the current running time interval in all access records of the first access record information as a first accessed total number FAN of all data files of the large data storage system within the current running time interval; and taking the number of the access records with the access start time in the previous operation time interval in all the access records of the second access record information as a second accessed total SAN of all the data files of the large data storage system in the previous operation time interval. The recording device is further configured to store a device record file, the device record file including a plurality of device record information, wherein each device record information is associated with a corresponding run-time interval for recording run records of all storage devices within the mass data storage system in the corresponding run-time interval; wherein the operation records of all the storage devices are a set of operation information of all the storage devices in the big data storage system;

each running record is a single running record of a storage device in the big data storage system, and each running record or single running record at least comprises an identifier of the storage device, a device starting running time and a device ending running time; each running record or single-time running record is used for recording one-time running of a single storage device; taking the number of the storage devices of the natural days that the device record files in all operation records of all operation time intervals have the device start operation time in the current operation time interval and the device operation time length in the current operation time interval is greater than the preset number as the first device number FDN of the storage devices which effectively operate in the current operation time interval; taking the number of storage devices of a natural day in which the device start operation time is in the previous operation time interval and the device operation time length in the previous operation time interval is greater than the preset number in all operation records of the device record file as a second device number SDN of the storage devices which effectively operate in the previous operation time interval;

wherein the storage devices that are actively operating in the current runtime interval are storage devices of all storage devices of the big data storage system that have been continuously operating for a predetermined number of natural days in the current runtime interval; wherein the storage devices that were actively operating in the previous runtime interval are storage devices of all storage devices of the big data storage system that were continuously operating for the previous runtime interval for a predetermined number of natural days; wherein the number of natural days included in the specific operation time interval is TD; when TD ≧ 100, wherein the predetermined number of natural days is

Each natural day; or, when 100>TD is greater than or equal to 50, wherein the predetermined number of natural days is

And (4) a natural day. The threshold of increase is 5%, 10%, 15%, 20%, 25%, 30% or 40%.

Further comprising: entering a data file acquisition process to cause the big data storage system to acquire a new data file from an external device when the first device number FDN is greater than the second device number SDN and the first accessed total number FAN is less than or equal to the second accessed total number SAN; entering a storage device acquisition process to cause addition of a new storage device within the big data storage system when the first number of devices FDN is less than or equal to the second number of devices SDN and the first total number of accesses FAN is greater than the second total number of accesses SAN.

Wherein the current storage device is one of all storage devices within the big data storage system; wherein each storage device has a record storage area for storing a local record file; the local record file comprises a plurality of local record information, wherein each local record information is associated with a corresponding runtime interval and is used for recording all local records of the storage device in the corresponding runtime interval; wherein all local records are a collection of accessed information for all data files within the storage device; wherein each local record is accessed information for a single data file within the storage device; wherein each local record is for recording a single access of a single data file, each local record comprising at least: an identifier and an access start time of the data file; analyzing the local record files stored in the record storage area of the current storage device, and taking a plurality of local records of which the access start time is positioned in the current running time interval in all local records in all running time intervals in the local record files as accessed records of all data files in the current storage device in the current running time interval;

the predetermined length of time is 1 minute, 2 minutes, 5 minutes, 10 minutes, 15 minutes, 20 minutes, or 30 minutes; dividing the current running time interval into a plurality of basic time units according to a preset time length, allocating a plurality of (all) local records in the current running time interval into corresponding basic time units according to the access starting time of each local record, and counting the number of the plurality of local records in each basic time unit so as to determine the number of the plurality of local records in each basic time unit as the number of times that all data files of the current storage device are accessed in each basic time unit in the current running time interval; the threshold number of visits is 20, 30, 40, 50, 60, 80, 100, 200, 300, or 500; wherein the same data file can belong to at least two different statistical time units; each data file has metadata or profile data describing content information, function information or topic information of the data file.

Wherein generating summary information for each statistical time cell based on metadata or profile data for a plurality of data files involved in each statistical time cell comprises: clustering metadata or profile data of each data file in a plurality of data files related in each statistical time unit, and generating summary information of each statistical time unit according to category information obtained by clustering; or selecting a predetermined number of selected data files from the plurality of data files involved in each statistical time unit, and combining the metadata or profile data of each selected data file to generate summary information of each statistical time unit; classifying metadata or profile data of each data file in a plurality of data files related in each statistical time unit according to a similar meaning word classification mode to generate a plurality of classifications, determining the classification with the largest number of included data files as a selected classification, and combining all metadata or profile data related to the selected classification to generate summary information of each statistical time unit; after the associated storage space is created for each statistical time unit, all data files related to each statistical time unit are saved in the associated storage space; wherein any two associated storage spaces of the plurality of associated storage spaces can have at least one identical data file; alternatively, after creating the associated storage space for each statistical time unit, no data files are saved into the associated storage space associated with each statistical time unit, and the associated storage space is used only for saving new data files. The content matching comprises semantic matching, keyword matching, text matching, content matching or word meaning matching; the content matching of the metadata or profile data of the new data file with the summary information of each associated storage space comprises: performing content matching on content information, function information or subject information in metadata or profile data of the new data file and summary information of each associated storage space; and when at least two associated storage spaces with the maximum content matching degree with the new data file exist, randomly saving the new data file to one of the at least two associated storage spaces with the maximum content matching degree.

Claims

1. A method of storing data files in a big data storage system, the method comprising:

when the current operation time interval of the big data storage system is finished, analyzing an access record file stored in a recording device of the big data storage system to determine first access record information of the big data storage system in the current operation time interval and second access record information of the big data storage system in a previous operation time interval, wherein the previous operation time interval and the current operation time interval comprise the same number of natural days and are two adjacent operation time intervals in time;

when the first device number FDN is larger than the second device number SDN and the first total number of accessed FAN is larger than the second total number SAN, determining whether the ratio FDN/SDN is larger than 120%, if yes, determining whether the ratio FAN/SAN of the first total number of accessed FAN to the second total number SAN is larger than the ratio FDN/SDN of the first device number to the second device number, and if yes, entering a storage process of the data file;

or when the first device number FDN is smaller than the second device number SDN and the first total accessed number FAN is smaller than the second total accessed number SAN, determining whether the ratio FAN/SAN of the first total accessed number FAN to the second total accessed number SAN is larger than the ratio FDN/TDN of the first device number to the second device number, if yes, determining whether the absolute value of the difference between the ratio FAN/SAN and the ratio FDN/TDN is larger than an increase threshold, and if yes, entering a storage process of the data file;

when determining to enter the storage process of the data file, the big data storage system sends a notification message to each storage device in all the storage devices to instruct each storage device to start the storage process of the data file:

taking the storage device receiving the notification message as a current storage device, analyzing local record files stored in a record storage area of the current storage device to obtain accessed records of all data files in the current storage device in a current running time interval, and counting the accessed records of all data files in the current storage device in the current running time interval by taking a preset time length as a basic time unit to determine the accessed times of all data files of the current storage device in each basic time unit in the current running time interval;

2. The method of claim 1, further comprising setting a plurality of runtime intervals for operation of the big data storage system at initial runtime of the big data storage system, wherein each runtime interval includes a same number of birthdays, and determining a runtime interval in which the current time is located as the current runtime interval;

3. The method of claim 2, wherein each access record comprises at least: an identifier and an access start time of the data file;

and taking the number of the access records with the access start time in the previous operation time interval in all the access records of the second access record information as a second accessed total SAN of all the data files of the large data storage system in the previous operation time interval.

4. The method of claim 3, the recording device further to store a device record file comprising a plurality of device record information, wherein each device record information is associated with a respective runtime interval for recording the operational records of all storage devices within the big data storage system in the respective runtime interval; the operation records of all the storage devices are a set of operation information of all the storage devices in the large data storage system;

each running record is a single running record of a storage device in the big data storage system, and each running record or single running record at least comprises an identifier of the storage device, a device starting running time and a device ending running time; each running record or single running record is used for recording one-time running of a single storage device;

wherein the storage devices that were actively operating in the previous runtime interval are storage devices of all storage devices of the big data storage system that were continuously operating for the previous runtime interval for a predetermined number of natural days.

5. The method of claim 4, the increase threshold being 5%, 10%, 15%, 20%, 25%, 30%, or 40%, further comprising:

entering a data file acquisition process to cause the big data storage system to acquire a new data file from an external device when the first device number FDN is greater than the second device number SDN and the first accessed total number FAN is less than or equal to the second accessed total number SAN;

6. A system for storing data files in a big data storage system, the system comprising:

the analysis device analyzes the access record file stored in the recording equipment of the big data storage system when the current operation time interval of the big data storage system is finished so as to determine first access record information of the big data storage system in the current operation time interval and second access record information of the big data storage system in the previous operation time interval, wherein the previous operation time interval and the current operation time interval both comprise the same number of natural days and are two adjacent operation time intervals in time; analyzing the first access record information to determine a first accessed total FAN of all data files of the big data storage system in a current running time interval, and analyzing the second access record information to determine a second accessed total SAN of all data files of the big data storage system in a previous running time interval;

the device comprises a judging device and a data file storing device, wherein when the first device number FDN is larger than the second device number SDN and the first total accessed number FAN is larger than the second total accessed number SAN, the judging device determines whether the ratio FDN/SDN is larger than 120%, if yes, the judging device determines whether the ratio FAN/SAN of the first total accessed number FAN to the second total accessed number SAN is larger than the ratio FDN/SDN of the first device number to the second device number, and if yes, the judging device enters data file storing processing;

7. The system of claim 6, further comprising setting a plurality of runtime intervals for operation of the big data storage system at initial runtime of the big data storage system, wherein each runtime interval includes a same number of birthdays, and determining a runtime interval in which the current time is located as the current runtime interval;

8. The system of claim 7, wherein each access record comprises at least: an identifier and an access start time of the data file;

taking the number of access records with access start times located in the current operation time interval in all access records of the first access record information as a first accessed total FAN of all data files of the large data storage system in the current operation time interval;

9. The system of claim 8, the logging device further to store a device log file, the device log file comprising a plurality of device log information, wherein each device log information is associated with a respective runtime interval to log the run records of all storage devices within the big data storage system in the respective runtime interval; wherein the operation records of all the storage devices are a set of operation information of all the storage devices in the big data storage system;

taking the number of storage devices of a natural day with a device start running time in a previous running time interval and a device running time length in the previous running time interval being greater than a preset number in all running records of the device record file as a second device number SDN of the storage devices which effectively run in the previous running time interval;

wherein the storage devices that were actively operating in the previous run-time interval are storage devices of all storage devices of the big data storage system that were continuously operating for a predetermined number of natural days in the previous run-time interval.

10. The system of claim 9, the increase threshold is 15%.