CN111797069B - Hadoop platform-based mass storage and pickup record processing method and device - Google Patents

Hadoop platform-based mass storage and pickup record processing method and device Download PDF

Info

Publication number
CN111797069B
CN111797069B CN202010949054.8A CN202010949054A CN111797069B CN 111797069 B CN111797069 B CN 111797069B CN 202010949054 A CN202010949054 A CN 202010949054A CN 111797069 B CN111797069 B CN 111797069B
Authority
CN
China
Prior art keywords
data
target
access
query
hdfs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010949054.8A
Other languages
Chinese (zh)
Other versions
CN111797069A (en
Inventor
王彪
杨博
张友平
邵杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing cloud cabinet Network Technology Co.,Ltd.
Original Assignee
Jiangsu Cloudbox Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Cloudbox Network Technology Co ltd filed Critical Jiangsu Cloudbox Network Technology Co ltd
Priority to CN202010949054.8A priority Critical patent/CN111797069B/en
Publication of CN111797069A publication Critical patent/CN111797069A/en
Application granted granted Critical
Publication of CN111797069B publication Critical patent/CN111797069B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method and the device for processing the mass access records based on the Hadoop platform provided by the embodiment of the invention comprise the steps of firstly, acquiring access data, detecting whether the occupancy rate of the HDFS in a preset Hadoop cluster reaches the set occupancy rate in real time when the access data are stored, and if so, calculating the current data volume corresponding to the access data. And secondly, when the current data volume is judged to be larger than the set data volume, identifying the target access piece data stored in each first target HDFS to obtain a query record corresponding to the target access piece data. And finally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS. Therefore, redundant data in the Hadoop cluster can be removed, storage space is released, and storage of the access data is dynamically achieved, so that storage efficiency of the access data is improved.

Description

Hadoop platform-based mass storage and pickup record processing method and device
Technical Field
The invention relates to the technical field of data processing of intelligent express cabinets, in particular to a method and a device for processing mass storage and pickup records based on a Hadoop platform.
Background
Along with the rapid development of express delivery business, the intelligent express delivery cabinet is a new thing along with the continuous development of express delivery business. The intelligent express cabinet has the functions of mainly assisting couriers in receiving and sending express, and the facing users are vast couriers and recipients. As a 24-hour unattended express temporary storage cabinet, the intelligent express cabinet interacts with a server through an Internet of things card and is used as an application of the Internet of things technology in the modern logistics industry, the intelligent express cabinet is the most effective solution for the last kilometer of electronic commerce terminal logistics, the delivery efficiency of couriers is improved, the delivery cost is reduced, the privacy and the safety of users are guaranteed, and the intelligent express cabinet is gradually and widely recognized and popularized.
The intelligent express cabinet system serves as a server of an express terminal, services such as express delivery person storage and user pickup are provided, data volume generated every day is extremely large, and historical storage information needs to be stored and inquired quickly and timely for later-stage data analysis, inquiry and the like.
However, the existing technology for storing the storage and taking piece data of the intelligent express cabinet has the technical problem of low efficiency.
Disclosure of Invention
In order to solve the problems, the invention provides a method and a device for recording and processing mass access parts based on a Hadoop platform.
Based on the first aspect of the embodiments of the present invention, a method for processing records of mass access components based on a Hadoop platform is provided, which is applied to a computer device, the computer device is in communication connection with a data source end, the computer device is also in communication connection with a client, and the method includes:
acquiring access piece data uploaded by the data source end through a set transmission protocol; the data source end comprises a relational database and an application program, and the access piece data comprises first access piece data uploaded by the relational database and second access piece data uploaded by the application program;
when the access data is stored, detecting whether the occupancy rate of the HDFS in a preset Hadoop cluster reaches a set occupancy rate or not in real time;
if the occupancy rate reaches the set occupancy rate, calculating the current data volume corresponding to the access piece data; the current data volume is used for representing the size of a storage space required for storing the access piece data into the HDFS;
judging whether the current data volume is larger than a set data volume;
when the current data volume is larger than the set data volume, identifying target access item data stored in each first target HDFS to obtain a query record corresponding to the target access item data; the first target HDFS is an HDFS with a storage space in a saturated state;
and performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS.
Optionally, the method further comprises:
after the storage space of each first target HDFS is released, calculating target data volume corresponding to all released storage spaces;
judging whether the target data volume reaches the current data volume;
and if the target data volume does not reach the current data volume, expanding the Hadoop cluster to increase at least one second target HDFS.
Optionally, the method further comprises:
determining a cluster response curve of the Hadoop cluster when the second target HDFS is added;
extracting real-time curve characteristics of the cluster response curve, and judging whether the Hadoop cluster is in an overload state according to the real-time curve characteristics;
and when the Hadoop cluster is in the overload state, returning to the step of performing redundant data elimination on at least part of target storage piece data stored in each first target HDFS according to the query records.
Optionally, the method further comprises:
and before the step of performing redundant data elimination on at least part of target storage part data stored in each first target HDFS according to the query record is returned, modifying preset configuration parameters to adjust an identification model for performing redundant data elimination.
Optionally, identifying the target access item data stored in each first target HDFS to obtain a query record corresponding to the target access item data, where the query record includes:
acquiring access member path distribution of target access member data and each access member order;
under the condition that the target access piece data contains a query call label according to the access piece path distribution, determining order correlation coefficients between each access piece order of the target access piece data under a sample call label and each access piece order of the target access piece data under the query call label according to an access piece order of the target access piece data under the query call label and an order generation moment of the access piece order of the target access piece data under the query call label, and dividing the access piece order of which the order correlation coefficient between the access piece order of the target access piece data under the sample call label and the access piece order under the query call label is higher than a set coefficient into the query call label;
under the condition that a plurality of access part orders are contained in a sample calling label corresponding to the target access part data, determining order correlation coefficients among the access part orders of the target access part data in the sample calling label according to the access part orders of the target access part data in the query calling label and the order generation time of the access part orders of the target access part data in the query calling label, and screening the access part orders in the sample calling label according to the order correlation coefficients among the access part orders;
setting query updating priority for the screened target access part orders according to the access part orders of the target access part data under the query calling label and the order generating time of the access part orders of the target access part data under the query calling label, and dividing at least part of the target access part orders under the query calling label according to the query updating priority;
and generating a query record corresponding to the target access piece data according to the access piece order under the query calling label.
Optionally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query record, and retaining key data corresponding to the at least part of target storage device data, including:
analyzing a data field set of the target access piece data in each first target HDFS according to the query time interval distribution information of the query records and the order user behavior data of the access piece order to obtain feature pointing data in field features included in the data field set of the target access piece data in each first target HDFS;
extracting second feature index data, which is not updated along with the change of the first feature index data with access behavior following property, of the field codes in the data field set of the target access device data in each first target HDFS by skipping first feature index data with access behavior following property relative to the access behavior data with adjustable weight in the feature index data;
and determining field characteristics to be processed corresponding to target characteristic identification similar to the set characteristic identification from the target field characteristics corresponding to the second characteristic pointing data, determining a first data field corresponding to the field characteristics to be processed in the data field set as redundant data and removing the redundant data, and then determining a second data field reserved in the data field set as the key data.
Optionally, determining, from the target field features corresponding to the second feature direction data, to-be-processed field features corresponding to target feature identifiers similar to the set feature identifiers, includes:
calculating the cosine distance between the identification character string of the current characteristic identification of each target field characteristic corresponding to the second characteristic pointing data and the identification character string of the set characteristic identification;
and if the cosine distance is within a set interval, determining that the current feature identifier of the target field feature is a target feature identifier and determining that the target field feature is the to-be-processed field feature.
Based on the second aspect of the embodiments of the present invention, a device for recording and processing a mass storage element based on a Hadoop platform is applied to a computer device, the computer device is in communication connection with a data source end, the computer device is also in communication connection with a client, and the device includes:
the data acquisition module is used for acquiring the access piece data uploaded by the data source end through a set transmission protocol; the data source end comprises a relational database and an application program, and the access piece data comprises first access piece data uploaded by the relational database and second access piece data uploaded by the application program;
the real-time detection module is used for detecting whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches the set occupancy rate or not in real time when the access data is stored;
if the occupancy rate reaches the set occupancy rate, calculating the current data volume corresponding to the access piece data; the current data volume is used for representing the size of a storage space required for storing the access piece data into the HDFS;
the data judging module is used for judging whether the current data volume is larger than the set data volume;
when the current data volume is larger than the set data volume, identifying target access item data stored in each first target HDFS to obtain a query record corresponding to the target access item data; the first target HDFS is an HDFS with a storage space in a saturated state;
and the space release module is used for removing redundant data of at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to the at least part of target storage device data to release the storage space of each first target HDFS.
Optionally, the data determining module is specifically configured to:
acquiring access member path distribution of target access member data and each access member order;
under the condition that the target access piece data contains a query call label according to the access piece path distribution, determining order correlation coefficients between each access piece order of the target access piece data under a sample call label and each access piece order of the target access piece data under the query call label according to an access piece order of the target access piece data under the query call label and an order generation moment of the access piece order of the target access piece data under the query call label, and dividing the access piece order of which the order correlation coefficient between the access piece order of the target access piece data under the sample call label and the access piece order under the query call label is higher than a set coefficient into the query call label;
under the condition that a plurality of access part orders are contained in a sample calling label corresponding to the target access part data, determining order correlation coefficients among the access part orders of the target access part data in the sample calling label according to the access part orders of the target access part data in the query calling label and the order generation time of the access part orders of the target access part data in the query calling label, and screening the access part orders in the sample calling label according to the order correlation coefficients among the access part orders;
setting query updating priority for the screened target access part orders according to the access part orders of the target access part data under the query calling label and the order generating time of the access part orders of the target access part data under the query calling label, and dividing at least part of the target access part orders under the query calling label according to the query updating priority;
and generating a query record corresponding to the target access piece data according to the access piece order under the query calling label.
Optionally, the space releasing module is specifically configured to:
analyzing a data field set of the target access piece data in each first target HDFS according to the query time interval distribution information of the query records and the order user behavior data of the access piece order to obtain feature pointing data in field features included in the data field set of the target access piece data in each first target HDFS;
extracting second feature index data, which is not updated along with the change of the first feature index data with access behavior following property, of the field codes in the data field set of the target access device data in each first target HDFS by skipping first feature index data with access behavior following property relative to the access behavior data with adjustable weight in the feature index data;
and determining field characteristics to be processed corresponding to target characteristic identification similar to the set characteristic identification from the target field characteristics corresponding to the second characteristic pointing data, determining a first data field corresponding to the field characteristics to be processed in the data field set as redundant data and removing the redundant data, and then determining a second data field reserved in the data field set as the key data.
The method and the device for processing the mass access records based on the Hadoop platform provided by the embodiment of the invention comprise the steps of firstly, acquiring access data, detecting whether the occupancy rate of the HDFS in a preset Hadoop cluster reaches the set occupancy rate in real time when the access data are stored, and if so, calculating the current data volume corresponding to the access data. And secondly, when the current data volume is judged to be larger than the set data volume, identifying the target access piece data stored in each first target HDFS to obtain a query record corresponding to the target access piece data. And finally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS. Therefore, redundant data in the Hadoop cluster can be removed, so that a storage space is released to dynamically realize storage of the access data, and the storage efficiency of the access data is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a flowchart of a method for processing records of mass access components based on a Hadoop platform according to an embodiment of the present invention.
Fig. 2 is a functional block diagram of a device for recording and processing a mass access component based on a Hadoop platform according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.
Detailed Description
In order to better understand the technical solutions of the present invention, the following detailed descriptions of the technical solutions of the present invention are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features in the embodiments and the examples of the present invention are the detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features in the embodiments and the examples of the present invention may be combined with each other without conflict.
Referring first to fig. 1, a flow chart of a method for processing a mass storage record based on a Hadoop platform is shown, where the method may be applied to a computer device, the computer device is communicatively connected to a data source, and the computer device is further communicatively connected to a client, and further, the method may include the following steps 110 to S160.
Step S110, obtaining the access piece data uploaded by the data source end through the set transmission protocol.
In this embodiment, the set transmission protocol mentioned in step S110 may be a packet Transport Layer Security (DTLS) protocol, the data source includes a relational database and an application, and the access component data includes a first access component data uploaded by the relational database and a second access component data uploaded by the application.
And step S120, detecting whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches the set occupancy rate in real time when the access data is stored.
In this embodiment, the Hadoop cluster described in step S120 includes a plurality of Distributed File systems (HDFS), and the set occupancy may be modified according to an actual situation, which is not limited herein.
And step S130, if the occupancy rate reaches the set occupancy rate, calculating the current data volume corresponding to the access piece data.
In this embodiment, the current data amount is used to characterize the size of the storage space required for storing the access data into the HDFS.
Step S140, determining whether the current data amount is greater than a set data amount.
And step S150, when the current data volume is larger than the set data volume, identifying the target access item data stored in each first target HDFS to obtain the query record corresponding to the target access item data.
In this embodiment, the first target HDFS is an HDFS whose storage space is in a saturated state.
Step S160, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS.
It can be understood that based on the descriptions of the above steps S110 to S160, first, the access item data is obtained and whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches the set occupancy rate is detected in real time when the access item data is stored, and if yes, the current data amount corresponding to the access item data is calculated. And secondly, when the current data volume is judged to be larger than the set data volume, identifying the target access piece data stored in each first target HDFS to obtain a query record corresponding to the target access piece data. And finally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS. Therefore, redundant data in the Hadoop cluster can be removed, storage space is released, and storage of the access data is dynamically achieved, so that storage efficiency of the access data is improved.
In an implementation manner, in order to further improve the storage efficiency, the Hadoop cluster may be dynamically expanded, specifically as follows: after the storage space of each first target HDFS is released, calculating target data volume corresponding to all released storage spaces; judging whether the target data volume reaches the current data volume; and if the target data volume does not reach the current data volume, expanding the Hadoop cluster to increase at least one second target HDFS.
Further, when the second target HDFS is added, determining a cluster response curve of the Hadoop cluster; extracting real-time curve characteristics of the cluster response curve, and judging whether the Hadoop cluster is in an overload state according to the real-time curve characteristics; and when the Hadoop cluster is in the overload state, returning to the step of performing redundant data elimination on at least part of target storage piece data stored in each first target HDFS according to the query records. It is understood that before returning to the step of performing redundant data elimination on at least part of the target storage device data stored in each first target HDFS according to the query record, the preset configuration parameters are modified to adjust the recognition model for performing redundant data elimination. Therefore, redundant data can be further removed, and the storage efficiency of the Hadoop cluster is improved.
In a specific embodiment, in order to accurately and completely obtain the query record, the identifying the target access item data stored in each first target HDFS in step S150 to obtain the query record corresponding to the target access item data may specifically include the following contents described in step S151 to step S155.
Step S151, access path distribution of the target access data and each access order are obtained.
Step S152, when it is determined that the target access item data includes the query call label according to the access item path distribution, determining an order correlation coefficient between each access item order of the target access item data in the sample call label and each access item order of the target access item data in the query call label according to the access item order of the target access item data in the query call label and the order generation time of the access item order of the target access item data in the query call label, and dividing the access item order in which the order correlation coefficient between the access item order of the target access item data in the sample call label and the access item order of the target access item data in the query call label is higher than a set coefficient into the query call label.
Step S153, when the sample call label corresponding to the target access item data includes a plurality of access item orders, determining an order correlation coefficient between the access item orders of the target access item data under the sample call label according to the access item orders of the target access item data under the query call label and the order generation time of the access item orders of the target access item data under the query call label, and screening the access item orders under the sample call label according to the order correlation coefficient between the access item orders.
Step S154, setting an inquiry update priority for the screened target access part orders according to the access part orders of the target access part data under the inquiry call label and the order generation time of the access part orders of the target access part data under the inquiry call label, and dividing at least part of the target access part orders under the inquiry call label according to the inquiry update priority.
And step S155, generating a query record corresponding to the target access item data according to the access item order under the query calling label.
By applying the contents described in the above steps S151 to S155, the query record can be accurately and completely obtained.
In an alternative embodiment, in order to accurately remove the redundant data, the redundant data removal is performed on at least a part of the target storage device data stored in each first target HDFS according to the query record and the key data corresponding to the at least a part of the target storage device data is retained as described in step S160, and the following steps S161 to S163 may be further included.
Step S161, according to the query time interval distribution information of the query record and the order user behavior data of the access piece order, analyzing the data field set of the target access piece data in each first target HDFS to obtain feature direction data in the field features included in the data field set of the target access piece data in each first target HDFS.
Step S162, by skipping the first feature indicating data having access behavior following property with respect to the access behavior data having adjustable weight in the feature indicating data, extracting the second feature indicating data whose field codes in the data field set of the target access object data in each first target HDFS are not updated with the change of the first feature indicating data having access behavior following property.
Step S163 of determining, from the target field features corresponding to the second feature direction data, to-be-processed field features corresponding to target feature identifiers similar to the set feature identifiers, determining, as redundant data, and removing first data fields corresponding to the to-be-processed field features in the data field set, and then determining, as the key data, second data fields retained in the data field set.
In this way, the redundant data can be accurately culled based on the above steps S161 to S163.
Further, in step S163, determining, from the target field features corresponding to the second feature indicating data, a to-be-processed field feature corresponding to a target feature identifier similar to the set feature identifier specifically includes: calculating the cosine distance between the identification character string of the current characteristic identification of each target field characteristic corresponding to the second characteristic pointing data and the identification character string of the set characteristic identification; and if the cosine distance is within a set interval, determining that the current feature identifier of the target field feature is a target feature identifier and determining that the target field feature is the to-be-processed field feature.
Based on the same inventive concept, please refer to fig. 2 in combination, a device 200 for processing mass access records based on a Hadoop platform is provided, which is applied to a computer device, the computer device is in communication connection with a data source end, the computer device is also in communication connection with a client, and the device includes:
a data obtaining module 210, configured to obtain access piece data uploaded by the data source end through a set transport protocol; the data source end comprises a relational database and an application program, and the access piece data comprises first access piece data uploaded by the relational database and second access piece data uploaded by the application program;
the real-time detection module 220 is configured to detect whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches a set occupancy rate in real time when the access data is stored;
if the occupancy rate reaches the set occupancy rate, calculating the current data volume corresponding to the access piece data; the current data volume is used for representing the size of a storage space required for storing the access piece data into the HDFS;
a data determining module 230, configured to determine whether the current data amount is greater than a set data amount;
when the current data volume is larger than the set data volume, identifying target access item data stored in each first target HDFS to obtain a query record corresponding to the target access item data; the first target HDFS is an HDFS with a storage space in a saturated state;
and the space releasing module 240 is configured to perform redundant data elimination on at least part of the target storage device data stored in each first target HDFS according to the query record, and retain key data corresponding to the at least part of the target storage device data to release the storage space of each first target HDFS.
Optionally, the data determining module 230 is specifically configured to:
acquiring access member path distribution of target access member data and each access member order;
under the condition that the target access piece data contains a query call label according to the access piece path distribution, determining order correlation coefficients between each access piece order of the target access piece data under a sample call label and each access piece order of the target access piece data under the query call label according to an access piece order of the target access piece data under the query call label and an order generation moment of the access piece order of the target access piece data under the query call label, and dividing the access piece order of which the order correlation coefficient between the access piece order of the target access piece data under the sample call label and the access piece order under the query call label is higher than a set coefficient into the query call label;
under the condition that a plurality of access part orders are contained in a sample calling label corresponding to the target access part data, determining order correlation coefficients among the access part orders of the target access part data in the sample calling label according to the access part orders of the target access part data in the query calling label and the order generation time of the access part orders of the target access part data in the query calling label, and screening the access part orders in the sample calling label according to the order correlation coefficients among the access part orders;
setting query updating priority for the screened target access part orders according to the access part orders of the target access part data under the query calling label and the order generating time of the access part orders of the target access part data under the query calling label, and dividing at least part of the target access part orders under the query calling label according to the query updating priority;
and generating a query record corresponding to the target access piece data according to the access piece order under the query calling label.
Optionally, the space releasing module 240 is specifically configured to:
analyzing a data field set of the target access piece data in each first target HDFS according to the query time interval distribution information of the query records and the order user behavior data of the access piece order to obtain feature pointing data in field features included in the data field set of the target access piece data in each first target HDFS;
extracting second feature index data, which is not updated along with the change of the first feature index data with access behavior following property, of the field codes in the data field set of the target access device data in each first target HDFS by skipping first feature index data with access behavior following property relative to the access behavior data with adjustable weight in the feature index data;
and determining field characteristics to be processed corresponding to target characteristic identification similar to the set characteristic identification from the target field characteristics corresponding to the second characteristic pointing data, determining a first data field corresponding to the field characteristics to be processed in the data field set as redundant data and removing the redundant data, and then determining a second data field reserved in the data field set as the key data.
On the basis of the above, please refer to fig. 3 in combination, a hardware structure diagram of a computer device 300 is provided, the computer device 300 includes a processor 310 and a memory 320 which are communicated with each other, and the processor 310 realizes the method shown in fig. 1 by retrieving a computer program from the memory 320 and running the computer program.
In summary, in the above technical solution, access data is first obtained, and when the access data is stored, whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches the set occupancy rate is detected in real time, and if yes, the current data amount corresponding to the access data is calculated. And secondly, when the current data volume is judged to be larger than the set data volume, identifying the target access piece data stored in each first target HDFS to obtain a query record corresponding to the target access piece data. And finally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS. Therefore, redundant data in the Hadoop cluster can be removed, storage space is released, and storage of the access data is dynamically achieved, so that storage efficiency of the access data is improved.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (8)

1. A method for processing records of mass access parts based on a Hadoop platform is characterized by being applied to computer equipment, wherein the computer equipment is in communication connection with a data source end and is also in communication connection with a client end, and the method comprises the following steps:
acquiring access piece data uploaded by the data source end through a set transmission protocol; the data source end comprises a relational database and an application program, and the access piece data comprises first access piece data uploaded by the relational database and second access piece data uploaded by the application program;
when the access data is stored, detecting whether the occupancy rate of the HDFS in a preset Hadoop cluster reaches a set occupancy rate or not in real time;
if the occupancy rate reaches the set occupancy rate, calculating the current data volume corresponding to the access piece data; the current data volume is used for representing the size of a storage space required for storing the access piece data into the HDFS;
judging whether the current data volume is larger than a set data volume;
when the current data volume is larger than the set data volume, identifying target access item data stored in each first target HDFS to obtain a query record corresponding to the target access item data; the first target HDFS is an HDFS with a storage space in a saturated state;
performing redundant data elimination on at least part of target storage part data stored in each first target HDFS according to the query records, and reserving key data corresponding to the at least part of target storage part data to release the storage space of each first target HDFS;
the method for identifying the target access item data stored in each first target HDFS to obtain the query record corresponding to the target access item data comprises the following steps:
acquiring access member path distribution of target access member data and each access member order;
under the condition that the target access piece data contains a query call label according to the access piece path distribution, determining order correlation coefficients between each access piece order of the target access piece data under a sample call label and each access piece order of the target access piece data under the query call label according to an access piece order of the target access piece data under the query call label and an order generation moment of the access piece order of the target access piece data under the query call label, and dividing the access piece order of which the order correlation coefficient between the access piece order of the target access piece data under the sample call label and the access piece order under the query call label is higher than a set coefficient into the query call label;
under the condition that a plurality of access part orders are contained in a sample calling label corresponding to the target access part data, determining order correlation coefficients among the access part orders of the target access part data in the sample calling label according to the access part orders of the target access part data in the query calling label and the order generation time of the access part orders of the target access part data in the query calling label, and screening the access part orders in the sample calling label according to the order correlation coefficients among the access part orders;
setting query updating priority for the screened target access part orders according to the access part orders of the target access part data under the query calling label and the order generating time of the access part orders of the target access part data under the query calling label, and dividing at least part of the target access part orders under the query calling label according to the query updating priority;
and generating a query record corresponding to the target access piece data according to the access piece order under the query calling label.
2. The method of claim 1, further comprising:
after the storage space of each first target HDFS is released, calculating target data volume corresponding to all released storage spaces;
judging whether the target data volume reaches the current data volume;
and if the target data volume does not reach the current data volume, expanding the Hadoop cluster to increase at least one second target HDFS.
3. The method of claim 2, further comprising:
determining a cluster response curve of the Hadoop cluster when the second target HDFS is added;
extracting real-time curve characteristics of the cluster response curve, and judging whether the Hadoop cluster is in an overload state according to the real-time curve characteristics;
and when the Hadoop cluster is in the overload state, returning to the step of performing redundant data elimination on at least part of target storage piece data stored in each first target HDFS according to the query records.
4. The method of claim 3, further comprising:
and before the step of performing redundant data elimination on at least part of target storage part data stored in each first target HDFS according to the query record is returned, modifying preset configuration parameters to adjust an identification model for performing redundant data elimination.
5. The method according to claim 1, wherein performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query record, and retaining key data corresponding to the at least part of target storage device data comprises:
analyzing a data field set of the target access piece data in each first target HDFS according to the query time interval distribution information of the query records and the order user behavior data of the access piece order to obtain feature pointing data in field features included in the data field set of the target access piece data in each first target HDFS;
extracting second feature index data, which is not updated along with the change of the first feature index data with access behavior following property, of the field codes in the data field set of the target access device data in each first target HDFS by skipping first feature index data with access behavior following property relative to the access behavior data with adjustable weight in the feature index data;
and determining field characteristics to be processed corresponding to target characteristic identification similar to the set characteristic identification from the target field characteristics corresponding to the second characteristic pointing data, determining a first data field corresponding to the field characteristics to be processed in the data field set as redundant data and removing the redundant data, and then determining a second data field reserved in the data field set as the key data.
6. The method according to claim 5, wherein determining the field feature to be processed corresponding to the target feature identifier similar to the set feature identifier from the target field features corresponding to the second feature pointing data comprises:
calculating the cosine distance between the identification character string of the current characteristic identification of each target field characteristic corresponding to the second characteristic pointing data and the identification character string of the set characteristic identification;
and if the cosine distance is within a set interval, determining that the current feature identifier of the target field feature is a target feature identifier and determining that the target field feature is the to-be-processed field feature.
7. A mass storage and pickup record processing device based on a Hadoop platform is characterized by being applied to computer equipment, wherein the computer equipment is in communication connection with a data source end and is also in communication connection with a client, and the device comprises:
the data acquisition module is used for acquiring the access piece data uploaded by the data source end through a set transmission protocol; the data source end comprises a relational database and an application program, and the access piece data comprises first access piece data uploaded by the relational database and second access piece data uploaded by the application program;
the real-time detection module is used for detecting whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches the set occupancy rate or not in real time when the access data is stored;
if the occupancy rate reaches the set occupancy rate, calculating the current data volume corresponding to the access piece data; the current data volume is used for representing the size of a storage space required for storing the access piece data into the HDFS;
the data judging module is used for judging whether the current data volume is larger than the set data volume;
when the current data volume is larger than the set data volume, identifying target access item data stored in each first target HDFS to obtain a query record corresponding to the target access item data; the first target HDFS is an HDFS with a storage space in a saturated state;
the space release module is used for removing redundant data of at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to the at least part of target storage device data to release the storage space of each first target HDFS;
the data judgment module is specifically configured to:
acquiring access member path distribution of target access member data and each access member order;
under the condition that the target access piece data contains a query call label according to the access piece path distribution, determining order correlation coefficients between each access piece order of the target access piece data under a sample call label and each access piece order of the target access piece data under the query call label according to an access piece order of the target access piece data under the query call label and an order generation moment of the access piece order of the target access piece data under the query call label, and dividing the access piece order of which the order correlation coefficient between the access piece order of the target access piece data under the sample call label and the access piece order under the query call label is higher than a set coefficient into the query call label;
under the condition that a plurality of access part orders are contained in a sample calling label corresponding to the target access part data, determining order correlation coefficients among the access part orders of the target access part data in the sample calling label according to the access part orders of the target access part data in the query calling label and the order generation time of the access part orders of the target access part data in the query calling label, and screening the access part orders in the sample calling label according to the order correlation coefficients among the access part orders;
setting query updating priority for the screened target access part orders according to the access part orders of the target access part data under the query calling label and the order generating time of the access part orders of the target access part data under the query calling label, and dividing at least part of the target access part orders under the query calling label according to the query updating priority;
and generating a query record corresponding to the target access piece data according to the access piece order under the query calling label.
8. The apparatus according to claim 7, wherein the space release module is specifically configured to:
analyzing a data field set of the target access piece data in each first target HDFS according to the query time interval distribution information of the query records and the order user behavior data of the access piece order to obtain feature pointing data in field features included in the data field set of the target access piece data in each first target HDFS;
extracting second feature index data, which is not updated along with the change of the first feature index data with access behavior following property, of the field codes in the data field set of the target access device data in each first target HDFS by skipping first feature index data with access behavior following property relative to the access behavior data with adjustable weight in the feature index data;
and determining field characteristics to be processed corresponding to target characteristic identification similar to the set characteristic identification from the target field characteristics corresponding to the second characteristic pointing data, determining a first data field corresponding to the field characteristics to be processed in the data field set as redundant data and removing the redundant data, and then determining a second data field reserved in the data field set as the key data.
CN202010949054.8A 2020-09-10 2020-09-10 Hadoop platform-based mass storage and pickup record processing method and device Active CN111797069B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010949054.8A CN111797069B (en) 2020-09-10 2020-09-10 Hadoop platform-based mass storage and pickup record processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010949054.8A CN111797069B (en) 2020-09-10 2020-09-10 Hadoop platform-based mass storage and pickup record processing method and device

Publications (2)

Publication Number Publication Date
CN111797069A CN111797069A (en) 2020-10-20
CN111797069B true CN111797069B (en) 2020-12-08

Family

ID=72834243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010949054.8A Active CN111797069B (en) 2020-09-10 2020-09-10 Hadoop platform-based mass storage and pickup record processing method and device

Country Status (1)

Country Link
CN (1) CN111797069B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200936A (en) * 2011-05-11 2011-09-28 杨钧 Intelligent configuration storage backup method suitable for cloud storage
CN104572113A (en) * 2015-01-21 2015-04-29 深圳市中兴移动通信有限公司 Automatic storage space arranging method and mobile terminal
CN105119763A (en) * 2015-09-24 2015-12-02 烽火通信科技股份有限公司 RIA-based Web network management client big data rapid export method and system
CN107092442A (en) * 2017-04-24 2017-08-25 杭州宏杉科技股份有限公司 Storage system resources distribution method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102200936A (en) * 2011-05-11 2011-09-28 杨钧 Intelligent configuration storage backup method suitable for cloud storage
CN104572113A (en) * 2015-01-21 2015-04-29 深圳市中兴移动通信有限公司 Automatic storage space arranging method and mobile terminal
CN105119763A (en) * 2015-09-24 2015-12-02 烽火通信科技股份有限公司 RIA-based Web network management client big data rapid export method and system
CN107092442A (en) * 2017-04-24 2017-08-25 杭州宏杉科技股份有限公司 Storage system resources distribution method and device
CN107092442B (en) * 2017-04-24 2020-08-18 杭州宏杉科技股份有限公司 Storage system resource allocation method and device

Also Published As

Publication number Publication date
CN111797069A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
US10560465B2 (en) Real time anomaly detection for data streams
CN106897334B (en) Question pushing method and equipment
CN109241084B (en) Data query method, terminal equipment and medium
CN106909613B (en) Method for self-adaptive equalization database access service
CN110245487B (en) Account risk identification method and device
CN113726783A (en) Abnormal IP address identification method and device, electronic equipment and readable storage medium
CN112839014A (en) Method, system, device and medium for establishing model for identifying abnormal visitor
CN112118249A (en) Security protection method and device based on log and firewall
CN106294406B (en) Method and equipment for processing application access data
CN113901037A (en) Data management method, device and storage medium
CN111797069B (en) Hadoop platform-based mass storage and pickup record processing method and device
CN108694107B (en) Message queue backlog monitoring method and device, readable medium and electronic equipment
CN112565228A (en) Client network analysis method and device
CN111368128A (en) Target picture identification method and device and computer readable storage medium
CN109284833B (en) Method, apparatus and storage medium for obtaining feature data for machine learning model
CN114048512B (en) Method and device for processing sensitive data
CN107992538B (en) Message log generation method and device, query method and information processing system
CN110443320A (en) The determination method and device of event similarity
CN110852893A (en) Risk identification method, system, equipment and storage medium based on mass data
CN108111557B (en) Method and device for acquiring data in cloud storage system
CN110807466A (en) Method and device for processing order data
CN115422522A (en) Abnormal equipment judgment reference establishment method, abnormal equipment identification method, abnormal equipment judgment reference establishment device, abnormal equipment identification device and abnormal equipment identification device
CN112269879B (en) Method and equipment for analyzing middle station log based on k-means algorithm
CN112488562B (en) Service realization method and device
CN112312590B (en) Equipment communication protocol identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210519

Address after: 210000 Room 301, 3rd floor, 138 Fengtai South Road, Yuhuatai District, Nanjing City, Jiangsu Province

Patentee after: Nanjing cloud cabinet Network Technology Co.,Ltd.

Address before: Room 1501, building 1, Tianhe Plaza, 146 Fengtai South Road, Yuhuatai District, Nanjing, Jiangsu Province 210000

Patentee before: JIANGSU CLOUDBOX NETWORK TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right