CN111797069B

CN111797069B - Hadoop platform-based mass storage and pickup record processing method and device

Info

Publication number: CN111797069B
Application number: CN202010949054.8A
Authority: CN
Inventors: 王彪; 杨博; 张友平; 邵杰
Original assignee: Jiangsu Cloudbox Network Technology Co ltd
Current assignee: Nanjing cloud cabinet Network Technology Co.,Ltd.
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2020-12-08
Anticipated expiration: 2040-09-10
Also published as: CN111797069A

Abstract

The method and the device for processing the mass access records based on the Hadoop platform provided by the embodiment of the invention comprise the steps of firstly, acquiring access data, detecting whether the occupancy rate of the HDFS in a preset Hadoop cluster reaches the set occupancy rate in real time when the access data are stored, and if so, calculating the current data volume corresponding to the access data. And secondly, when the current data volume is judged to be larger than the set data volume, identifying the target access piece data stored in each first target HDFS to obtain a query record corresponding to the target access piece data. And finally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS. Therefore, redundant data in the Hadoop cluster can be removed, storage space is released, and storage of the access data is dynamically achieved, so that storage efficiency of the access data is improved.

Description

Hadoop platform-based mass storage and pickup record processing method and device

Technical Field

The invention relates to the technical field of data processing of intelligent express cabinets, in particular to a method and a device for processing mass storage and pickup records based on a Hadoop platform.

Background

Along with the rapid development of express delivery business, the intelligent express delivery cabinet is a new thing along with the continuous development of express delivery business. The intelligent express cabinet has the functions of mainly assisting couriers in receiving and sending express, and the facing users are vast couriers and recipients. As a 24-hour unattended express temporary storage cabinet, the intelligent express cabinet interacts with a server through an Internet of things card and is used as an application of the Internet of things technology in the modern logistics industry, the intelligent express cabinet is the most effective solution for the last kilometer of electronic commerce terminal logistics, the delivery efficiency of couriers is improved, the delivery cost is reduced, the privacy and the safety of users are guaranteed, and the intelligent express cabinet is gradually and widely recognized and popularized.

The intelligent express cabinet system serves as a server of an express terminal, services such as express delivery person storage and user pickup are provided, data volume generated every day is extremely large, and historical storage information needs to be stored and inquired quickly and timely for later-stage data analysis, inquiry and the like.

However, the existing technology for storing the storage and taking piece data of the intelligent express cabinet has the technical problem of low efficiency.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a device for recording and processing mass access parts based on a Hadoop platform.

Based on the first aspect of the embodiments of the present invention, a method for processing records of mass access components based on a Hadoop platform is provided, which is applied to a computer device, the computer device is in communication connection with a data source end, the computer device is also in communication connection with a client, and the method includes:

acquiring access piece data uploaded by the data source end through a set transmission protocol; the data source end comprises a relational database and an application program, and the access piece data comprises first access piece data uploaded by the relational database and second access piece data uploaded by the application program;

when the access data is stored, detecting whether the occupancy rate of the HDFS in a preset Hadoop cluster reaches a set occupancy rate or not in real time;

if the occupancy rate reaches the set occupancy rate, calculating the current data volume corresponding to the access piece data; the current data volume is used for representing the size of a storage space required for storing the access piece data into the HDFS;

judging whether the current data volume is larger than a set data volume;

when the current data volume is larger than the set data volume, identifying target access item data stored in each first target HDFS to obtain a query record corresponding to the target access item data; the first target HDFS is an HDFS with a storage space in a saturated state;

and performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS.

Optionally, the method further comprises:

after the storage space of each first target HDFS is released, calculating target data volume corresponding to all released storage spaces;

judging whether the target data volume reaches the current data volume;

and if the target data volume does not reach the current data volume, expanding the Hadoop cluster to increase at least one second target HDFS.

Optionally, the method further comprises:

determining a cluster response curve of the Hadoop cluster when the second target HDFS is added;

extracting real-time curve characteristics of the cluster response curve, and judging whether the Hadoop cluster is in an overload state according to the real-time curve characteristics;

and when the Hadoop cluster is in the overload state, returning to the step of performing redundant data elimination on at least part of target storage piece data stored in each first target HDFS according to the query records.

Optionally, the method further comprises:

and before the step of performing redundant data elimination on at least part of target storage part data stored in each first target HDFS according to the query record is returned, modifying preset configuration parameters to adjust an identification model for performing redundant data elimination.

Optionally, identifying the target access item data stored in each first target HDFS to obtain a query record corresponding to the target access item data, where the query record includes:

acquiring access member path distribution of target access member data and each access member order;

under the condition that the target access piece data contains a query call label according to the access piece path distribution, determining order correlation coefficients between each access piece order of the target access piece data under a sample call label and each access piece order of the target access piece data under the query call label according to an access piece order of the target access piece data under the query call label and an order generation moment of the access piece order of the target access piece data under the query call label, and dividing the access piece order of which the order correlation coefficient between the access piece order of the target access piece data under the sample call label and the access piece order under the query call label is higher than a set coefficient into the query call label;

under the condition that a plurality of access part orders are contained in a sample calling label corresponding to the target access part data, determining order correlation coefficients among the access part orders of the target access part data in the sample calling label according to the access part orders of the target access part data in the query calling label and the order generation time of the access part orders of the target access part data in the query calling label, and screening the access part orders in the sample calling label according to the order correlation coefficients among the access part orders;

setting query updating priority for the screened target access part orders according to the access part orders of the target access part data under the query calling label and the order generating time of the access part orders of the target access part data under the query calling label, and dividing at least part of the target access part orders under the query calling label according to the query updating priority;

and generating a query record corresponding to the target access piece data according to the access piece order under the query calling label.

Optionally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query record, and retaining key data corresponding to the at least part of target storage device data, including:

analyzing a data field set of the target access piece data in each first target HDFS according to the query time interval distribution information of the query records and the order user behavior data of the access piece order to obtain feature pointing data in field features included in the data field set of the target access piece data in each first target HDFS;

extracting second feature index data, which is not updated along with the change of the first feature index data with access behavior following property, of the field codes in the data field set of the target access device data in each first target HDFS by skipping first feature index data with access behavior following property relative to the access behavior data with adjustable weight in the feature index data;

and determining field characteristics to be processed corresponding to target characteristic identification similar to the set characteristic identification from the target field characteristics corresponding to the second characteristic pointing data, determining a first data field corresponding to the field characteristics to be processed in the data field set as redundant data and removing the redundant data, and then determining a second data field reserved in the data field set as the key data.

Optionally, determining, from the target field features corresponding to the second feature direction data, to-be-processed field features corresponding to target feature identifiers similar to the set feature identifiers, includes:

calculating the cosine distance between the identification character string of the current characteristic identification of each target field characteristic corresponding to the second characteristic pointing data and the identification character string of the set characteristic identification;

and if the cosine distance is within a set interval, determining that the current feature identifier of the target field feature is a target feature identifier and determining that the target field feature is the to-be-processed field feature.

Based on the second aspect of the embodiments of the present invention, a device for recording and processing a mass storage element based on a Hadoop platform is applied to a computer device, the computer device is in communication connection with a data source end, the computer device is also in communication connection with a client, and the device includes:

the data acquisition module is used for acquiring the access piece data uploaded by the data source end through a set transmission protocol; the data source end comprises a relational database and an application program, and the access piece data comprises first access piece data uploaded by the relational database and second access piece data uploaded by the application program;

the real-time detection module is used for detecting whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches the set occupancy rate or not in real time when the access data is stored;

the data judging module is used for judging whether the current data volume is larger than the set data volume;

and the space release module is used for removing redundant data of at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to the at least part of target storage device data to release the storage space of each first target HDFS.

Optionally, the data determining module is specifically configured to:

Optionally, the space releasing module is specifically configured to:

The method and the device for processing the mass access records based on the Hadoop platform provided by the embodiment of the invention comprise the steps of firstly, acquiring access data, detecting whether the occupancy rate of the HDFS in a preset Hadoop cluster reaches the set occupancy rate in real time when the access data are stored, and if so, calculating the current data volume corresponding to the access data. And secondly, when the current data volume is judged to be larger than the set data volume, identifying the target access piece data stored in each first target HDFS to obtain a query record corresponding to the target access piece data. And finally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS. Therefore, redundant data in the Hadoop cluster can be removed, so that a storage space is released to dynamically realize storage of the access data, and the storage efficiency of the access data is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of a method for processing records of mass access components based on a Hadoop platform according to an embodiment of the present invention.

Fig. 2 is a functional block diagram of a device for recording and processing a mass access component based on a Hadoop platform according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

In order to better understand the technical solutions of the present invention, the following detailed descriptions of the technical solutions of the present invention are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features in the embodiments and the examples of the present invention are the detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features in the embodiments and the examples of the present invention may be combined with each other without conflict.

Referring first to fig. 1, a flow chart of a method for processing a mass storage record based on a Hadoop platform is shown, where the method may be applied to a computer device, the computer device is communicatively connected to a data source, and the computer device is further communicatively connected to a client, and further, the method may include the following steps 110 to S160.

Step S110, obtaining the access piece data uploaded by the data source end through the set transmission protocol.

In this embodiment, the set transmission protocol mentioned in step S110 may be a packet Transport Layer Security (DTLS) protocol, the data source includes a relational database and an application, and the access component data includes a first access component data uploaded by the relational database and a second access component data uploaded by the application.

And step S120, detecting whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches the set occupancy rate in real time when the access data is stored.

In this embodiment, the Hadoop cluster described in step S120 includes a plurality of Distributed File systems (HDFS), and the set occupancy may be modified according to an actual situation, which is not limited herein.

And step S130, if the occupancy rate reaches the set occupancy rate, calculating the current data volume corresponding to the access piece data.

In this embodiment, the current data amount is used to characterize the size of the storage space required for storing the access data into the HDFS.

Step S140, determining whether the current data amount is greater than a set data amount.

And step S150, when the current data volume is larger than the set data volume, identifying the target access item data stored in each first target HDFS to obtain the query record corresponding to the target access item data.

In this embodiment, the first target HDFS is an HDFS whose storage space is in a saturated state.

Step S160, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS.

It can be understood that based on the descriptions of the above steps S110 to S160, first, the access item data is obtained and whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches the set occupancy rate is detected in real time when the access item data is stored, and if yes, the current data amount corresponding to the access item data is calculated. And secondly, when the current data volume is judged to be larger than the set data volume, identifying the target access piece data stored in each first target HDFS to obtain a query record corresponding to the target access piece data. And finally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS. Therefore, redundant data in the Hadoop cluster can be removed, storage space is released, and storage of the access data is dynamically achieved, so that storage efficiency of the access data is improved.

In an implementation manner, in order to further improve the storage efficiency, the Hadoop cluster may be dynamically expanded, specifically as follows: after the storage space of each first target HDFS is released, calculating target data volume corresponding to all released storage spaces; judging whether the target data volume reaches the current data volume; and if the target data volume does not reach the current data volume, expanding the Hadoop cluster to increase at least one second target HDFS.

Further, when the second target HDFS is added, determining a cluster response curve of the Hadoop cluster; extracting real-time curve characteristics of the cluster response curve, and judging whether the Hadoop cluster is in an overload state according to the real-time curve characteristics; and when the Hadoop cluster is in the overload state, returning to the step of performing redundant data elimination on at least part of target storage piece data stored in each first target HDFS according to the query records. It is understood that before returning to the step of performing redundant data elimination on at least part of the target storage device data stored in each first target HDFS according to the query record, the preset configuration parameters are modified to adjust the recognition model for performing redundant data elimination. Therefore, redundant data can be further removed, and the storage efficiency of the Hadoop cluster is improved.

In a specific embodiment, in order to accurately and completely obtain the query record, the identifying the target access item data stored in each first target HDFS in step S150 to obtain the query record corresponding to the target access item data may specifically include the following contents described in step S151 to step S155.

Step S151, access path distribution of the target access data and each access order are obtained.

Step S152, when it is determined that the target access item data includes the query call label according to the access item path distribution, determining an order correlation coefficient between each access item order of the target access item data in the sample call label and each access item order of the target access item data in the query call label according to the access item order of the target access item data in the query call label and the order generation time of the access item order of the target access item data in the query call label, and dividing the access item order in which the order correlation coefficient between the access item order of the target access item data in the sample call label and the access item order of the target access item data in the query call label is higher than a set coefficient into the query call label.

Step S153, when the sample call label corresponding to the target access item data includes a plurality of access item orders, determining an order correlation coefficient between the access item orders of the target access item data under the sample call label according to the access item orders of the target access item data under the query call label and the order generation time of the access item orders of the target access item data under the query call label, and screening the access item orders under the sample call label according to the order correlation coefficient between the access item orders.

Step S154, setting an inquiry update priority for the screened target access part orders according to the access part orders of the target access part data under the inquiry call label and the order generation time of the access part orders of the target access part data under the inquiry call label, and dividing at least part of the target access part orders under the inquiry call label according to the inquiry update priority.

And step S155, generating a query record corresponding to the target access item data according to the access item order under the query calling label.

By applying the contents described in the above steps S151 to S155, the query record can be accurately and completely obtained.

In an alternative embodiment, in order to accurately remove the redundant data, the redundant data removal is performed on at least a part of the target storage device data stored in each first target HDFS according to the query record and the key data corresponding to the at least a part of the target storage device data is retained as described in step S160, and the following steps S161 to S163 may be further included.

Step S161, according to the query time interval distribution information of the query record and the order user behavior data of the access piece order, analyzing the data field set of the target access piece data in each first target HDFS to obtain feature direction data in the field features included in the data field set of the target access piece data in each first target HDFS.

Step S162, by skipping the first feature indicating data having access behavior following property with respect to the access behavior data having adjustable weight in the feature indicating data, extracting the second feature indicating data whose field codes in the data field set of the target access object data in each first target HDFS are not updated with the change of the first feature indicating data having access behavior following property.

Step S163 of determining, from the target field features corresponding to the second feature direction data, to-be-processed field features corresponding to target feature identifiers similar to the set feature identifiers, determining, as redundant data, and removing first data fields corresponding to the to-be-processed field features in the data field set, and then determining, as the key data, second data fields retained in the data field set.

In this way, the redundant data can be accurately culled based on the above steps S161 to S163.

Further, in step S163, determining, from the target field features corresponding to the second feature indicating data, a to-be-processed field feature corresponding to a target feature identifier similar to the set feature identifier specifically includes: calculating the cosine distance between the identification character string of the current characteristic identification of each target field characteristic corresponding to the second characteristic pointing data and the identification character string of the set characteristic identification; and if the cosine distance is within a set interval, determining that the current feature identifier of the target field feature is a target feature identifier and determining that the target field feature is the to-be-processed field feature.

Based on the same inventive concept, please refer to fig. 2 in combination, a device 200 for processing mass access records based on a Hadoop platform is provided, which is applied to a computer device, the computer device is in communication connection with a data source end, the computer device is also in communication connection with a client, and the device includes:

a data obtaining module 210, configured to obtain access piece data uploaded by the data source end through a set transport protocol; the data source end comprises a relational database and an application program, and the access piece data comprises first access piece data uploaded by the relational database and second access piece data uploaded by the application program;

the real-time detection module 220 is configured to detect whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches a set occupancy rate in real time when the access data is stored;

a data determining module 230, configured to determine whether the current data amount is greater than a set data amount;

and the space releasing module 240 is configured to perform redundant data elimination on at least part of the target storage device data stored in each first target HDFS according to the query record, and retain key data corresponding to the at least part of the target storage device data to release the storage space of each first target HDFS.

Optionally, the data determining module 230 is specifically configured to:

Optionally, the space releasing module 240 is specifically configured to:

On the basis of the above, please refer to fig. 3 in combination, a hardware structure diagram of a computer device 300 is provided, the computer device 300 includes a processor 310 and a memory 320 which are communicated with each other, and the processor 310 realizes the method shown in fig. 1 by retrieving a computer program from the memory 320 and running the computer program.

In summary, in the above technical solution, access data is first obtained, and when the access data is stored, whether the occupancy rate of the HDFS in the preset Hadoop cluster reaches the set occupancy rate is detected in real time, and if yes, the current data amount corresponding to the access data is calculated. And secondly, when the current data volume is judged to be larger than the set data volume, identifying the target access piece data stored in each first target HDFS to obtain a query record corresponding to the target access piece data. And finally, performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to at least part of target storage device data to release the storage space of each first target HDFS. Therefore, redundant data in the Hadoop cluster can be removed, storage space is released, and storage of the access data is dynamically achieved, so that storage efficiency of the access data is improved.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for processing records of mass access parts based on a Hadoop platform is characterized by being applied to computer equipment, wherein the computer equipment is in communication connection with a data source end and is also in communication connection with a client end, and the method comprises the following steps:

judging whether the current data volume is larger than a set data volume;

performing redundant data elimination on at least part of target storage part data stored in each first target HDFS according to the query records, and reserving key data corresponding to the at least part of target storage part data to release the storage space of each first target HDFS;

the method for identifying the target access item data stored in each first target HDFS to obtain the query record corresponding to the target access item data comprises the following steps:

2. The method of claim 1, further comprising:

judging whether the target data volume reaches the current data volume;

3. The method of claim 2, further comprising:

4. The method of claim 3, further comprising:

5. The method according to claim 1, wherein performing redundant data elimination on at least part of target storage device data stored in each first target HDFS according to the query record, and retaining key data corresponding to the at least part of target storage device data comprises:

6. The method according to claim 5, wherein determining the field feature to be processed corresponding to the target feature identifier similar to the set feature identifier from the target field features corresponding to the second feature pointing data comprises:

7. A mass storage and pickup record processing device based on a Hadoop platform is characterized by being applied to computer equipment, wherein the computer equipment is in communication connection with a data source end and is also in communication connection with a client, and the device comprises:

the space release module is used for removing redundant data of at least part of target storage device data stored in each first target HDFS according to the query records, and reserving key data corresponding to the at least part of target storage device data to release the storage space of each first target HDFS;

the data judgment module is specifically configured to:

8. The apparatus according to claim 7, wherein the space release module is specifically configured to: