CN114385668A - Cold data cleaning method, device, equipment and storage medium - Google Patents

Cold data cleaning method, device, equipment and storage medium Download PDF

Info

Publication number
CN114385668A
CN114385668A CN202210038324.9A CN202210038324A CN114385668A CN 114385668 A CN114385668 A CN 114385668A CN 202210038324 A CN202210038324 A CN 202210038324A CN 114385668 A CN114385668 A CN 114385668A
Authority
CN
China
Prior art keywords
data
request
cold data
result
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210038324.9A
Other languages
Chinese (zh)
Inventor
邓成杨
谢小娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202210038324.9A priority Critical patent/CN114385668A/en
Publication of CN114385668A publication Critical patent/CN114385668A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence, and discloses a cold data cleaning method, a cold data cleaning device, cold data cleaning equipment and a storage medium, which are used for improving the cold data cleaning efficiency. The cold data cleaning method comprises the following steps: preprocessing the initial log data to obtain preprocessed log data, and analyzing the preprocessed log data to obtain an analysis result; classifying the analysis result to obtain structured query information, extracting the request time corresponding to each request table in the structured query information, and calculating to obtain a plurality of request time difference values; comparing the plurality of request time difference values with a preset time difference threshold value respectively to obtain a comparison result, and determining cold data in the structured query information to obtain initial cold data; and monitoring the initial cold data, determining target cold data based on the monitoring result, and cleaning. In addition, the invention also relates to a block chain technology, and cleaned data can be stored in the block chain.

Description

Cold data cleaning method, device, equipment and storage medium
Technical Field
The invention relates to the field of classification algorithms, in particular to a cold data cleaning method, a cold data cleaning device, cold data cleaning equipment and a storage medium.
Background
New data is generated all the time in the big data era, the data is growing in a geometric multiple explosion mode, and how to store, manage and use the data is a difficult problem faced by modern enterprises. These data may be referred to as "cold data," which is information that is not frequently accessed but is not immediately deleted, such as a large amount of information stored by the user on social media, enterprise backup data, business and operation log data, phone bills and statistics, etc.
In the face of increasing data of enterprises, cold data needs to be identified through an intelligent method, the cold data is cleaned under the condition that an external system has no influence, and the storage alarm of a big data platform is reduced.
Disclosure of Invention
The invention provides a cold data cleaning method, a device, equipment and a storage medium, which are used for analyzing pre-processed log data through a preset analysis tool to obtain an analysis result, classifying the analysis result to obtain structured query information, extracting request time corresponding to each request table in the structured query information, calculating to obtain a plurality of request time difference values, comparing the plurality of request time difference values with preset time difference threshold values respectively to obtain a comparison result, determining cold data in the structured query information based on the comparison result, and improving the efficiency of cold data cleaning.
The invention provides a cold data cleaning method in a first aspect, which comprises the following steps: acquiring initial log data from a preset database, preprocessing the initial log data to obtain preprocessed log data, and calling a preset analysis tool to analyze the preprocessed log data to obtain an analysis result; classifying the analysis result to obtain structured query information, extracting request time corresponding to each request table in the structured query information, and calculating to obtain a plurality of request time difference values based on the request time corresponding to each request table; comparing the plurality of request time difference values with a preset time difference threshold value respectively to obtain comparison results, and determining cold data in the structured query information based on the comparison results to obtain initial cold data; and calling a preset data monitor to monitor the initial cold data to obtain a monitoring result, determining target cold data based on the monitoring result, and cleaning the target cold data.
Optionally, in a first implementation manner of the first aspect of the present invention, the obtaining initial log data from a preset database, preprocessing the initial log data to obtain preprocessed log data, and invoking a preset analysis tool to analyze the preprocessed log data, where obtaining an analysis result includes: acquiring initial log data from a preset database, and performing missing value completion, abnormal value filtration and repeated value filtration on the initial log data to obtain preprocessed log data, wherein the preprocessed log data comprises a request base table, a request field, an insertion database, an insertion field and a limiting condition; and calling a preset analysis tool, converting the preprocessed log data into an abstract syntax tree, and performing traversal processing on the abstract syntax tree to obtain an analysis result.
Optionally, in a second implementation manner of the first aspect of the present invention, the invoking a preset parsing tool, converting the preprocessed log data into an abstract syntax tree, and performing traversal processing on the abstract syntax tree to obtain a parsing result includes: calling a preset syntax analysis tool, and sequentially performing lexical analysis and syntax analysis on the preprocessed log data to obtain an abstract syntax tree, wherein the syntax analysis tool comprises a lexical analyzer and a syntax analyzer; and calling a preset traversal algorithm to perform traversal processing on the abstract syntax tree to obtain an analysis result.
Optionally, in a third implementation manner of the first aspect of the present invention, the classifying the analysis result to obtain structured query information, extracting a request time corresponding to each request table in the structured query information, and calculating a plurality of request time difference values based on the request time corresponding to each request table includes: calling a preset classification algorithm, and performing classification processing on the analysis result to obtain structured query information, wherein the structured query information comprises an insertion database, an insertion table, a request database, a request table and request time; acquiring request time corresponding to each request table in the structured query information and analysis starting time in the structured query information, and calculating a difference value between the request time corresponding to each request table and the analysis starting time to obtain a plurality of request time difference values.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the comparing the multiple request time difference values with a preset time difference threshold respectively to obtain a comparison result, and determining cold data in the structured query information based on the comparison result to obtain initial cold data includes: comparing the plurality of request time difference values with respectively preset time difference thresholds to obtain comparison results, and extracting target results in the comparison results, wherein the target results are the comparison results corresponding to the request time difference values which are greater than or equal to the time difference thresholds; and acquiring a request table corresponding to the target result to obtain a target request table, and determining all data corresponding to the target request table as initial cold data.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the obtaining the request table corresponding to the target result to obtain a target request table, and determining all data corresponding to the target request table as initial cold data further includes: acquiring a request table corresponding to the target result, calling a preset modification tool, and modifying the table name of the request table corresponding to the target result to obtain a target request table; and determining all data corresponding to the target request table as initial cold data.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the invoking a preset data monitor to monitor the initial cold data to obtain a monitoring result, determining target cold data based on the monitoring result, and performing cleaning processing on the target cold data includes: calling a preset data monitor, and monitoring the initial cold data within a preset time range to obtain a monitoring result, wherein the monitoring result comprises an alarm result and an unserviced result; and determining initial cold data corresponding to the result of monitoring as the result of no alarm as target cold data, and calling a preset cleaning strategy to clean the target cold data, wherein the cleaning strategy comprises a cleaning condition and a cleaning mode.
A second aspect of the present invention provides a cold data cleansing apparatus, including: the system comprises a preprocessing module, a data analysis module and a data analysis module, wherein the preprocessing module is used for acquiring initial log data from a preset database, preprocessing the initial log data to obtain preprocessed log data, and calling a preset analysis tool to analyze the preprocessed log data to obtain an analysis result; the classification module is used for classifying the analysis result to obtain structured query information, extracting the request time corresponding to each request table in the structured query information, and calculating to obtain a plurality of request time difference values based on the request time corresponding to each request table; the comparison module is used for comparing the plurality of request time difference values with a preset time difference threshold value respectively to obtain a comparison result, and determining cold data in the structured query information based on the comparison result to obtain initial cold data; and the monitoring module is used for calling a preset data monitor to monitor the initial cold data to obtain a monitoring result, determining target cold data based on the monitoring result and cleaning the target cold data.
Optionally, in a first implementation manner of the second aspect of the present invention, the preprocessing module includes: the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring initial log data from a preset database, and performing missing value completion, abnormal value filtration and repeated value filtration on the initial log data to obtain preprocessed log data, and the preprocessed log data comprises a request base table, a request field, an insertion database, an insertion field and a limiting condition; and the traversal unit is used for calling a preset analysis tool, converting the preprocessed log data into an abstract syntax tree, and performing traversal processing on the abstract syntax tree to obtain an analysis result.
Optionally, in a second implementation manner of the second aspect of the present invention, the traversal unit is specifically configured to: calling a preset syntax analysis tool, and sequentially performing lexical analysis and syntax analysis on the preprocessed log data to obtain an abstract syntax tree, wherein the syntax analysis tool comprises a lexical analyzer and a syntax analyzer; and calling a preset traversal algorithm to perform traversal processing on the abstract syntax tree to obtain an analysis result.
Optionally, in a third implementation manner of the second aspect of the present invention, the classification module includes: the classification unit is used for calling a preset classification algorithm and classifying the analysis result to obtain structured query information, wherein the structured query information comprises an insertion database, an insertion table, a request database, a request table and request time; and the calculating unit is used for acquiring the request time corresponding to each request table in the structured query information and the analysis starting time in the structured query information, and calculating the difference between the request time corresponding to each request table and the analysis starting time to obtain a plurality of request time differences.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the comparison module includes: the comparison unit is used for comparing the plurality of request time difference values with a preset time difference threshold respectively to obtain comparison results, and extracting target results in the comparison results, wherein the target results are the comparison results corresponding to the request time difference values which are greater than or equal to the time difference threshold; and the determining unit is used for acquiring the request table corresponding to the target result, obtaining a target request table, and determining all data corresponding to the target request table as initial cold data.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the determining unit is specifically configured to: acquiring a request table corresponding to the target result, calling a preset modification tool, and modifying the table name of the request table corresponding to the target result to obtain a target request table; and determining all data corresponding to the target request table as initial cold data.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the monitoring module includes: the monitoring unit is used for calling a preset data monitoring instrument and monitoring the initial cold data within a preset time range to obtain a monitoring result, wherein the monitoring result comprises an alarm result and an unserviceable result; and the cleaning unit is used for determining the initial cold data corresponding to the non-alarm result as the target cold data according to the monitoring result, and calling a preset cleaning strategy to clean the target cold data, wherein the cleaning strategy comprises a cleaning condition and a cleaning mode.
A third aspect of the present invention provides a cold data cleansing apparatus comprising: a memory and at least one processor, the memory having stored therein a computer program; the at least one processor calls the computer program in the memory to cause the cold data cleansing apparatus to perform the cold data cleansing method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein a computer program which, when run on a computer, causes the computer to execute the above-described cold data cleansing method.
In the technical scheme provided by the invention, initial log data are obtained from a preset database, the initial log data are preprocessed to obtain preprocessed log data, and a preset analysis tool is called to analyze the preprocessed log data to obtain an analysis result; classifying the analysis result to obtain structured query information, extracting request time corresponding to each request table in the structured query information, and calculating to obtain a plurality of request time difference values based on the request time corresponding to each request table; comparing the plurality of request time difference values with a preset time difference threshold value respectively to obtain comparison results, and determining cold data in the structured query information based on the comparison results to obtain initial cold data; and calling a preset data monitor to monitor the initial cold data to obtain a monitoring result, determining target cold data based on the monitoring result, and cleaning the target cold data. In the embodiment of the invention, the preprocessed log data is analyzed through a preset analysis tool to obtain an analysis result, the analysis result is classified to obtain the structured query information, the request time corresponding to each request table in the structured query information is extracted, a plurality of request time difference values are obtained through calculation, the plurality of request time difference values are compared with the preset time difference threshold value respectively to obtain a comparison result, the cold data in the structured query information is determined based on the comparison result, and the cold data cleaning efficiency is improved.
Drawings
FIG. 1 is a diagram of an embodiment of a cold data scrubbing method according to an embodiment of the present invention;
FIG. 2 is a diagram of another embodiment of a cold data scrubbing method according to an embodiment of the present invention;
FIG. 3 is a diagram of an embodiment of a cold data cleaner in accordance with an embodiment of the present invention;
FIG. 4 is a diagram of another embodiment of a cold data cleaner according to an embodiment of the present invention;
FIG. 5 is a diagram of an embodiment of a cold data cleansing apparatus according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a cold data cleaning method, a device, equipment and a storage medium, wherein pre-processed log data are analyzed through a preset analysis tool to obtain an analysis result, the analysis result is classified to obtain structured query information, request time corresponding to each request table in the structured query information is extracted, a plurality of request time difference values are obtained through calculation, the plurality of request time difference values are compared with preset time difference threshold values respectively to obtain a comparison result, cold data in the structured query information are determined based on the comparison result, and the cold data cleaning efficiency is improved.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For ease of understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a cold data scrubbing method in an embodiment of the present invention includes:
101. the method comprises the steps of obtaining initial log data from a preset database, preprocessing the initial log data to obtain preprocessed log data, and calling a preset analysis tool to analyze the preprocessed log data to obtain an analysis result.
It is to be understood that the executing subject of the present invention may be a cold data cleaning device, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject. The embodiment of the present invention is described by taking a server as an execution subject. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The server acquires initial log data from a preset database, preprocesses the initial log data to obtain preprocessed log data, and invokes a preset analysis tool to analyze the preprocessed log data to obtain an analysis result. The server acquires initial log data from a preset database, wherein the database can be an information base, an enterprise backup database, a service and operation log database or a call ticket and statistical database and the like stored on a social media by a user. After the preprocessed log data are obtained, the server converts the preprocessed log data in the character string format into a recursive data structure, namely an abstract syntax tree, through an open source analysis tool Quyparser, and calls a preset traversal algorithm to perform traversal processing on the abstract syntax tree to obtain an analysis result.
102. And classifying the analysis result to obtain structured query information, extracting the request time corresponding to each request table in the structured query information, and calculating to obtain a plurality of request time difference values based on the request time corresponding to each request table.
The server classifies the analysis result to obtain structured query information, extracts the request time corresponding to each request table in the structured query information, and calculates to obtain a plurality of request time difference values based on the request time corresponding to each request table. The classification algorithm of the data may be a K nearest neighbor algorithm or a bex algorithm, the classification algorithm is used to classify the analysis result to obtain the structured query information, a table name is added before the list name in the classification process, and a database object set name schema is added before the table name, where the schema includes various objects, such as: tables, views, stored procedures, indexes, etc., typically one set for each user. The server obtains the request time t corresponding to each request table in the structured query information1、t2、t3…tnObtaining the analysis start time oftsAnd setting the time difference threshold value as T, and respectively calculating the difference between the request time corresponding to each request table and the analysis starting time to obtain a plurality of request time difference values.
103. And comparing the plurality of request time difference values with a preset time difference threshold value respectively to obtain a comparison result, and determining cold data in the structured query information based on the comparison result to obtain initial cold data.
The server compares the plurality of request time difference values with a preset time difference threshold value respectively to obtain a comparison result, and determines cold data in the structured query information based on the comparison result to obtain initial cold data. The server compares a plurality of request time difference values with a time difference threshold value T to obtain a comparison result, and when the request time difference T is greater than the threshold value Tn-ts<T, wherein TnRepresenting the request time, t, corresponding to each request table in the structured query informationsIndicating the beginning time of the resolution, proving that the request list is requested in a specified time period, otherwise tn-ts>T, all data in the corresponding request table is determined as preliminary cold data.
104. And calling a preset data monitor to monitor the initial cold data to obtain a monitoring result, determining target cold data based on the monitoring result, and cleaning the target cold data.
And the server calls a preset data monitor to monitor the initial cold data to obtain a monitoring result, determines target cold data based on the monitoring result, and cleans the target cold data. After the initial cold data is obtained, the table name corresponding to the initial cold data is modified by a preset modification tool Hive, and the initial cold data is monitored for a period of time, which is usually 1 to 3 months, and this embodiment is not limited specifically. The data monitor is used for monitoring whether there is an alarm message indicating that the access list does not exist, such as the mail system, short message or telephone receiving prompt "XXX list does not exist, task XX fails to execute, please process as soon as possible! If the monitoring result is the result of no alarm, the corresponding initial cold data is determined as the target cold data, and the target cold data is cleaned.
In the embodiment of the invention, the preprocessed log data is analyzed through a preset analysis tool to obtain an analysis result, the analysis result is classified to obtain the structured query information, the request time corresponding to each request table in the structured query information is extracted, a plurality of request time difference values are obtained through calculation, the plurality of request time difference values are compared with the preset time difference threshold value respectively to obtain a comparison result, the cold data in the structured query information is determined based on the comparison result, and the cold data cleaning efficiency is improved.
Referring to fig. 2, another embodiment of a cold data scrubbing method according to an embodiment of the present invention includes:
201. acquiring initial log data from a preset database, and performing missing value completion, abnormal value filtration and repeated value filtration on the initial log data to obtain preprocessed log data, wherein the preprocessed log data comprises a request base table, a request field, an insertion database, an insertion field and a limiting condition.
The server acquires initial log data from a preset database, and performs missing value completion, abnormal value filtration and repeated value filtration on the initial log data to obtain preprocessed log data, wherein the preprocessed log data comprises a request base table, a request field, an insertion database, an insertion field and a limiting condition. The server obtains initial log data from a preset database, wherein the database can be an information base, an enterprise backup database, a service and operation log database or a call ticket and statistical database and the like stored on a social media by a user, the embodiment is not limited to the above, after the initial log data is obtained, the initial log data needs to be preprocessed, the preprocessing process mainly comprises missing value completion, abnormal value filtering and repeated value filtering, wherein the missing value filling can be multiple interpolation, the abnormal value filtering mainly adopts an abnormal value detection algorithm z-score to identify and delete the abnormal value, the server simultaneously calls a preset data analysis tool package pandas to judge and filter the repeated value, the repeated value in the initial text data is inquired by calling an inquiry instruction df.duplicated (), and the repeated value is deleted by a deletion instruction df.drop _ duplicates (), the processing process of the repeated value can also be an unique () method in the extended program library numpy of the python language, the algorithm returns all different values in the parameter array, and the values are arranged from small to large, and finally the preprocessed log data is obtained.
202. And calling a preset analysis tool, converting the preprocessed log data into an abstract syntax tree, and performing traversal processing on the abstract syntax tree to obtain an analysis result.
And calling a preset analysis tool by the server, converting the preprocessed log data into an abstract syntax tree, and performing traversal processing on the abstract syntax tree to obtain an analysis result. Specifically, the server calls a preset syntax parsing tool, lexical parsing and syntax parsing are sequentially performed on the preprocessed log data, and an abstract syntax tree is obtained, wherein the syntax parsing tool comprises a lexical analyzer and a syntax analyzer; and calling a preset traversal algorithm by the server to perform traversal processing on the abstract syntax tree to obtain an analysis result. After the preprocessed log data are obtained, the server converts the preprocessed log data in a character string format into a recursive data structure, namely an abstract syntax tree, through an open source analysis tool Quyparser, and calls a preset traversal algorithm to perform traversal processing on the abstract syntax tree to obtain an analysis result, the traversal algorithm can be a depth-first search (DFS), the traversal process can be any one or combination of a plurality of types of a pre-sequence traversal, a middle-sequence traversal and a post-sequence traversal, and the QueryParser has the basic function of converting a character string meeting specific syntax into a corresponding query object, so that a composite logic query can be manually constructed without using a combined object, and the query function which can be completed only by a single-line character string and a plurality of lines of codes can be completed.
203. And classifying the analysis result to obtain structured query information, extracting the request time corresponding to each request table in the structured query information, and calculating to obtain a plurality of request time difference values based on the request time corresponding to each request table.
The server classifies the analysis result to obtain structured query information and extracts the structured query informationAnd calculating a plurality of request time difference values based on the request time corresponding to each request table in the structured query information. Specifically, the server calls a preset classification algorithm to classify the analysis result to obtain structured query information, wherein the structured query information comprises an insertion database, an insertion table, a request database, a request table and request time; the server obtains the request time corresponding to each request table in the structured query information and the analysis starting time in the structured query information, and calculates the difference value between the request time corresponding to each request table and the analysis starting time to obtain a plurality of request time difference values. The classification algorithm of the data may be a K nearest neighbor algorithm or a bex algorithm, the classification algorithm is used to classify the analysis result to obtain the structured query information, a table name is added before the list name in the classification process, and a database object set name schema is added before the table name, where the schema includes various objects, such as: tables, views, storage processes, indexes and the like, generally, one user corresponds to one set, for example, structured query information obtained by classifying analysis results includes inserting database A, database B, database C …, inserting table A, table B and table C …, requesting libraries database 1 and database 2 …, requesting table 1 and table 2 …, and requesting time t1、t2、t3…tnThe server obtains the request time t corresponding to each request table in the structured query information1、t2、t3…tnObtaining the analysis start time as tsAnd setting the time difference threshold value as T, and respectively calculating the difference between the request time corresponding to each request table and the analysis starting time to obtain a plurality of request time difference values.
204. And comparing the plurality of request time difference values with a preset time difference threshold value respectively to obtain a comparison result, and determining cold data in the structured query information based on the comparison result to obtain initial cold data.
The server compares the plurality of request time difference values with a preset time difference threshold value respectively to obtain a comparison result, and determines cold data in the structured query information based on the comparison result to obtain initial cold data. Specifically, the server compares the plurality of request time difference values with a preset time difference threshold value respectively to obtain comparison results, and extracts target results in the comparison results, wherein the target results are the comparison results corresponding to the request time difference values which are greater than or equal to the time difference threshold value; and the server acquires a request table corresponding to the target result to obtain a target request table, and determines all data corresponding to the target request table as initial cold data.
The server compares a plurality of request time difference values with a time difference threshold value T to obtain a comparison result, and when the request time difference T is greater than the threshold value Tn-ts<T, proving that the request list is requested in a specified time period, otherwise Tn-ts>T, all data in the corresponding request table is determined as preliminary cold data.
Obtaining a request table corresponding to a target result to obtain a target request table, and determining all data corresponding to the target request table as initial cold data further comprises: acquiring a request table corresponding to a target result, calling a preset modification tool, and modifying the table name of the request table corresponding to the target result to obtain a target request table; and determining all data corresponding to the target request table as initial cold data. After the initial cold data is obtained, the table name of the request table corresponding to the target result is modified through a preset modification tool Hive, the purpose of modifying the table name is to avoid that data cannot be recovered due to table deletion, and a production accident that cannot be preset is caused.
205. And calling a preset data monitor to monitor the initial cold data to obtain a monitoring result, determining target cold data based on the monitoring result, and cleaning the target cold data.
And the server calls a preset data monitor to monitor the initial cold data to obtain a monitoring result, determines target cold data based on the monitoring result, and cleans the target cold data. Specifically, the server calls a preset data monitoring instrument to monitor the initial cold data within a preset time range to obtain a monitoring result, wherein the monitoring result comprises an alarm result and an unserviceable result; and the server determines the initial cold data corresponding to the non-alarm result as the target cold data, and calls a preset cleaning strategy to clean the target cold data, wherein the cleaning strategy comprises a cleaning condition and a cleaning mode.
After the initial cold data is obtained, the initial cold data is monitored for a period of time, which is usually 1-3 months, in this embodiment, the data monitor is not limited specifically, and is used to monitor whether there is an alarm message indicating that the access table does not exist, such as a prompt received by a mail system, a short message, or a telephone, that "XXX table does not exist, the task XX fails to be executed, please process as soon as possible! If the monitoring result is an unseasoned result, the corresponding initial cold data is determined as target cold data, a preset cleaning strategy is called to perform cleaning processing on the target cold data, the cleaning strategy comprises cleaning conditions and a cleaning mode, the cleaning conditions comprise cleaning all the target cold data, and the cleaning mode comprises deleting or transferring the target cold data to a target storage device.
In the embodiment of the invention, the preprocessed log data is analyzed through a preset analysis tool to obtain an analysis result, the analysis result is classified to obtain the structured query information, the request time corresponding to each request table in the structured query information is extracted, a plurality of request time difference values are obtained through calculation, the plurality of request time difference values are compared with the preset time difference threshold value respectively to obtain a comparison result, the cold data in the structured query information is determined based on the comparison result, and the cold data cleaning efficiency is improved.
With reference to fig. 3, the cold data cleansing apparatus according to an embodiment of the present invention is described above, and an embodiment of the cold data cleansing apparatus according to an embodiment of the present invention includes:
the preprocessing module 301 is configured to obtain initial log data from a preset database, preprocess the initial log data to obtain preprocessed log data, and invoke a preset analysis tool to analyze the preprocessed log data to obtain an analysis result;
the classification module 302 is configured to perform classification processing on the analysis result to obtain structured query information, extract request time corresponding to each request table in the structured query information, and calculate a plurality of request time difference values based on the request time corresponding to each request table;
a comparison module 303, configured to compare the multiple request time difference values with a preset time difference threshold respectively to obtain a comparison result, and determine cold data in the structured query information based on the comparison result to obtain initial cold data;
and the monitoring module 304 is configured to invoke a preset data monitor to monitor the initial cold data, obtain a monitoring result, determine target cold data based on the monitoring result, and clean the target cold data.
In the embodiment of the invention, the preprocessed log data is analyzed through a preset analysis tool to obtain an analysis result, the analysis result is classified to obtain the structured query information, the request time corresponding to each request table in the structured query information is extracted, a plurality of request time difference values are obtained through calculation, the plurality of request time difference values are compared with the preset time difference threshold value respectively to obtain a comparison result, the cold data in the structured query information is determined based on the comparison result, and the cold data cleaning efficiency is improved.
Referring to fig. 4, another embodiment of the cold data cleansing apparatus according to the embodiment of the present invention includes:
the preprocessing module 301 is configured to obtain initial log data from a preset database, preprocess the initial log data to obtain preprocessed log data, and invoke a preset analysis tool to analyze the preprocessed log data to obtain an analysis result;
the classification module 302 is configured to perform classification processing on the analysis result to obtain structured query information, extract request time corresponding to each request table in the structured query information, and calculate a plurality of request time difference values based on the request time corresponding to each request table;
a comparison module 303, configured to compare the multiple request time difference values with a preset time difference threshold respectively to obtain a comparison result, and determine cold data in the structured query information based on the comparison result to obtain initial cold data;
and the monitoring module 304 is configured to invoke a preset data monitor to monitor the initial cold data, obtain a monitoring result, determine target cold data based on the monitoring result, and clean the target cold data.
Optionally, the preprocessing module 301 includes:
an obtaining unit 3011, configured to obtain initial log data from a preset database, and perform missing value completion, abnormal value filtering, and repeated value filtering on the initial log data to obtain preprocessed log data, where the preprocessed log data includes a request base table, a request field, an insertion database, an insertion field, and a limitation condition;
and the traversing unit 3012 is configured to invoke a preset parsing tool, convert the preprocessed log data into an abstract syntax tree, and perform traversal processing on the abstract syntax tree to obtain a parsing result.
Optionally, the traversal unit 3012 may be further specifically configured to:
calling a preset syntax analysis tool, and sequentially performing lexical analysis and syntax analysis on the preprocessed log data to obtain an abstract syntax tree, wherein the syntax analysis tool comprises a lexical analyzer and a syntax analyzer; and calling a preset traversal algorithm to perform traversal processing on the abstract syntax tree to obtain an analysis result.
Optionally, the classification module 302 includes:
a classification unit 3021, configured to invoke a preset classification algorithm, and perform classification processing on the analysis result to obtain structured query information, where the structured query information includes an insertion database, an insertion table, a request database, a request table, and a request time;
the calculating unit 3022 is configured to obtain a request time and an analysis start time corresponding to each request table in the structured query information, and calculate a difference between the request time corresponding to each request table and the analysis start time in the structured query information to obtain a plurality of request time differences.
Optionally, the comparison module 303 includes:
a comparison unit 3031, configured to compare the multiple request time difference values with preset time difference thresholds respectively to obtain comparison results, and extract a target result in the comparison results, where the target result is a comparison result corresponding to the request time difference value being greater than or equal to the time difference threshold;
the determining unit 3032 is configured to obtain a request table corresponding to the target result, obtain a target request table, and determine all data corresponding to the target request table as initial cold data.
Optionally, the determining unit 3032 may further be specifically configured to:
acquiring a request table corresponding to a target result, calling a preset modification tool, and modifying the table name of the request table corresponding to the target result to obtain a target request table; and determining all data corresponding to the target request table as initial cold data.
Optionally, the monitoring module 304 includes:
a monitoring unit 3041, configured to invoke a preset data monitor to monitor the initial cold data within a preset time range, so as to obtain a monitoring result, where the monitoring result includes an alarm result and an un-alarm result;
the cleaning unit 3042 is configured to determine, as the target cold data, the initial cold data corresponding to the result that the monitoring result is the non-alarm result, and invoke a preset cleaning policy to perform cleaning processing on the target cold data, where the cleaning policy includes a cleaning condition and a cleaning manner.
In the embodiment of the invention, the preprocessed log data is analyzed through a preset analysis tool to obtain an analysis result, the analysis result is classified to obtain the structured query information, the request time corresponding to each request table in the structured query information is extracted, a plurality of request time difference values are obtained through calculation, the plurality of request time difference values are compared with the preset time difference threshold value respectively to obtain a comparison result, the cold data in the structured query information is determined based on the comparison result, and the cold data cleaning efficiency is improved.
Fig. 3 and 4 describe the cold data cleansing apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the cold data cleansing device in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 5 is a schematic structural diagram of a cold data cleansing apparatus 500 according to an embodiment of the present invention, where the cold data cleansing apparatus 500 may have relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) for storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of computer program operations in the cold data cleansing apparatus 500. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of computer program operations in the storage medium 530 on the cold data cleaning device 500.
Cold data cleansing apparatus 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the cold data cleansing apparatus configuration shown in FIG. 5 does not constitute a limitation of the cold data cleansing apparatus and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.
The present application further provides a cold data cleaning device, including: a memory having a computer program stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor calls the computer program in the memory to cause the cold data cleansing apparatus to perform the steps of the cold data cleansing method described above.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored thereon a computer program, which, when run on a computer, causes the computer to perform the steps of the cold data cleansing method.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several computer programs to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A cold data scrubbing method, comprising:
acquiring initial log data from a preset database, preprocessing the initial log data to obtain preprocessed log data, and calling a preset analysis tool to analyze the preprocessed log data to obtain an analysis result;
classifying the analysis result to obtain structured query information, extracting request time corresponding to each request table in the structured query information, and calculating to obtain a plurality of request time difference values based on the request time corresponding to each request table;
comparing the plurality of request time difference values with a preset time difference threshold value respectively to obtain comparison results, and determining cold data in the structured query information based on the comparison results to obtain initial cold data;
and calling a preset data monitor to monitor the initial cold data to obtain a monitoring result, determining target cold data based on the monitoring result, and cleaning the target cold data.
2. The cold data cleaning method according to claim 1, wherein the obtaining initial log data from a preset database, preprocessing the initial log data to obtain preprocessed log data, and invoking a preset parsing tool to parse the preprocessed log data to obtain a parsing result comprises:
acquiring initial log data from a preset database, and performing missing value completion, abnormal value filtration and repeated value filtration on the initial log data to obtain preprocessed log data, wherein the preprocessed log data comprises a request base table, a request field, an insertion database, an insertion field and a limiting condition;
and calling a preset analysis tool, converting the preprocessed log data into an abstract syntax tree, and performing traversal processing on the abstract syntax tree to obtain an analysis result.
3. The method of claim 2, wherein the invoking a preset parsing tool to convert the preprocessed log data into an abstract syntax tree, and performing traversal processing on the abstract syntax tree to obtain a parsing result comprises:
calling a preset syntax analysis tool, and sequentially performing lexical analysis and syntax analysis on the preprocessed log data to obtain an abstract syntax tree, wherein the syntax analysis tool comprises a lexical analyzer and a syntax analyzer;
and calling a preset traversal algorithm to perform traversal processing on the abstract syntax tree to obtain an analysis result.
4. The method of claim 1, wherein the classifying the parsing result to obtain structured query information, extracting a request time corresponding to each request table in the structured query information, and calculating a plurality of request time difference values based on the request time corresponding to each request table comprises:
calling a preset classification algorithm, and performing classification processing on the analysis result to obtain structured query information, wherein the structured query information comprises an insertion database, an insertion table, a request database, a request table and request time;
acquiring request time corresponding to each request table in the structured query information and analysis starting time in the structured query information, and calculating a difference value between the request time corresponding to each request table and the analysis starting time to obtain a plurality of request time difference values.
5. The method according to claim 1, wherein the comparing the plurality of request time difference values with a preset time difference threshold value respectively to obtain a comparison result, and determining the cold data in the structured query information based on the comparison result to obtain initial cold data comprises:
comparing the plurality of request time difference values with a preset time difference threshold respectively to obtain comparison results, and extracting a target result in the comparison results, wherein the target result is the comparison result corresponding to the request time difference value which is greater than or equal to the time difference threshold;
and acquiring a request table corresponding to the target result to obtain a target request table, and determining all data corresponding to the target request table as initial cold data.
6. The method according to claim 5, wherein the obtaining the request table corresponding to the target result to obtain a target request table, and determining all data corresponding to the target request table as initial cold data further comprises:
acquiring a request table corresponding to the target result, calling a preset modification tool, and modifying the table name of the request table corresponding to the target result to obtain a target request table;
and determining all data corresponding to the target request table as initial cold data.
7. The cold data cleaning method according to any one of claims 1 to 6, wherein the calling of a preset data monitor monitors the initial cold data to obtain a monitoring result, the target cold data is determined based on the monitoring result, and the cleaning of the target cold data comprises:
calling a preset data monitor, and monitoring the initial cold data within a preset time range to obtain a monitoring result, wherein the monitoring result comprises an alarm result and an unserviced result;
and determining initial cold data corresponding to the result of monitoring as the result of no alarm as target cold data, and calling a preset cleaning strategy to clean the target cold data, wherein the cleaning strategy comprises a cleaning condition and a cleaning mode.
8. A cold data cleansing apparatus, comprising:
the system comprises a preprocessing module, a data analysis module and a data analysis module, wherein the preprocessing module is used for acquiring initial log data from a preset database, preprocessing the initial log data to obtain preprocessed log data, and calling a preset analysis tool to analyze the preprocessed log data to obtain an analysis result;
the classification module is used for classifying the analysis result to obtain structured query information, extracting the request time corresponding to each request table in the structured query information, and calculating to obtain a plurality of request time difference values based on the request time corresponding to each request table;
the comparison module is used for comparing the plurality of request time difference values with a preset time difference threshold value respectively to obtain a comparison result, and determining cold data in the structured query information based on the comparison result to obtain initial cold data;
and the monitoring module is used for calling a preset data monitor to monitor the initial cold data to obtain a monitoring result, determining target cold data based on the monitoring result and cleaning the target cold data.
9. A cold data cleansing apparatus, comprising: a memory and at least one processor, the memory having stored therein a computer program;
the at least one processor calls the computer program in the memory to cause the cold data cleansing device to perform the cold data cleansing method according to any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the cold data cleaning method according to any one of claims 1 to 7.
CN202210038324.9A 2022-01-13 2022-01-13 Cold data cleaning method, device, equipment and storage medium Pending CN114385668A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210038324.9A CN114385668A (en) 2022-01-13 2022-01-13 Cold data cleaning method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210038324.9A CN114385668A (en) 2022-01-13 2022-01-13 Cold data cleaning method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114385668A true CN114385668A (en) 2022-04-22

Family

ID=81201157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210038324.9A Pending CN114385668A (en) 2022-01-13 2022-01-13 Cold data cleaning method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114385668A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996306A (en) * 2022-08-04 2022-09-02 北京首信科技股份有限公司 Data management method and system based on multiple dimensions
CN115934849A (en) * 2023-03-13 2023-04-07 安徽中科晶格技术有限公司 Method, device, node and storage medium for identifying workload certification of blocks

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996306A (en) * 2022-08-04 2022-09-02 北京首信科技股份有限公司 Data management method and system based on multiple dimensions
CN115934849A (en) * 2023-03-13 2023-04-07 安徽中科晶格技术有限公司 Method, device, node and storage medium for identifying workload certification of blocks

Similar Documents

Publication Publication Date Title
CN109034993B (en) Account checking method, account checking equipment, account checking system and computer readable storage medium
EP3251031B1 (en) Techniques for compact data storage of network traffic and efficient search thereof
CN109656999B (en) Method, device, storage medium and apparatus for synchronizing large data volume data
CN113676464A (en) Network security log alarm processing method based on big data analysis technology
CN111027615B (en) Middleware fault early warning method and system based on machine learning
CN110502509B (en) Traffic big data cleaning method based on Hadoop and Spark framework and related device
CN108521339B (en) Feedback type node fault processing method and system based on cluster log
CN114385668A (en) Cold data cleaning method, device, equipment and storage medium
CN111447224A (en) Web vulnerability scanning method and vulnerability scanner
US11568344B2 (en) Systems and methods for automated pattern detection in service tickets
CN114721856A (en) Service data processing method, device, equipment and storage medium
CN115269438A (en) Automatic testing method and device for image processing algorithm
CN107871055B (en) Data analysis method and device
CN115858504A (en) Multidimensional data fusion management system and method for Internet of things platform and storage medium
CN111967885A (en) Intelligent outbound call processing method and device
CN113242157B (en) Centralized data quality monitoring method under distributed processing environment
CN114465875B (en) Fault processing method and device
CN111414567A (en) Data processing method and device
CN112269879B (en) Method and equipment for analyzing middle station log based on k-means algorithm
CN112597498A (en) Webshell detection method, system and device and readable storage medium
KR20220069229A (en) The method of coupling with heterogeneous data using relation of fields in data
CN111611483A (en) Object portrait construction method, device, equipment and storage medium
KR20210060829A (en) Big data platform managing method and device
CN116795663B (en) Method for tracking and analyzing execution performance of trino engine
CN114860847B (en) Data link processing method, system and medium applied to big data platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination