CN114443580A - Data cleaning method, device, medium and computing equipment - Google Patents

Data cleaning method, device, medium and computing equipment Download PDF

Info

Publication number
CN114443580A
CN114443580A CN202210106424.0A CN202210106424A CN114443580A CN 114443580 A CN114443580 A CN 114443580A CN 202210106424 A CN202210106424 A CN 202210106424A CN 114443580 A CN114443580 A CN 114443580A
Authority
CN
China
Prior art keywords
application
component
cleaning
data
target folder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210106424.0A
Other languages
Chinese (zh)
Inventor
杨斌杰
余利华
蒋鸿翔
姚琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202210106424.0A priority Critical patent/CN114443580A/en
Publication of CN114443580A publication Critical patent/CN114443580A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • G06F9/44526Plug-ins; Add-ons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An embodiment of the present disclosure provides a data cleaning method, including: monitoring the working state of the application; the working state comprises at least an application end event; sending an application identification of the application to the first component in response to the application end event to send a cleaning request to the second component by the first component; the cleaning request comprises the application identification; responding to the second component to find a target folder containing the application identification, and executing cleaning operation on the target folder; the target folder is created based on the application identification when the application writes the data of the application into a container. In the process, on one hand, directional cleaning of the application is realized, consumption of disk input and output caused by large-batch cleaning operation is avoided, and other services are prevented from being influenced; on the other hand, timely and efficient cleaning is realized, the resource recovery speed is increased, and the influence of the occupied space of the data is reduced to the minimum.

Description

Data cleaning method, device, medium and computing equipment
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a data cleaning method, a data cleaning device, a data cleaning medium and a computing device.
Background
This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
During the operation process of the application, a lot of data is usually generated, for example, during the process of using the browser by the user, some temporary files are generated, so that the user can complete a quick response to the user when subsequently using the browser.
However, as these temporary files are increased more and more, if not cleaned in time, when the occupied space of data is too much, resulting in insufficient storage space, the problem of application jamming is likely to be caused. Therefore, the problem of application blockage can be solved by executing data cleaning operation on the data, and the running speed of the application is prevented from being reduced.
Disclosure of Invention
In this context, embodiments of the present disclosure are intended to provide a data cleansing method and apparatus.
In a first aspect of the disclosed embodiments, there is provided a data cleansing method applied to a resource scheduler on which a first component, a second component and a containerized application are deployed, including:
monitoring the working state of the application; the working state comprises at least an application end event;
sending an application identification of the application to the first component in response to the application end event to send a cleaning request to the second component by the first component; the cleaning request comprises the application identification;
responding to the second component to find a target folder containing the application identification, and executing cleaning operation on the target folder; the target folder is created based on the application identification when the application writes the data of the application into a container.
In a second aspect of the disclosed embodiments, there is provided a data cleansing apparatus applied to a resource scheduler on which a first component, a second component and a containerized application are deployed, including:
the monitoring module monitors the working state of the application; the working state comprises at least an application end event;
the sending module is used for responding to the application ending event and sending the application identification of the application to the first component so as to send a cleaning request to the second component by the first component; the cleaning request comprises the application identification;
the cleaning module is used for responding to the second component to find the target folder containing the application identifier and executing the cleaning operation of the target folder; the target folder is created based on the application identification when the application writes the data of the application into a container.
In a third aspect of embodiments of the present disclosure, there is provided a storage medium; having stored thereon computer instructions which, when executed by a processor, implement the steps of the method as described below:
monitoring the working state of the application; the working state comprises at least an application end event;
sending an application identification of the application to the first component in response to the application end event to send a cleaning request to the second component by the first component; the cleaning request comprises the application identification;
responding to the second component to find a target folder containing the application identification, and executing cleaning operation on the target folder; the target folder is created based on the application identification when the application writes the data of the application into a container.
In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising:
a processor; and a memory for storing processor-executable instructions;
wherein the processor implements the steps of the method by executing the executable instructions to:
monitoring the working state of the application; the working state comprises at least an application end event;
sending an application identification of the application to the first component in response to the application end event to send a cleaning request to the second component by the first component; the cleaning request comprises the application identification;
responding to the second component to find a target folder containing the application identification, and executing cleaning operation on the target folder; the target folder is created based on the application identification when the application writes the data of the application into a container.
The above embodiments of the present disclosure have at least the following advantages: .
By monitoring the working state of the application, sending the application identifier to the first component when the application is finished, and sending a cleaning request aiming at the application to the second component by the first component, further, the second component finds the folder corresponding to the application and executes the cleaning operation. Through the technical scheme, on one hand, directional cleaning for application can be realized, the consumption of disk input and output caused by large-batch cleaning operation is avoided, and the influence on the normal operation of other services is avoided; on the other hand, timely and efficient cleaning of data can be achieved, the speed of resource recovery is increased, the influence of occupied space of the data is reduced to the minimum, and meanwhile cost can be saved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically illustrates a flow chart of a method of data scrubbing in accordance with an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow diagram of a method of data scrubbing in accordance with an embodiment of the present disclosure;
FIG. 3 schematically illustrates a block diagram of a data cleansing apparatus according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a schematic diagram of a data cleansing medium according to an embodiment of the present disclosure;
fig. 5 schematically shows a schematic diagram of an electronic device capable of implementing the above method according to an embodiment of the present disclosure.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
One skilled in the art will appreciate that embodiments of the present disclosure can be implemented as a system, apparatus, device, method, or computer-readable storage medium. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to an embodiment of the disclosure, a data cleaning method, a data cleaning device, a data cleaning medium and a computing device are provided.
In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.
The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.
Summary of The Invention
The inventor finds that when data cleaning is carried out, cleaning can be usually carried out only at regular time, and cleaning operation is carried out on all data after a fixed time interval. However, when the above-mentioned timing cleaning is performed, if the time interval is set to be too long, the data in a single cleaning process is too much, and a large amount of input and output resources of the disk are occupied, which may affect the normal operation of other services. Further, although a short time interval can be set to avoid excessive data of a single cleaning, this may result in that data that needs to be used by the application is deleted, which affects normal use of the application, causes inconvenience to a user, and causes resource duplication.
In view of this, the present specification provides a technical solution for implementing targeted cleaning of an application by monitoring a working state of the application, sending an application identifier to a first component when the application is ended, sending a cleaning request for the application to a second component by the first component, further searching a folder corresponding to the application by the second component, and executing a cleaning operation.
The core technical concept of the specification is as follows:
and creating a folder for storing the application data according to the application identifier corresponding to the application while writing the data corresponding to the application into the container. Further, when the life cycle of the application is finished, the resource scheduler may monitor the working state of the application, and in response to an application end event, send the application identifier to the first component, and then the first component controls the second component to clean the folder containing the application identifier.
Through the technical scheme, on one hand, directional cleaning for application can be realized, the consumption of disk input and output caused by large-batch cleaning operation is avoided, and the influence on the normal operation of other services is avoided; on the other hand, the data can be timely and efficiently cleaned, the resource recovery speed is accelerated, the influence of the occupied space of the data is reduced to the minimum, and meanwhile, the cost can be saved.
Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.
Application fieldOverview of the scene
As can be seen from the foregoing, during the process of using various application clients, a user usually generates a temporary file in the storage space of the device, so that the user can make a quick call when using the application next time, where the temporary file is also called cache data. For example, the cached data may be a temporary file generated when the user accesses the network using a browser, a temporary file generated when the user views a picture or a video using social software, a file downloaded by the user updating an application, and the like.
With the rapid development of cloud computing, applications do not need to be installed on the local device for use any more, a user can manage and use the applications through a resource scheduler provided by a cloud platform, and various types of containerized applications are generally deployed on the resource scheduler.
For example, the resource scheduler may be kubernets, abbreviated as "k 8s," which is a system that may be used to deploy, extend, and manage containerized applications. In kubernets, multiple containers may be created, each inside of which an application instance runs. While the containers are operating in a Pod, which is the smallest computing unit in Kubernetes, there may be 1 or more containers in a Pod, which are relatively tightly coupled together. In kubernets, Pod is the carrier of applications that can be accessed through the IP of Pod.
The Kubernetes is essentially a group of server clusters, and a Kubernetes cluster mainly comprises a control Node (Master) and a working Node (Node), wherein each Node is provided with different components. The Master is a gateway and a pivot of the cluster and is responsible for the decision of the cluster, and the Node is responsible for receiving a working instruction from the Master, correspondingly creating and destroying a Pod object according to the instruction, and adjusting a network rule to carry out reasonable routing and flow forwarding.
In the containerized application process, residual temporary files are generated, and need to be cleaned in time. And, different from the device used by the ordinary consumer, in addition to considering the performance influence caused by the data occupation, the cost problem caused by the data occupation is also considered.
Taking the kubernets as an example, the timed cleaning can be performed only by closing the container, and if the time interval is too short, application data required by the cloud computing task can be deleted, so that partial task recalculation is caused, the computing efficiency is affected, and the computing cost is increased. If the time interval is too long, too much data can be cleaned at one time, a large number of input and output resources of the disk are occupied, precious computing resources are wasted on cleaning the data, and the cost of cloud computing is increased. In addition, when the cache data is cleared, kubernets cannot realize application perception, and the cache data corresponding to each application is distinguished.
It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present disclosure, and the embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.
Exemplary method
The technical idea of the present specification will be described in detail by specific examples.
The technical scheme includes that the working state of the application is monitored, the application identifier is sent to the first component when the application is finished, the first component sends a cleaning request for the application to the second component, and further the second component finds the folder corresponding to the application and executes cleaning operation, so that the application is cleaned specifically.
Referring to fig. 1, fig. 1 is a flowchart of a data cleansing method according to an exemplary embodiment, where the method is applied to a resource scheduler, and includes the following steps:
step 101, a resource scheduler can monitor the working state of an application; the working state comprises at least an application end event; wherein, the resource scheduler is deployed with a first component, a second component and a containerized application;
for example, the resource scheduler may determine the working state of the application by monitoring an event log corresponding to the application to determine whether the life cycle of the application is finished.
Step 102, in response to the application end event, sending an application identifier of the application to the first component, so that the first component sends a cleaning request to the second component; the cleaning request comprises the application identification;
for example, when the application task is finished, a preset function may be called, corresponding logic is executed, an application identifier corresponding to the application is sent to a specified address, and the application identifier is received by the first component; further, the first component sends a cleaning request for the application to the second component based on the application identification.
103, in response to the second component finding the target folder containing the application identifier, performing a cleaning operation on the target folder; the target folder is created based on the application identification when the application writes the data of the application into a container;
for example, after receiving the cleaning request, the second component may search, in a directory monitored by the second component, whether the target folder contains the application identifier, and if the target folder contains the application identifier, perform a cleaning operation on the target folder.
Through the technical scheme, on one hand, directional cleaning for application can be realized, the consumption of disk input and output caused by large-batch cleaning operation is avoided, and the influence on the normal operation of other services is avoided; on the other hand, timely and efficient cleaning of data can be achieved, the speed of resource recovery is increased, the influence of occupied space of the data is reduced to the minimum, and meanwhile cost can be saved.
In the above step 101, the resource scheduler deploys the first component, the second component and the containerized application, and the resource scheduler may monitor the working state of the application.
Wherein the operating state comprises at least an application end event.
The resource scheduler may be kubernets mentioned above, or may be other systems used in the cloud platform that can manage online applications.
In one embodiment shown, the application is built by a compute engine and used to perform data computations.
In particular, the application may be constructed by a computing engine that supports large-scale data processing, and used to perform large-scale data analysis applications, such as SQL queries, text processing, or machine learning, etc., computations.
Obviously, since the data analysis application has a large data volume when performing large-scale data processing, timely, efficient and targeted cleaning of data is more desirable.
The calculation engine may be an Apache Spark calculation engine, or an Apache Hadoop calculation engine.
Taking Spark as an example, the computing engine defines an operation event for the application, so as to show the complete life cycle of the application, including the whole process from application submission to application termination, and also provides a history log service, so that the practice logs of the application can be stored, and the application performance can be further optimized by analyzing the event logs.
In step 102, when the working state of the application is task end, the resource scheduler may monitor an application end event, and in response to the application end event, send an application identifier corresponding to the application to the first component, so that the first component provides the auxiliary service, and send a cleaning request including the application identifier to the second component.
As can be seen from the foregoing, the first component and the second component are deployed on a resource scheduler. It should be noted that the first component does not execute the cleaning operation itself, but acts as a bridge between the resource scheduler and the second component to notify the second component to execute the cleaning operation on the application.
For example, the running event corresponding to the application constructed by the Spark when the task is finished is an application ending event, and by monitoring the running event of the Spark application, the resource scheduler may send the application identifier of the Spark application to the first component in response to the Spark application ending event; further, the first component sends a cleaning request containing the Spark application identification to the second component.
In one embodiment, a preset function may be called in response to the application end event, and an end message may be sent to the first component via a specified address; the end message includes an application identification of the application.
For example, to implement sending the application identifier to the first component in the Spark application, a function ExtraListener may be preset, and the function may inherit from a Spark listener class in Spark source code, and further, a function of assisting in listening to Spark application provided by the function ExtraListener may be preset, so that when a Spark application task is ended, sending an end message containing the ApplicationId to the service at the specified address, that is, sending the end message to the first component, is implemented by calling the function ExtraListener.
As can be seen from the foregoing, the resource scheduler may include a control node and a computing node, and the computing node is used for executing the computation of the application, so that the second component for cleaning the application data may be deployed on the computing node.
In one embodiment shown, the second component is deployed on a plurality of compute nodes in the resource scheduler;
further, the first component may send the cleaning request to a second component according to a second component list stored in advance; the second component list records address information corresponding to the second component.
Since the application has a short time for sending the task end message when the task is ended, and is not suitable for the resource scheduler to directly send the cleaning request to the second component, the first component can be deployed as a bridge between the resource scheduler and the second component.
For example, after receiving the message of ending the application, the first component may send a cleaning request to the second component deployed by each computing node according to the address information corresponding to each second component recorded in the pre-stored second component list. When the second component is deployed, Host information and IP information of a node where the second component is located can be registered with the first component, so that the first component adds information of the second component to the second component list.
It should be noted that, because there may be many nodes in daily service, if a TCP link mode is used to send a cleaning request, a significant overhead is caused, and the resource stability of the on-line cluster is affected. Therefore, the message can be transmitted using UDP broadcasting, thereby reducing overhead.
In one embodiment, the second component creates copies of the second component on a plurality of compute nodes in the resource scheduler, respectively, in a mirror fashion;
further, the second component list records address information of a computing node where the copy of the second component is registered with the first component when the copy of the second component is created.
For example, the second component may be copied from one compute node to multiple compute nodes in a mirror image manner when deployed, so that a copy of the second component is created on each compute node. Accordingly, the copy of the second component, when created, registers address information of the computing node where the copy of the second component is located with the first component.
In one illustrated embodiment, in response to a new compute node being added to the resource scheduler, the second component is mirrored to create a copy of the second component on the new compute node.
For example, when a new compute node is added to the resource scheduler, the second component may continue to be created in a mirrored fashion, yet another copy of the second component on the new compute node; accordingly, the further copy, when created, registers with the first component address information of the new computing node where the further copy of the second component is located.
Taking kubernets as an example, in kubernets, a DaemonSet (daemon set) is started on each node, and the DaemonSet can ensure that a copy of a Pod runs on all (or some) nodes, when a node joins a cluster, a Pod is added to the node, when the node is removed from the cluster, the pods are recycled, and if the DaemonSet is deleted, all the pods created by the node are deleted.
In step 103, after receiving the cleaning request sent by the first component, the second component may search whether the target folder contains the application identifier, and if so, perform a cleaning operation on the target folder.
It should be noted that the target folder is created by the application based on the application identifier when the application writes data of the application into the container.
Taking the application constructed by Spark as an example, since no application identifier is provided in the process of sending the Shuffle data to the local, after the Spark executor exits, the resource scheduler cannot determine to which application the local data specifically belongs. Thus, the spare executor generating the Shuffle folder name may be added with the prefix spare ApplicationId, i.e., the application identification, by modifying the creatlocaldirs in the disakbocklockmanager of spare.
Continuing with the example, the second component can look up whether a folder containing the ApplicationId exists in the local Shuffle folder, and if so, perform a cleanup operation on the folder.
In an illustrated embodiment, in response to the second component finding the target folder containing the application identifier in the specified folder directory, a cleaning operation on the target folder may be performed.
For example, after receiving the cleaning request sent by the first component, the second component may scan whether a target folder containing the application identifier exists in the second component execution directory, and if so, perform a cleaning operation on the target folder.
As can be seen from the foregoing, when the first component sends the cleaning request to the second component, a UDP broadcast mode may be adopted, and a packet loss situation may occur in the UDP broadcast, so a timer may be set for the second component to prevent that data cannot be cleaned in time due to the packet loss.
In one embodiment shown, a timer is preset in the second component;
and further, responding to the trigger of the timer, calling the second component, and executing cleaning operation on the folder with the time length exceeding the threshold value.
For example, a timer may be set in the second component, and when the timer is triggered, a folder with a time length exceeding a threshold is deleted, so that the remaining cache data is cleared in case of packet loss.
In one embodiment shown, the container is mounted to a designated disk;
further, when the application writes the data of the application into a container, a target folder is created in the disk mounted by the resource scheduler based on the application identification.
Taking kubernets as an example, in daily business, generated cache data may be very large, and if the storage space provided by kubernets is adopted, firstly, the storage space is limited and is not enough to support huge data volume, when a large amount of data is written, Pod will die by itself, resulting in data loss, and secondly, the read-write rate of kubernets is lower, which is not as good as that of a real disk.
Thus, the container may be used to cache data by mounting it to a designated disk. Further, when the application writes data of the application to the container, a folder may be created in the mounted specified disk based on the application identification.
Through the technical scheme, on one hand, directional cleaning for application can be realized, the consumption of disk input and output caused by large-batch cleaning operation is avoided, and the influence on the normal operation of other services is avoided; on the other hand, the data can be timely and efficiently cleaned, the resource recovery speed is accelerated, the influence of the occupied space of the data is reduced to the minimum, and meanwhile, the cost can be saved.
Next, taking Spark On kubernets as an example, the process of cleaning the Shuffle residual file according to Spark Application status will be described. Referring to fig. 2, fig. 2 is a flowchart of a data cleansing method according to an exemplary embodiment, including the following steps:
in step 201, a first component Cleaner-GateWay is started on a Kubernets cluster.
The first component Cleaner-GateWay can monitor two pre-configured ports, one port is used for communicating with the resource scheduler, receiving an application end message, and the other port is used for communicating with the second component, receiving registration information of the second component and sending a cleaning request to the second component.
A second component Block-Cleaner is started on the kubernets cluster, step 202.
Wherein, the second component Block-Cleaner may be in the form of DaemonSet, and one second component Block-Cleaner instance is generated on each scheduled Kubernetes node. When each second component Block-Cleaner instance is generated, node Host information and IP information where the second component Block-Cleaner instance is located are registered with the configured first component Cleaner-GateWay.
In step 203, the Spark application starts, creating a folder containing the Spark AppId prefix.
When the Spark application is started, a preset function ExtraListener is loaded, and Spark Executors are pulled up in a Kubernets cluster; further, when Spark Executors are started, a folder provided for use in the Shuffle link is created in a node disk where the Executors are located, and the folder takes Spark AppId as an application identifier.
The function extraListener inherits from a sparkListener class in Spark source code, and the application Id is sent to the service of the specified address by rewriting the application End function.
In step 204, the Spark application is ended, and an application end event is generated.
When the Spark application ends, an application end event, i.e., Spark application end event, is generated. When the extraListener hears the application end event, it can send an application end message with Spark AppId to the first component Cleaner-GateWay.
In step 205, the first component Cleaner-GateWay sends a clean request to the second component Block-Cleaner.
After receiving the application end message, the first component Cleaner-GateWay may send a cleaning request to all registered second components Block-Cleaner according to a second component list stored in advance, where the cleaning request includes a spare AppId.
In step 206, the second component Block-Cleaner finds the target folder.
After receiving the cleaning request, the second component Block-Cleaner scans under a specified directory corresponding to the node where the second component Block-Cleaner is located, searches whether a folder containing the prefix of the Spark AppId exists, and if so, performs a cleaning operation on the target folder.
Step 207, triggering a timer, and cleaning the folder with the existing duration exceeding the threshold.
A timer can be preset in the second component Block-Cleaner, and when the timer is triggered, cleaning operation can be executed on a folder with the time length exceeding a threshold value, so that the condition that cache data cannot be cleaned in time due to packet loss of a cleaning request is prevented.
In the process, on one hand, directional cleaning for application can be realized, the consumption of disk input and output caused by large-batch cleaning operation is avoided, and the influence on the normal operation of other services is avoided; on the other hand, timely and efficient cleaning of data can be achieved, the resource recovery speed is increased, the influence of the occupied space of the data is reduced to the minimum, meanwhile, waste of cloud computing platform resources can be avoided, and cost is saved.
Exemplary devices
Having described the method of the exemplary embodiment of the present disclosure, referring next to fig. 3, fig. 3 is a block diagram of a data cleansing apparatus according to an exemplary embodiment.
The implementation process of the functions and actions of each module in the following device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again. For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points.
As shown in fig. 3, the data cleansing apparatus 300 may include: a monitoring module 301, a sending module 302 and a cleaning module 303. Wherein:
the monitoring module 301 is configured to monitor the working state of the application; the working state comprises at least an application end event;
the sending module 302 is configured to send an application identification of the application to the first component in response to the application end event, so that a cleaning request is sent by the first component to the second component; the cleaning request comprises the application identification;
the cleaning module 303 is configured to, in response to the second component finding the target folder containing the application identifier, perform a cleaning operation on the target folder; the target folder is created based on the application identification when the application writes the data of the application into a container.
In one embodiment, the application is built by a compute engine and used to perform data computations.
In one embodiment, the sending module further:
responding to the application ending event, calling a preset function, and sending an ending message to the first component through a specified address; the end message includes an application identification of the application.
In an embodiment, the second component is deployed on a plurality of compute nodes in the resource scheduler;
the first component sending a cleaning request to the second component, comprising:
the first component sends the cleaning request to a second component according to a second component list stored in advance; the second component list records address information corresponding to the second component.
In one embodiment, the second component creates copies of the second component on a plurality of compute nodes in the resource scheduler, respectively, in a mirror manner;
the second component list records address information corresponding to the second component, and includes:
the second component list records address information of a computing node where the copy of the second component is registered with the first component when the copy of the second component is created.
In one embodiment, the apparatus 300 further comprises:
a mirroring module 304, responsive to a new compute node being added to the resource scheduler, creates a copy of the second component on the new compute node in a mirrored manner.
In one embodiment, the cleaning module 303 further:
and in response to the second component finding the target folder containing the application identifier under the appointed folder directory, executing a cleaning operation on the target folder.
In one embodiment, the container is mounted to a designated disk;
when the application writes the data of the application into a container, creating a target folder based on the application identification, wherein the creating comprises the following steps:
and when the application writes the data of the application into a container, creating a target folder in the disk mounted by the resource scheduler based on the application identification.
In one embodiment, a timer is preset in the second component;
the apparatus 300 further comprises:
the timing module 305, in response to the triggering of the timer, calls the second component to perform a cleaning operation on the folder with the duration exceeding the threshold.
The details of each module of the data cleansing apparatus 300 have been described in detail in the foregoing description of the word vector compression method based on frequency domain transformation, and therefore, the details are not repeated here.
It should be noted that although several modules or units of the data cleansing device 300 are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Exemplary Medium
Having described the apparatus of the exemplary embodiments of the present disclosure, reference is next made to fig. 4, where fig. 4 is a schematic diagram of a data cleaning medium according to an exemplary embodiment.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Referring to fig. 4, a readable storage medium 400 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the readable storage medium of the present disclosure is not limited thereto, and in this document, the readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Exemplary computing device
Having described the method, medium, and apparatus of the exemplary embodiments of the present disclosure, reference is next made to fig. 5, where fig. 5 is a schematic diagram of an electronic device capable of implementing the method according to an exemplary embodiment.
An electronic device 500 according to such an embodiment of the present disclosure is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: the at least one processing unit 501, the at least one memory unit 502, and a bus 503 connecting the various system components (including the memory unit 502 and the processing unit 501).
Wherein the storage unit stores program code that can be executed by the processing unit 501 to cause the processing unit 501 to perform the steps of the various embodiments described above in this specification.
The storage unit 502 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)5021 and/or a cache memory unit 5022, and may further include a read only memory unit (ROM) 5023.
The storage unit 502 may also include a program/utility tool 5024 having a set (at least one) of program modules 5025, such program modules 5025 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, and in some combination, may comprise a representation of a network environment.
Bus 503 may be any type or number of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 500 may also communicate with one or more external devices 504 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 500 to communicate with one or more other computing devices. Such communication may be through input/output (I/O) interfaces 505. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 506. As shown, the network adapter 506 communicates with the other modules of the electronic device 500 over the bus 503. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A data cleaning method is applied to a resource scheduler, wherein a first component, a second component and a containerized application are deployed on the resource scheduler, and the method comprises the following steps:
monitoring the working state of the application; the working state comprises at least an application end event;
sending an application identification of the application to the first component in response to the application end event to send a cleaning request to the second component by the first component; the cleaning request comprises the application identification;
responding to the second component to find a target folder containing the application identification, and executing cleaning operation on the target folder; the target folder is created based on the application identification when the application writes the data of the application into a container.
2. The method of claim 1, the application being built by a compute engine and used to perform data computations.
3. The method of claim 1, the sending an application identification of the application to the first component in response to the application end event, comprising:
responding to the application ending event, calling a preset function, and sending an ending message to the first component through a specified address; the end message includes an application identification of the application.
4. The method of claim 1, the second component deployed on a plurality of compute nodes in the resource scheduler;
the first component sending a cleaning request to the second component, comprising:
the first component sends the cleaning request to a second component according to a pre-stored second component list; the second component list records address information corresponding to the second component.
5. The method of claim 4, the second component creating copies of the second component, respectively, on a plurality of compute nodes in the resource scheduler in a mirrored fashion;
the second component list records address information corresponding to the second component, and includes:
the second component list records the address information of the computing node where the copy of the second component is registered with the first component when the copy of the second component is created.
6. The method of claim 5, further comprising:
in response to the addition of a new compute node to the resource scheduler, creating a copy of the second component on the new compute node in a mirrored manner with the second component.
7. The method of claim 1, wherein the performing a cleanup operation on the target folder that contains the application identification in response to the second component finding the target folder comprises:
and in response to the second component finding the target folder containing the application identifier under the appointed folder directory, executing a cleaning operation on the target folder.
8. The method of claim 1, the container mounted to a designated disk;
when the application writes the data of the application into a container, creating a target folder based on the application identifier, wherein the creating comprises the following steps:
and when the application writes the data of the application into a container, creating a target folder in the disk mounted by the resource scheduler based on the application identification.
9. The method of claim 1, wherein a timer is preset in the second component;
the method further comprises the following steps:
and responding to the triggering of the timer, calling the second component, and executing cleaning operation on the folder with the existing time length exceeding a threshold value.
10. A data cleaning device is applied to a resource scheduler, wherein a first component, a second component and a containerized application are deployed on the resource scheduler, and the data cleaning device comprises:
the monitoring module monitors the working state of the application; the working state comprises at least an application end event;
the sending module is used for responding to the application ending event and sending the application identification of the application to the first component so as to send a cleaning request to the second component by the first component; the cleaning request comprises the application identification;
the cleaning module is used for responding to the second component to find the target folder containing the application identifier and executing the cleaning operation of the target folder; the target folder is created based on the application identification when the application writes the data of the application into a container.
CN202210106424.0A 2022-01-28 2022-01-28 Data cleaning method, device, medium and computing equipment Pending CN114443580A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210106424.0A CN114443580A (en) 2022-01-28 2022-01-28 Data cleaning method, device, medium and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210106424.0A CN114443580A (en) 2022-01-28 2022-01-28 Data cleaning method, device, medium and computing equipment

Publications (1)

Publication Number Publication Date
CN114443580A true CN114443580A (en) 2022-05-06

Family

ID=81371772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210106424.0A Pending CN114443580A (en) 2022-01-28 2022-01-28 Data cleaning method, device, medium and computing equipment

Country Status (1)

Country Link
CN (1) CN114443580A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048343A (en) * 2022-06-30 2022-09-13 深圳市瑞云科技有限公司 File isolation method based on process granularity under Windows

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048343A (en) * 2022-06-30 2022-09-13 深圳市瑞云科技有限公司 File isolation method based on process granularity under Windows
CN115048343B (en) * 2022-06-30 2024-04-16 深圳市瑞云科技有限公司 File isolation method based on process granularity under Windows

Similar Documents

Publication Publication Date Title
WO2022078345A1 (en) Method for data interaction between multiple devices, and related device
WO2020024408A1 (en) Test device, test method and storage medium
CN110795029B (en) Cloud hard disk management method, device, server and medium
CN103092670A (en) Cross-platform virtual computing resource management method under cloud computing environment
CN109684038B (en) Docker service container log processing method and device and electronic equipment
CN105450759A (en) System mirror image management method and device
CN108427619B (en) Log management method and device, computing equipment and storage medium
US8966068B2 (en) Selective logging of network requests based on subsets of the program that were executed
CN113268308A (en) Information processing method, device and storage medium
CN114443580A (en) Data cleaning method, device, medium and computing equipment
CN111522744A (en) Service testing method, device and computer readable storage medium
CN110908837A (en) Application program exception handling method and device, electronic equipment and storage medium
US8650160B1 (en) Systems and methods for restoring multi-tier applications
CN113791798B (en) Model updating method and device, computer storage medium and electronic equipment
CN108696559B (en) Stream processing method and device
EP2530590A1 (en) Object pipeline-based virtual infrastructure management
US20230229545A1 (en) Intelligent log analysis and retention for microservices applications
CN110502238A (en) A kind of method and device of front and back end joint debugging
US20230229717A1 (en) Optimized real-time streaming graph queries in a distributed digital security system
CN113360558B (en) Data processing method, data processing device, electronic equipment and storage medium
CN113590367B (en) Exit processing method, system, device and storage medium of application program
US11489712B1 (en) Systems and methods for efficient configuration file management and distribution by network management systems
CN108062224A (en) Data read-write method, device and computing device based on file handle
CN110049065B (en) Attack defense method, device, medium and computing equipment of security gateway
CN114116030A (en) Operating system running method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination