CN112671613B

CN112671613B - Federal learning cluster monitoring method, device, equipment and medium

Info

Publication number: CN112671613B
Application number: CN202011585022.0A
Authority: CN
Inventors: 王国彬; 牟锟伦; 杨行榜
Original assignee: Shenzhen Bincent Technology Co Ltd
Current assignee: Shenzhen Bincent Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-08-23
Anticipated expiration: 2040-12-28
Also published as: CN112671613A

Abstract

The invention relates to the technical field of monitoring, and discloses a method, a device, equipment and a medium for monitoring a federated learning cluster, wherein the method comprises the following steps: registering a federal learning service center to a monitoring service center by receiving a first monitoring request from the federal learning service center in a federal learning cluster, and sending a creation success command to the federal learning service center after creating a main node and a main path, and creating main monitoring; receiving second monitoring requests sent by all federal learning participants, and acquiring all first registration slave information and first monitoring slave information; registering each sent federal learning participant to a monitoring service center, creating a first slave node in the monitoring service center, and creating slave monitoring after creating a slave path; and starting the master monitoring and all the slave monitoring through a monitoring mechanism of the monitoring service center. The invention realizes the effective, rapid and accurate monitoring of the abnormal condition of the federal learning cluster, and improves the monitoring quality.

Description

Federal learning cluster monitoring method, device, equipment and medium

Technical Field

The invention relates to the technical field of monitoring, in particular to a method, a device, equipment and a medium for monitoring a federated learning cluster.

Background

The federal learning system is a distributed cluster system with multiple servers, and the stop or downtime of any one server can cause the failure of the task of federal learning, so that the maintenance of all the servers of the federal learning system in a normal working state is important in the process of federal learning; in the prior art, each server in the federal learning system is often detected by a traditional heartbeat detection method, and when the server is monitored to be stopped or crashed, operation and maintenance personnel are notified to process the server.

Disclosure of Invention

The invention provides a method and a device for monitoring a federated learning cluster, computer equipment and a storage medium, which realize light-weight monitoring, can effectively, quickly and accurately monitor the occurrence of abnormal conditions of the federated learning cluster, and improve the monitoring quality.

A method for monitoring a federated learning cluster comprises the following steps:

receiving a first monitoring request from a federal learning service center in a federal learning cluster, and acquiring first registration main information and first monitoring main information in the first monitoring request; the federated learning cluster includes one federated learning service center and a plurality of federated learning participants;

registering the federal learning service center to a monitoring service center according to the first registered main information, and after a main node and a main path corresponding to the first registered main information are created in the monitoring service center, sending a creation success instruction corresponding to the main node to the federal learning service center, and simultaneously creating a main monitor corresponding to the first monitoring main information;

receiving second monitoring requests sent by all the federal learning participants, and acquiring first registration slave information and first monitoring slave information in all the second monitoring requests; a second monitoring request including a first registration slave information and a first monitoring slave information; the second monitoring request is generated by triggering of all the federal learning participants after the federal learning service center receives the creation success instruction;

registering each federal learning participant to the monitoring service center according to each first registration slave information, creating a first slave node corresponding to each first registration slave information in the monitoring service center, and creating a slave monitoring corresponding to the first monitoring slave information generated by each federal learning participant after creating a slave path corresponding to each first slave node one by one under the master path;

and starting the master monitoring and all the slave monitoring through a monitoring mechanism of the monitoring service center so as to monitor the federal learning cluster through the master monitoring and all the slave monitoring.

The utility model provides a nation study cluster monitoring device, includes:

the system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a first monitoring request from a federal learning service center in a federal learning cluster and acquiring first registration main information and first monitoring main information in the first monitoring request; the federal learning cluster includes one of the federal learning service centers and a plurality of federal learning participants;

the creating module is used for registering the federal learning service center to a monitoring service center according to the first registered main information, sending a creation success instruction corresponding to the main node to the federal learning service center after the main node and the main path corresponding to the first registered main information are created in the monitoring service center, and simultaneously creating main monitoring corresponding to the first monitored main information;

the acquisition module is used for receiving second monitoring requests sent by all the federal learning participants and acquiring first registration slave information and first monitoring slave information in all the second monitoring requests; a second monitoring request including a first registration slave information and a first monitoring slave information; the second monitoring request is generated by triggering of all the federal learning participants after the federal learning service center receives the creation success instruction;

a registration module, configured to register each federated learning participant to the monitoring service center according to each first registered slave information, create a first slave node in the monitoring service center corresponding to each first registered slave information, and create a slave path in one-to-one correspondence with each first slave node under the master path, and then create a slave monitor corresponding to the first monitoring slave information generated by each federated learning participant;

and the starting module is used for starting the master monitoring and all the slave monitoring through a monitoring mechanism of the monitoring service center so as to monitor the federal learning cluster through the master monitoring and all the slave monitoring.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above federated learning cluster monitoring method when executing the computer program.

A computer readable storage medium storing a computer program which, when executed by a processor, carries out the steps of the above federal learning cluster monitoring method.

The invention provides a federal learning cluster monitoring method, a device, computer equipment and a storage medium, wherein first registration main information and first monitoring main information in a first monitoring request are obtained by receiving the first monitoring request from a federal learning service center in a federal learning cluster; registering the federal learning service center to a monitoring service center according to the first registered main information, and after a main node and a main path corresponding to the first registered main information are created in the monitoring service center, sending a creation success instruction corresponding to the main node to the federal learning service center, and simultaneously creating a main monitor corresponding to the first monitoring main information; receiving second monitoring requests sent by all the federal learning participants, and acquiring first registration slave information and first monitoring slave information in all the second monitoring requests; registering each federal learning participant to the monitoring service center according to each first registration slave information, creating a first slave node corresponding to each first registration slave information in the monitoring service center, and creating a slave monitoring corresponding to the first monitoring slave information generated by each federal learning participant after creating a slave path corresponding to each first slave node one by one under the master path; through monitoring service center's monitoring mechanism starts main control and all from the control, in order to pass through main control and all from the control is right federate learning cluster monitors, so, has realized through monitoring service center and monitoring mechanism, and to the monitoring of federate learning service center and all federate learning participants in the federate learning cluster, realized light-weight control, can monitor effectively, fast, accurately the abnormal conditions of federate learning cluster and take place, improved the control quality.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a federated learning cluster monitoring method in an embodiment of the present invention;

FIG. 2 is a flow diagram of a federated learning cluster monitoring method in an embodiment of the present invention;

FIG. 3 is a flowchart of step S10 of the federated learning cluster monitoring method in an embodiment of the present invention;

fig. 4 is a schematic block diagram of a federal learning cluster monitoring device in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a computer device in an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The federated learning cluster monitoring method provided by the present invention can be applied in the application environment as shown in fig. 1, wherein a client (computer device) communicates with a server through a network. The client (computer device) includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a method for monitoring a federated learning cluster is provided, which mainly includes the following steps S10-S50:

s10, receiving a first monitoring request from a federal learning service center in a federal learning cluster, and acquiring first registration main information and first monitoring main information in the first monitoring request; the federal learning cluster includes one of the federal learning service centers and a plurality of federal learning participants.

Understandably, the federal learning cluster is a cluster constructed by a distributed application system, the federal learning refers to efficient machine learning between multiple participants or multiple computing nodes under the premise of guaranteeing information security during big data exchange, protecting privacy of terminal data and personal data and guaranteeing legal compliance, one federal learning cluster comprises one federal learning service center and multiple federal learning participants, the federal learning cluster manages the federal learning service center and all the federal learning participants, the federal learning service center is a center in the federal learning cluster, namely a coordinator in the federal learning, and mainly coordinates each of the federal learning participants to perform machine learning, and the federal learning participants are computers or servers participating in the federal learning, the first monitoring request is a request for notifying the federal learning service center to trigger before the federal learning cluster starts to enter federal learning.

The first monitoring request includes the first registration master information and the first monitoring master information, the first registration master information is information related to the federal learning service center and used for registration, such as an IP address of the federal learning service center or a computer name, and the first monitoring master information is a set of monitoring parameters related to the federal learning service center.

In an embodiment, as shown in fig. 3, before the step S10, that is, before the receiving the first monitoring request from the federal learning service center in the federal learning cluster, the method includes:

s101, constructing an initial cluster based on Kubernetes.

Understandably, the initial cluster comprises a plurality of computers or servers, kubernets are installed on the initial cluster, Master nodes and Worker nodes of the kubernets are deployed, then a Docker container is installed, and the Docker container is deployed.

S102, establishing the federal learning service center and the federal learning participators in the initial cluster by applying a flying oar-based federal learning framework, and determining the established initial cluster as the federal learning cluster.

Understandably, the federal learning framework is also referred to as PaddleFL, i.e., a framework of horizontal federal learning based on propeller-free (paddle), the primary node is deployed as the federal learning service center in the initial cluster, and the working node is deployed as the federal learning participant, so that the deployed initial cluster is marked as the federal learning cluster.

The invention realizes the establishment of an initial cluster based on Kubernetes; the method comprises the steps that a Federal learning framework based on the propeller is applied, the Federal learning service center and the Federal learning participants are built in the initial cluster, and the built initial cluster is determined to be the Federal learning cluster.

And S20, registering the federal learning service center to a monitoring service center according to the first registered main information, and after a main node and a main path corresponding to the first registered main information are created in the monitoring service center, sending a creation success instruction corresponding to the main node to the federal learning service center, and simultaneously creating a main monitor corresponding to the first monitored main information.

Understandably, the federated learning service center is registered to the monitoring service center according to the first registered main information in a zookeeper temporary node manner, the monitoring service center is a server of a management center which is built based on zookeeper and is used for managing all monitoring, the zookeeper is a packaged complex and error-prone key service, a simple and easy-to-use interface and a system with high performance and stable functions can be provided for the service, the monitoring service center provides distributed application program coordination service, the monitoring service center has N +1 nodes, one node is in the role of a leader (leader), the other nodes are in the role of a follower (follow), the monitoring service center has an election mechanism, when one node in the monitoring service center is down, the other node is automatically elected to take over, continuous service can be ensured, the main node and the main path are created in the monitoring service center while registering, the master node plays a role of a leader (leader), the master path is associated with the master node, the creation success command is triggered after the master node and the master path are created, the command is sent to the federal learning service center, the master monitor corresponding to the first monitoring master information is created, and the master monitor is a monitoring mechanism which is set according to relevant parameters of the first monitoring master information.

S30, receiving second monitoring requests sent by all the federal learning participants, and acquiring first registration slave information and first monitoring slave information in all the second monitoring requests; a second monitoring request including a first registration slave message and a first monitoring slave message; the second monitoring request is generated by being triggered by all the federal learning participants after the federal learning service center receives the creation success instruction.

Understandably, the federal learning service center triggers the second monitoring request by all the federal learning participants after receiving the creation success command, the monitoring request includes one of the first registration slave information and one of the first monitoring slave information, the first registration slave information is information related to the federal learning participants and used for registration, and the first monitoring slave information is a set for setting monitoring parameters related to the federal learning participants.

And S40, registering each federal learning participant to the monitoring service center according to each first registered slave information, creating a first slave node corresponding to each first registered slave information in the monitoring service center, and creating a slave path corresponding to each first slave node one by one under the master path, and then creating a slave monitor corresponding to the first monitoring slave information generated by each federal learning participant.

Understandably, each of the federal learning participants who have sent the second monitoring request is registered to the monitoring service center according to the second registered slave information by means of a zookeeper temporary node, a first slave node corresponding to each of the first registered slave information is created in the monitoring service center, the first slave node plays a role of a follower (follow) of the monitoring service center, and the slave path is created under the master path, for example: the master node name is named as FlCenter, the master path is named as FlPartyA, FlPartyB, FlPartyC, and the first slave paths are named as FlCenter/FlPartyA, "/FlCenter/FlPartyB,"/FlCenter/FlPartyC, "i.e., the first slave node" FlPartyA "corresponds to the first slave path"/FlCenter/FlPartyA, "the first slave node" FlPartyB "corresponds to the first slave path"/FlCenter/FlPartyB, "and the first slave node" FlPartyC "corresponds to the first monitoring slave information.

And S50, starting the master monitoring and all the slave monitoring through a monitoring mechanism of the monitoring service center, so as to monitor the federal learning cluster through the master monitoring and all the slave monitoring.

Understandably, the monitoring (watch) mechanism is to adopt a monitoring (watch) mode to complete monitoring of the states of the master node and each of the first slave nodes, that is, to start the master monitoring and all the slave monitoring, and to complete this objective by monitoring a nodechirenchhanged event of the master path corresponding to the master node, where the nodechirenchhanged event is an execution event of monitoring of the monitoring service center, and since the event is a temporary node, when the federal learning participant is abnormal, the temporary node will disappear immediately, so that it is monitored that there is an abnormality in the node in the federal learning cluster; then, by triggering nodecheldernchanged events, a getchildern method may be invoked to identify which computer or server (the federal learning service center or the federal learning participants) is abnormal (down or computer disconnected, etc.), the getchildern method obtains a list of all slave monitors under the master monitor for applying getchildern () events, and may identify abnormal computers or servers from the list, so that the monitoring of the federal learning cluster may be performed by the master monitor and all the slave monitors, since the monitoring mechanism may belong to one-time triggered monitoring or permanent triggered monitoring, it may be changed according to the monitored requirements, and since the monitored state changes may be set by the master monitor and the slave monitors, only one event type and node information may be sent for transmission, but not the specific state change content, the event itself is lightweight and thus the monitoring of the listening mechanism is also lightweight.

The method and the device realize that the first registration main information and the first monitoring main information in the first monitoring request are obtained by receiving the first monitoring request from the federal learning service center in the federal learning cluster; registering the federal learning service center to a monitoring service center according to the first registration main information, and after a main node and a main path corresponding to the first registration main information are established in the monitoring service center, sending an establishment success instruction corresponding to the main node to the federal learning service center, and simultaneously establishing main monitoring corresponding to the first monitoring main information; receiving second monitoring requests sent by all the federal learning participants, and acquiring first registration slave information and first monitoring slave information in all the second monitoring requests; registering each federal learning participator to the monitoring service center according to each first registration slave information, creating a first slave node corresponding to each first registration slave information in the monitoring service center, and simultaneously creating slave monitoring corresponding to the first monitoring slave information generated by each federal learning participator after creating a slave path corresponding to each first slave node in the master path; through monitoring service center's monitoring mechanism starts main control and all from the control, in order to pass through main control and all from the control is right federate learning cluster monitors, so, has realized through monitoring service center and monitoring mechanism, and to the monitoring of federate learning service center and all federate learning participants in the federate learning cluster, realized light-weight control, can monitor effectively, fast, accurately the abnormal conditions of federate learning cluster and take place, improved the control quality.

In an embodiment, in step S50, that is, the monitoring mechanism of the monitoring service center starts the master monitoring and all the slave monitoring, so as to monitor the federal learning cluster through the master monitoring and all the slave monitoring, including:

s501, generating a sending thread and an event thread according to the main monitor and all the auxiliary monitors.

Understandably, the zookeeper maintains two watch chains: data monitors and child monitors (data monitors and child monitors), namely the master monitor and all the slave monitors, wherein a getData () event and an exists () event set the data monitors, a getchild () event sets the child monitor, namely different monitors set by the zookeeper return different data, the getData () event and the exists () event return the relevant information of the master node, and the getchild () event returns a slave node list, the getData () event is an execution event for acquiring an object set, the exists () event is an execution event for monitoring one-time data change of a corresponding node, so that the sending thread and the event thread are generated, the sending thread is a thread for monitoring information of all nodes, and the event thread is a thread for monitoring the type of the event.

S502, starting the sending thread and the event thread in an asynchronous mode.

Understandably, the asynchronous mode is a mode which can not need synchronous execution, and the sending thread and the event thread are started through the asynchronous mode.

S503, the sending thread and the event thread are monitored by using the monitoring mechanism.

In an embodiment, after the step S50, that is, after the monitoring mechanism of the monitoring service center starts the master monitoring and all the slave monitoring, to monitor the federal learning cluster through the master monitoring and all the slave monitoring, the method includes:

and S60, when the main monitor monitors that the federal learning service center is down, sending a first down instruction corresponding to the federal learning service center to the federal learning cluster.

Understandably, when the shutdown condition of the federal learning service center is monitored, the fact that the federal learning service center is abnormal is recognized, the first shutdown instruction is sent to the federal learning cluster and is drunk by the federal learning service center, and the first shutdown instruction comprises the related information of the federal learning service center and the time when the shutdown condition occurs.

In an embodiment, after step S60, that is, after the sending the first downtime instruction corresponding to the federal learning service center to the federal learning cluster, the method further includes:

s601, receiving a fourth monitoring request, and acquiring third registration main information and third monitoring main information in the fourth monitoring request; the fourth monitoring request is a request corresponding to the first downtime instruction generated by a backup center corresponding to the federal learning service center after the federal learning cluster receives the first downtime instruction and restarts the federal learning service center to be invalid.

Understandably, after the federal learning cluster receives the first downtime instruction and restarts the federal learning service center to be invalid, the federal learning service center is indicated to be softly started to be unsuccessful, at the moment, the backup center needs to be started, the backup center is a computer or a server which carries out synchronous backup with the federal learning service center, and the backup center carries out synchronous backup with the federal learning service center through a data synchronization technology, so that seamless connection can be carried out when the federal learning service center is abnormal, and the backup center is started immediately to replace the federal learning service center.

And S602, associating and registering the third registration main information with the main node through the monitoring service center.

S603, updating the main monitor according to the third monitoring main information, starting the updated main monitor through the monitoring mechanism, and monitoring the restarted backup center through the updated main monitor.

Understandably, the updating is a process of resetting the time interval of monitoring according to the third monitoring master information.

The method and the device realize that the third registration main information and the third monitoring main information in the fourth monitoring request are obtained by receiving the fourth monitoring request; associating and registering the third registration main information with the main node through the monitoring service center; the main monitoring is updated according to the third monitoring main information, the updated main monitoring is started through the monitoring mechanism, and the restarted backup center is monitored through the updated main monitoring, so that the backup center can be immediately replaced through the backup center under the condition that the federal learning service center is invalid in restarting, seamless connection can be achieved, federal learning of a federal learning cluster is continued, and timeliness of federal learning is guaranteed.

S70, receiving a third monitoring request, and acquiring second registration main information and second monitoring main information in the third monitoring request; and the third monitoring request is generated by the restarted federal learning service center after the federal learning cluster receives the first downtime instruction and restarts the federal learning service center.

Understandably, after the federal learning cluster receives the first downtime instruction, a restart instruction corresponding to the federal learning service center is sent in a soft start manner to restart the federal learning service center, after the federal learning service center is restarted, the federal learning service center generates the third monitoring request, the third monitoring request is a request for restarting monitoring the federal learning service center, the third monitoring request includes the second main registration information and the second main monitoring information, the second main registration information may be the same as the first main registration information or different from the first main registration information, depending on whether the relevant information of the federal learning service center after being restarted changes, the second main monitoring information may be the same as the first main monitoring information, or may be different from the first monitoring main information, i.e. the monitoring related parameters may be changed.

And S80, associating and registering the second registration master information and the master node through the monitoring service center.

Understandably, the monitoring service center associates the second registration master information of the federal learning service center used for restarting with the master node of the federal learning service center before restarting, and registers.

And S90, updating the main monitor according to the second monitoring main information, starting the updated main monitor through the monitoring mechanism, and monitoring the restarted federal learning service center through the updated main monitor.

Understandably, after the main monitoring is updated, the restarted federal learning service center is monitored again through the monitoring mechanism.

The method and the device realize that when the shutdown of the federal learning service center is monitored through the main monitor, a first shutdown instruction corresponding to the federal learning service center is sent to the federal learning cluster; receiving a third monitoring request, and acquiring second registration main information and second monitoring main information in the third monitoring request; associating and registering the second registration main information with the main node through the monitoring service center; the main monitoring is updated according to the second monitoring main information, the updated main monitoring is started through the monitoring mechanism, the restarted federal learning service center is monitored through the updated main monitoring, and therefore when the shut down occurs in the federal learning service center, the federal learning service center is restarted after the shut down is monitored, the problem that the shut down needs to be manually participated in restarting is solved, the sustainability and the stability of federal learning are improved, the federal learning time delayed due to manual participation is reduced, and the efficiency of federal learning is improved.

In an embodiment, after the step S90, namely after monitoring the restarted federal learning service center, the method includes:

and S901, recording a restart event corresponding to the first downtime instruction into a log.

Understandably, the log is a set recording restart events occurring in the federated learning cluster, the restart events record restart event information, and the restart events including the first downtime instruction describe when the federated learning service center is down and is restarted.

And S902, performing downtime analysis on the logs to obtain a downtime distribution map.

Understandably, the downtime analysis is to perform time-interval analysis on all the restart events in the log, that is, to perform feature analysis and extraction on all the restart events in time intervals, and determine which computers or servers in the federal learning cluster are in a high probability of downtime through the distribution condition of the features (such as frequency and time-interval features), so as to obtain a probability distribution graph of downtime, that is, the downtime distribution graph.

And S903, formulating a balancing strategy according to the downtime distribution diagram, generating adjusting data, and sending the adjusting data to the federal learning cluster for implementation.

Understandably, inputting the downtime distribution diagram into a strategy model, wherein the strategy model can be a neural network model or a preset matching model, the downtime distribution diagram is subjected to strategy mapping through the strategy model, a high-efficiency time sequence diagram of the federated learning cluster corresponding to the downtime distribution diagram is mapped, the balancing strategy is predicted according to the high-efficiency time sequence diagram, so that the adjusting data is generated according to the balancing strategy, the adjusting data indicates proportion data of resource load transmission data of computers or servers in each federated learning cluster in each time period, and the federated learning cluster is executed according to the adjusting data, so that the downtime distribution of the federated learning cluster can be effectively identified, resources of the federated learning cluster are fully coordinated, and the federated learning cluster is guaranteed to be in an efficient state, For continued federal learning.

According to the invention, the restarting event corresponding to the first downtime instruction is recorded into the log; performing downtime analysis on the logs to obtain a downtime distribution map; according to the downtime distribution diagram, a balance strategy is formulated, adjustment data are generated and sent to the federal learning cluster for implementation, and therefore the purposes that restarting events are recorded through logs, downtime analysis is conducted, the downtime distribution diagram is generated, the balance strategy is formulated and sent to the federal learning cluster for implementation are achieved, the federal learning cluster is guaranteed to be in efficient and continuous federal learning, and the federal learning quality is improved.

In an embodiment, after the step S50, that is, after the monitoring mechanism of the monitoring service center starts the master monitoring and all the slave monitoring, so as to monitor the federal learning cluster through the master monitoring and all the slave monitoring, the method further includes:

and S100, when monitoring that any one of the federal learning participants goes down through the monitoring, sending a second down instruction corresponding to the federal learning participant who goes down to the federal learning cluster.

Understandably, when any one of the federal learning participants goes down, the second down instruction corresponding to the federal learning participant who goes down is sent, and the second down instruction is the time that the information related to the federal learning participant who goes down and the time that the federal learning participant goes down.

S110, receiving a fifth monitoring request, and acquiring second registration slave information and second monitoring slave information in the fifth monitoring request; the fifth monitoring request is generated by the restarted federal learning participants after the federal learning cluster receives the second downtime instruction and restarts the federal learning participants corresponding to the second downtime instruction.

The second registration slave information is registration information of the federal learning participator which is down, and the second monitoring slave information is information for resetting monitoring related parameters for the federal learning participator which is down, such as increasing the monitoring frequency and the like.

And S120, registering to the monitoring service center according to each second registration slave information, creating a second slave node corresponding to the second registration slave information, and creating a slave path corresponding to the second slave node under the master path.

Understandably, the second registered slave information is registered in the monitoring service center, and one of the second slave nodes is created according to the second registered slave information, and a slave path corresponding to the second slave node is created.

S130, a slave monitor corresponding to the second monitoring slave information is created, the created slave monitor is started through the monitoring mechanism, and the restarted federal learning participator is monitored through the created slave monitor.

Understandably, according to the second monitoring slave information, a new slave monitoring is created, and the slave monitoring is started, so that the monitoring of the restarted federal learning participator is realized, and the monitoring mechanism can be perfected by increasing the monitoring frequency.

The invention realizes that when monitoring that any one of the federal learning participants goes down through the monitoring, a second down instruction corresponding to the shut down federal learning participant is sent to the federal learning cluster; receiving a fifth monitoring request, and acquiring second registration slave information and second monitoring slave information in the fifth monitoring request; the fifth monitoring request is generated by the restarted federal learning participant after the federal learning cluster receives the second downtime instruction and restarts the federal learning participant corresponding to the second downtime instruction; registering to the monitoring service center according to the second registration slave information, creating a second slave node corresponding to the second registration slave information, and creating a slave path corresponding to the second slave node under the master path; and creating slave monitoring corresponding to the second monitoring slave information, starting the created slave monitoring through the monitoring mechanism, and monitoring the restarted federal learning participator through the created slave monitoring, so that the slave monitoring is created again after the crashed federal learning participator is restarted, new monitoring parameters are changed for monitoring, the normal operation of the federal learning participator is ensured, and the federal learning quality is improved.

In an embodiment, a federated learning cluster monitoring device is provided, and the federated learning cluster monitoring device corresponds to the federated learning cluster monitoring method in the above embodiments one to one. As shown in fig. 4, the federal learning cluster monitoring apparatus includes a receiving module 11, a creating module 12, an obtaining module 13, a registering module 14 and an initiating module 15. The detailed description of each functional module is as follows:

the system comprises a receiving module 11, a monitoring module and a processing module, wherein the receiving module is used for receiving a first monitoring request from a federal learning service center in a federal learning cluster, and acquiring first registration main information and first monitoring main information in the first monitoring request; the federated learning cluster includes one federated learning service center and a plurality of federated learning participants;

a creating module 12, configured to register the federal learning service center to a monitoring service center according to the first registered master information, and after a master node and a master path corresponding to the first registered master information are created in the monitoring service center, send a creation success instruction corresponding to the master node to the federal learning service center, and create a master monitor corresponding to the first monitored master information at the same time;

an obtaining module 13, configured to receive second monitoring requests sent by each federal learning participant, and obtain first registration slave information and first monitoring slave information in each second monitoring request; a second monitoring request including a first registration slave information and a first monitoring slave information; the second monitoring request is generated by being triggered by all the federal learning participants after the federal learning service center receives the successful creation instruction;

a registration module 14, configured to register each federal learning participant to the monitoring service center according to each first registered slave information, create a first slave node corresponding to each first registered slave information in the monitoring service center, and create a slave path corresponding to each first slave node one to one under the master path, and then create a slave monitor corresponding to the first monitoring slave information generated by each federal learning participant;

and the starting module 15 is configured to start the master monitoring and all the slave monitoring through a monitoring mechanism of the monitoring service center, so as to monitor the federal learning cluster through the master monitoring and all the slave monitoring.

For specific definition of the federal learning cluster monitoring apparatus, see the above definition of the federal learning cluster monitoring method, which is not described in detail herein. All or part of each module in the federal learning cluster monitoring device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a federated learning cluster monitoring method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the federal learning cluster monitoring method in the foregoing embodiments is implemented.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the federated learning cluster monitoring method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases or other media used in the embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the apparatus may be divided into different functional units or modules to perform all or part of the above described functions.

The above-mentioned embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims

1. A method for monitoring a federated learning cluster is characterized by comprising the following steps:

2. The federal learning cluster monitoring method as claimed in claim 1, wherein the initiating the master monitor and all the slave monitors through a monitoring mechanism of the monitoring service center, after the monitoring of the federal learning cluster through the master monitor and all the slave monitors, comprises:

when the main monitor monitors that the federal learning service center is down, a first down instruction corresponding to the federal learning service center is sent to the federal learning cluster;

receiving a third monitoring request, and acquiring second registration main information and second monitoring main information in the third monitoring request; the third monitoring request is generated by the restarted federal learning service center after the federal learning cluster receives the first downtime instruction and restarts the federal learning service center;

associating the second registration main information with the main node, and registering the federal learning service center corresponding to the second registration main information after restarting to the monitoring service center;

and updating the main monitoring according to the second monitoring main information, starting the updated main monitoring through the monitoring mechanism, and monitoring the restarted federal learning service center through the updated main monitoring.

3. The federal learning cluster monitoring method as claimed in claim 2, wherein after the monitoring of the restarted federal learning service center, the method comprises:

recording a restarting event corresponding to the first downtime instruction into a log;

performing downtime analysis on the logs to obtain a downtime distribution map;

and formulating a balance strategy according to the downtime distribution diagram, generating regulation data, and sending the regulation data to the federal learning cluster for implementation.

4. The federal learning cluster monitoring method of claim 2, wherein after sending the first downtime instruction corresponding to the federal learning service center to the federal learning cluster, the method further comprises:

receiving a fourth monitoring request, and acquiring third registration main information and third monitoring main information in the fourth monitoring request; the fourth monitoring request is a request corresponding to the first downtime instruction generated by a backup center corresponding to the federal learning service center after the federal learning cluster receives the first downtime instruction and restarts the federal learning service center to be invalid;

associating the third registered main information with the main node, and registering the backup center to the monitoring service center;

and updating the main monitoring according to the third monitoring main information, starting the updated main monitoring through the monitoring mechanism, and monitoring the restarted backup center through the updated main monitoring.

5. The federal learning cluster monitoring method of claim 1, wherein prior to receiving the first monitoring request from a federal learning service center in a federal learning cluster, comprising:

constructing an initial cluster based on Kubernetes; kubernetes refers to an open source system for deploying, extending, and managing containerized applications;

and building the federal learning service center and the federal learning participants in the initial cluster by using a paddle-based federal learning framework, and determining the built initial cluster as the federal learning cluster.

6. The federal learning cluster monitoring method as claimed in claim 1, wherein, after the master monitor and all of the slave monitors are started by a monitoring mechanism of the monitoring service center so as to monitor the federal learning cluster by the master monitor and all of the slave monitors, the method further comprises:

when monitoring that any one of the federal learning participants goes down through the secondary monitoring, sending a second down instruction corresponding to the federal learning participant who goes down to the federal learning cluster;

receiving a fifth monitoring request, and acquiring second registration slave information and second monitoring slave information in the fifth monitoring request; the fifth monitoring request is generated by the restarted federal learning participant after the federal learning cluster receives the second downtime instruction and restarts the federal learning participant corresponding to the second downtime instruction;

registering each federal learning participant to the monitoring service center according to each second registration slave information, creating a second slave node corresponding to the second registration slave information, and creating a slave path corresponding to the second slave node under the master path;

and creating a slave monitor corresponding to the second monitoring slave information, starting the created slave monitor through the monitoring mechanism, and monitoring the restarted federal learning participator through the created slave monitor.

7. The federal learning cluster monitoring method as claimed in claim 1, wherein said initiating the master monitor and all of the slave monitors through a monitoring mechanism of the monitoring service center to monitor the federal learning cluster through the master monitor and all of the slave monitors comprises:

generating a sending thread and an event thread according to the main monitor and all the auxiliary monitors;

starting the sending thread and the event thread in an asynchronous mode;

and monitoring the sending thread and the event thread by using the monitoring mechanism.

8. The utility model provides a nation study cluster monitoring device which characterized in that includes:

the system comprises a receiving module, a monitoring module and a processing module, wherein the receiving module is used for receiving a first monitoring request from a federal learning service center in a federal learning cluster and acquiring first registration main information and first monitoring main information in the first monitoring request; the federated learning cluster includes one federated learning service center and a plurality of federated learning participants;

a creating module, configured to register the federal learning service center to a monitoring service center according to the first registered master information, and after creating a master node and a master path corresponding to the first registered master information in the monitoring service center, send a creation success instruction corresponding to the master node to the federal learning service center, and create a master monitor corresponding to the first monitored master information at the same time;

the acquisition module is used for receiving second monitoring requests sent by all the federal learning participants and acquiring first registration slave information and first monitoring slave information in all the second monitoring requests; a second monitoring request including a first registration slave information and a first monitoring slave information; the second monitoring request is generated by being triggered by all the federal learning participants after the federal learning service center receives the successful creation instruction;

9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the federal learning cluster monitoring methodology of any of claims 1 to 7 when executed by the computer program.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the federal learning cluster monitoring method as claimed in any of claims 1 to 7.