CN110995488B

CN110995488B - Multi-mechanism collaborative learning system and method based on hierarchical parameter server

Info

Publication number: CN110995488B
Application number: CN201911220964.6A
Authority: CN
Inventors: 虞红芳; 李宗航; 李晴; 孙罡; 周华漫
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-11-03
Anticipated expiration: 2039-12-03
Also published as: CN110995488A

Abstract

The invention discloses a multi-mechanism collaborative learning system based on a hierarchical parameter server, which comprises a central mechanism and a plurality of participating mechanisms connected with the central mechanism through a WAN (wide area network). Based on the system, the invention also discloses a multi-mechanism collaborative learning method based on the hierarchical parameter server. The invention solves the data island problem of big data, the data privacy safety problem in multi-party cooperation, and the problems of high communication cost, high maintenance cost, high safety risk and low resource utilization rate of the existing system. The invention realizes the multiparty collaborative learning with high communication efficiency and high calculation efficiency on the premise of ensuring the data privacy and safety, and is suitable for the cross-domain interconnection of multiple independent mechanisms and multiple data centers. The system provided by the invention supports a platform mode and a participation mode, can be used as a platform to provide multi-party knowledge fusion service, and can also be used as a tool to support shared cooperation among a plurality of independent mechanisms.

Description

Multi-mechanism collaborative learning system and method based on hierarchical parameter server

Technical Field

The invention belongs to the technical field of electronics, and particularly relates to a multi-mechanism collaborative learning system and method based on a hierarchical parameter server.

Background

In the 5G era of high-speed interconnection of everything, the data acquisition speed and the data accumulation amount are increased explosively, and the fact that the human society really steps into the big data era is marked. The big data provides higher requirements for data mining capability, and the rapid development of artificial intelligence provides strong data mining and analyzing capability for a plurality of advanced science fields, so that core knowledge can be extracted from huge data volume by intelligent application, and the knowledge is organically combined to execute complex tasks such as detection, identification, prediction, decision making, generation and the like, such as face identification of pay treasure, face detection of China customs, human posture identification of shaking sound short video and the like. For the core technology deep learning of the artificial intelligence, more data usually means better application performance and generalization capability, and the reliability and competitiveness of the artificial intelligence application are also improved.

However, the development of artificial intelligence faces the contradiction between big data and data islands. Data islands, that is, the phenomenon that mass data is scattered like dust in various organizations (such as enterprises, schools, research institutes, hospitals, etc.). Due to the existence of data islands, the numerous and independent organizations lack enough data to train a high-performance model, and the data has data preference due to different factors such as geographic positions, business types, data acquisition time and the like of the organizations, which finally results in the inefficiency and the unavailability of the model. Data islanding forces these independent mechanisms to work with each other to improve the performance of artificial intelligence applications.

Therefore, the data island problem must be solved, data fusion application channels of all industries are opened, data barriers in different fields are broken, the aggregation and value-added functions of big data are fully played, and a firm data base is laid for the intelligent application of artificial intelligence in all fields. Meanwhile, an artificial intelligence technology is stably applied, and a new intelligent application form of data driving, cross-border fusion and co-creation sharing is established.

The data sharing is the simplest and most direct solution for solving the data island problem, the solution collects island data of a plurality of organizations into a trusted organization or a shared cross-domain distributed database for data cleaning and data analysis, and the solution violates the data privacy protection principle and faces extremely high data leakage and data abuse risks.

The invention analyzes two types of domains (inter-domain and intra-domain) aiming at a cross-domain Multi-center scene, provides a novel Multi-Party Collaborative Learning (Multi-Party Collaborative Learning) concept for isolating the inter-domain and intra-domain, and provides a Multi-Party Collaborative Learning architecture (HiPS) based on a layered Parameter Server. The framework inherits the characteristics and advantages of privacy protection and the like of alliance learning, and compared with a single-layer framework of federal learning, the framework can greatly reduce network pressure, reduce safety risks and improve the resource utilization rate under the scene of cross-domain multi-center (a plurality of computing domains, each computing domain comprises a plurality of computing nodes), thereby accelerating the training process of artificial intelligence.

Disclosure of Invention

Aiming at the defects in the prior art, the multi-mechanism collaborative learning system and the multi-mechanism collaborative learning method based on the hierarchical parameter server solve the data island problem of big data, solve the data privacy safety problem during multi-party collaboration, and solve the problems of high communication cost, high maintenance cost, high safety risk and low resource utilization rate of the existing system.

In order to achieve the above purpose, the invention adopts the technical scheme that:

the scheme provides a multi-mechanism collaborative learning system based on a hierarchical parameter server, which comprises a central mechanism and a plurality of participating mechanisms connected with the central mechanism through a WAN (wide area network);

the domain of each participating mechanism and the domain of the central mechanism both adopt a Parameter Server architecture, and the domain of each participating mechanism and the domain of the central mechanism are respectively set as 1L-PS layers;

and a Parameter Server architecture is adopted between the central mechanism and each participating mechanism, and a 2L-PS layer is set between the central mechanism and each participating mechanism.

Further, when the center mechanism is in the platform mode:

the central mechanism comprises a master control working node MW, a global parameter server GS connected with the master control working node MW through a LAN (local area network), a local scheduler LC respectively connected with the master control working node MW and the global parameter server GS through the LAN, and a global scheduler GC connected with the global parameter server GS through the LAN, wherein the global scheduler GC and the global parameter server GS are respectively connected with the participating mechanism through a WAN (wide area network);

the participation mechanism comprises a parameter server S, a working node group which is connected with the parameter server S through a LAN (local area network) and consists of a plurality of working nodes W, and a local scheduler LC which is respectively connected with the parameter server S and the working nodes W through the LAN, wherein the parameter server S is respectively connected with a global parameter server GS and a global scheduler GC through a WAN (wide area network);

when the central mechanism is in the participation mode:

the central mechanism comprises a global parameter server GS, a global scheduler GC, a working node group consisting of a master control working node MW and a plurality of working nodes W and a local scheduler LC; the global parameter server GS, the local scheduler LC and the working node group are connected with each other through a LAN network, and the global scheduler GC and the global parameter server GS are respectively connected with the participating mechanism through a WAN network;

the participation mechanism comprises a parameter server S, a working node group which is connected with the parameter server S through a LAN and consists of a plurality of working nodes W, and a local scheduler LC which is respectively connected with the parameter server S and the working nodes W through the LAN, wherein the parameter server S is respectively connected with a global parameter server GS and a global scheduler GC through a WAN.

Further, in the participation mode, the master control working node MW is configured to send configuration information to the global parameter server GS, and initialize global model parameters; the system comprises a global parameter server GS and a pull request, wherein the global parameter server GS is used for calculating model update by using training set data and calculation resources of a mechanism where the model update is located, uploading the model update to the global parameter server GS in a domain, and sending the pull request to the global parameter server GS in the domain;

in the participation mode, the global parameter server GS is used for aggregating the model updates of the working node groups in the domain in the 1L-PS layer, and the aggregated model updates are used for aggregating the global model updates in the 2L-PS layer; the system comprises a parameter server S, a master control working node MW, a parameter server S and a parameter server, wherein the parameter server S is used for sending a pull request of a working node group in a domain to the master control working node MW;

in the platform mode, the global parameter server GS is used for aggregating global model updates in the 2L-PS layer, updating global model parameters, responding to a pull request of the parameter server S, and issuing the model parameters to each working node W along a request path;

in the platform mode, the master control working node MW is configured to send configuration information to the global parameter server GS, and initialize global model parameters;

the parameter server S is used for aggregating model updates of the working node groups in the domain in the 1L-PS layer, uploading the aggregated model updates to the global parameter server GS in the 2L-PS layer, and sending a pull request to the global parameter server GS;

in the participation mode, a working node W in the participation mechanism is used for calculating model update by using training set data and calculation resources of the mechanism where the working node W is located, and uploading the model update to a parameter server S in the domain; the system comprises a parameter server S and a pull request sending unit, wherein the parameter server S is used for sending a pull request to the parameter server S in the domain of the parameter server S;

in the participation mode, a working node W in the central mechanism is used for calculating model update by using training set data and calculation resources of the mechanism where the working node W is located, uploading the model update to a global parameter server GS and pulling the latest model parameters from the global parameter server GS;

the local scheduler LC is used for establishing a communication architecture in the organization domain when the cluster is started, and for registering, identifying and configuring the state of other nodes in the organization domain;

the global scheduler GC is used for the cluster to start the establishment of the inter-domain communication architecture and for the registration, identification and state configuration of the global parameter server GS and the parameter server S.

Still further, the platform mode and the participation mode both include a full synchronization mode;

the full synchronization mode is as follows: the parameter server S or the global parameter server GS in the participation mode needs to wait for all nodes in the working node group in the domain to upload model updates and then execute aggregation operation, and when the global parameter server GS collects the aggregation models of all organizations and updates, the aggregation of the global model updates and the update of the global model are carried out.

Based on the system, the invention also discloses a multi-mechanism collaborative learning method based on the hierarchical parameter server, which comprises the following steps:

s1, starting and initializing the cluster;

s2, model updating calculation: training a local model by utilizing a Mini-Batch SGD small Batch stochastic gradient descent method according to the training set of each participating mechanism working node W, judging whether the working node W traverses the local training set E wheel, if so, calculating to obtain model update according to the current model parameter and the initial global model parameter pulled by each participating mechanism working node W from a global parameter server GS through an intra-domain parameter server S, and entering the step S3, otherwise, continuously traversing the local training set until the local training set E wheel is traversed;

meanwhile, according to the master control working node MW in the central mechanism or the training sets of other working nodes W in the central mechanism in the participation mode, a Mini-Batch SGD random gradient descent method is used for training a local model, whether the node traverses the E wheel of the local training set is judged, if yes, model updating is calculated according to the current model parameter of the node and the initial global model parameter pulled by the node from the GS position of the intra-domain global parameter server, and the step S3 is entered, otherwise, the local training set is continuously traversed until the local training set E wheel is traversed;

wherein E is a hyper-parameter;

s3, updating and aggregating intra-domain models: when the platform mode is adopted, model updating obtained by each participating mechanism working node W is uploaded to an intra-domain parameter server S to perform intra-domain model updating aggregation;

when in the participation mode, uploading the model update obtained by each participation mechanism working node W to an intra-domain parameter server S for intra-domain model update aggregation, and uploading the model update to an intra-domain global parameter server GS for intra-domain model update aggregation by a main control working node MW and other working nodes W in a central mechanism;

s4, global model updating and aggregating: when the model is in the platform mode, forwarding the aggregated model update to a global parameter server GS of a central institution by a parameter server S of each participating institution, and performing global aggregation on the model update by the global parameter server GS;

when in the participation mode, the parameter server S of each participating agency forwards the aggregated model update to the global parameter server GS of the central agency, and the global parameter server GS performs global aggregation on the model update submitted by the participating agency and the model update obtained in the central agency in step S3.

S5, global model parameter updating: updating a global model by a global parameter server (GS) according to the model update of the global aggregation;

s6, model synchronization: when the global working node is in the platform mode, the parameter server S of each participating mechanism initiates a pulling request to the global parameter server GS to obtain the latest model parameters, responds to the pulling request of the working node W in the domain of the global parameter server GS to send the latest model parameters to each working node, and completes the model synchronization of the global working node;

when in the participation mode, a parameter server S of each participating mechanism initiates a pulling request to a global parameter server GS to obtain the latest model parameters, responds to the pulling request of the working nodes W in the domain to send the latest model parameters to each working node, responds to the pulling request of the master control working node MW and other working nodes W in the domain by the global parameter server GS of the central mechanism, sends the latest model parameters to the master control working node MW and each working node W in the domain, and completes the model synchronization of the master control working node MW and each working node W in the domain of the central mechanism and the global working node;

s7, iterative training: and judging whether the current iteration time T reaches a preset iteration time T, if so, finishing the multi-mechanism collaborative learning process of the hierarchical parameter server, otherwise, returning to S2 until the preset iteration time is reached.

Further, the step S1 includes the following steps:

s101, a master control working node MW of a central mechanism sends configuration information to a global parameter server GS, an operation mode is set to be a platform mode or a participation mode, a synchronous mode is set to be a full synchronous mode or an inter-domain asynchronous mode, an optimization algorithm is set, a compression mode is set, initial global model parameters are sent to the global parameter server GS, and cluster configuration and global model initialization are completed;

s102, working node W marked as 0 by each participating organization₀Initializing a model storage space of a parameter server S in the domain;

s103, the working nodes of all participating mechanisms pull initial global model parameters from the global parameter server through the intra-domain parameter server S to complete global model synchronization, so that cluster starting and initialization are completed.

Still further, in step S2, the expression for training the local model by using Mini-Batch SGD minibatch stochastic gradient descent method is as follows:

wherein,

for working node W at the t communication turn_srModel parameters after completion of k local updates, W_srIs a mechanism p_sThe r-th working node in (b),

for the average loss pair_srModel parameter w of_sr ^(t),kGradient calculation of (B) batch data

Including the number of samples, j being the jth training sample in a batch containing B training samples

The number of (a) is included,

for the loss function, the model parameters are measured

In training sample

The error in (2);

the expression of the computational model update is as follows:

wherein,

is W_srModel updates obtained by iterating E-round local training set in the t-th communication round, W_srIs a mechanism p_sThe r-th working node in (E) is the number of rounds of traversing the local training set,

to W at the t communication turn_srThe initial model parameters of the pull are set to be,

to W at the t communication turn_srComplete EL_srThe model parameter after the secondary local update, L_srThe number of local updates required to iterate through a round of local training sets.

Still further, the expression of intra-domain model update aggregation in step S3 is as follows:

wherein,

is a mechanism p_sThe aggregated model updates in the time domain at the t-th communication turn,m_sthe number of nodes in the working node group in the s-th mechanism,

is W_srModel updates obtained by iterating E-round local training set in the t-th communication round, W_srIs a mechanism p_sThe r-th working node in (1).

Still further, the expression of global model update aggregation while in platform mode in step S4 is as follows:

wherein, Δ w^(t)For globally aggregated model updates at the t-th communication round, | p | is the total number of central and participating agencies, s is participating agency p_sNumber of (1), n_sTo participate in mechanism p_sThe total number of samples in the training set,

to participate in mechanism p_sUpdating the aggregated model in the time domain of the t communication round;

the expression for global model update aggregation while in participating mode is as follows:

wherein, Δ w^(t)For globally aggregated model updates at the t-th communication round, | p | is the total number of central and participating institutions, and s is institution p_sNumber of (1), n_sIs a mechanism p_sThe total number of samples in the training set,

to participate in mechanism p_sModel update aggregated in the t-th communication round time domain, m₁Is the total number of master control working nodes MW and working nodes W in the central mechanism, r is the central mechanism p₁Inner r-th working node W_1rAnd r 1 denotes the master working node MW,

for the t-th communication round central mechanism p₁Inner working node W_1rAnd updating the uploaded model.

Still further, the expression of the global model parameter update in step S5 is as follows:

w^(t+1)←w^(t)+Δw^(t)(6)

wherein, w^(t+1)To complete the global updated latest global model parameters at the t-th communication round, w^(t)Is the original global model parameter at the t-th communication turn, Δ w^(t)For the globally aggregated model update at the t-th communication round.

The invention has the beneficial effects that:

(1) the invention solves the data island problem. The system and the method provided by the invention break through the data barriers among a plurality of independent mechanisms, provide a solution with high calculation efficiency and high communication efficiency for multi-party knowledge fusion, and finally promote the construction of a new data fusion application schema with data driving, cross-border fusion and co-creation sharing;

(2) the invention has endogenous data privacy security capability. The system provided by the invention interacts highly abstract model data among a plurality of independent mechanisms instead of the data, so that the original data is prevented from being uploaded to an unsafe network and an untrusted third-party mechanism, and data leakage and data abuse are effectively prevented;

(3) the invention is suitable for platform as a service business model. The system provided by the invention is operated in a Platform mode, namely, the business mode corresponds to a Platform as a Service (PaaS), under the mode, a holder of the system serves as a central mechanism to provide a safe and efficient multi-party knowledge fusion Platform and Service, other mechanisms serve as participating mechanisms to search cooperation mechanisms on the Platform, and the multi-party knowledge fusion Service provided by the Platform is utilized to complete multi-party collaborative learning;

(4) the invention is suitable for business model of software as service. The system provided by the invention is operated in a participation mode, namely, the business mode of corresponding Software as a Service (SaaS), under the mode, a holder does not participate in multi-party collaborative learning as a central mechanism or a participation mechanism, but the system and the method provided by the invention are provided as tools to support the multi-party collaborative learning of other independent mechanisms;

(5) the invention has wide application range. Compared with other classical distributed deep learning frameworks in the industry, the system and the method provided by the invention have wider application range, and comprise the following steps: distributed deep learning of a single-mechanism single data center, distributed deep learning of a single-mechanism multi-data center, distributed deep learning of a multi-mechanism multi-data center and distributed deep learning of a cloud-edge-terminal edge across a wide area network are adopted;

(6) the invention accesses objects more specifically. Compared with the existing federal learning system in the industry, the system and the method provided by the invention are more suitable for accessing a data center of a real organization instead of accessing a logical individual of a user or the organization;

(7) the invention has low communication cost. Compared with a common single-layer parameter server architecture, the system provided by the invention comprehensively analyzes the characteristics of two types of domains (intra-domain and inter-domain) in a multi-mechanism multi-data-center multi-party collaborative learning scene and isolates inter-domain and intra-domain, so that the multi-party collaborative learning system based on the layered parameter server architecture provided by the invention can greatly reduce the communication flow of the whole and a central mechanism, thereby greatly reducing the communication cost;

(8) the system provided by the invention can greatly reduce the number of WAN-crossing network connections between participating mechanisms and a central mechanism, and reduce the complexity of cluster management personnel of each mechanism in managing and maintaining cluster communication connections, thereby reducing the cost of cluster management and maintenance;

(9) the system provided by the invention is low in safety risk, and only needs to participate in 2 network connections exposed by mechanisms to the external network, so that not only is the occupation of communication resources greatly reduced, but also the mechanisms are prevented from exposing too many ports to the external network, thereby facilitating the monitoring and safety precaution of cluster management personnel of each mechanism on the cluster running state, and further reducing the risk of the cluster suffering from network safety attack;

(10) the system provided by the invention is low in deployment cost, the system is suitable for a general server cluster and a GPU cluster, the mechanism only needs to deploy a software environment without replacing server equipment and network equipment, cross-domain communication between the mechanisms also depends on the existing wide area network hardware to realize interconnection and intercommunication, and extra hardware facility cost investment is not needed.

Drawings

Fig. 1 is a deployment architecture diagram in the platform mode in the present embodiment.

Fig. 2 is a deployment architecture diagram in the participation mode in the present embodiment.

FIG. 3 is a flow chart of the method of the present invention.

Fig. 4 is a schematic view of a traffic model of the HiPS framework in the full synchronization mode and the platform mode in this embodiment.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Examples

Under the cross-domain multi-center scene, because the intra-domain has the characteristics of high bandwidth and low delay, isomorphism of calculation and communication resources, safety and reliability, and the inter-domain has the characteristics of low bandwidth and high delay, isomorphism of calculation and communication resources, safety and unreliability, the intra-domain and inter-domain isolation can maximize the intra-domain resource utilization rate and minimize the inter-domain communication pressure, and the flexibility is provided for the mechanism to select a proper communication topology according to the own calculation cluster environment. The invention provides a multi-party collaborative learning architecture HiPS based on a layered parameter server, which isolates data interaction in and among domains through the layered parameter server. And the intra-domain parameter server sends the intra-domain fused model update to the central mechanism, and the global parameter server of the central mechanism realizes the aggregation of the global model update and the update and synchronization of the global model.

As shown in fig. 1-2, the present invention discloses a multi-institution collaborative learning system based on a hierarchical parameter server, which comprises a central institution and a plurality of participating institutions connected with the central institution through a WAN network; the domain of each participating mechanism and the domain of the central mechanism both adopt a Parameter Server architecture, and the domain of each participating mechanism and the domain of the central mechanism are respectively set as 1L-PS layers; and a ParameterServer architecture is adopted between the central mechanism and each participating mechanism, and a 2L-PS layer is set between the central mechanism and each participating mechanism.

As shown in fig. 1, when the center mechanism is in the platform mode:

the participation mechanism comprises a parameter server S, a working node group which is connected with the parameter server S through a LAN (local area network) and consists of a plurality of working nodes W, and a local scheduler LC which is respectively connected with the parameter server S and the working nodes W through the LAN, wherein the parameter server S is respectively connected with a global parameter server GS and a global scheduler GC through a WAN (wide area network).

As shown in fig. 2, when the central mechanism is in the participation mode:

In this embodiment, in the participation mode, the master control working node MW is configured to send configuration information to the global parameter server GS, and initialize global model parameters; the system comprises a global parameter server GS and a pull request, wherein the global parameter server GS is used for calculating model update by using training set data and calculation resources of a mechanism where the model update is located, uploading the model update to the global parameter server GS in a domain, and sending the pull request to the global parameter server GS in the domain;

The platform mode and the participation mode both comprise a full synchronization mode; the full synchronization mode is as follows: the parameter server S or the global parameter server GS in the participation mode needs to wait for all nodes in the working node group in the domain to upload model updates and then execute aggregation operation, and when the global parameter server GS collects the aggregation models of all organizations and updates, the aggregation of the global model updates and the update of the global model are carried out.

In this embodiment, as shown in fig. 1, the multi-party collaborative learning architecture based on the hierarchical parameter server includes six types of nodes: a working node w (worker), a parameter server s (server), a Global parameter server gs (Global server), a master working node mw (master worker), a Global Scheduler (GC) and a local Scheduler (LC). The central mechanism operates in a platform mode, and a Parameter Server architecture is adopted in the domain and the inter-domain, which are respectively called as a 1L-PS layer and a 2L-PS layer. The working node W only participates in the 1L-PS layer training, the working node W uses training set data and computing resource computing model updating of the mechanism where the working node W is located, and model updating is uploaded to the intra-domain parameter server S. And the intra-domain parameter server S aggregates the model updates of the intra-domain working node groups in the 1L-PS layer and continuously uploads the aggregated model updates to the global parameter server GS in the 2L-PS layer. The global parameter server GS aggregates the global model updates and updates the global model parameters in the 2L-PS layer. And then, each working node W sends a pulling request to the intra-domain parameter server S, each intra-domain parameter server S sends a pulling request to the global parameter server GS, and the global parameter server GS responds to the pulling request and sends the latest model parameters to each working node W along the request path. In particular, the master working node MW of the central authority is only used for configuring training modes (such as full synchronous/inter-domain asynchronous mode, platform/participating mode, central aggregation/central update mode, turning on/off inter-domain compression, etc.) and initializing global model parameters, and it exits after completing configuration and initialization operations, does not participate in model training, and does not contribute to data and computational power. In the figure, a thick solid line represents transmission model parameters, a thick dotted line represents transmission model updating, a thin solid line represents transmission configuration information, inter-domain transmission is WAN network transmission, and intra-domain transmission is LAN network transmission.

In this embodiment, as shown in fig. 2, the multi-party collaborative learning architecture based on the hierarchical parameter server includes six types of nodes: a working node w (worker), a parameter server s (server), a Global parameter server gs (Global server), a master working node mw (master worker), a Global Scheduler (GC) and a local Scheduler (LC). The central mechanism operates in a participation mode, and a Parameter Server architecture is adopted between domains, namely a 1L-PS layer and a 2L-PS layer. The central organization not only needs to provide the multi-party knowledge fusion service as a platform, but also needs to provide data and computing power. Besides the global parameter server GS and the master control working node MW node, a plurality of working nodes W are also deployed in the central organization, and are responsible for training a local model and updating a calculation model based on data owned by the organization together with the master control working node MW. The participating mechanism comprises a working node group with a plurality of working nodes W and is responsible for data contribution, computing power and model training, the central mechanism comprises a working node group with a master control working node MW and a plurality of working nodes W and is also responsible for training a local model and updating a computing model based on data owned by the central mechanism. In the figure, a thick solid line represents transmission model parameters, a thick dotted line represents transmission model updating, a thin solid line represents transmission configuration information, inter-domain transmission is WAN network transmission, and intra-domain transmission is LAN network transmission. Supplementary description of data communication between master working node MW and global parameter server GS in the central authority:

master working node MW → global parameter server GS:

in the cluster configuration phase, the master control working node MW sends configuration information to the global parameter server GS.

Master working node MW → global parameter server GS:

in the global parameter initialization phase, the master working node MW sends the initial global model parameters to the global parameter server GS.

Master working node MW → global parameter server GS:

in the global aggregation update and synchronization stage in the participation mode, the master control working node MW also sends model updates to the global parameter server GS, as do the other working nodes W in the central authority.

Master working node MW ← global parameter server GS:

in the global aggregation update and synchronization stage in the participation mode, the global parameter server GS sends the model parameters to the master control working node MW.

In this embodiment, the local scheduler LC and the global scheduler GC are only used for cluster startup, for example, each node needs to register itself with the scheduler to obtain information such as an identifier and communication addresses of other nodes.

In this embodiment, in the platform mode, the master control working node MW only has a control function (e.g., cluster mode configuration and global model initialization), but in the participation mode, the master control working node MW needs to assume the same function as the working node (e.g., update based on local data calculation model, upload/pull-down model) in addition to the control function. Thus, in the participating mode, the master working node MW also belongs to the working node group.

In the embodiment, in the platform mode, only the working node W of the participating mechanism participates in training; in the participation mode, except for the participation mechanism, the master control working node MW in the central mechanism and the working node W in the central mechanism participate in training.

In this embodiment, in the participation mode, the master control working node MW and other working nodes W in the central authority directly upload the model update to the global parameter server GS in the central authority for aggregation. In the participation mode, the global parameter server GS replaces the role of the parameter server S and fulfills its function, since the central authority has no parameter server S.

In this embodiment, in the platform mode, the master control work node MW has only the two functions described above, and only in the participation mode, the master control work node MW further undertakes the functions of model update calculation and upload and model parameter pull-down.

In this embodiment, in the participation mode, the master control working node MW and the plurality of working nodes W of the central mechanism also participate in training; in the participation mode, model updates generated by a work node group (including a master work node MW and a plurality of work nodes W) in the central authority are directly uploaded to a global parameter server GS in the domain (because the central authority does not have a parameter server S, and the global parameter server GS replaces the function of the parameter server S), and the pull request is also sent to the global parameter server GS.

In this embodiment, the central authority also uploads model updates to the global parameter server GS in the participation mode, the model updates in the central authority are first intra-domain aggregated by the global parameter server GS, and then the aggregated model updates are used for aggregation of global model updates. Then, the global parameter server GS updates the global model parameters, responds to the pull-down requests of the working node groups in the central organization in addition to the pull-down requests of other organizations, and directly issues the latest model parameters to each node (including the master working node MW and the plurality of working nodes W) in the working node groups.

In this embodiment, the mechanism where the global parameter server GS and the master work node MW are located is referred to as a central mechanism. In the present invention, the central mechanism supports the following two modes of operation:

1. platform mode. In the platform mode, the central authority does not provide data and computing power, but only provides a multi-party knowledge fusion service as a platform. The central authority needs to deploy a global parameter server GS, a master work node MW, a local scheduler LC and a global scheduler GC. The master control working node MW is responsible for configuring a cluster training mode and initializing global model parameters; the global parameter server GS is responsible for the aggregation of global model updates and the update and synchronization of global model parameters.

2. And (4) participating in a mode. In the participation mode, the central authority needs to provide not only the multi-party knowledge fusion service, but also data and computing power. In the participating mode, the master work node MW together with zero or more work nodes W is responsible for training the local model, computing and uploading model updates based on the data owned by the organization.

In the embodiment, the domain only supports a synchronous mode, that is, the parameter server S participating in the mechanism domain can execute the aggregation and forwarding operations only after waiting for the update of the uploading model of all the working nodes W in the domain; or the global parameter server GS in the central authority in the participation mode needs to wait for the work node groups (including the master control work node MW and the plurality of work nodes W) in the domain to upload the model update before executing the aggregation operation. In the present invention, the central mechanism supports the following two synchronization modes:

1. full synchronization mode. In the full synchronization mode, the synchronization mode is adopted both in the intra-domain and the inter-domain. On the basis of an intra-domain synchronization mode, in a platform mode, after the parameter servers S of all participating institutions upload model updates to the global parameter server GS, the global parameter server GS performs aggregation of global model updates and update of a global model; in the participation mode, when the parameter servers S of all participating institutions and all nodes in the central institution work node group upload model updates to the global parameter server GS, the global parameter server GS performs aggregation of global model updates and update of the global model once.

2. Inter-domain asynchronous mode. In this embodiment, in addition to the full synchronization mode, an inter-domain asynchronous mode may be included, and according to the difference between the calculation and communication capabilities of the mechanisms and the difference between the requirements of the mechanisms on the convergence accuracy and speed of the model, in the inter-domain asynchronous mode, a synchronous mode is used in the domain, and an asynchronous mode is used in the inter-domain. The main difference between inter-domain asynchronous mode and fully synchronous mode is that inter-domain asynchronous mode does not require inter-domain aggregation, i.e., intra-domain aggregated model updates submitted by any authority will be used directly to update the global model without waiting for other authorities. In the platform mode, when receiving intra-domain aggregation model update submitted by a parameter server S from any participating organization, a global parameter server GS is used for updating global model parameters and responding to a latest model requested to be pulled by a source organization; in the participation mode, the processing flow of the participation mechanism is the same as that in the platform mode, in the central mechanism, the global parameter server GS performs aggregation of intra-domain model updates after collecting all model updates of the intra-domain work node group, and immediately uses the intra-domain aggregation model updates for updating the global model, and then the global parameter server GS immediately responds to the latest model requested to be pulled by all nodes (including the master work node MW and the plurality of work nodes W) of the intra-domain work node group.

As shown in table 1, the main functions of the multi-party collaborative learning platform HiPS based on the hierarchical parameter server in the platform/participation mode and the fully synchronous/inter-domain asynchronous mode are shown in table 1:

TABLE 1

Based on the system, the invention also discloses a multi-mechanism collaborative learning method based on the hierarchical parameter server, as shown in fig. 3, comprising the following steps:

s1, starting and initializing the cluster, wherein the implementation method comprises the following steps:

s103, pulling initial global model parameters from a global parameter server by the working nodes of all participating mechanisms through an intra-domain parameter server S to complete global model synchronization, thereby completing cluster starting and initialization;

wherein E is a hyper-parameter; and each working node uploads the model to be updated to the parameter server in the domain of the working node immediately after local training is finished without waiting for other working nodes in the working node group of the working node.

In the step S2, the expression for training the local model by using the Mini-Batch SGD minibatch stochastic gradient descent method is as follows:

wherein,

The number of (a) is included,

for the loss function, the model parameters are measured

In training sample

The error in (2);

the expression of the computational model update is as follows:

wherein,

to W at the t communication turn_srComplete EL_srThe model parameter after the secondary local update, L_srThe number of local updates required for iterating a local training set;

the expression of the intra-domain model update aggregation in step S3 is as follows:

wherein,

is a mechanism p_sModel update aggregated in the t-th communication round time domain, m_sThe number of nodes in the working node group in the s-th mechanism,

when in the participation mode, the parameter server S of each participating mechanism forwards the aggregated model update to the global parameter server GS of the central mechanism, and the global parameter server GS performs global aggregation on the model update submitted by the participating mechanism and the model update obtained in the central mechanism in the step S3;

the expression of the global model update aggregation when in the platform mode in step S4 is as follows:

wherein, Δ w^(t)For the model update after global aggregation at the t-th communication round, | p | is the total number of central and participating mechanisms, s is the participating mechanism p_sNumber of (1), n_sTo participate in mechanism p_sThe total number of samples in the training set,

wherein, Δ w^(t)For model update after global aggregation at the t-th communication round, | p | is the total number of central and participating institutions, s is the institutionp_sNumber of (1), n_sIs a mechanism p_sThe total number of samples in the training set,

for the t-th communication round central mechanism p₁Inner working node W_1rUpdating the uploaded model;

the expression of the global model parameter update in step S5 is as follows:

w^(t+1)←w^(t)+Δw^(t)(6)

wherein, w^(t+1)To complete the global updated latest global model parameters at the t-th communication round, w^(t)Is the original global model parameter at the t-th communication turn, Δ w^(t)Updating the globally aggregated model in the t communication turn;

In this embodiment, in the platform mode, the master control working node MW may directly exit the cluster because it does not participate in the training; however, in the participation mode, the master work node MW may participate in training together with other work nodes W in its domain, and at this time, the master work node MW may not exit the cluster.

In this embodiment, in the central authority, the initial model parameters submitted to the global parameter server GS by the master control working node MW are used for global model synchronization; in other participating institutions, however, only the worker node W0 identified as 0 will initialize the model storage space of the parameter servers S within its domain without assigning values to that space, and then each parameter server S will use the global initial model parameter assignment model storage space pulled from the global parameter server GS.

In this embodiment, a model training process is modeled based on the basic flow. Supposing that | P | mechanisms participate in the cooperative training, the mechanism set is P ═ { P₁,...,p_s,...,p_|p|In which p is₁Is a central mechanism. The s-th participating authority p_sComprising 1 parameter server node S_sAnd m_sEach working node, the collection of working nodes is W_s＝{W_s1,...,W_sr,...,W_msIn which W_srTo participate in mechanism p_sThe r-th working node in (1). Suppose participating mechanism p_sTraining set of (2) { X }_s,Y_sContains n_sA training sample, W_srTraining set of (2) { X }_sr,Y_srIs { X }_s,Y_sThe training set subset at the r-th working node and containing n_srA training sample, and forOne and the same participating entity p_sNumber n of samples of local training set of all working nodes in its domain_srSame, W_srFor training set { X_sr,Y_srThe kth order of sampling resulted in batches of data of

And batch data

The batch data amount of (a) is B, wherein,

to represent

The jth training sample. W_srGo through one round { X_sr,Y_srNeeds L_sr＝n_srPerforming sequential sampling B times and iterating E rounds of local training sets to finish EL_srAnd updating locally. Eta is the learning rate of the working node local optimizer, and l (x, y, w) is a loss function representing the error that the model w produces on the training samples { x, y }. The Mini-Batch SGD random gradient descent method was used below to simplify the analysis.

In the present embodiment, participating mechanism p_sR of work node W_srAbove, the k-th local update formula is executed as follows:

wherein,

for average loss pair at the t communication turn and k local iterationWorking node W_srModel parameter w of_sr ^(t),kGradient calculation of (B) batch data

The number of (a) is included,

for the loss function, the model parameters are measured

In training sample

The error of (2).

In the present embodiment, participating mechanism p_sR of work node W_srAfter the E round of local data set traversal is completed, the working node calculation model updating formula is as follows:

wherein,

Subsequently, participating institutions p_sAll working nodes W in_sr(r∈[1,m_s]) Computed model updates

Is uploaded to an intra-domain parameters server S_sThe aggregation of intra-domain model updates is performed, with the formula:

wherein,

In this embodiment, the parameter server S ═ { S ] of each participating entity₂,...,S_s,...,S_pAnd forwarding the intra-domain aggregated model update to a global parameter server GS, and performing aggregation of the global model update by the global parameter server GS.

In the present embodiment, in the platform mode, the center mechanism p₁The model is not trained, and the model is not uploaded to the global parameter server GS, and the aggregation formula of the global model update is as follows:

wherein, Δ w^(t)For global aggregation at the t communication turnThe combined model updates, | p | is the total number of central and participating mechanisms, s is participating mechanism p_sNumber of (1), n_sTo participate in mechanism p_sThe total number of samples in the training set,

to participate in mechanism p_sAnd updating the aggregated model in the time domain of the t communication turn.

In this embodiment, in the participation mode, the center mechanism p needs to be considered₁Master control working node and working node W_1r(r∈[1,m₁]) Submitting the model update to the global parameter server GS, where the global model aggregation formula is as follows:

In this embodiment, after the global model update aggregation is completed, the global parameter server GS uses the model update Δ w after the global aggregation^(t)To update the global model w^(t)The update formula is as follows:

w^(t+1)←w^(t)+Δw^(t)(6)

For further explanation of the present invention, the following description will be given by taking the traffic model and communication process of HiPS in the multi-party collaborative learning framework based on the hierarchical parameter server as an example:

the traffic model of the multi-party collaborative learning framework HiPS based on the hierarchical parameter server developed by the invention is shown in FIG. 4. Typically the central mechanism needs to be activated in preference to the participating mechanisms. The multi-party collaborative learning framework HiPS based on the hierarchical parameter server comprises four stages: starting and initializing a cluster, locally training in a mechanism, updating and synchronizing global aggregation, and stopping and destroying the cluster. To simplify the analysis, the flow model diagram only shows three main stages: cluster starting and initializing, local training in a mechanism, and global aggregation updating and synchronizing. In the full synchronization mode and the platform mode, the detailed communication process of the HiPS framework developed by the invention is described as follows: as shown in fig. 4, a thick solid line in the figure represents transmission of model parameters or model updates, a thin solid line represents transmission of control information, a thick dotted line represents model parameters or model updates from/to other unknown institutions, and a thin dotted line represents control information from/to other unknown institutions. Fig. 4 shows a flow model diagram of a central authority and a participating authority, the other participating authorities being simplified and hidden, their model data/control information interaction being indicated by dashed lines.

Stage-one global model parameter initialization

The method comprises the following steps: center mechanism p₁Uploading initial model parameters w to a global parameter server GS by an inner master control working node MW⁽¹⁾Global parameter server GS uses w⁽¹⁾Initial global model parameters. Participating in the mechanism p simultaneously_s(s∈[2,|p|]) First working node W_s1Intradomain parameter server S_sUpload model parameters of arbitrary values, usingInitializing model parameter storage space of a parameter server in each participating mechanism, but not assigning model parameters; participating institutions p_s(s∈[2,|p|]) Other working nodes W in_sr(r∈[2,m_s]) Not in charge of its intra-domain parameter server S_sIs initialized directly to S_sA pull request is initiated to pull global initial model parameters.

Step two: and after the global parameter server GS completes the initialization of the global model parameters, ACK is sent to the master control working node MW. And the master control working node MW exits the cluster after confirming that the global model initialization is completed. Participating in the mechanism p simultaneously_s(s∈[2,|p|]) Internal parameter server S_sTo its working node W_s1And sending the ACK. Working node W_s1And continuing to execute the subsequent steps after the initialization of the model parameter space is confirmed to be completed.

Step three: parameter server S ═ S in each participating institution₂,...,S_s,...,S_|p|Sending a pulling request to a global parameter server GS for obtaining a global initialization model parameter w⁽¹⁾。

Step four: after the global parameter server GS completes the model parameter initialization of step one, the global parameter server GS responds to the parameter server S ═ S of each participating organization₂,...,S_s,...,S_|p|Get request and return initial model parameters w⁽¹⁾。

Step five: participating institutions p_s(s∈[2,|p|]) Inner working node W_s1Upon receipt of its intradomain parameter server S_sAfter the returned ACK, continue to S_sA pull request is initiated to pull global initial model parameters.

Step six: participating institutions p_s(s∈[2,|p|]) Parameter server S_sAfter step four is completed, responding each work node W in the domain_sr(r∈[1,m_s]) Pull request and return initial model parameters w⁽¹⁾. Each working node W_sr(r∈[1,m_s]) Using w⁽¹⁾Initializing local model parameters, and finally enabling all working nodes to have the same initial model parameters

Model initialization and model synchronization of the global working node are completed.

Local training in phase two organization

Step seven: participating in institution p when performing in-institution local training for the tth communication round_s(s∈[2,|p|]) Inner working node W_sr(r∈[1,m_s]) Based on local model parameters

And local training set data { X_sr,Y_srComputing model updates, where { X_sr,Y_srAre sequentially divided into L_sr＝n_srB small batch data. Working node W_srLocal training formula (1) is carried out on the small Batch of data by using a Mini-Batch SGD optimizer with the learning rate of eta (or an optimizer such as Adam, RMSProp and the like) until the local training set E wheel is circularly traversed, and local updating EL is carried out together at the moment_srNext, the process is carried out. Finally, the working node W_srModel updating by calculation of equation (2)

Stage three global aggregate update and synchronization

Step eight: participating institutions p_s(s∈[2,|p|]) Inner working node W_sr(r∈[1,m_s]) Updating the model calculated in the step seven

Upload to parameter server S within its domain_s。

Step nine: participating institutions p_s(s∈[2,|p|]) Parameter server S_sAll work nodes W in its domain are collected_sr(r∈[1,m_s]) After updating the model of (3), the model in the aggregation domain is updated according to the formula

Subsequently, the parameter server S updates the intra-domain aggregated model

And continuously uploading to a global parameter server GS.

Step ten: the global parameter server GS collects all the parameter servers S ═ S of the organizations₂,...,S_s,...,S_pAfter the uploaded model is updated, the global model is aggregated according to the formula (4) to obtain delta w^(t)(if the central mechanism p₁When the system runs in the participation mode, the global model is aggregated according to the formula (5) to obtain the delta w^(t)). Then, the global parameter server GS uses Δ w^(t)Updating the global model parameters w according to equation (6)^(t)Obtaining the latest global model parameter w^(t+1). Finally, the global parameter server GS sends S ═ S to the parameter servers S of the respective organizations₂,...,S_s,...,S_pReturn ACKs, allow them to perform a pull operation to pull up the latest global model parameter w^(t+1)。

Step eleven: parameter server S ═ { S ═ S for each participating institution₂,...,S_s,...,S_|p|After receiving the ACK response of the global parameter server GS, initiating a pull request to the global parameter server GS to pull the latest global model parameter w^(t+1)。

Step twelve: the global parameter server GS responds to each participating agency parameter server S ═ S₂,...,S_s,...,S_|p|Get request and return the latest global model parameter w^(t+1)。

Step thirteen: participating institutions p_s(s∈[2,|p|]) Parameter server S_sAfter pulling the latest global model parameter w^(t+1)Then all the working nodes W in the domain_sr(r∈[1,m_s]) Sending ACK responses, allowing them to perform a pull operation to pull up the latest global model parameters w^(t+1)。

Fourteen steps: participating institutions p_s(s∈[2,|p|]) Working node W of_sr(r∈[1,m_s]) Upon receipt of its intradomain parameter server S_sAfter ACK response, continue to S_sInitiating a pull request to pull the latest global model parameter w^(t+1)。

Step fifteen: participating institutions p_s(s∈[2,|p|]) Parameter server S_sReceives its working node W in domain_sr(r∈[1,m_s]) Pull request and return the latest global model parameters w^(t+1). Working node W_sr(r∈[1,m_s]) Latest global model parameters w using pull^(t+1)The local model parameters are overlaid. Thus, the aggregation of the global model update, the update of the global model and the synchronization of the global model of the t-th communication turn are completed.

Sixthly, the steps are as follows: and judging whether the current communication round T reaches the specified training round T or not, or whether the current global model precision reaches the specified precision threshold. If the stopping condition is met, stopping training and entering a cluster stopping and destroying stage; otherwise, let t ← t +1 and continue to execute from step seven until reaching the stop condition.

When the deployment environment is implemented, the invention can be deployed in a cross-domain multi-organization data center cluster, and comprises a center organization and a plurality of participating organizations. The technology supports the central mechanism to run in two modes, namely a platform mode and a participation mode. In the platform mode, the central institution serves as a service platform to provide multi-party knowledge fusion service for a plurality of participating institutions, but does not provide data and computing power. In the participation mode, a plurality of organizations negotiate and select one organization as a central organization, and the organization needs to provide both the service of multi-party knowledge fusion and data and effort participation in model training. A plurality of nodes can participate in training at the same time in the data center of the participating organization, and the nodes can be interconnected in any physical topology, but at least two nodes need to be ensured to be intercommunicated with other nodes. The HiPS framework based on the hierarchical parameter server developed by the technology can run on a general server and can also run in a GPU cluster. The recommended single data center is configured to: the method comprises the following steps that computing nodes are deployed on GPU clusters with the same or similar computing capacity, and scheduling nodes and parameter server nodes are deployed on a general server; in platform mode, the recommended central authority is configured such that all nodes are deployed on a common server.

In this embodiment, as shown in fig. 1, in the platform mode, the central authority serves as a platform to provide the multi-party knowledge fusion service. The central mechanism needs to deploy a global parameter server node, a global scheduler node, a local scheduler node and a master control working node on a general server, and expose two network ports to the external network, which correspond to the communication ports of the global parameter server node and the global scheduler node respectively. Three types of nodes within the central authority need to communicate with each other. A central authority may deploy multiple global parameter server nodes for load balancing. The participating mechanism needs to deploy parameter server nodes, local scheduler nodes and working nodes, wherein the participating mechanism can only deploy one parameter server node and one local scheduler node on the general server, but can deploy a plurality of working nodes on the GPU server. The local scheduler node needs to be interoperable with other nodes in the participating enterprise, and the parameter server node needs to be interoperable with the working node. The parameter servers of the participating institutions need to expose two network ports to the external network for establishing network connections with the global parameter server and the global scheduler of the central institution. As shown in fig. 2, in the participating mode, the central authority may additionally deploy a plurality of working nodes on the GPU server, which need to interwork with the local scheduler node and the parameter server node. The local scheduler is connected with a local parameter server node (in a central mechanism, namely a global parameter server node) and all nodes in the intra-domain working node group; the global scheduler is connected with the global parameter server node and the parameter server nodes in all the participating institutions.

In this embodiment, the central mechanism and the participating mechanisms may be located in different WAN network environments, the data center clusters of the central mechanism and the participating mechanisms in the participating mode are recommended to be configured as clusters having larger-scale computing cards of the same type or similar computing power and interconnected with a high-speed stable network, and the central mechanism in the platform mode may be deployed in a general server cluster. If the requirements are not met, the system can be directly deployed in a general server cluster.

In this embodiment, the central mechanism may select different combinations of modes according to actual requirements, and the modes include: full synchronous mode/inter-domain asynchronous mode, platform mode/participation mode, central aggregation mode/central update mode. Multiple global parameter servers may also be configured to achieve load balancing for the central authority. Optionally, compression of the parameters may be turned on to reduce traffic. These functions can be quickly and easily turned on or off upon invoking the interface provided by the framework of the present invention. If the stopping condition is set, the frame can automatically execute the cluster stopping process when the stopping condition is met, so that all related processes can be conveniently and quickly closed without manually stopping the cluster.

Through the design, the data privacy security system solves the data island problem of big data, the data privacy security problem in multi-party cooperation, and the problems of high communication cost, high maintenance cost, high security risk and low resource utilization rate of the existing system. The invention realizes the multiparty collaborative learning with high communication efficiency and high calculation efficiency on the premise of ensuring the data privacy and safety, and is suitable for the cross-domain interconnection of multiple independent mechanisms and multiple data centers. The system provided by the invention supports a platform mode and a participation mode, can be used as a platform to provide multi-party knowledge fusion service, and can also be used as a tool to support shared cooperation among a plurality of independent mechanisms.

The layering thought in the HiPS framework based on the layering parameter server provided by the invention has the following advantages:

1) direct network connection between the cloud center parameter server and all computing nodes of each mechanism is avoided, the communication pressure of the cloud center parameter server can be greatly reduced, and the communication bottleneck is relieved;

2) the access number of the cloud center parameter servers is reduced from the number of global computing nodes to the number of mechanisms, and the number of the mechanisms is usually small, so that the cloud center parameter servers are suitable for using a parameter server framework;

3) the computing and communication environment in the domain is more ideal and isomorphic, and the mechanism can flexibly select different communication topologies according to the cluster scale without sampling computing equipment, so that computing and communication resources in the mechanism can be fully utilized.

Claims

1. A multi-mechanism collaborative learning system based on a hierarchical parameter server is characterized by comprising a central mechanism and a plurality of participating mechanisms connected with the central mechanism through a WAN (wide area network);

a Parameter Server architecture is adopted between the central mechanism and each participating mechanism, and a 2L-PS layer is arranged between the central mechanism and each participating mechanism;

when the central mechanism is in the platform mode:

when the central mechanism is in the participation mode:

2. The multi-mechanism collaborative learning system based on hierarchical parameter servers as claimed in claim 1, wherein in the participation mode, the master work node MW is configured to send configuration information to a global parameter server GS, initializing global model parameters; the system comprises a global parameter server GS and a pull request, wherein the global parameter server GS is used for calculating model update by using training set data and calculation resources of a mechanism where the model update is located, uploading the model update to the global parameter server GS in a domain, and sending the pull request to the global parameter server GS in the domain;

3. The hierarchical parameter server-based multi-institution collaborative learning system of claim 1, wherein the platform mode and the participation mode each comprise a fully synchronized mode;

4. A multi-mechanism collaborative learning method based on a hierarchical parameter server is characterized by comprising the following steps:

s1, starting and initializing the cluster;

wherein E is a hyper-parameter;

5. The hierarchical parameter server-based multi-mechanism collaborative learning method according to claim 4, wherein the step S1 includes the steps of:

6. The multi-mechanism collaborative learning method based on the hierarchical parameter server of claim 4, wherein the expression for training the local model by Mini-Batch SGD stochastic gradient descent method in the step S2 is as follows:

wherein,

The number of (a) is included,

for the loss function, the model parameters are measured

In training sample

The error in (2);

the expression of the computational model update is as follows:

wherein,

7. The hierarchical parameter server-based multi-mechanism collaborative learning method according to claim 4, wherein the expression of intra-domain model update aggregation in the step S3 is as follows:

wherein,

is a mechanism p_sModel update aggregated in the t-th communication round time domain, m_sFor each node in the working node group in the s-th mechanismThe number of the first and second groups is,

8. The hierarchical parameter server-based multi-mechanism collaborative learning method according to claim 4, wherein the global model updates the aggregated expression in the step S4 while in the platform mode as follows:

9. The hierarchical parameter server-based multi-mechanism collaborative learning method according to claim 4, wherein the global model parameter update in the step S5 is expressed as follows:

w^(t+1)←w^(t)+Δw^(t)(6)