CN110995488B - Multi-mechanism collaborative learning system and method based on hierarchical parameter server - Google Patents

Multi-mechanism collaborative learning system and method based on hierarchical parameter server Download PDF

Info

Publication number
CN110995488B
CN110995488B CN201911220964.6A CN201911220964A CN110995488B CN 110995488 B CN110995488 B CN 110995488B CN 201911220964 A CN201911220964 A CN 201911220964A CN 110995488 B CN110995488 B CN 110995488B
Authority
CN
China
Prior art keywords
global
parameter server
model
domain
working node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911220964.6A
Other languages
Chinese (zh)
Other versions
CN110995488A (en
Inventor
虞红芳
李宗航
李晴
孙罡
周华漫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201911220964.6A priority Critical patent/CN110995488B/en
Publication of CN110995488A publication Critical patent/CN110995488A/en
Application granted granted Critical
Publication of CN110995488B publication Critical patent/CN110995488B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0823Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability
    • H04L41/0826Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability for reduction of network costs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0893Assignment of logical groups to network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/51Discovery or management thereof, e.g. service location protocol [SLP] or web services

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a multi-mechanism collaborative learning system based on a hierarchical parameter server, which comprises a central mechanism and a plurality of participating mechanisms connected with the central mechanism through a WAN (wide area network). Based on the system, the invention also discloses a multi-mechanism collaborative learning method based on the hierarchical parameter server. The invention solves the data island problem of big data, the data privacy safety problem in multi-party cooperation, and the problems of high communication cost, high maintenance cost, high safety risk and low resource utilization rate of the existing system. The invention realizes the multiparty collaborative learning with high communication efficiency and high calculation efficiency on the premise of ensuring the data privacy and safety, and is suitable for the cross-domain interconnection of multiple independent mechanisms and multiple data centers. The system provided by the invention supports a platform mode and a participation mode, can be used as a platform to provide multi-party knowledge fusion service, and can also be used as a tool to support shared cooperation among a plurality of independent mechanisms.

Description

Multi-mechanism collaborative learning system and method based on hierarchical parameter server
Technical Field
The invention belongs to the technical field of electronics, and particularly relates to a multi-mechanism collaborative learning system and method based on a hierarchical parameter server.
Background
In the 5G era of high-speed interconnection of everything, the data acquisition speed and the data accumulation amount are increased explosively, and the fact that the human society really steps into the big data era is marked. The big data provides higher requirements for data mining capability, and the rapid development of artificial intelligence provides strong data mining and analyzing capability for a plurality of advanced science fields, so that core knowledge can be extracted from huge data volume by intelligent application, and the knowledge is organically combined to execute complex tasks such as detection, identification, prediction, decision making, generation and the like, such as face identification of pay treasure, face detection of China customs, human posture identification of shaking sound short video and the like. For the core technology deep learning of the artificial intelligence, more data usually means better application performance and generalization capability, and the reliability and competitiveness of the artificial intelligence application are also improved.
However, the development of artificial intelligence faces the contradiction between big data and data islands. Data islands, that is, the phenomenon that mass data is scattered like dust in various organizations (such as enterprises, schools, research institutes, hospitals, etc.). Due to the existence of data islands, the numerous and independent organizations lack enough data to train a high-performance model, and the data has data preference due to different factors such as geographic positions, business types, data acquisition time and the like of the organizations, which finally results in the inefficiency and the unavailability of the model. Data islanding forces these independent mechanisms to work with each other to improve the performance of artificial intelligence applications.
Therefore, the data island problem must be solved, data fusion application channels of all industries are opened, data barriers in different fields are broken, the aggregation and value-added functions of big data are fully played, and a firm data base is laid for the intelligent application of artificial intelligence in all fields. Meanwhile, an artificial intelligence technology is stably applied, and a new intelligent application form of data driving, cross-border fusion and co-creation sharing is established.
The data sharing is the simplest and most direct solution for solving the data island problem, the solution collects island data of a plurality of organizations into a trusted organization or a shared cross-domain distributed database for data cleaning and data analysis, and the solution violates the data privacy protection principle and faces extremely high data leakage and data abuse risks.
The invention analyzes two types of domains (inter-domain and intra-domain) aiming at a cross-domain Multi-center scene, provides a novel Multi-Party Collaborative Learning (Multi-Party Collaborative Learning) concept for isolating the inter-domain and intra-domain, and provides a Multi-Party Collaborative Learning architecture (HiPS) based on a layered Parameter Server. The framework inherits the characteristics and advantages of privacy protection and the like of alliance learning, and compared with a single-layer framework of federal learning, the framework can greatly reduce network pressure, reduce safety risks and improve the resource utilization rate under the scene of cross-domain multi-center (a plurality of computing domains, each computing domain comprises a plurality of computing nodes), thereby accelerating the training process of artificial intelligence.
Disclosure of Invention
Aiming at the defects in the prior art, the multi-mechanism collaborative learning system and the multi-mechanism collaborative learning method based on the hierarchical parameter server solve the data island problem of big data, solve the data privacy safety problem during multi-party collaboration, and solve the problems of high communication cost, high maintenance cost, high safety risk and low resource utilization rate of the existing system.
In order to achieve the above purpose, the invention adopts the technical scheme that:
the scheme provides a multi-mechanism collaborative learning system based on a hierarchical parameter server, which comprises a central mechanism and a plurality of participating mechanisms connected with the central mechanism through a WAN (wide area network);
the domain of each participating mechanism and the domain of the central mechanism both adopt a Parameter Server architecture, and the domain of each participating mechanism and the domain of the central mechanism are respectively set as 1L-PS layers;
and a Parameter Server architecture is adopted between the central mechanism and each participating mechanism, and a 2L-PS layer is set between the central mechanism and each participating mechanism.
Further, when the center mechanism is in the platform mode:
the central mechanism comprises a master control working node MW, a global parameter server GS connected with the master control working node MW through a LAN (local area network), a local scheduler LC respectively connected with the master control working node MW and the global parameter server GS through the LAN, and a global scheduler GC connected with the global parameter server GS through the LAN, wherein the global scheduler GC and the global parameter server GS are respectively connected with the participating mechanism through a WAN (wide area network);
the participation mechanism comprises a parameter server S, a working node group which is connected with the parameter server S through a LAN (local area network) and consists of a plurality of working nodes W, and a local scheduler LC which is respectively connected with the parameter server S and the working nodes W through the LAN, wherein the parameter server S is respectively connected with a global parameter server GS and a global scheduler GC through a WAN (wide area network);
when the central mechanism is in the participation mode:
the central mechanism comprises a global parameter server GS, a global scheduler GC, a working node group consisting of a master control working node MW and a plurality of working nodes W and a local scheduler LC; the global parameter server GS, the local scheduler LC and the working node group are connected with each other through a LAN network, and the global scheduler GC and the global parameter server GS are respectively connected with the participating mechanism through a WAN network;
the participation mechanism comprises a parameter server S, a working node group which is connected with the parameter server S through a LAN and consists of a plurality of working nodes W, and a local scheduler LC which is respectively connected with the parameter server S and the working nodes W through the LAN, wherein the parameter server S is respectively connected with a global parameter server GS and a global scheduler GC through a WAN.
Further, in the participation mode, the master control working node MW is configured to send configuration information to the global parameter server GS, and initialize global model parameters; the system comprises a global parameter server GS and a pull request, wherein the global parameter server GS is used for calculating model update by using training set data and calculation resources of a mechanism where the model update is located, uploading the model update to the global parameter server GS in a domain, and sending the pull request to the global parameter server GS in the domain;
in the participation mode, the global parameter server GS is used for aggregating the model updates of the working node groups in the domain in the 1L-PS layer, and the aggregated model updates are used for aggregating the global model updates in the 2L-PS layer; the system comprises a parameter server S, a master control working node MW, a parameter server S and a parameter server, wherein the parameter server S is used for sending a pull request of a working node group in a domain to the master control working node MW;
in the platform mode, the global parameter server GS is used for aggregating global model updates in the 2L-PS layer, updating global model parameters, responding to a pull request of the parameter server S, and issuing the model parameters to each working node W along a request path;
in the platform mode, the master control working node MW is configured to send configuration information to the global parameter server GS, and initialize global model parameters;
the parameter server S is used for aggregating model updates of the working node groups in the domain in the 1L-PS layer, uploading the aggregated model updates to the global parameter server GS in the 2L-PS layer, and sending a pull request to the global parameter server GS;
in the participation mode, a working node W in the participation mechanism is used for calculating model update by using training set data and calculation resources of the mechanism where the working node W is located, and uploading the model update to a parameter server S in the domain; the system comprises a parameter server S and a pull request sending unit, wherein the parameter server S is used for sending a pull request to the parameter server S in the domain of the parameter server S;
in the participation mode, a working node W in the central mechanism is used for calculating model update by using training set data and calculation resources of the mechanism where the working node W is located, uploading the model update to a global parameter server GS and pulling the latest model parameters from the global parameter server GS;
the local scheduler LC is used for establishing a communication architecture in the organization domain when the cluster is started, and for registering, identifying and configuring the state of other nodes in the organization domain;
the global scheduler GC is used for the cluster to start the establishment of the inter-domain communication architecture and for the registration, identification and state configuration of the global parameter server GS and the parameter server S.
Still further, the platform mode and the participation mode both include a full synchronization mode;
the full synchronization mode is as follows: the parameter server S or the global parameter server GS in the participation mode needs to wait for all nodes in the working node group in the domain to upload model updates and then execute aggregation operation, and when the global parameter server GS collects the aggregation models of all organizations and updates, the aggregation of the global model updates and the update of the global model are carried out.
Based on the system, the invention also discloses a multi-mechanism collaborative learning method based on the hierarchical parameter server, which comprises the following steps:
s1, starting and initializing the cluster;
s2, model updating calculation: training a local model by utilizing a Mini-Batch SGD small Batch stochastic gradient descent method according to the training set of each participating mechanism working node W, judging whether the working node W traverses the local training set E wheel, if so, calculating to obtain model update according to the current model parameter and the initial global model parameter pulled by each participating mechanism working node W from a global parameter server GS through an intra-domain parameter server S, and entering the step S3, otherwise, continuously traversing the local training set until the local training set E wheel is traversed;
meanwhile, according to the master control working node MW in the central mechanism or the training sets of other working nodes W in the central mechanism in the participation mode, a Mini-Batch SGD random gradient descent method is used for training a local model, whether the node traverses the E wheel of the local training set is judged, if yes, model updating is calculated according to the current model parameter of the node and the initial global model parameter pulled by the node from the GS position of the intra-domain global parameter server, and the step S3 is entered, otherwise, the local training set is continuously traversed until the local training set E wheel is traversed;
wherein E is a hyper-parameter;
s3, updating and aggregating intra-domain models: when the platform mode is adopted, model updating obtained by each participating mechanism working node W is uploaded to an intra-domain parameter server S to perform intra-domain model updating aggregation;
when in the participation mode, uploading the model update obtained by each participation mechanism working node W to an intra-domain parameter server S for intra-domain model update aggregation, and uploading the model update to an intra-domain global parameter server GS for intra-domain model update aggregation by a main control working node MW and other working nodes W in a central mechanism;
s4, global model updating and aggregating: when the model is in the platform mode, forwarding the aggregated model update to a global parameter server GS of a central institution by a parameter server S of each participating institution, and performing global aggregation on the model update by the global parameter server GS;
when in the participation mode, the parameter server S of each participating agency forwards the aggregated model update to the global parameter server GS of the central agency, and the global parameter server GS performs global aggregation on the model update submitted by the participating agency and the model update obtained in the central agency in step S3.
S5, global model parameter updating: updating a global model by a global parameter server (GS) according to the model update of the global aggregation;
s6, model synchronization: when the global working node is in the platform mode, the parameter server S of each participating mechanism initiates a pulling request to the global parameter server GS to obtain the latest model parameters, responds to the pulling request of the working node W in the domain of the global parameter server GS to send the latest model parameters to each working node, and completes the model synchronization of the global working node;
when in the participation mode, a parameter server S of each participating mechanism initiates a pulling request to a global parameter server GS to obtain the latest model parameters, responds to the pulling request of the working nodes W in the domain to send the latest model parameters to each working node, responds to the pulling request of the master control working node MW and other working nodes W in the domain by the global parameter server GS of the central mechanism, sends the latest model parameters to the master control working node MW and each working node W in the domain, and completes the model synchronization of the master control working node MW and each working node W in the domain of the central mechanism and the global working node;
s7, iterative training: and judging whether the current iteration time T reaches a preset iteration time T, if so, finishing the multi-mechanism collaborative learning process of the hierarchical parameter server, otherwise, returning to S2 until the preset iteration time is reached.
Further, the step S1 includes the following steps:
s101, a master control working node MW of a central mechanism sends configuration information to a global parameter server GS, an operation mode is set to be a platform mode or a participation mode, a synchronous mode is set to be a full synchronous mode or an inter-domain asynchronous mode, an optimization algorithm is set, a compression mode is set, initial global model parameters are sent to the global parameter server GS, and cluster configuration and global model initialization are completed;
s102, working node W marked as 0 by each participating organization0Initializing a model storage space of a parameter server S in the domain;
s103, the working nodes of all participating mechanisms pull initial global model parameters from the global parameter server through the intra-domain parameter server S to complete global model synchronization, so that cluster starting and initialization are completed.
Still further, in step S2, the expression for training the local model by using Mini-Batch SGD minibatch stochastic gradient descent method is as follows:
Figure BDA0002300828820000051
wherein,
Figure BDA0002300828820000052
for working node W at the t communication turnsrModel parameters after completion of k local updates, WsrIs a mechanism psThe r-th working node in (b),
Figure BDA0002300828820000053
for the average loss pairsrModel parameter w ofsr (t),kGradient calculation of (B) batch data
Figure BDA0002300828820000054
Including the number of samples, j being the jth training sample in a batch containing B training samples
Figure BDA0002300828820000055
The number of (a) is included,
Figure BDA0002300828820000056
for the loss function, the model parameters are measured
Figure BDA0002300828820000057
In training sample
Figure BDA0002300828820000058
The error in (2);
the expression of the computational model update is as follows:
Figure BDA0002300828820000061
wherein,
Figure BDA0002300828820000062
is WsrModel updates obtained by iterating E-round local training set in the t-th communication round, WsrIs a mechanism psThe r-th working node in (E) is the number of rounds of traversing the local training set,
Figure BDA0002300828820000063
to W at the t communication turnsrThe initial model parameters of the pull are set to be,
Figure BDA0002300828820000064
to W at the t communication turnsrComplete ELsrThe model parameter after the secondary local update, LsrThe number of local updates required to iterate through a round of local training sets.
Still further, the expression of intra-domain model update aggregation in step S3 is as follows:
Figure BDA0002300828820000065
wherein,
Figure BDA0002300828820000066
is a mechanism psThe aggregated model updates in the time domain at the t-th communication turn,msthe number of nodes in the working node group in the s-th mechanism,
Figure BDA0002300828820000067
is WsrModel updates obtained by iterating E-round local training set in the t-th communication round, WsrIs a mechanism psThe r-th working node in (1).
Still further, the expression of global model update aggregation while in platform mode in step S4 is as follows:
Figure BDA0002300828820000068
wherein, Δ w(t)For globally aggregated model updates at the t-th communication round, | p | is the total number of central and participating agencies, s is participating agency psNumber of (1), nsTo participate in mechanism psThe total number of samples in the training set,
Figure BDA0002300828820000069
to participate in mechanism psUpdating the aggregated model in the time domain of the t communication round;
the expression for global model update aggregation while in participating mode is as follows:
Figure BDA00023008288200000610
wherein, Δ w(t)For globally aggregated model updates at the t-th communication round, | p | is the total number of central and participating institutions, and s is institution psNumber of (1), nsIs a mechanism psThe total number of samples in the training set,
Figure BDA00023008288200000611
to participate in mechanism psModel update aggregated in the t-th communication round time domain, m1Is the total number of master control working nodes MW and working nodes W in the central mechanism, r is the central mechanism p1Inner r-th working node W1rAnd r 1 denotes the master working node MW,
Figure BDA00023008288200000612
for the t-th communication round central mechanism p1Inner working node W1rAnd updating the uploaded model.
Still further, the expression of the global model parameter update in step S5 is as follows:
w(t+1)←w(t)+Δw(t)(6)
wherein, w(t+1)To complete the global updated latest global model parameters at the t-th communication round, w(t)Is the original global model parameter at the t-th communication turn, Δ w(t)For the globally aggregated model update at the t-th communication round.
The invention has the beneficial effects that:
(1) the invention solves the data island problem. The system and the method provided by the invention break through the data barriers among a plurality of independent mechanisms, provide a solution with high calculation efficiency and high communication efficiency for multi-party knowledge fusion, and finally promote the construction of a new data fusion application schema with data driving, cross-border fusion and co-creation sharing;
(2) the invention has endogenous data privacy security capability. The system provided by the invention interacts highly abstract model data among a plurality of independent mechanisms instead of the data, so that the original data is prevented from being uploaded to an unsafe network and an untrusted third-party mechanism, and data leakage and data abuse are effectively prevented;
(3) the invention is suitable for platform as a service business model. The system provided by the invention is operated in a Platform mode, namely, the business mode corresponds to a Platform as a Service (PaaS), under the mode, a holder of the system serves as a central mechanism to provide a safe and efficient multi-party knowledge fusion Platform and Service, other mechanisms serve as participating mechanisms to search cooperation mechanisms on the Platform, and the multi-party knowledge fusion Service provided by the Platform is utilized to complete multi-party collaborative learning;
(4) the invention is suitable for business model of software as service. The system provided by the invention is operated in a participation mode, namely, the business mode of corresponding Software as a Service (SaaS), under the mode, a holder does not participate in multi-party collaborative learning as a central mechanism or a participation mechanism, but the system and the method provided by the invention are provided as tools to support the multi-party collaborative learning of other independent mechanisms;
(5) the invention has wide application range. Compared with other classical distributed deep learning frameworks in the industry, the system and the method provided by the invention have wider application range, and comprise the following steps: distributed deep learning of a single-mechanism single data center, distributed deep learning of a single-mechanism multi-data center, distributed deep learning of a multi-mechanism multi-data center and distributed deep learning of a cloud-edge-terminal edge across a wide area network are adopted;
(6) the invention accesses objects more specifically. Compared with the existing federal learning system in the industry, the system and the method provided by the invention are more suitable for accessing a data center of a real organization instead of accessing a logical individual of a user or the organization;
(7) the invention has low communication cost. Compared with a common single-layer parameter server architecture, the system provided by the invention comprehensively analyzes the characteristics of two types of domains (intra-domain and inter-domain) in a multi-mechanism multi-data-center multi-party collaborative learning scene and isolates inter-domain and intra-domain, so that the multi-party collaborative learning system based on the layered parameter server architecture provided by the invention can greatly reduce the communication flow of the whole and a central mechanism, thereby greatly reducing the communication cost;
(8) the system provided by the invention can greatly reduce the number of WAN-crossing network connections between participating mechanisms and a central mechanism, and reduce the complexity of cluster management personnel of each mechanism in managing and maintaining cluster communication connections, thereby reducing the cost of cluster management and maintenance;
(9) the system provided by the invention is low in safety risk, and only needs to participate in 2 network connections exposed by mechanisms to the external network, so that not only is the occupation of communication resources greatly reduced, but also the mechanisms are prevented from exposing too many ports to the external network, thereby facilitating the monitoring and safety precaution of cluster management personnel of each mechanism on the cluster running state, and further reducing the risk of the cluster suffering from network safety attack;
(10) the system provided by the invention is low in deployment cost, the system is suitable for a general server cluster and a GPU cluster, the mechanism only needs to deploy a software environment without replacing server equipment and network equipment, cross-domain communication between the mechanisms also depends on the existing wide area network hardware to realize interconnection and intercommunication, and extra hardware facility cost investment is not needed.
Drawings
Fig. 1 is a deployment architecture diagram in the platform mode in the present embodiment.
Fig. 2 is a deployment architecture diagram in the participation mode in the present embodiment.
FIG. 3 is a flow chart of the method of the present invention.
Fig. 4 is a schematic view of a traffic model of the HiPS framework in the full synchronization mode and the platform mode in this embodiment.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
Examples
Under the cross-domain multi-center scene, because the intra-domain has the characteristics of high bandwidth and low delay, isomorphism of calculation and communication resources, safety and reliability, and the inter-domain has the characteristics of low bandwidth and high delay, isomorphism of calculation and communication resources, safety and unreliability, the intra-domain and inter-domain isolation can maximize the intra-domain resource utilization rate and minimize the inter-domain communication pressure, and the flexibility is provided for the mechanism to select a proper communication topology according to the own calculation cluster environment. The invention provides a multi-party collaborative learning architecture HiPS based on a layered parameter server, which isolates data interaction in and among domains through the layered parameter server. And the intra-domain parameter server sends the intra-domain fused model update to the central mechanism, and the global parameter server of the central mechanism realizes the aggregation of the global model update and the update and synchronization of the global model.
As shown in fig. 1-2, the present invention discloses a multi-institution collaborative learning system based on a hierarchical parameter server, which comprises a central institution and a plurality of participating institutions connected with the central institution through a WAN network; the domain of each participating mechanism and the domain of the central mechanism both adopt a Parameter Server architecture, and the domain of each participating mechanism and the domain of the central mechanism are respectively set as 1L-PS layers; and a ParameterServer architecture is adopted between the central mechanism and each participating mechanism, and a 2L-PS layer is set between the central mechanism and each participating mechanism.
As shown in fig. 1, when the center mechanism is in the platform mode:
the central mechanism comprises a master control working node MW, a global parameter server GS connected with the master control working node MW through a LAN (local area network), a local scheduler LC respectively connected with the master control working node MW and the global parameter server GS through the LAN, and a global scheduler GC connected with the global parameter server GS through the LAN, wherein the global scheduler GC and the global parameter server GS are respectively connected with the participating mechanism through a WAN (wide area network);
the participation mechanism comprises a parameter server S, a working node group which is connected with the parameter server S through a LAN (local area network) and consists of a plurality of working nodes W, and a local scheduler LC which is respectively connected with the parameter server S and the working nodes W through the LAN, wherein the parameter server S is respectively connected with a global parameter server GS and a global scheduler GC through a WAN (wide area network).
As shown in fig. 2, when the central mechanism is in the participation mode:
the central mechanism comprises a global parameter server GS, a global scheduler GC, a working node group consisting of a master control working node MW and a plurality of working nodes W and a local scheduler LC; the global parameter server GS, the local scheduler LC and the working node group are connected with each other through a LAN network, and the global scheduler GC and the global parameter server GS are respectively connected with the participating mechanism through a WAN network;
the participation mechanism comprises a parameter server S, a working node group which is connected with the parameter server S through a LAN and consists of a plurality of working nodes W, and a local scheduler LC which is respectively connected with the parameter server S and the working nodes W through the LAN, wherein the parameter server S is respectively connected with a global parameter server GS and a global scheduler GC through a WAN.
In this embodiment, in the participation mode, the master control working node MW is configured to send configuration information to the global parameter server GS, and initialize global model parameters; the system comprises a global parameter server GS and a pull request, wherein the global parameter server GS is used for calculating model update by using training set data and calculation resources of a mechanism where the model update is located, uploading the model update to the global parameter server GS in a domain, and sending the pull request to the global parameter server GS in the domain;
in the participation mode, the global parameter server GS is used for aggregating the model updates of the working node groups in the domain in the 1L-PS layer, and the aggregated model updates are used for aggregating the global model updates in the 2L-PS layer; the system comprises a parameter server S, a master control working node MW, a parameter server S and a parameter server, wherein the parameter server S is used for sending a pull request of a working node group in a domain to the master control working node MW;
in the platform mode, the global parameter server GS is used for aggregating global model updates in the 2L-PS layer, updating global model parameters, responding to a pull request of the parameter server S, and issuing the model parameters to each working node W along a request path;
in the platform mode, the master control working node MW is configured to send configuration information to the global parameter server GS, and initialize global model parameters;
the parameter server S is used for aggregating model updates of the working node groups in the domain in the 1L-PS layer, uploading the aggregated model updates to the global parameter server GS in the 2L-PS layer, and sending a pull request to the global parameter server GS;
in the participation mode, a working node W in the participation mechanism is used for calculating model update by using training set data and calculation resources of the mechanism where the working node W is located, and uploading the model update to a parameter server S in the domain; the system comprises a parameter server S and a pull request sending unit, wherein the parameter server S is used for sending a pull request to the parameter server S in the domain of the parameter server S;
in the participation mode, a working node W in the central mechanism is used for calculating model update by using training set data and calculation resources of the mechanism where the working node W is located, uploading the model update to a global parameter server GS and pulling the latest model parameters from the global parameter server GS;
the local scheduler LC is used for establishing a communication architecture in the organization domain when the cluster is started, and for registering, identifying and configuring the state of other nodes in the organization domain;
the global scheduler GC is used for the cluster to start the establishment of the inter-domain communication architecture and for the registration, identification and state configuration of the global parameter server GS and the parameter server S.
The platform mode and the participation mode both comprise a full synchronization mode; the full synchronization mode is as follows: the parameter server S or the global parameter server GS in the participation mode needs to wait for all nodes in the working node group in the domain to upload model updates and then execute aggregation operation, and when the global parameter server GS collects the aggregation models of all organizations and updates, the aggregation of the global model updates and the update of the global model are carried out.
In this embodiment, as shown in fig. 1, the multi-party collaborative learning architecture based on the hierarchical parameter server includes six types of nodes: a working node w (worker), a parameter server s (server), a Global parameter server gs (Global server), a master working node mw (master worker), a Global Scheduler (GC) and a local Scheduler (LC). The central mechanism operates in a platform mode, and a Parameter Server architecture is adopted in the domain and the inter-domain, which are respectively called as a 1L-PS layer and a 2L-PS layer. The working node W only participates in the 1L-PS layer training, the working node W uses training set data and computing resource computing model updating of the mechanism where the working node W is located, and model updating is uploaded to the intra-domain parameter server S. And the intra-domain parameter server S aggregates the model updates of the intra-domain working node groups in the 1L-PS layer and continuously uploads the aggregated model updates to the global parameter server GS in the 2L-PS layer. The global parameter server GS aggregates the global model updates and updates the global model parameters in the 2L-PS layer. And then, each working node W sends a pulling request to the intra-domain parameter server S, each intra-domain parameter server S sends a pulling request to the global parameter server GS, and the global parameter server GS responds to the pulling request and sends the latest model parameters to each working node W along the request path. In particular, the master working node MW of the central authority is only used for configuring training modes (such as full synchronous/inter-domain asynchronous mode, platform/participating mode, central aggregation/central update mode, turning on/off inter-domain compression, etc.) and initializing global model parameters, and it exits after completing configuration and initialization operations, does not participate in model training, and does not contribute to data and computational power. In the figure, a thick solid line represents transmission model parameters, a thick dotted line represents transmission model updating, a thin solid line represents transmission configuration information, inter-domain transmission is WAN network transmission, and intra-domain transmission is LAN network transmission.
In this embodiment, as shown in fig. 2, the multi-party collaborative learning architecture based on the hierarchical parameter server includes six types of nodes: a working node w (worker), a parameter server s (server), a Global parameter server gs (Global server), a master working node mw (master worker), a Global Scheduler (GC) and a local Scheduler (LC). The central mechanism operates in a participation mode, and a Parameter Server architecture is adopted between domains, namely a 1L-PS layer and a 2L-PS layer. The central organization not only needs to provide the multi-party knowledge fusion service as a platform, but also needs to provide data and computing power. Besides the global parameter server GS and the master control working node MW node, a plurality of working nodes W are also deployed in the central organization, and are responsible for training a local model and updating a calculation model based on data owned by the organization together with the master control working node MW. The participating mechanism comprises a working node group with a plurality of working nodes W and is responsible for data contribution, computing power and model training, the central mechanism comprises a working node group with a master control working node MW and a plurality of working nodes W and is also responsible for training a local model and updating a computing model based on data owned by the central mechanism. In the figure, a thick solid line represents transmission model parameters, a thick dotted line represents transmission model updating, a thin solid line represents transmission configuration information, inter-domain transmission is WAN network transmission, and intra-domain transmission is LAN network transmission. Supplementary description of data communication between master working node MW and global parameter server GS in the central authority:
master working node MW → global parameter server GS:
in the cluster configuration phase, the master control working node MW sends configuration information to the global parameter server GS.
Master working node MW → global parameter server GS:
in the global parameter initialization phase, the master working node MW sends the initial global model parameters to the global parameter server GS.
Master working node MW → global parameter server GS:
in the global aggregation update and synchronization stage in the participation mode, the master control working node MW also sends model updates to the global parameter server GS, as do the other working nodes W in the central authority.
Master working node MW ← global parameter server GS:
in the global aggregation update and synchronization stage in the participation mode, the global parameter server GS sends the model parameters to the master control working node MW.
In this embodiment, the local scheduler LC and the global scheduler GC are only used for cluster startup, for example, each node needs to register itself with the scheduler to obtain information such as an identifier and communication addresses of other nodes.
In this embodiment, in the platform mode, the master control working node MW only has a control function (e.g., cluster mode configuration and global model initialization), but in the participation mode, the master control working node MW needs to assume the same function as the working node (e.g., update based on local data calculation model, upload/pull-down model) in addition to the control function. Thus, in the participating mode, the master working node MW also belongs to the working node group.
In the embodiment, in the platform mode, only the working node W of the participating mechanism participates in training; in the participation mode, except for the participation mechanism, the master control working node MW in the central mechanism and the working node W in the central mechanism participate in training.
In this embodiment, in the participation mode, the master control working node MW and other working nodes W in the central authority directly upload the model update to the global parameter server GS in the central authority for aggregation. In the participation mode, the global parameter server GS replaces the role of the parameter server S and fulfills its function, since the central authority has no parameter server S.
In this embodiment, in the platform mode, the master control work node MW has only the two functions described above, and only in the participation mode, the master control work node MW further undertakes the functions of model update calculation and upload and model parameter pull-down.
In this embodiment, in the participation mode, the master control working node MW and the plurality of working nodes W of the central mechanism also participate in training; in the participation mode, model updates generated by a work node group (including a master work node MW and a plurality of work nodes W) in the central authority are directly uploaded to a global parameter server GS in the domain (because the central authority does not have a parameter server S, and the global parameter server GS replaces the function of the parameter server S), and the pull request is also sent to the global parameter server GS.
In this embodiment, the central authority also uploads model updates to the global parameter server GS in the participation mode, the model updates in the central authority are first intra-domain aggregated by the global parameter server GS, and then the aggregated model updates are used for aggregation of global model updates. Then, the global parameter server GS updates the global model parameters, responds to the pull-down requests of the working node groups in the central organization in addition to the pull-down requests of other organizations, and directly issues the latest model parameters to each node (including the master working node MW and the plurality of working nodes W) in the working node groups.
In this embodiment, the mechanism where the global parameter server GS and the master work node MW are located is referred to as a central mechanism. In the present invention, the central mechanism supports the following two modes of operation:
1. platform mode. In the platform mode, the central authority does not provide data and computing power, but only provides a multi-party knowledge fusion service as a platform. The central authority needs to deploy a global parameter server GS, a master work node MW, a local scheduler LC and a global scheduler GC. The master control working node MW is responsible for configuring a cluster training mode and initializing global model parameters; the global parameter server GS is responsible for the aggregation of global model updates and the update and synchronization of global model parameters.
2. And (4) participating in a mode. In the participation mode, the central authority needs to provide not only the multi-party knowledge fusion service, but also data and computing power. In the participating mode, the master work node MW together with zero or more work nodes W is responsible for training the local model, computing and uploading model updates based on the data owned by the organization.
In the embodiment, the domain only supports a synchronous mode, that is, the parameter server S participating in the mechanism domain can execute the aggregation and forwarding operations only after waiting for the update of the uploading model of all the working nodes W in the domain; or the global parameter server GS in the central authority in the participation mode needs to wait for the work node groups (including the master control work node MW and the plurality of work nodes W) in the domain to upload the model update before executing the aggregation operation. In the present invention, the central mechanism supports the following two synchronization modes:
1. full synchronization mode. In the full synchronization mode, the synchronization mode is adopted both in the intra-domain and the inter-domain. On the basis of an intra-domain synchronization mode, in a platform mode, after the parameter servers S of all participating institutions upload model updates to the global parameter server GS, the global parameter server GS performs aggregation of global model updates and update of a global model; in the participation mode, when the parameter servers S of all participating institutions and all nodes in the central institution work node group upload model updates to the global parameter server GS, the global parameter server GS performs aggregation of global model updates and update of the global model once.
2. Inter-domain asynchronous mode. In this embodiment, in addition to the full synchronization mode, an inter-domain asynchronous mode may be included, and according to the difference between the calculation and communication capabilities of the mechanisms and the difference between the requirements of the mechanisms on the convergence accuracy and speed of the model, in the inter-domain asynchronous mode, a synchronous mode is used in the domain, and an asynchronous mode is used in the inter-domain. The main difference between inter-domain asynchronous mode and fully synchronous mode is that inter-domain asynchronous mode does not require inter-domain aggregation, i.e., intra-domain aggregated model updates submitted by any authority will be used directly to update the global model without waiting for other authorities. In the platform mode, when receiving intra-domain aggregation model update submitted by a parameter server S from any participating organization, a global parameter server GS is used for updating global model parameters and responding to a latest model requested to be pulled by a source organization; in the participation mode, the processing flow of the participation mechanism is the same as that in the platform mode, in the central mechanism, the global parameter server GS performs aggregation of intra-domain model updates after collecting all model updates of the intra-domain work node group, and immediately uses the intra-domain aggregation model updates for updating the global model, and then the global parameter server GS immediately responds to the latest model requested to be pulled by all nodes (including the master work node MW and the plurality of work nodes W) of the intra-domain work node group.
As shown in table 1, the main functions of the multi-party collaborative learning platform HiPS based on the hierarchical parameter server in the platform/participation mode and the fully synchronous/inter-domain asynchronous mode are shown in table 1:
TABLE 1
Figure BDA0002300828820000141
Figure BDA0002300828820000151
Based on the system, the invention also discloses a multi-mechanism collaborative learning method based on the hierarchical parameter server, as shown in fig. 3, comprising the following steps:
s1, starting and initializing the cluster, wherein the implementation method comprises the following steps:
s101, a master control working node MW of a central mechanism sends configuration information to a global parameter server GS, an operation mode is set to be a platform mode or a participation mode, a synchronous mode is set to be a full synchronous mode or an inter-domain asynchronous mode, an optimization algorithm is set, a compression mode is set, initial global model parameters are sent to the global parameter server GS, and cluster configuration and global model initialization are completed;
s102, working node W marked as 0 by each participating organization0Initializing a model storage space of a parameter server S in the domain;
s103, pulling initial global model parameters from a global parameter server by the working nodes of all participating mechanisms through an intra-domain parameter server S to complete global model synchronization, thereby completing cluster starting and initialization;
s2, model updating calculation: training a local model by utilizing a Mini-Batch SGD small Batch stochastic gradient descent method according to the training set of each participating mechanism working node W, judging whether the working node W traverses the local training set E wheel, if so, calculating to obtain model update according to the current model parameter and the initial global model parameter pulled by each participating mechanism working node W from a global parameter server GS through an intra-domain parameter server S, and entering the step S3, otherwise, continuously traversing the local training set until the local training set E wheel is traversed;
meanwhile, according to the master control working node MW in the central mechanism or the training sets of other working nodes W in the central mechanism in the participation mode, a Mini-Batch SGD random gradient descent method is used for training a local model, whether the node traverses the E wheel of the local training set is judged, if yes, model updating is calculated according to the current model parameter of the node and the initial global model parameter pulled by the node from the GS position of the intra-domain global parameter server, and the step S3 is entered, otherwise, the local training set is continuously traversed until the local training set E wheel is traversed;
wherein E is a hyper-parameter; and each working node uploads the model to be updated to the parameter server in the domain of the working node immediately after local training is finished without waiting for other working nodes in the working node group of the working node.
In the step S2, the expression for training the local model by using the Mini-Batch SGD minibatch stochastic gradient descent method is as follows:
Figure BDA0002300828820000161
wherein,
Figure BDA0002300828820000162
for working node W at the t communication turnsrModel parameters after completion of k local updates, WsrIs a mechanism psThe r-th working node in (b),
Figure BDA0002300828820000163
for the average loss pairsrModel parameter w ofsr (t),kGradient calculation of (B) batch data
Figure BDA0002300828820000164
Including the number of samples, j being the jth training sample in a batch containing B training samples
Figure BDA0002300828820000165
The number of (a) is included,
Figure BDA0002300828820000166
for the loss function, the model parameters are measured
Figure BDA0002300828820000167
In training sample
Figure BDA0002300828820000168
The error in (2);
the expression of the computational model update is as follows:
Figure BDA0002300828820000169
wherein,
Figure BDA00023008288200001610
is WsrModel updates obtained by iterating E-round local training set in the t-th communication round, WsrIs a mechanism psThe r-th working node in (E) is the number of rounds of traversing the local training set,
Figure BDA00023008288200001611
to W at the t communication turnsrThe initial model parameters of the pull are set to be,
Figure BDA00023008288200001612
to W at the t communication turnsrComplete ELsrThe model parameter after the secondary local update, LsrThe number of local updates required for iterating a local training set;
s3, updating and aggregating intra-domain models: when the platform mode is adopted, model updating obtained by each participating mechanism working node W is uploaded to an intra-domain parameter server S to perform intra-domain model updating aggregation;
when in the participation mode, uploading the model update obtained by each participation mechanism working node W to an intra-domain parameter server S for intra-domain model update aggregation, and uploading the model update to an intra-domain global parameter server GS for intra-domain model update aggregation by a main control working node MW and other working nodes W in a central mechanism;
the expression of the intra-domain model update aggregation in step S3 is as follows:
Figure BDA00023008288200001613
wherein,
Figure BDA00023008288200001614
is a mechanism psModel update aggregated in the t-th communication round time domain, msThe number of nodes in the working node group in the s-th mechanism,
Figure BDA00023008288200001615
is WsrModel updates obtained by iterating E-round local training set in the t-th communication round, WsrIs a mechanism psThe r-th working node in (1).
S4, global model updating and aggregating: when the model is in the platform mode, forwarding the aggregated model update to a global parameter server GS of a central institution by a parameter server S of each participating institution, and performing global aggregation on the model update by the global parameter server GS;
when in the participation mode, the parameter server S of each participating mechanism forwards the aggregated model update to the global parameter server GS of the central mechanism, and the global parameter server GS performs global aggregation on the model update submitted by the participating mechanism and the model update obtained in the central mechanism in the step S3;
the expression of the global model update aggregation when in the platform mode in step S4 is as follows:
Figure BDA0002300828820000171
wherein, Δ w(t)For the model update after global aggregation at the t-th communication round, | p | is the total number of central and participating mechanisms, s is the participating mechanism psNumber of (1), nsTo participate in mechanism psThe total number of samples in the training set,
Figure BDA0002300828820000172
to participate in mechanism psUpdating the aggregated model in the time domain of the t communication round;
the expression for global model update aggregation while in participating mode is as follows:
Figure BDA0002300828820000173
wherein, Δ w(t)For model update after global aggregation at the t-th communication round, | p | is the total number of central and participating institutions, s is the institutionpsNumber of (1), nsIs a mechanism psThe total number of samples in the training set,
Figure BDA0002300828820000174
to participate in mechanism psModel update aggregated in the t-th communication round time domain, m1Is the total number of master control working nodes MW and working nodes W in the central mechanism, r is the central mechanism p1Inner r-th working node W1rAnd r 1 denotes the master working node MW,
Figure BDA0002300828820000175
for the t-th communication round central mechanism p1Inner working node W1rUpdating the uploaded model;
s5, global model parameter updating: updating a global model by a global parameter server (GS) according to the model update of the global aggregation;
the expression of the global model parameter update in step S5 is as follows:
w(t+1)←w(t)+Δw(t)(6)
wherein, w(t+1)To complete the global updated latest global model parameters at the t-th communication round, w(t)Is the original global model parameter at the t-th communication turn, Δ w(t)Updating the globally aggregated model in the t communication turn;
s6, model synchronization: when the global working node is in the platform mode, the parameter server S of each participating mechanism initiates a pulling request to the global parameter server GS to obtain the latest model parameters, responds to the pulling request of the working node W in the domain of the global parameter server GS to send the latest model parameters to each working node, and completes the model synchronization of the global working node;
when in the participation mode, a parameter server S of each participating mechanism initiates a pulling request to a global parameter server GS to obtain the latest model parameters, responds to the pulling request of the working nodes W in the domain to send the latest model parameters to each working node, responds to the pulling request of the master control working node MW and other working nodes W in the domain by the global parameter server GS of the central mechanism, sends the latest model parameters to the master control working node MW and each working node W in the domain, and completes the model synchronization of the master control working node MW and each working node W in the domain of the central mechanism and the global working node;
s7, iterative training: and judging whether the current iteration time T reaches a preset iteration time T, if so, finishing the multi-mechanism collaborative learning process of the hierarchical parameter server, otherwise, returning to S2 until the preset iteration time is reached.
In this embodiment, in the platform mode, the master control working node MW may directly exit the cluster because it does not participate in the training; however, in the participation mode, the master work node MW may participate in training together with other work nodes W in its domain, and at this time, the master work node MW may not exit the cluster.
In this embodiment, in the central authority, the initial model parameters submitted to the global parameter server GS by the master control working node MW are used for global model synchronization; in other participating institutions, however, only the worker node W0 identified as 0 will initialize the model storage space of the parameter servers S within its domain without assigning values to that space, and then each parameter server S will use the global initial model parameter assignment model storage space pulled from the global parameter server GS.
In this embodiment, a model training process is modeled based on the basic flow. Supposing that | P | mechanisms participate in the cooperative training, the mechanism set is P ═ { P1,...,ps,...,p|p|In which p is1Is a central mechanism. The s-th participating authority psComprising 1 parameter server node SsAnd msEach working node, the collection of working nodes is Ws={Ws1,...,Wsr,...,WmsIn which WsrTo participate in mechanism psThe r-th working node in (1). Suppose participating mechanism psTraining set of (2) { X }s,YsContains nsA training sample, WsrTraining set of (2) { X }sr,YsrIs { X }s,YsThe training set subset at the r-th working node and containing nsrA training sample, and forOne and the same participating entity psNumber n of samples of local training set of all working nodes in its domainsrSame, WsrFor training set { Xsr,YsrThe kth order of sampling resulted in batches of data of
Figure BDA0002300828820000181
And batch data
Figure BDA0002300828820000182
The batch data amount of (a) is B, wherein,
Figure BDA0002300828820000183
to represent
Figure BDA0002300828820000184
The jth training sample. WsrGo through one round { Xsr,YsrNeeds Lsr=nsrPerforming sequential sampling B times and iterating E rounds of local training sets to finish ELsrAnd updating locally. Eta is the learning rate of the working node local optimizer, and l (x, y, w) is a loss function representing the error that the model w produces on the training samples { x, y }. The Mini-Batch SGD random gradient descent method was used below to simplify the analysis.
In the present embodiment, participating mechanism psR of work node WsrAbove, the k-th local update formula is executed as follows:
Figure BDA0002300828820000185
wherein,
Figure BDA0002300828820000186
for working node W at the t communication turnsrModel parameters after completion of k local updates, WsrIs a mechanism psThe r-th working node in (b),
Figure BDA0002300828820000191
for average loss pair at the t communication turn and k local iterationWorking node WsrModel parameter w ofsr (t),kGradient calculation of (B) batch data
Figure BDA0002300828820000192
Including the number of samples, j being the jth training sample in a batch containing B training samples
Figure BDA0002300828820000193
The number of (a) is included,
Figure BDA0002300828820000194
for the loss function, the model parameters are measured
Figure BDA0002300828820000195
In training sample
Figure BDA0002300828820000196
The error of (2).
In the present embodiment, participating mechanism psR of work node WsrAfter the E round of local data set traversal is completed, the working node calculation model updating formula is as follows:
Figure BDA0002300828820000197
wherein,
Figure BDA0002300828820000198
is WsrModel updates obtained by iterating E-round local training set in the t-th communication round, WsrIs a mechanism psThe r-th working node in (E) is the number of rounds of traversing the local training set,
Figure BDA0002300828820000199
to W at the t communication turnsrThe initial model parameters of the pull are set to be,
Figure BDA00023008288200001910
to W at the t communication turnsrComplete ELsrThe model parameter after the secondary local update, LsrThe number of local updates required to iterate through a round of local training sets.
Subsequently, participating institutions psAll working nodes W insr(r∈[1,ms]) Computed model updates
Figure BDA00023008288200001911
Is uploaded to an intra-domain parameters server SsThe aggregation of intra-domain model updates is performed, with the formula:
Figure BDA00023008288200001912
wherein,
Figure BDA00023008288200001913
is a mechanism psModel update aggregated in the t-th communication round time domain, msThe number of nodes in the working node group in the s-th mechanism,
Figure BDA00023008288200001914
is WsrModel updates obtained by iterating E-round local training set in the t-th communication round, WsrIs a mechanism psThe r-th working node in (1).
In this embodiment, the parameter server S ═ { S ] of each participating entity2,...,Ss,...,SpAnd forwarding the intra-domain aggregated model update to a global parameter server GS, and performing aggregation of the global model update by the global parameter server GS.
In the present embodiment, in the platform mode, the center mechanism p1The model is not trained, and the model is not uploaded to the global parameter server GS, and the aggregation formula of the global model update is as follows:
Figure BDA00023008288200001915
wherein, Δ w(t)For global aggregation at the t communication turnThe combined model updates, | p | is the total number of central and participating mechanisms, s is participating mechanism psNumber of (1), nsTo participate in mechanism psThe total number of samples in the training set,
Figure BDA00023008288200001916
to participate in mechanism psAnd updating the aggregated model in the time domain of the t communication turn.
In this embodiment, in the participation mode, the center mechanism p needs to be considered1Master control working node and working node W1r(r∈[1,m1]) Submitting the model update to the global parameter server GS, where the global model aggregation formula is as follows:
Figure BDA0002300828820000201
wherein, Δ w(t)For globally aggregated model updates at the t-th communication round, | p | is the total number of central and participating institutions, and s is institution psNumber of (1), nsIs a mechanism psThe total number of samples in the training set,
Figure BDA0002300828820000202
to participate in mechanism psModel update aggregated in the t-th communication round time domain, m1Is the total number of master control working nodes MW and working nodes W in the central mechanism, r is the central mechanism p1Inner r-th working node W1rAnd r 1 denotes the master working node MW,
Figure BDA0002300828820000203
for the t-th communication round central mechanism p1Inner working node W1rAnd updating the uploaded model.
In this embodiment, after the global model update aggregation is completed, the global parameter server GS uses the model update Δ w after the global aggregation(t)To update the global model w(t)The update formula is as follows:
w(t+1)←w(t)+Δw(t)(6)
wherein, w(t+1)To complete the global updated latest global model parameters at the t-th communication round, w(t)Is the original global model parameter at the t-th communication turn, Δ w(t)For the globally aggregated model update at the t-th communication round.
For further explanation of the present invention, the following description will be given by taking the traffic model and communication process of HiPS in the multi-party collaborative learning framework based on the hierarchical parameter server as an example:
the traffic model of the multi-party collaborative learning framework HiPS based on the hierarchical parameter server developed by the invention is shown in FIG. 4. Typically the central mechanism needs to be activated in preference to the participating mechanisms. The multi-party collaborative learning framework HiPS based on the hierarchical parameter server comprises four stages: starting and initializing a cluster, locally training in a mechanism, updating and synchronizing global aggregation, and stopping and destroying the cluster. To simplify the analysis, the flow model diagram only shows three main stages: cluster starting and initializing, local training in a mechanism, and global aggregation updating and synchronizing. In the full synchronization mode and the platform mode, the detailed communication process of the HiPS framework developed by the invention is described as follows: as shown in fig. 4, a thick solid line in the figure represents transmission of model parameters or model updates, a thin solid line represents transmission of control information, a thick dotted line represents model parameters or model updates from/to other unknown institutions, and a thin dotted line represents control information from/to other unknown institutions. Fig. 4 shows a flow model diagram of a central authority and a participating authority, the other participating authorities being simplified and hidden, their model data/control information interaction being indicated by dashed lines.
Stage-one global model parameter initialization
The method comprises the following steps: center mechanism p1Uploading initial model parameters w to a global parameter server GS by an inner master control working node MW(1)Global parameter server GS uses w(1)Initial global model parameters. Participating in the mechanism p simultaneouslys(s∈[2,|p|]) First working node Ws1Intradomain parameter server SsUpload model parameters of arbitrary values, usingInitializing model parameter storage space of a parameter server in each participating mechanism, but not assigning model parameters; participating institutions ps(s∈[2,|p|]) Other working nodes W insr(r∈[2,ms]) Not in charge of its intra-domain parameter server SsIs initialized directly to SsA pull request is initiated to pull global initial model parameters.
Step two: and after the global parameter server GS completes the initialization of the global model parameters, ACK is sent to the master control working node MW. And the master control working node MW exits the cluster after confirming that the global model initialization is completed. Participating in the mechanism p simultaneouslys(s∈[2,|p|]) Internal parameter server SsTo its working node Ws1And sending the ACK. Working node Ws1And continuing to execute the subsequent steps after the initialization of the model parameter space is confirmed to be completed.
Step three: parameter server S ═ S in each participating institution2,...,Ss,...,S|p|Sending a pulling request to a global parameter server GS for obtaining a global initialization model parameter w(1)
Step four: after the global parameter server GS completes the model parameter initialization of step one, the global parameter server GS responds to the parameter server S ═ S of each participating organization2,...,Ss,...,S|p|Get request and return initial model parameters w(1)
Step five: participating institutions ps(s∈[2,|p|]) Inner working node Ws1Upon receipt of its intradomain parameter server SsAfter the returned ACK, continue to SsA pull request is initiated to pull global initial model parameters.
Step six: participating institutions ps(s∈[2,|p|]) Parameter server SsAfter step four is completed, responding each work node W in the domainsr(r∈[1,ms]) Pull request and return initial model parameters w(1). Each working node Wsr(r∈[1,ms]) Using w(1)Initializing local model parameters, and finally enabling all working nodes to have the same initial model parameters
Figure BDA0002300828820000211
Model initialization and model synchronization of the global working node are completed.
Local training in phase two organization
Step seven: participating in institution p when performing in-institution local training for the tth communication rounds(s∈[2,|p|]) Inner working node Wsr(r∈[1,ms]) Based on local model parameters
Figure BDA0002300828820000212
And local training set data { Xsr,YsrComputing model updates, where { Xsr,YsrAre sequentially divided into Lsr=nsrB small batch data. Working node WsrLocal training formula (1) is carried out on the small Batch of data by using a Mini-Batch SGD optimizer with the learning rate of eta (or an optimizer such as Adam, RMSProp and the like) until the local training set E wheel is circularly traversed, and local updating EL is carried out together at the momentsrNext, the process is carried out. Finally, the working node WsrModel updating by calculation of equation (2)
Figure BDA0002300828820000213
Stage three global aggregate update and synchronization
Step eight: participating institutions ps(s∈[2,|p|]) Inner working node Wsr(r∈[1,ms]) Updating the model calculated in the step seven
Figure BDA0002300828820000221
Upload to parameter server S within its domains
Step nine: participating institutions ps(s∈[2,|p|]) Parameter server SsAll work nodes W in its domain are collectedsr(r∈[1,ms]) After updating the model of (3), the model in the aggregation domain is updated according to the formula
Figure BDA0002300828820000222
Subsequently, the parameter server S updates the intra-domain aggregated model
Figure BDA0002300828820000223
And continuously uploading to a global parameter server GS.
Step ten: the global parameter server GS collects all the parameter servers S ═ S of the organizations2,...,Ss,...,SpAfter the uploaded model is updated, the global model is aggregated according to the formula (4) to obtain delta w(t)(if the central mechanism p1When the system runs in the participation mode, the global model is aggregated according to the formula (5) to obtain the delta w(t)). Then, the global parameter server GS uses Δ w(t)Updating the global model parameters w according to equation (6)(t)Obtaining the latest global model parameter w(t+1). Finally, the global parameter server GS sends S ═ S to the parameter servers S of the respective organizations2,...,Ss,...,SpReturn ACKs, allow them to perform a pull operation to pull up the latest global model parameter w(t+1)
Step eleven: parameter server S ═ { S ═ S for each participating institution2,...,Ss,...,S|p|After receiving the ACK response of the global parameter server GS, initiating a pull request to the global parameter server GS to pull the latest global model parameter w(t+1)
Step twelve: the global parameter server GS responds to each participating agency parameter server S ═ S2,...,Ss,...,S|p|Get request and return the latest global model parameter w(t+1)
Step thirteen: participating institutions ps(s∈[2,|p|]) Parameter server SsAfter pulling the latest global model parameter w(t+1)Then all the working nodes W in the domainsr(r∈[1,ms]) Sending ACK responses, allowing them to perform a pull operation to pull up the latest global model parameters w(t+1)
Fourteen steps: participating institutions ps(s∈[2,|p|]) Working node W ofsr(r∈[1,ms]) Upon receipt of its intradomain parameter server SsAfter ACK response, continue to SsInitiating a pull request to pull the latest global model parameter w(t+1)
Step fifteen: participating institutions ps(s∈[2,|p|]) Parameter server SsReceives its working node W in domainsr(r∈[1,ms]) Pull request and return the latest global model parameters w(t+1). Working node Wsr(r∈[1,ms]) Latest global model parameters w using pull(t+1)The local model parameters are overlaid. Thus, the aggregation of the global model update, the update of the global model and the synchronization of the global model of the t-th communication turn are completed.
Sixthly, the steps are as follows: and judging whether the current communication round T reaches the specified training round T or not, or whether the current global model precision reaches the specified precision threshold. If the stopping condition is met, stopping training and entering a cluster stopping and destroying stage; otherwise, let t ← t +1 and continue to execute from step seven until reaching the stop condition.
When the deployment environment is implemented, the invention can be deployed in a cross-domain multi-organization data center cluster, and comprises a center organization and a plurality of participating organizations. The technology supports the central mechanism to run in two modes, namely a platform mode and a participation mode. In the platform mode, the central institution serves as a service platform to provide multi-party knowledge fusion service for a plurality of participating institutions, but does not provide data and computing power. In the participation mode, a plurality of organizations negotiate and select one organization as a central organization, and the organization needs to provide both the service of multi-party knowledge fusion and data and effort participation in model training. A plurality of nodes can participate in training at the same time in the data center of the participating organization, and the nodes can be interconnected in any physical topology, but at least two nodes need to be ensured to be intercommunicated with other nodes. The HiPS framework based on the hierarchical parameter server developed by the technology can run on a general server and can also run in a GPU cluster. The recommended single data center is configured to: the method comprises the following steps that computing nodes are deployed on GPU clusters with the same or similar computing capacity, and scheduling nodes and parameter server nodes are deployed on a general server; in platform mode, the recommended central authority is configured such that all nodes are deployed on a common server.
In this embodiment, as shown in fig. 1, in the platform mode, the central authority serves as a platform to provide the multi-party knowledge fusion service. The central mechanism needs to deploy a global parameter server node, a global scheduler node, a local scheduler node and a master control working node on a general server, and expose two network ports to the external network, which correspond to the communication ports of the global parameter server node and the global scheduler node respectively. Three types of nodes within the central authority need to communicate with each other. A central authority may deploy multiple global parameter server nodes for load balancing. The participating mechanism needs to deploy parameter server nodes, local scheduler nodes and working nodes, wherein the participating mechanism can only deploy one parameter server node and one local scheduler node on the general server, but can deploy a plurality of working nodes on the GPU server. The local scheduler node needs to be interoperable with other nodes in the participating enterprise, and the parameter server node needs to be interoperable with the working node. The parameter servers of the participating institutions need to expose two network ports to the external network for establishing network connections with the global parameter server and the global scheduler of the central institution. As shown in fig. 2, in the participating mode, the central authority may additionally deploy a plurality of working nodes on the GPU server, which need to interwork with the local scheduler node and the parameter server node. The local scheduler is connected with a local parameter server node (in a central mechanism, namely a global parameter server node) and all nodes in the intra-domain working node group; the global scheduler is connected with the global parameter server node and the parameter server nodes in all the participating institutions.
In this embodiment, the central mechanism and the participating mechanisms may be located in different WAN network environments, the data center clusters of the central mechanism and the participating mechanisms in the participating mode are recommended to be configured as clusters having larger-scale computing cards of the same type or similar computing power and interconnected with a high-speed stable network, and the central mechanism in the platform mode may be deployed in a general server cluster. If the requirements are not met, the system can be directly deployed in a general server cluster.
In this embodiment, the central mechanism may select different combinations of modes according to actual requirements, and the modes include: full synchronous mode/inter-domain asynchronous mode, platform mode/participation mode, central aggregation mode/central update mode. Multiple global parameter servers may also be configured to achieve load balancing for the central authority. Optionally, compression of the parameters may be turned on to reduce traffic. These functions can be quickly and easily turned on or off upon invoking the interface provided by the framework of the present invention. If the stopping condition is set, the frame can automatically execute the cluster stopping process when the stopping condition is met, so that all related processes can be conveniently and quickly closed without manually stopping the cluster.
Through the design, the data privacy security system solves the data island problem of big data, the data privacy security problem in multi-party cooperation, and the problems of high communication cost, high maintenance cost, high security risk and low resource utilization rate of the existing system. The invention realizes the multiparty collaborative learning with high communication efficiency and high calculation efficiency on the premise of ensuring the data privacy and safety, and is suitable for the cross-domain interconnection of multiple independent mechanisms and multiple data centers. The system provided by the invention supports a platform mode and a participation mode, can be used as a platform to provide multi-party knowledge fusion service, and can also be used as a tool to support shared cooperation among a plurality of independent mechanisms.
The layering thought in the HiPS framework based on the layering parameter server provided by the invention has the following advantages:
1) direct network connection between the cloud center parameter server and all computing nodes of each mechanism is avoided, the communication pressure of the cloud center parameter server can be greatly reduced, and the communication bottleneck is relieved;
2) the access number of the cloud center parameter servers is reduced from the number of global computing nodes to the number of mechanisms, and the number of the mechanisms is usually small, so that the cloud center parameter servers are suitable for using a parameter server framework;
3) the computing and communication environment in the domain is more ideal and isomorphic, and the mechanism can flexibly select different communication topologies according to the cluster scale without sampling computing equipment, so that computing and communication resources in the mechanism can be fully utilized.

Claims (9)

1. A multi-mechanism collaborative learning system based on a hierarchical parameter server is characterized by comprising a central mechanism and a plurality of participating mechanisms connected with the central mechanism through a WAN (wide area network);
the domain of each participating mechanism and the domain of the central mechanism both adopt a Parameter Server architecture, and the domain of each participating mechanism and the domain of the central mechanism are respectively set as 1L-PS layers;
a Parameter Server architecture is adopted between the central mechanism and each participating mechanism, and a 2L-PS layer is arranged between the central mechanism and each participating mechanism;
when the central mechanism is in the platform mode:
the central mechanism comprises a master control working node MW, a global parameter server GS connected with the master control working node MW through a LAN (local area network), a local scheduler LC respectively connected with the master control working node MW and the global parameter server GS through the LAN, and a global scheduler GC connected with the global parameter server GS through the LAN, wherein the global scheduler GC and the global parameter server GS are respectively connected with the participating mechanism through a WAN (wide area network);
the participation mechanism comprises a parameter server S, a working node group which is connected with the parameter server S through a LAN (local area network) and consists of a plurality of working nodes W, and a local scheduler LC which is respectively connected with the parameter server S and the working nodes W through the LAN, wherein the parameter server S is respectively connected with a global parameter server GS and a global scheduler GC through a WAN (wide area network);
when the central mechanism is in the participation mode:
the central mechanism comprises a global parameter server GS, a global scheduler GC, a working node group consisting of a master control working node MW and a plurality of working nodes W and a local scheduler LC; the global parameter server GS, the local scheduler LC and the working node group are connected with each other through a LAN network, and the global scheduler GC and the global parameter server GS are respectively connected with the participating mechanism through a WAN network;
the participation mechanism comprises a parameter server S, a working node group which is connected with the parameter server S through a LAN and consists of a plurality of working nodes W, and a local scheduler LC which is respectively connected with the parameter server S and the working nodes W through the LAN, wherein the parameter server S is respectively connected with a global parameter server GS and a global scheduler GC through a WAN.
2. The multi-mechanism collaborative learning system based on hierarchical parameter servers as claimed in claim 1, wherein in the participation mode, the master work node MW is configured to send configuration information to a global parameter server GS, initializing global model parameters; the system comprises a global parameter server GS and a pull request, wherein the global parameter server GS is used for calculating model update by using training set data and calculation resources of a mechanism where the model update is located, uploading the model update to the global parameter server GS in a domain, and sending the pull request to the global parameter server GS in the domain;
in the participation mode, the global parameter server GS is used for aggregating the model updates of the working node groups in the domain in the 1L-PS layer, and the aggregated model updates are used for aggregating the global model updates in the 2L-PS layer; the system comprises a parameter server S, a master control working node MW, a parameter server S and a parameter server, wherein the parameter server S is used for sending a pull request of a working node group in a domain to the master control working node MW;
in the platform mode, the global parameter server GS is used for aggregating global model updates in the 2L-PS layer, updating global model parameters, responding to a pull request of the parameter server S, and issuing the model parameters to each working node W along a request path;
in the platform mode, the master control working node MW is configured to send configuration information to the global parameter server GS, and initialize global model parameters;
the parameter server S is used for aggregating model updates of the working node groups in the domain in the 1L-PS layer, uploading the aggregated model updates to the global parameter server GS in the 2L-PS layer, and sending a pull request to the global parameter server GS;
in the participation mode, a working node W in the participation mechanism is used for calculating model update by using training set data and calculation resources of the mechanism where the working node W is located, and uploading the model update to a parameter server S in the domain; the system comprises a parameter server S and a pull request sending unit, wherein the parameter server S is used for sending a pull request to the parameter server S in the domain of the parameter server S;
in the participation mode, a working node W in the central mechanism is used for calculating model update by using training set data and calculation resources of the mechanism where the working node W is located, uploading the model update to a global parameter server GS and pulling the latest model parameters from the global parameter server GS;
the local scheduler LC is used for establishing a communication architecture in the organization domain when the cluster is started, and for registering, identifying and configuring the state of other nodes in the organization domain;
the global scheduler GC is used for the cluster to start the establishment of the inter-domain communication architecture and for the registration, identification and state configuration of the global parameter server GS and the parameter server S.
3. The hierarchical parameter server-based multi-institution collaborative learning system of claim 1, wherein the platform mode and the participation mode each comprise a fully synchronized mode;
the full synchronization mode is as follows: the parameter server S or the global parameter server GS in the participation mode needs to wait for all nodes in the working node group in the domain to upload model updates and then execute aggregation operation, and when the global parameter server GS collects the aggregation models of all organizations and updates, the aggregation of the global model updates and the update of the global model are carried out.
4. A multi-mechanism collaborative learning method based on a hierarchical parameter server is characterized by comprising the following steps:
s1, starting and initializing the cluster;
s2, model updating calculation: training a local model by utilizing a Mini-Batch SGD small Batch stochastic gradient descent method according to the training set of each participating mechanism working node W, judging whether the working node W traverses the local training set E wheel, if so, calculating to obtain model update according to the current model parameter and the initial global model parameter pulled by each participating mechanism working node W from a global parameter server GS through an intra-domain parameter server S, and entering the step S3, otherwise, continuously traversing the local training set until the local training set E wheel is traversed;
meanwhile, according to the master control working node MW in the central mechanism or the training sets of other working nodes W in the central mechanism in the participation mode, a Mini-Batch SGD random gradient descent method is used for training a local model, whether the node traverses the E wheel of the local training set is judged, if yes, model updating is calculated according to the current model parameter of the node and the initial global model parameter pulled by the node from the GS position of the intra-domain global parameter server, and the step S3 is entered, otherwise, the local training set is continuously traversed until the local training set E wheel is traversed;
wherein E is a hyper-parameter;
s3, updating and aggregating intra-domain models: when the platform mode is adopted, model updating obtained by each participating mechanism working node W is uploaded to an intra-domain parameter server S to perform intra-domain model updating aggregation;
when in the participation mode, uploading the model update obtained by each participation mechanism working node W to an intra-domain parameter server S for intra-domain model update aggregation, and uploading the model update to an intra-domain global parameter server GS for intra-domain model update aggregation by a main control working node MW and other working nodes W in a central mechanism;
s4, global model updating and aggregating: when the model is in the platform mode, forwarding the aggregated model update to a global parameter server GS of a central institution by a parameter server S of each participating institution, and performing global aggregation on the model update by the global parameter server GS;
when in the participation mode, the parameter server S of each participating mechanism forwards the aggregated model update to the global parameter server GS of the central mechanism, and the global parameter server GS performs global aggregation on the model update submitted by the participating mechanism and the model update obtained in the central mechanism in the step S3;
s5, global model parameter updating: updating a global model by a global parameter server (GS) according to the model update of the global aggregation;
s6, model synchronization: when the global working node is in the platform mode, the parameter server S of each participating mechanism initiates a pulling request to the global parameter server GS to obtain the latest model parameters, responds to the pulling request of the working node W in the domain of the global parameter server GS to send the latest model parameters to each working node, and completes the model synchronization of the global working node;
when in the participation mode, a parameter server S of each participating mechanism initiates a pulling request to a global parameter server GS to obtain the latest model parameters, responds to the pulling request of the working nodes W in the domain to send the latest model parameters to each working node, responds to the pulling request of the master control working node MW and other working nodes W in the domain by the global parameter server GS of the central mechanism, sends the latest model parameters to the master control working node MW and each working node W in the domain, and completes the model synchronization of the master control working node MW and each working node W in the domain of the central mechanism and the global working node;
s7, iterative training: and judging whether the current iteration time T reaches a preset iteration time T, if so, finishing the multi-mechanism collaborative learning process of the hierarchical parameter server, otherwise, returning to S2 until the preset iteration time is reached.
5. The hierarchical parameter server-based multi-mechanism collaborative learning method according to claim 4, wherein the step S1 includes the steps of:
s101, a master control working node MW of a central mechanism sends configuration information to a global parameter server GS, an operation mode is set to be a platform mode or a participation mode, a synchronous mode is set to be a full synchronous mode or an inter-domain asynchronous mode, an optimization algorithm is set, a compression mode is set, initial global model parameters are sent to the global parameter server GS, and cluster configuration and global model initialization are completed;
s102, working node W marked as 0 by each participating organization0Initializing a model storage space of a parameter server S in the domain;
s103, the working nodes of all participating mechanisms pull initial global model parameters from the global parameter server through the intra-domain parameter server S to complete global model synchronization, so that cluster starting and initialization are completed.
6. The multi-mechanism collaborative learning method based on the hierarchical parameter server of claim 4, wherein the expression for training the local model by Mini-Batch SGD stochastic gradient descent method in the step S2 is as follows:
Figure FDA0002691985890000041
wherein,
Figure FDA0002691985890000042
for working node W at the t communication turnsrModel parameters after completion of k local updates, WsrIs a mechanism psThe r-th working node in (b),
Figure FDA0002691985890000043
for the average loss pairsrModel parameter w ofsr (t),kGradient calculation of (B) batch data
Figure FDA0002691985890000044
Including the number of samples, j being the jth training sample in a batch containing B training samples
Figure FDA0002691985890000045
The number of (a) is included,
Figure FDA0002691985890000046
for the loss function, the model parameters are measured
Figure FDA0002691985890000047
In training sample
Figure FDA0002691985890000048
The error in (2);
the expression of the computational model update is as follows:
Figure FDA0002691985890000049
wherein,
Figure FDA00026919858900000410
is WsrModel updates obtained by iterating E-round local training set in the t-th communication round, WsrIs a mechanism psThe r-th working node in (E) is the number of rounds of traversing the local training set,
Figure FDA00026919858900000411
to W at the t communication turnsrThe initial model parameters of the pull are set to be,
Figure FDA00026919858900000412
to W at the t communication turnsrComplete ELsrThe model parameter after the secondary local update, LsrThe number of local updates required to iterate through a round of local training sets.
7. The hierarchical parameter server-based multi-mechanism collaborative learning method according to claim 4, wherein the expression of intra-domain model update aggregation in the step S3 is as follows:
Figure FDA0002691985890000051
wherein,
Figure FDA0002691985890000052
is a mechanism psModel update aggregated in the t-th communication round time domain, msFor each node in the working node group in the s-th mechanismThe number of the first and second groups is,
Figure FDA0002691985890000053
is WsrModel updates obtained by iterating E-round local training set in the t-th communication round, WsrIs a mechanism psThe r-th working node in (1).
8. The hierarchical parameter server-based multi-mechanism collaborative learning method according to claim 4, wherein the global model updates the aggregated expression in the step S4 while in the platform mode as follows:
Figure FDA0002691985890000054
wherein, Δ w(t)For globally aggregated model updates at the t-th communication round, | p | is the total number of central and participating agencies, s is participating agency psNumber of (1), nsTo participate in mechanism psThe total number of samples in the training set,
Figure FDA0002691985890000055
to participate in mechanism psUpdating the aggregated model in the time domain of the t communication round;
the expression for global model update aggregation while in participating mode is as follows:
Figure FDA0002691985890000056
wherein, Δ w(t)For globally aggregated model updates at the t-th communication round, | p | is the total number of central and participating institutions, and s is institution psNumber of (1), nsIs a mechanism psThe total number of samples in the training set,
Figure FDA0002691985890000057
to participate in mechanism psModel update aggregated in the t-th communication round time domain, m1Is the total number of master control working nodes MW and working nodes W in the central mechanism, r is the central mechanism p1Inner r-th working node W1rAnd r 1 denotes the master working node MW,
Figure FDA0002691985890000058
for the t-th communication round central mechanism p1Inner working node W1rAnd updating the uploaded model.
9. The hierarchical parameter server-based multi-mechanism collaborative learning method according to claim 4, wherein the global model parameter update in the step S5 is expressed as follows:
w(t+1)←w(t)+Δw(t)(6)
wherein, w(t+1)To complete the global updated latest global model parameters at the t-th communication round, w(t)Is the original global model parameter at the t-th communication turn, Δ w(t)For the globally aggregated model update at the t-th communication round.
CN201911220964.6A 2019-12-03 2019-12-03 Multi-mechanism collaborative learning system and method based on hierarchical parameter server Active CN110995488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911220964.6A CN110995488B (en) 2019-12-03 2019-12-03 Multi-mechanism collaborative learning system and method based on hierarchical parameter server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911220964.6A CN110995488B (en) 2019-12-03 2019-12-03 Multi-mechanism collaborative learning system and method based on hierarchical parameter server

Publications (2)

Publication Number Publication Date
CN110995488A CN110995488A (en) 2020-04-10
CN110995488B true CN110995488B (en) 2020-11-03

Family

ID=70089563

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911220964.6A Active CN110995488B (en) 2019-12-03 2019-12-03 Multi-mechanism collaborative learning system and method based on hierarchical parameter server

Country Status (1)

Country Link
CN (1) CN110995488B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111580970B (en) * 2020-05-07 2023-02-03 电子科技大学 Transmission scheduling method for model distribution and aggregation of federated learning
CN111898137A (en) * 2020-06-30 2020-11-06 深圳致星科技有限公司 Private data processing method, equipment and system for federated learning
CN112465043B (en) * 2020-12-02 2024-05-14 平安科技(深圳)有限公司 Model training method, device and equipment
CN113626687A (en) * 2021-07-19 2021-11-09 浙江师范大学 Online course recommendation method and system taking federal learning as core
CN114429223B (en) * 2022-01-26 2023-11-07 上海富数科技有限公司 Heterogeneous model building method and device
CN114500642A (en) * 2022-02-25 2022-05-13 百度在线网络技术(北京)有限公司 Model application method and device and electronic equipment
CN115174404B (en) * 2022-05-17 2024-06-21 南京大学 Multi-device federal learning system based on SDN networking

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330516A (en) * 2016-04-29 2017-11-07 腾讯科技(深圳)有限公司 Model parameter training method, apparatus and system
WO2019111118A1 (en) * 2017-12-04 2019-06-13 International Business Machines Corporation Robust gradient weight compression schemes for deep learning applications
CN110380917A (en) * 2019-08-26 2019-10-25 深圳前海微众银行股份有限公司 Control method, device, terminal device and the storage medium of federal learning system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102006026631B4 (en) * 2006-06-08 2011-06-22 ZF Friedrichshafen AG, 88046 Device for driving an oil pump
CN102196372B (en) * 2010-03-01 2014-12-10 ***通信集团公司 Method, device, portable terminal and system for movably monitoring network alarm in real-time

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330516A (en) * 2016-04-29 2017-11-07 腾讯科技(深圳)有限公司 Model parameter training method, apparatus and system
WO2019111118A1 (en) * 2017-12-04 2019-06-13 International Business Machines Corporation Robust gradient weight compression schemes for deep learning applications
CN110380917A (en) * 2019-08-26 2019-10-25 深圳前海微众银行股份有限公司 Control method, device, terminal device and the storage medium of federal learning system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
parameter_server架构;张雨石;《https://blog.csdn.net/stdcoutzyx/article/details/51241868》;20160425;全文 *
基于参数服务器的分布式机器学习研究;李培;《万方数据库》;20170621;全文 *

Also Published As

Publication number Publication date
CN110995488A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
CN110995488B (en) Multi-mechanism collaborative learning system and method based on hierarchical parameter server
CN103856480B (en) User datagram protocol packet moving method and device in virtual machine (vm) migration
CN108122032A (en) A kind of neural network model training method, device, chip and system
CN105684357A (en) Management of addresses in virtual machines
CN114584581B (en) Federal learning system and federal learning training method for intelligent city internet of things (IOT) letter fusion
CN110222005A (en) Data processing system and its method for isomery framework
CN104113596A (en) Cloud monitoring system and method for private cloud
CN107145673B (en) Joint simulation system and method
CN114710330B (en) Anomaly detection method based on heterogeneous layered federated learning
CN102982209A (en) Space network visual simulation system and method based on HLA (high level architecture)
CN105681474A (en) System architecture for supporting upper layer applications based on enterprise-level big data platform
CN109819032A (en) A kind of base station selected cloud robot task distribution method with computation migration of joint consideration
CN112104491A (en) Service-oriented network virtualization resource management method
CN110689174B (en) Personnel route planning method and device based on public transportation
CN116382843A (en) Industrial AI power-calculating PaaS platform based on Kubernetes container technology
Kim et al. Reducing model cost based on the weights of each layer for federated learning clustering
He et al. Beamer: stage-aware coflow scheduling to accelerate hyper-parameter tuning in deep learning clusters
CN106227465A (en) A kind of data placement method of ring structure
CN115841197A (en) Path planning method, device, equipment and storage medium
Jin et al. Adaptive and optimized agent placement scheme for parallel agent‐based simulation
CN107231291A (en) A kind of micro services partition method and device suitable for electric network information physical system
Ning et al. A data oriented analysis and design method for smart complex software systems of IoT
Liu et al. Accelerated dual averaging methods for decentralized constrained optimization
Sengupta et al. Collaborative learning-based schema for predicting resource usage and performance in F2C paradigm
Pattnayak et al. Green IoT based technology for sustainable smart cities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant