CN111445027B - Training method and device for machine learning model - Google Patents

Training method and device for machine learning model Download PDF

Info

Publication number
CN111445027B
CN111445027B CN201910041301.1A CN201910041301A CN111445027B CN 111445027 B CN111445027 B CN 111445027B CN 201910041301 A CN201910041301 A CN 201910041301A CN 111445027 B CN111445027 B CN 111445027B
Authority
CN
China
Prior art keywords
training
parameter server
parameter
results
server group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910041301.1A
Other languages
Chinese (zh)
Other versions
CN111445027A (en
Inventor
张强
谈政荣
王栎汉
姚小龙
蔡适择
陈敏
任亚坤
陈军
龚杰文
韩兆鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN201910041301.1A priority Critical patent/CN111445027B/en
Publication of CN111445027A publication Critical patent/CN111445027A/en
Application granted granted Critical
Publication of CN111445027B publication Critical patent/CN111445027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The application discloses a training method and device of a machine learning model, wherein the method comprises the following steps: the model parameters and the training data sets divided into m parts are issued to (1+a) m parameter server groups for training, and training results of the parameter server groups are received; wherein m is a positive integer, 0< a <1, and (1+a) is rounded by one bit; thus, the training result is processed according to the training times. According to the training method for the machine learning model, model parameters and the training data sets divided into m parts are issued to the (1+a) m parameter server groups for training, so that the number of the parameter server groups is increased, when the parameter server groups fail and cannot train data, the spare parameter server groups continue to operate on the data, normal operation of a training process is guaranteed, and the training efficiency of the machine learning model is improved.

Description

Training method and device for machine learning model
Technical Field
The invention relates to the technical field of information, in particular to a training method and device of a machine learning model.
Background
In modern society, people increasingly use express items to receive and send items. Especially, with the rapid development of electronic commerce, online shopping is rapidly popularized and applied. In addition, the arrival of the big data age, the express business constantly generates massive data every moment.
Today, model training is performed by collecting a large amount of sample data by means of machine learning and artificial intelligence techniques. Therefore, when new data are generated, the new data can be conveniently processed by using the model obtained through training. In the training process of the model, the parameter training needs to take a long time, so that the training time of the parameter is shortened, and the operation efficiency of the algorithm is improved.
In the related art, after training data is randomly divided into a certain number, the training data is directly issued to the same number of trainers for parameter training, and training results are obtained. However, in the parameter training process, the trainer is prone to failure due to huge operation data volume, so that the training speed is slow, and the training efficiency of the machine learning model is further affected.
Disclosure of Invention
In view of the above-mentioned drawbacks or shortcomings in the prior art, it is desirable to provide a training method and apparatus for a machine learning model, by increasing the number of nodes configured, when there are nodes that fail and cannot train data, there are standby nodes to continue to perform operation processing on the data, so as to ensure normal running of the training process, and further improve the training efficiency of the machine learning model.
In a first aspect, the present application provides a training method of a machine learning model, including:
model parameters and training data sets divided into m parts are issued to a (1+a) m parameter server group for training; wherein m is a positive integer, 0< a <1, and (1+a) is rounded by one bit;
receiving training results of the parameter server group;
and processing the training result according to the training times.
In a second aspect, the present application provides a training apparatus for a machine learning model, comprising:
the issuing module is used for issuing model parameters and training data sets divided into m parts to the (1+a) m parameter server groups for training; wherein m is a positive integer, 0< a <1, and (1+a) is rounded by one bit;
the receiving module is used for receiving the training result of the parameter server group;
and the processing module is used for processing the training result according to the training times.
In summary, after the training data set is divided into m parts, the m parts and model parameters are issued to (1+a) m parameter server groups for training, wherein m is a positive integer, 0< a <1, (1+a) m is one-bit rounded; because the number of the parameter server groups is increased, when the parameter server groups fail and cannot train data, standby parameter server groups continue to operate the data; further, training results of the parameter server group are received, and the training results are processed according to the training times; based on this, the embodiment of the application can improve the training efficiency of the machine learning model while ensuring that the training process is normally carried out.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:
fig. 1 is a basic flow diagram of a training method of a machine learning model according to an embodiment of the present application;
FIG. 2 is an example of a training method for a machine learning model provided in an embodiment of the present application;
FIG. 3 is a training device for a machine learning model according to an embodiment of the present disclosure;
FIG. 4 is a training device of another machine learning model according to an embodiment of the present application;
FIG. 5 is a training device of yet another machine learning model provided in an embodiment of the present application;
fig. 6 is a schematic diagram of a computer system according to an embodiment of the present application.
Detailed Description
The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the invention are shown in the drawings.
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
The embodiment of the application provides a training method of a machine learning model, which is applied to a terminal. It should be noted that, the terminals in the embodiments of the present application may include, but are not limited to, a personal Computer (PersonalComputer, PC), a personal digital assistant (Personal Digital Assistant, PDA), a Tablet Computer (Tablet Computer), a wireless handheld device, a mobile phone, and the like.
For easy understanding and explanation, the training method and apparatus of the machine learning model provided in the embodiments of the present application are described in detail below with reference to fig. 1 to 5.
Please refer to fig. 1, which is a basic flow chart of a training method of a machine learning model according to an embodiment of the present application, the method includes the following steps:
and S101, the model parameters and the training data sets divided into m parts are issued to a (1+a) m parameter server group for training.
Wherein m is a positive integer, 0< a <1, (1+a) m is rounded by one bit.
For example, after the training data set is randomly divided into m parts, the m parts are issued to (1+a) m parameter server groups together with the model parameters for training, and a m parameter server groups serve as standby parameter server groups. Wherein m is an integer greater than 0, and the calculation results of (1+a) m and a m are rounded by one bit. For example, when m=1 and a=10%, the calculation result of (1+a) ×m is 1.1, and the result is 2 after one-bit rounding, i.e. 2 parameter server groups; for example, when m=6 and a=20%, the calculation result of (1+a) ×m is 7.2, and the result is 8 after rounding, i.e. 8 parameter server groups. Because the number of the parameter server groups is increased, when the parameter server groups fail and cannot train data, the spare parameter server groups can continue to operate the data according to the model parameters and the training results recorded in the log of the failed parameter server groups when the machine is down, so that the training efficiency of the machine learning model is effectively improved.
It should be noted that, the parameter server set in the embodiment of the present application further includes a main parameter server and a training parameter server. The main parameter server has the functions of distributing training parameters, counting training results fed back by each training parameter server and training data, and the training parameter server is used for training according to the training parameters distributed by the main parameter server.
For example, the master parameter server distributes the model parameters and m training data sets to (1+b) n training parameter servers in the parameter server group for training, where n is a positive integer, 0< b <1, (1+b) n is rounded in one bit. For example, when n=1 and b=10%, the calculation result of (1+b) n is 1.1, and 2 is obtained after one-bit rounding, namely 2 training parameter servers; for example, when n=5 and b=30%, the calculation result of (1+b) ×n is 6.5, and after rounding, the calculation result is 7, i.e. 7 training parameter servers. The number of the training parameter servers is increased, so that the main parameter server can distribute the training data sets with the same total number to a plurality of training parameter servers for training; for each training parameter server, the distributed training data volume is reduced, and the training efficiency of the machine learning model is improved.
In each parameter server group, one parameter server is designated as the master parameter server in the group, i.e. master, and the other parameter servers are designated as training parameter servers, i.e. slave. When the master node cannot normally communicate with the slave node due to network connection failure, power failure, and crash, that is, the master node cannot receive heartbeat information of the slave node and respond, one training parameter server is designated or randomly selected from the (1+b) n training parameter servers as the master parameter server, and an additionally configured standby parameter server is used as the master parameter server.
For designating one training parameter server as a main parameter server, for example, the number of the training parameter servers is 10, the numbers are respectively the training parameter server 1, the training parameter servers 2 and … and the training parameter server 10, when the main parameter server fails, the training parameter server 1 is complemented as the main parameter server, the data of the training parameter server 1 can be continuously trained by the main parameter server or the data can be transmitted to the training parameter server 2 for training, and the original main parameter server has no data transfer when the main parameter server fails because of not training the data; in addition, it should be noted that, after the failure is resolved, the original main parameter server may be used as a training parameter server to implement the function of training data. When the training parameter server 1 fails, the training parameter server 2 is complemented as a main parameter server, and the data of the training parameter server 1 is transmitted to the training parameter server 2 for training, wherein the data of the training parameter server 2 comprises the residual data of the training parameter server 1 which does not complete training and the training data which is originally distributed to the training parameter server 2; alternatively, the training parameter server 2 transmits the remaining data of the training parameter server 1 which does not complete training and the remaining data of the training which is originally distributed to the training parameter server 2 and does not complete training to the training parameter server 3 for training. Since the model parameters of each training parameter server are the same, only the specific training data are different. Therefore, the embodiment of the application can transmit the rest data which are not trained to the training parameter server which does not have faults for training, thereby ensuring the normal running of the training process. Similarly, when the training parameter server 9 fails, the training parameter server 10 supplements the main parameter server, and the data of the training parameter server 9 is transmitted to the training parameter server 10 for training, and at this time, the data of the training parameter server 10 comprises the rest data of the training parameter server 9 which does not complete training and the training data which is originally distributed to the training parameter server 10; alternatively, the training parameter server 10 transmits the remaining data of the training parameter server 9 which has not completed training and the remaining data of the training which has not completed training, which has been originally distributed to the training parameter server 10, to the training parameter server 1 for training.
For randomly selecting one training parameter server as the main parameter server, for example, the number of the training parameter servers is 10, the numbers are respectively the training parameter server 1, the training parameter servers 2 and …, the training parameter server 10, and when the main parameter server fails, the parameter server 2 is randomly selected from the 10 training parameter servers as the main parameter server, and the data of the training parameter server 2 can be continuously trained by the training parameter server or transmitted to the training parameter server 3 for training. Similarly, the data of the training parameter server 10 can be trained by the training parameter server, or the data can be transmitted to the training parameter server 1 for training.
For example, as shown in fig. 2, an example of a training method of a machine learning model is provided in an embodiment of the present application. Firstly, the data distribution gateway randomly divides a training data set into m parts, and transmits model parameters and m parts of training data sets to (1+10%) m parameter server groups, wherein a=10%; such as parameter server group 1, parameter server groups 2, …, parameter server group 1.1m-1, parameter server group 1.1m; then, the master parameter server corresponding to the master node in the parameter server group distributes the model parameters and the training data set issued to the parameter server group to (1+10%) n training parameter servers corresponding to the slave nodes in the parameter server group for training, wherein b=10%; such as slave node 1, slave nodes 2, …, slave node 1.1n-1, slave node 1.1n; thus, the (1+10%) n training parameter servers corresponding to the slave node feed back the training results to the master parameter server corresponding to the master node.
After receiving the feedback data of the training parameter servers corresponding to the first 90% of slave nodes, the master parameter server corresponding to the master node discards the outliers in the training results according to the Euclidean distance, calculates the average value of the rest training results, and feeds back the average value as the training result of the model to the data distribution gateway. Wherein, the number of training parameter servers corresponding to 90% slave nodes is rounded off by one bit. For example, when the number of training parameter servers corresponding to the slave nodes is 3, the obtained calculation result is 2.7, and the one-bit rounding is 2; when the number of training parameter servers corresponding to the slave nodes is 7, the obtained calculation result is 6.3, and the one-bit rounding is 6. It should be noted that, in order to ensure the calculation accuracy and training efficiency of the training result, in the embodiment of the present application, the main parameter server receives the feedback data of the first 90% training parameter server. Of course, the main parameter server can also receive feedback data of other proportions of training parameter servers, such as 80%, 95%, which is not limited in the embodiment of the present application.
To facilitate understanding, specific implementations are illustrated. For example, dividing 1000 training data into 5 parts, each training data set includes 200 training data, where m=5; secondly, 5 training data sets are issued to 6 parameter server groups, wherein 1 data are obtained by the 5 parameter server groups respectively for training, and the 1 parameter server groups are used as standby parameter server groups, and a=10% at the moment; thirdly, distributing 200 training data to 4 training parameter servers by a main parameter server in each parameter server group, wherein each training parameter server trains 50 training data, and b=20% and n=3; thus, the master parameter server corresponding to the master node only needs to acquire the feedback data of the training parameter servers corresponding to the first 3 slave nodes. The euclidean distance (Euclidean distance) is a commonly used distance definition, which represents the true distance between two points in the l-dimensional space.
The main parameter server discards outliers in the training result according to the euclidean distance, and the adopted method can include but is not limited to a K-means algorithm and a K-center point algorithm. For ease of understanding, the K-means algorithm is illustrated in the embodiments of the present application. Since the K-center algorithm is a known technology, the embodiments of the present application will not be described in detail.
The K-means algorithm is a very typical distance-based clustering algorithm, and uses distance as an evaluation index of similarity, i.e. the closer the distance between two objects is, the greater the similarity, and the similar data are classified into the same cluster. Specifically, the K-means algorithm randomly selects K objects from the c data objects as initial clustering centers, and for the rest data objects, allocates them to clusters (represented by the initial clustering centers) most similar to them according to their distances from the initial clustering centers, respectively; then, after obtaining the cluster center of each new cluster (namely the average value of all data objects in the cluster), dividing k clusters according to the calculated distance between the data objects in each new cluster and the cluster center; this process is repeated until the clustering criterion function converges; the algorithm adopts the error square sum criterion function of the data as the clustering criterion function.
Since the K-means algorithm and the K-center point algorithm are clustering algorithms, all data is divided into K clusters. In the embodiment of the present application, only 1 cluster is needed to be divided, i.e., k=1. Specifically, a master parameter server corresponding to a master node calculates an average value of training results fed back by a training parameter server corresponding to a slave node, and the average value is used as a clustering center; then, respectively calculating Euclidean distance between each training result and a clustering center; therefore, if the euclidean distance is greater than a preset threshold, for example, the preset threshold is 5%, the training result corresponding to the euclidean distance is discarded as an abnormal value.
It should be noted that, in the embodiment of the present application, model parallelism is implemented between groups of parameter server groups, data parallelism is implemented in groups of each parameter server, and in a manner of changing resources for time, by increasing the number of nodes configured, when a node fails and cannot train data, a standby node continues to perform operation processing on the data, so as to ensure normal running of a training process, and further improve training efficiency of a machine learning model.
Meanwhile, in the embodiment of the application, each parameter server is deployed in the docker, and kubernetes are adopted to manage the docker parameter server, so that the running state of the docker is monitored in real time. Where dock is an open-source application container engine that allows developers to package their applications and rely on packages into a portable container, then release them to any popular Linux machine, and also implement virtualization. The containers are completely using a sandbox mechanism without any interface to each other. Kubernetes is a tool for arranging containers, and is also a tool for managing the whole life cycle of applications, which is very convenient from creating applications, deploying applications, publishing applications, expanding applications and updating applications, and can achieve fault self-healing, for example, a host is hung, services on the host can be automatically scheduled to another host for running, manual intervention is not needed, and the method is convenient and quick.
S102, receiving training results of the parameter server group.
For example, the data distribution gateway receives training results of each parameter server group, wherein the training results are average values obtained by discarding abnormal values in the feedback data according to euclidean distances and calculating the rest feedback data after receiving the feedback data of the training parameter servers corresponding to the slave nodes by the master parameter servers corresponding to the master nodes in each parameter server group.
It should be noted that, the data distribution gateway in the embodiment of the present application has the functions of issuing a training data set, collecting training results and performing corresponding processing. Of course, a terminal having the same function as the data distribution gateway in the embodiment of the present application is also possible, and the embodiment of the present application is not limited to this.
S103, processing training results according to the training times.
Specifically, when the training times are equal to the preset times, calculating the average value of the training results and outputting the average value; and when the training times are smaller than the preset times, the model parameters and the training data set divided into m parts are issued to the (1+a) m parameter server groups again for training. According to the method and the device, the training times are controlled to reach the preset times, and the data can be fully trained on the basis of guaranteeing the calculation accuracy.
For example, after receiving feedback data of the first 90% parameter server group, the data distribution gateway calculates an average value of the feedback data as a training result of the model. Wherein the number of 90% parameter server sets is rounded off by one bit. For example, when the number of the parameter server groups is 5, the obtained calculation result is 4.5, and then the data distribution gateway can obtain feedback data corresponding to the first 4 parameter server groups; when the number of the parameter server groups is 10, the data distribution gateway can acquire feedback data corresponding to the first 9 parameter server groups. It should be noted that, in order to ensure the calculation accuracy and the training efficiency of the training result, the data distribution gateway in the embodiment of the present application receives the feedback data of the first 90% parameter server group. Of course, the data distribution gateway can also receive feedback data of other proportions of parameter server groups, such as 80%, 95%, which is not limited in the embodiment of the present application.
Under normal conditions, after the training times reach the preset times, the training parameter servers in each parameter server group feed back data to the main parameter server in the parameter server group. However, in special cases, for example, when the training parameter server in the parameter server group is down, the training result is directly fed back to the main parameter server, and the training times do not reach the preset times; because the main parameter server does not have the capability of judging the training times, the data distribution gateway has the capability of monitoring the parameter server and judging the training times; therefore, as a precaution mechanism, the data distribution gateway judges whether the training times of the parameter server group meet the training requirement, so as to ensure that the data is sufficiently trained. If the training times reach the preset times, the data distribution gateway stops training of the parameter server group and outputs a training result; if the training times do not reach the preset times, in order to ensure the orderly progress of the whole training process, the data distribution gateway transmits the model parameters and the divided training data sets to each parameter server group again for training.
It should be noted that, in the embodiment of the present application, when the model parameters and the training data set are issued to each parameter server group again for training, the training parameter server in the parameter server group can also perform training in the training parameter server again according to the model parameters and the training results recorded by the log when the log is down.
Because each parameter server is deployed in the dock in the training process of the machine learning model, the dock is unstable in performance and is easy to downtime. Therefore, in the embodiment of the application, a log recording mode is adopted, the training parameter servers in each parameter server group write the training results and training times of the training parameter servers into the log at the beginning of model training, and the dock can position according to the log after restarting to obtain the model parameters, the training results and the training times recorded during downtime, so that the training state during downtime can be quickly recovered, and the waiting time is effectively reduced.
After the training data set is divided into m parts, the m parts and model parameters are issued to a (1+a) m parameter server group together for training, wherein m is a positive integer, 0< a <1, (1+a) m is one-bit rounded; because the number of the parameter server groups is increased, when the parameter server groups fail and cannot train data, standby parameter server groups continue to operate the data; further, training results of the parameter server group are received, and the training results are processed according to the training times; based on this, the embodiment of the application can improve the training efficiency of the machine learning model while ensuring that the training process is normally carried out.
Based on the foregoing embodiments, the embodiments of the present application provide a training device for a machine learning model, which may be applied to the training method for a machine learning model provided in the corresponding embodiment of fig. 1 to 2. Referring to fig. 3, the machine learning model training apparatus 3 includes:
the issuing module 31 is configured to issue the model parameters and the training data set divided into m parts to the (1+a) parameter server groups for training; wherein m is a positive integer, 0< a <1, (1+a) m is rounded by one bit;
a receiving module 32, configured to receive training results of the parameter server set;
and the processing module 33 is used for processing the training result according to the training times.
In other embodiments of the present application, as shown in fig. 4, the issuing module 31 further includes:
the distributing module 311 is configured to distribute the model parameters and m training data sets to (1+b) n training parameter servers in the parameter server group for training; wherein n is a positive integer, 0< b <1, (1+b) n is rounded by one bit.
In other embodiments of the present application, as shown in fig. 5, the issuing module 31 further includes:
a discarding module 312, configured to discard an outlier in the training result according to the training result and the euclidean distance fed back by the training parameter server;
a calculation module 313 for calculating an average value of the remaining training results;
wherein the training results are a set of outliers and remaining training results.
In other embodiments of the present application, the discarding module 312 is specifically configured to calculate an average value of the training results, where the average value is used as a cluster center;
respectively calculating Euclidean distance between each training result and a clustering center;
if the Euclidean distance is larger than the preset threshold value, discarding the training result corresponding to the Euclidean distance as an abnormal value.
In other embodiments of the present application, the processing module 33 is specifically configured to calculate and output an average value of the training results when the training number is equal to the preset number.
In other embodiments of the present application, the processing module 33 is further configured to, when the training number is less than the preset number, issue the model parameters and the training data set divided into m to the (1+a) x m parameter server group again for training.
In other embodiments of the present application, the processing module 33 is further configured to retrain in the parameter server set according to the log recorded by the parameter server set; the log comprises model parameters and training results recorded by the parameter server group during downtime.
It should be noted that, in this embodiment, the descriptions of the same steps and the same content as those in other embodiments may refer to the descriptions in other embodiments, and are not repeated here.
After the training data set is divided into m parts, the m parts and model parameters are issued to a (1+a) m parameter server group together for training, wherein m is a positive integer, 0< a <1, (1+a) m is one-bit rounded; because the number of the parameter server groups is increased, when the parameter server groups fail and cannot train data, standby parameter server groups continue to operate the data; further, training results of the parameter server group are received, and the training results are processed according to the training times; based on this, the embodiment of the application can improve the training efficiency of the machine learning model while ensuring that the training process is normally carried out.
Based on the foregoing embodiments, embodiments of the present application provide a computer system. Referring to fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for system operation are also stored. The CPU 601, ROM 602, and RAM603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.
In particular, according to embodiments of the present application, the process described above with reference to flowchart 1 may be implemented as a computer software program. For example, embodiment 1 of the present application includes a computer program product including a computer program loaded on a computer-readable medium, the computer program being executed by the CPU 601 to realize the steps of:
model parameters and training data sets divided into m parts are issued to a (1+a) m parameter server group for training; wherein m is a positive integer, 0< a <1, (1+a) m is rounded by one bit;
receiving training results of the parameter server group;
and processing the training result according to the training times.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products for machine learning model training according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present application may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases. The described units or modules may also be provided in a processor, for example, as: a processor comprises a transmitting module, a receiving module and a processing module. Wherein the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the present application also provides a computer-readable medium that may be contained in the terminal described in the above embodiment; or may exist alone without being fitted into the terminal. The computer-readable medium carries one or more programs that, when executed by one of the terminals, cause the terminal to implement the training method of the machine learning model as in the above embodiment.
For example, the terminal may implement as shown in fig. 1: s101, model parameters and training data sets divided into m parts are issued to a (1+a) m parameter server group for training; wherein m is a positive integer, 0< a <1, (1+a) m is rounded by one bit; s102, receiving training results of a parameter server group; s103, processing training results according to the training times.
It should be noted that although in the above detailed description several modules or units of a terminal for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware.
The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims (5)

1. A method of training a machine learning model, the method comprising:
model parameters and training data sets divided into m parts are issued to a (1+a) m parameter server group for training; wherein, a.m parameter server groups are used as standby parameter server groups; m is a positive integer, 0< a <1, and (1+a) m is rounded in one bit; the parameter server group also comprises a main parameter server and a training parameter server;
receiving training results of the parameter server group;
processing the training result according to the training times;
the step of issuing the model parameters and the training data sets divided into m parts to the (1+a) m parameter server groups for training comprises the following steps:
the main parameter server distributes the model parameters and training data sets issued to the parameter server group to (1+b) n training parameter servers in the parameter server group for training; wherein n is a positive integer, 0< b <1, and (1+b) n is rounded by one bit;
the step of processing the training result according to the training times comprises the following steps:
when the training times are equal to the preset times, calculating the average value of the training results and outputting the average value; when the training times are smaller than the preset times, the model parameters and the training data sets divided into m parts are issued to the (1+a) m parameter server groups again for training; training in the parameter server group again according to the log recorded by the parameter server group; the log comprises the model parameters and training results recorded by the parameter server group during downtime.
2. The method of training a machine learning model of claim 1, further comprising:
discarding abnormal values in the training results according to the training results and Euclidean distance fed back by the training parameter server;
calculating the average value of the rest training results; the training result is a set of the outlier and the remaining training result.
3. The method for training a machine learning model according to claim 2, wherein discarding outliers in the training results according to the training results and euclidean distance fed back by the training parameter server comprises:
calculating an average value of the training results, wherein the average value is used as a clustering center;
respectively calculating Euclidean distance between each training result and the clustering center;
and if the Euclidean distance is larger than a preset threshold value, discarding the training result corresponding to the Euclidean distance as an abnormal value.
4. A training apparatus for a machine learning model, the apparatus comprising:
the issuing module is used for issuing model parameters and training data sets divided into m parts to the (1+a) m parameter server groups for training; wherein, a.m parameter server groups are used as standby parameter server groups; m is a positive integer, 0< a <1, and (1+a) m is rounded in one bit; the parameter server group also comprises a main parameter server and a training parameter server;
the receiving module is used for receiving the training result of the parameter server group;
the processing module is used for processing the training result according to the training times;
the issuing module further comprises:
the distribution module is used for distributing the model parameters and the training data set issued to the parameter server group to (1+b) n training parameter servers in the parameter server group by the main parameter server for training; wherein n is a positive integer, 0< b <1, and (1+b) n is rounded by one bit;
the processing module is specifically configured to calculate an average value of the training results and output the average value when the training times are equal to preset times; when the training times are smaller than the preset times, the model parameters and the training data sets divided into m parts are issued to the (1+a) m parameter server groups again for training; training in the parameter server group again according to the log recorded by the parameter server group; the log comprises the model parameters and training results recorded by the parameter server group during downtime.
5. The training device of a machine learning model of claim 4 wherein the issuing module further comprises:
the discarding module is used for discarding the abnormal value in the training result according to the training result and the Euclidean distance fed back by the training parameter server;
the calculation module is used for calculating the average value of the rest training results; the training result is a set of the outlier and the remaining training result.
CN201910041301.1A 2019-01-16 2019-01-16 Training method and device for machine learning model Active CN111445027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910041301.1A CN111445027B (en) 2019-01-16 2019-01-16 Training method and device for machine learning model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910041301.1A CN111445027B (en) 2019-01-16 2019-01-16 Training method and device for machine learning model

Publications (2)

Publication Number Publication Date
CN111445027A CN111445027A (en) 2020-07-24
CN111445027B true CN111445027B (en) 2024-01-16

Family

ID=71626888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910041301.1A Active CN111445027B (en) 2019-01-16 2019-01-16 Training method and device for machine learning model

Country Status (1)

Country Link
CN (1) CN111445027B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111641716B (en) * 2020-06-01 2023-05-02 第四范式(北京)技术有限公司 Self-healing method of parameter server, parameter server and parameter service system
CN114936117A (en) * 2021-09-02 2022-08-23 华为技术有限公司 Model training method, server, chip and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156810A (en) * 2015-04-26 2016-11-23 阿里巴巴集团控股有限公司 General-purpose machinery learning algorithm model training method, system and calculating node
CN107025205A (en) * 2016-01-30 2017-08-08 华为技术有限公司 A kind of method and apparatus of training pattern in distributed system
CN107819605A (en) * 2016-09-14 2018-03-20 北京百度网讯科技有限公司 Method and apparatus for the switching server in server cluster
CN108665072A (en) * 2018-05-23 2018-10-16 中国电力科学研究院有限公司 A kind of machine learning algorithm overall process training method and system based on cloud framework

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10600002B2 (en) * 2016-08-04 2020-03-24 Loom Systems LTD. Machine learning techniques for providing enriched root causes based on machine-generated data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156810A (en) * 2015-04-26 2016-11-23 阿里巴巴集团控股有限公司 General-purpose machinery learning algorithm model training method, system and calculating node
CN107025205A (en) * 2016-01-30 2017-08-08 华为技术有限公司 A kind of method and apparatus of training pattern in distributed system
CN107819605A (en) * 2016-09-14 2018-03-20 北京百度网讯科技有限公司 Method and apparatus for the switching server in server cluster
CN108665072A (en) * 2018-05-23 2018-10-16 中国电力科学研究院有限公司 A kind of machine learning algorithm overall process training method and system based on cloud framework

Also Published As

Publication number Publication date
CN111445027A (en) 2020-07-24

Similar Documents

Publication Publication Date Title
US11762697B2 (en) Method and apparatus for scheduling resource for deep learning framework
EP3540652B1 (en) Method, device, chip and system for training neural network model
US10140572B2 (en) Memory bandwidth management for deep learning applications
CN108804039B (en) Adaptive data recovery flow control method and device, electronic equipment and storage medium
US20210409341A1 (en) Cloud resource management using externally-sourced data
WO2021068513A1 (en) Abnormal object recognition method and apparatus, medium, and electronic device
US11379718B2 (en) Ground truth quality for machine learning models
CN111445027B (en) Training method and device for machine learning model
US11886969B2 (en) Dynamic network bandwidth in distributed deep learning training
WO2023116067A1 (en) Power service decomposition method and system for 5g cloud-edge-end collaboration
KR20210156243A (en) Training methods of deep-running frameworks, devices and storage media
US11736363B2 (en) Techniques for analyzing a network and increasing network availability
WO2022267085A1 (en) Artificial-intelligence-based data management method and system for data center
CN109542737A (en) Platform alert processing method, device, electronic device and storage medium
WO2023020355A1 (en) Distributed training method for ai model and related device
CN111611622A (en) Block chain-based file storage method and electronic equipment
CN111935140A (en) Abnormal message identification method and device
CN110782122A (en) Data processing method and device and electronic equipment
CN107016115A (en) Data export method, device, computer-readable recording medium and electronic equipment
US10915704B2 (en) Intelligent reporting platform
CN112925964A (en) Big data acquisition method based on cloud computing service and big data acquisition service system
CN109558222A (en) Batch service process monitoring method, device, computer and readable storage medium storing program for executing
CN116796850A (en) Model training method, device, equipment and storage medium
CN114385606A (en) Big data cleaning method and system, storage medium and electronic equipment
GB2610969A (en) Optimized deployment of analytic models in an edge topology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant