KR102296319B1

KR102296319B1 - Distributed deep learning method using packet scheduling in programmable switch

Info

Publication number: KR102296319B1
Application number: KR1020200052711A
Authority: KR
Inventors: 강민구; 양경식; 유혁
Original assignee: 고려대학교 산학협력단
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2021-08-31

Abstract

A distributed deep learning method is provided. The distributed deep learning method comprises the steps of: performing distributed learning of training data in a plurality of worker nodes; classifying parameters of the data learned by the worker nodes into layers and giving priority to the classified layers; marking information on the classified layers in a header of a tensor packet; transmitting the parameters of the layers to a parameter server; updating the parameters by the parameter server; and transmitting the updated parameters to the worker nodes. A programmable network switch is used to parse the header of the tensor packet, such that the layer to which the packet belongs is identified. In accordance with the present invention, communication time is reduced.

Description

프로그래머블 스위치에서의 패킷 스케줄링을 이용한 분산 딥러닝 방법{DISTRIBUTED DEEP LEARNING METHOD USING PACKET SCHEDULING IN PROGRAMMABLE SWITCH}Distributed deep learning method using packet scheduling in programmable switch

본 발명은 딥러닝에 관한 것으로, 보다 상세하게는 프로그래머블 스위치에서의 패킷 스케줄링을 이용한 분산 딥러닝(distributed deep learning) 방법에 관한 것이다.The present invention relates to deep learning, and more particularly, to a distributed deep learning method using packet scheduling in a programmable switch.

모델의 크기와 학습 데이터의 규모의 증가로 인해 최첨단 딥러닝(deep learning)은 분산되어 수행되고 있다. 이를 분산 딥러닝(distributed deep learning, DDL)이라고 하는데, 여러 개의 GPU 노드(워커)에서 학습 데이터를 나누어 학습하고 학습한 모델 파라미터(모델의 가중치 등)를 갱신하는 데이터 병렬 방식이 대중적으로 사용된다. 데이터 병렬 DDL이 동기화하는 방식에는 파라미터 서버(parameter server, PS) 구조와 all-reduce 구조로 크게 두 가지가 있다. PS 구조는 모델의 파라미터를 관리하는 PS가 독립적으로 존재해 각각의 워커들이 학습한 모델 파라미터를 PS와 주고 받으면서 파라미터를 갱신한다. all-reduce 방식은 PS 없이 각각의 워커가 매 학습마다 모델 파라미터를 모든 워커에게 전송하며 갱신하는 방식이다. Due to the increase in the size of the model and the size of the training data, cutting-edge deep learning is being performed in a distributed manner. This is called distributed deep learning (DDL), and a data parallel method that divides training data from multiple GPU nodes (workers) and updates the learned model parameters (weights of the model, etc.) is popularly used. There are two main ways of synchronizing data parallel DDL: a parameter server (PS) structure and an all-reduce structure. In the PS structure, the PS managing the parameters of the model exists independently, and the parameters are updated by exchanging the model parameters learned by each worker with the PS. The all-reduce method is a method in which each worker transmits and updates model parameters to all workers for every training without PS.

갱신 방식에는 크게 동기적 또는 비동기적 갱신 방식이 있다. 모든 워커가 학습이 종료될 때까지 기다린 후 모델 파라미터를 취합해 갱신하고 다음 학습을 진행하는 것이 동기적 갱신 방식이다. 반대로 각각의 워커의 학습 종료마다 모델 파라미터를 주고 받은 후 개별적으로 다음 학습을 진행하는 것이 비동기적 갱신 방식이다.The update method is largely divided into a synchronous or asynchronous update method. The synchronous update method is to wait until all workers have finished training, then collect and update model parameters, and then proceed with the next training. Conversely, the asynchronous update method is to conduct the next training individually after exchanging model parameters at the end of training of each worker.

기존 네트워크 스위치는 구매 전에 동작하는 프로토콜 및 기능 등이 고정되어 있다. 따라서 사용자의 필요에 따라 새로운 기능을 구현할 수 없다. 하지만 최근 스위치를 프로그래밍하는 언어인 P4의 등장으로 인해 스위치의 프로그램가능성(programmability)이 주목받고 있다. 프로그래머블 스위치(programmable switch)는 P4로 작성한 프로그램대로 동작할 수 있는 스위치로, 프로그래머블 스위치를 이용하면 사용자가 원하는 방식으로 패킷을 처리할 수 있다. 기존 네트워크 스위치에서는 임의의 패킷 처리가 불가능했으나, 프로그래머블 스위치에서는 하드웨어가 지원하는 기능들을 이용해 사용자가 원하는 기능을 P4 프로그램의 형태로 구현할 수 있다. 본 발명에서는 분산 딥러닝을 위한 패킷 스케줄링 기법을 스위치에 구현하기 위해 P4 언어와 프로그래머블 스위치를 이용하였다.Existing network switches have fixed protocols and functions that operate before purchase. Therefore, new functions cannot be implemented according to the user's needs. However, due to the recent emergence of P4, a language for programming switches, the programmability of switches is attracting attention. A programmable switch is a switch that can operate according to a program written in P4. If a programmable switch is used, the user can process packets in a desired way. In the existing network switch, arbitrary packet processing was impossible, but in the programmable switch, a user's desired function can be implemented in the form of a P4 program using the functions supported by the hardware. In the present invention, the P4 language and the programmable switch are used to implement the packet scheduling technique for distributed deep learning in the switch.

한편, 각각의 워커에서 학습 연산 후 파라미터 통신을 하게 되므로 학습 수행 시간은 연산 시간과 통신 시간의 합이 된다. GPU 성능 개선으로 인해 연산 시간이 줄게 되면서 분산 딥러닝 학습 시간에 통신 시간이 큰 병목이 되는 것이 알려져 있다. On the other hand, since each worker performs parameter communication after the learning operation, the learning execution time is the sum of the operation time and the communication time. It is known that communication time becomes a big bottleneck in distributed deep learning training time as computation time is reduced due to GPU performance improvement.

국내공개특허 제10-2016-0149422에는 고속 패킷 처리 시스템에 개시되어 있으나, 학습 수행 시간을 감소시키기 위한 연구는 지속적으로 진행되고 있다.Although disclosed in Korean Patent Application Laid-Open No. 10-2016-0149422 in a high-speed packet processing system, research to reduce the learning execution time is continuously being conducted.

대한민국 공개특허 제10-2016-0149422호Republic of Korea Patent Publication No. 10-2016-0149422

본원 발명이 해결하고자 하는 과제는 통신 시간을 감소시키는 분산 딥러닝을 위한 프로그래머블 스위치에서의 패킷 스케줄링 방법 및 시스템을 제공하는 것이다.An object of the present invention is to provide a packet scheduling method and system in a programmable switch for distributed deep learning that reduces communication time.

해결하고자 하는 과제를 달성하기 위하여 본 발명의 실시예들에 따른 분산 딥러닝 방법은, 학습 데이터를 복수의 워커 노드들(worker nodes)에서 분산 학습하는 단계; 상기 워커 노드들에서 학습한 데이터의 파라미터(parameter)를 레이어(layer) 단위로 분류하고 분류된 레이어들에 우선순위를 부여하는 단계; 상기 분류된 레이어들의 정보를 텐서 패킷(tensor packet)의 헤더(header)에 마킹하는 단계; 상기 레이어들의 파라미터를 파라미터 서버(parameter server)로 전송하는 단계; 상기 파라미터 서버에서 상기 파라미터를 갱신하는 단계; 및 상기 갱신된 파라미터를 상기 워커 노드들로 전송하는 단계를 포함하되, 프로그램 가능한 네트워크 스위치(programmable network switch)를 이용하여 상기 텐서 패킷의 헤더를 파싱하여 해당 패킷이 속한 레이어를 파악한다.Distributed deep learning method according to embodiments of the present invention in order to achieve the problem to be solved, comprising the steps of: distributed learning training data in a plurality of worker nodes (worker nodes); classifying parameters of data learned by the worker nodes into layers and giving priority to the classified layers; marking the classified information of the layers in a header of a tensor packet; transmitting the parameters of the layers to a parameter server; updating the parameter in the parameter server; and transmitting the updated parameter to the worker nodes, wherein the header of the tensor packet is parsed using a programmable network switch to determine a layer to which the corresponding packet belongs.

다른 실시예에 따른 분산 딥러닝 방법은, 학습 데이터를 복수의 워커 노드들에서 분산 학습하는 단계; 상기 워커 노드들에서 학습한 데이터의 파라미터를 레이어 단위로 분류하고 분류된 레이어들에 우선순위를 부여하는 패킷 스케줄링하는 단계; 상기 레이어들의 우선순위를 매핑하는 단계; 상기 레이어들보다 낮은 새로운 레이어가 도착하면, 상기 새로운 레이어를 가장 높은 우선순위 큐에 매핑하고, 나머지 맵핑을 한 칸씩 변경하는 단계; 상기 레이어들의 파라미터를 파라미터 서버로 전송하는 단계; 상기 파라미터 서버에서 상기 파라미터를 갱신하는 단계; 및 상기 갱신된 파라미터를 상기 워커 노드들로 전송하는 단계를 포함한다.A distributed deep learning method according to another embodiment includes: distributed learning of training data in a plurality of worker nodes; packet scheduling for classifying parameters of data learned by the worker nodes in a layer-by-layer unit and giving priority to the classified layers; mapping the priorities of the layers; when a new layer lower than the layers arrives, mapping the new layer to a highest priority queue and changing the remaining mappings one by one; transmitting the parameters of the layers to a parameter server; updating the parameter in the parameter server; and transmitting the updated parameter to the worker nodes.

또 다른 실시예에 따른 분산 딥러닝 방법은, 학습 데이터를 복수의 워커 노드들에서 분산 학습하는 단계; 상기 워커 노드들에서 학습한 데이터의 파라미터를 레이어 단위로 분류하고 분류된 레이어들에 우선순위를 부여하는 패킷 스케줄링하는 단계; 상기 레이어들의 우선순위를 매핑하는 단계; 상기 우선순위 중에 가장 높은 우선순위 큐의 패킷을 K-바이트로 처리하는 단계; 상기 레이어들보다 낮은 새로운 레이어가 도착하면, 상기 맵핑을 변경하여 빈 상태의 가장 높은 우선순위 큐에 상기 새로운 레이어의 패킷을 처리하는 단계; 상기 레이어들의 파라미터를 파라미터 서버로 전송하는 단계; 상기 파라미터 서버에서 상기 파라미터를 갱신하는 단계; 및 상기 갱신된 파라미터를 상기 워커 노드들로 전송하는 단계를 포함한다.A distributed deep learning method according to another embodiment includes: distributed learning of training data in a plurality of worker nodes; packet scheduling for classifying parameters of data learned by the worker nodes in a layer-by-layer unit and giving priority to the classified layers; mapping the priorities of the layers; processing a packet of the highest priority queue among the priorities as K-bytes; processing a packet of the new layer in an empty highest priority queue by changing the mapping when a new layer lower than the layers arrives; transmitting the parameters of the layers to a parameter server; updating the parameter in the parameter server; and transmitting the updated parameter to the worker nodes.

본 발명의 실시예들에 따른 분산 딥러닝 방법을 이용함으로써, 우선순위가 높은 레이어의 패킷이 더 낮은 통신 지연을 겪고 학습 시간이 개선될 수 있다.By using the distributed deep learning method according to embodiments of the present invention, a packet of a higher priority layer may suffer from lower communication delay and learning time may be improved.

도 1은 본 발명의 일 실시예에 따른 분산 딥러닝 방법을 설명하는 블록도이다.
도 2는 본 발명의 일 실시예에 따른 분산 딥러닝 방법을 설명하기 위한 순서도이다.
도 3은 스케줄링을 설명하기 위한 예시적으로 도면이다.
도 4는 본 발명의 일 실시예에 따른 분산 딥러닝 방법을 설명하기 위한 모식도이다.
도 5는 기존의 분산 딥러닝 및 본 발명의 일 실시예에 따른 분산 딥러닝을 이론적으로 동작시킨 경우 각각의 학습 시간을 나타내는 도면이다.
도 6은 기존의 분산 딥러닝을 실제로 적용한 경우의 네트워크 내부 지연을 나타내는 그래프이다.
도 7은 본 발명의 일 실시예에 따른 분산 딥러닝을 실제로 적용한 경우의 네트워크 내부 지연을 나타내는 그래프이다.1 is a block diagram illustrating a distributed deep learning method according to an embodiment of the present invention.
2 is a flowchart illustrating a distributed deep learning method according to an embodiment of the present invention.
3 is an exemplary diagram for describing scheduling.
4 is a schematic diagram for explaining a distributed deep learning method according to an embodiment of the present invention.
5 is a diagram showing each learning time when the conventional distributed deep learning and distributed deep learning according to an embodiment of the present invention are theoretically operated.
6 is a graph showing internal network delay when the existing distributed deep learning is actually applied.
7 is a graph showing internal network delay when distributed deep learning according to an embodiment of the present invention is actually applied.

본 발명의 구성 및 효과를 충분히 이해하기 위하여, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예들을 설명한다. 그러나 본 발명은, 이하에서 개시되는 실시예들에 한정되는 것이 아니라, 여러 가지 형태로 구현될 수 있고 다양한 변경을 가할 수 있다.In order to fully understand the configuration and effect of the present invention, preferred embodiments of the present invention will be described with reference to the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, and may be embodied in various forms and various modifications may be made.

또한, 본 발명의 실시예들에서 사용되는 용어들은 다르게 정의되지 않는 한, 해당 기술 분야에서 통상의 지식을 가진 자에게 통상적으로 알려진 의미로 해석될 수 있다.In addition, terms used in the embodiments of the present invention may be interpreted as meanings commonly known to those of ordinary skill in the art unless otherwise defined.

본 발명의 분산 딥러닝 방법을 설명하기 전에 본 기술에 연관된 용어 및 기술에 대하여 간략하게 정리하기로 한다.Before describing the distributed deep learning method of the present invention, terms and techniques related to the present technology will be briefly summarized.

기존 네트워크 스위치는 동작하는 프로토콜 및 기능 등이 고정되어 있었으나, 프로그래머블 네트워크 스위치는 프로그래밍하는 언어(예컨대, P4)를 이용하여 사용자가 원하는 방식으로 패킷을 처리할 수 있다. 즉, 프로그래머블 네트워크 스위치를 이용하여 패킷 파싱 및 처리 로직을 구현할 수 있다. 따라서, 본 발명에서는 분산 딥러닝을 위하여 P4 언어를 컴파일하여 프로그래머블 네트워크 스위치에 적용하여 새 로직을 적용할 수 있다.Conventional network switches have fixed operating protocols and functions, but programmable network switches can process packets in a user-desired manner using a programming language (eg, P4). That is, packet parsing and processing logic can be implemented using a programmable network switch. Therefore, in the present invention, a new logic can be applied by compiling the P4 language for distributed deep learning and applying it to a programmable network switch.

본 발명에서는 네트워크 스위치의 처리 단위인 패킷으로, 프로그래머블 네트워크 스위치 내에서 스케줄링을 적용한다. 이하에서는, 특정 텐서의 파라미터를 담은 패킷을 텐서 패킷(tensor packet)이라 한다.In the present invention, a packet is a processing unit of a network switch, and scheduling is applied in a programmable network switch. Hereinafter, a packet including a parameter of a specific tensor is referred to as a tensor packet.

도 1은 본 발명의 일 실시예에 따른 분산 딥러닝 방법을 설명하는 블록도이다.1 is a block diagram illustrating a distributed deep learning method according to an embodiment of the present invention.

도 1을 참조하면, 학습 데이터를 미니 배치(mini batch)로 쪼개 각각의 워커 노드(worker node)에서 학습한 후, 파라미터 변화량을 파라미터 서버(parameter server)가 취합하고 갱신한 후, 다시 워커 노드들에 새로운 파라미터를 전달한다.Referring to FIG. 1 , after the training data is divided into mini batches and trained at each worker node, the parameter change amount is collected and updated by the parameter server, and then the worker nodes are again pass new parameters to

상세하게 설명하면, 학습 데이터가 복수의 워커 노드들로 프로그래머블 네트워크 스위치를 이용하여 첫 번째 레이어부터 N번째 레이어까지 전방향 전파(forward propagation, FP) 처리하여 예측값(prediction)을 얻는다. 이어서, 예측값과 실제값의 격차를 계산하는 손실 함수(loss function)를 이용해 N번째 레이어부터 첫 번째 레이어까지 역방향 전파(back propagation, BP)를 수행하며 각각 레이어의 경사(gradient)를 계산해 모델 파라미터를 조정한다.In detail, the training data is processed by forward propagation (FP) from the first layer to the N-th layer using a programmable network switch with a plurality of worker nodes to obtain a prediction. Then, back propagation (BP) is performed from the Nth layer to the first layer using a loss function that calculates the difference between the predicted value and the actual value, and the gradient of each layer is calculated to calculate the model parameters. Adjust.

각각의 워커 노드는 경사값을 파라미터 서버로 전달한다(push). 파라미터 서버는 워커들로부터 취합한 경사값을 이용해 파라미터를 갱신한 후 워커들에게 갱신한 파라미터를 전달한다(pull).Each worker node pushes the gradient value to the parameter server. The parameter server updates the parameter using the gradient value collected from the workers, and then delivers the updated parameter to the workers (pull).

즉, 분산 딥러닝은 FP, BP, push, 및 pull의 순서대로 반복하며 수행될 수 있다.That is, distributed deep learning can be performed repeatedly in the order of FP, BP, push, and pull.

도 2는 본 발명의 일 실시예에 따른 분산 딥러닝 방법을 설명하기 위한 순서도이다.2 is a flowchart illustrating a distributed deep learning method according to an embodiment of the present invention.

학습 데이터를 복수의 워커 노드들에서 분산 학습한다(S110).Distributed learning is performed on the training data in a plurality of worker nodes (S110).

각각의 워커 노드에서 학습한 데이터의 파라미터를 레이어 단위로 분류하고 분류된 레이어들에 우선순위를 부여하는 스케줄링을 수행한다(S120).Scheduling is performed by classifying the parameters of the data learned by each worker node by layer unit and giving priority to the classified layers (S120).

모델의 파라미터를 주고 받을 때, FP에 필요한 순서에 따라 워커에서 파라미터 전송 순서를 변경한다. 예를 들어 FP는 항상 첫 번째 레이어부터 순서대로 수행되므로, 스케줄링은 첫 번째 레이어의 파라미터를 워커에서 우선적으로 보내, 다음 학습의 FP 시작 시간을 앞당기는 방식이다. 커뮤니케이션 스케줄링을 적용할 경우 FP 시작 시간이 앞당겨지고 FP가 나중에 수행될 레이어의 파라미터가 나중에 전송되기 때문에 연산 시간과 통신시간이 겹쳐지게 되고 이는 전체 학습 시간의 감축을 의미한다. 또한, 스케줄링에서는 모델 파라미터를 레이어에 따라 우선순위를 부여해 전송하는데, 이때 스케줄링 및 전송(push & pull) 단위를 텐서(tensor)라 부른다. 즉, N 번째 레이어(layer N)의 텐서를 텐서 N이라 하고, 스케줄링은 텐서 0을 텐서 1 보다, 텐서 1을 텐서 2보다 먼저 전송한다.When sending and receiving model parameters, the worker changes the parameter transmission order according to the order required for FP. For example, since FP is always performed in order from the first layer, the scheduling method is to send the parameters of the first layer first to the worker, thereby speeding up the FP start time of the next training. When communication scheduling is applied, the FP start time is advanced, and the calculation time and communication time overlap because the parameters of the layer to be performed later are transmitted later, which means reduction of the overall learning time. In addition, in scheduling, model parameters are transmitted by giving priority to each layer. In this case, the scheduling and transmission (push & pull) unit is called a tensor. That is, a tensor of the N-th layer (layer N) is referred to as a tensor N, and scheduling transmits tensor 0 before tensor 1 and tensor 1 before tensor 2.

도 3은 스케줄링을 설명하기 위한 예시적으로 도면이다. 도 3을 참조하면, 파라미터를 레이어 단위로 분류하고 레이어들의 우선순위를 부여하여 전송함으로써 약 40%의 학습 시간이 개선된다.3 is an exemplary diagram for describing scheduling. Referring to FIG. 3 , the learning time is improved by about 40% by classifying parameters for each layer and prioritizing the layers for transmission.

그러나, 네트워크 스위치에서의 스케줄링은 다중 우선순위 큐 기반으로 동작하기 때문에 스케줄링에 한계가 있다. 다중 우선순위 큐 방식은 우선순위에 따라 높은 우선순위에 있는 큐에 속한 패킷부터 순서대로 스위치가 전송하게 된다. 예를 들어, 4개의 우선순위 큐가 존재할 경우 패킷의 우선순위를 4가지 종류로 밖에 줄 수 없다는 의미이다. 일반적으로 스위치는 4개에서 8개의 우선순위 큐를 가지고 있고, 스케줄링해야 할 레이어의 개수는 일반적으로 수십 수백에 달하기 때문에 스위치 내에서 이를 반영해 패킷의 스케줄링 하기 어렵다는 한계가 있다. 이를 극복하기 위하여 다음의 단계들을 수행할 수 있다.However, since scheduling in a network switch operates based on a multi-priority queue, there is a limitation in scheduling. In the multi-priority queuing method, the switch transmits in order from the packets belonging to the queue with the highest priority according to the priority. For example, if there are 4 priority queues, it means that only 4 types of packet priorities can be given. In general, a switch has 4 to 8 priority queues, and the number of layers to be scheduled generally reaches several tens or hundreds, so there is a limitation in that it is difficult to schedule packets by reflecting this in the switch. To overcome this, the following steps can be performed.

계속해서 도 2를 참조하면, 상기 분류된 레이어들의 정보를 텐서 패킷(tensor packet)의 헤더(header)에 마킹한다(S130). 네트워크 스위치에서의 처리 단위는 패킷이고 패킷들이 고속으로 처리되기 때문에, 네트워크 스위치나 플로우의 상태 등을 관리하기 어렵다. 본 발명의 일 실시예에 따르면, 네트워크 스위치에서 분산 딥러닝을 위한 스케줄링을 적용하기 위해 각 워커 노드에서 텐서를 전송할 때, 각각의 텐서 패킷의 IP 옵션 헤더 필드에 해당 텐서 패킷이 어떤 레이어의 파라미터를 담고 있는지 정보를 추가한다. 따라서, 네트워크 스위치는 텐서 패킷의 헤더를 파싱함으로써 해당 패킷이 어떤 레이어에 속하는지 알 수 있다.Continuing to refer to FIG. 2 , information on the classified layers is marked in a header of a tensor packet ( S130 ). Since a processing unit in a network switch is a packet, and packets are processed at high speed, it is difficult to manage the status of a network switch or a flow. According to an embodiment of the present invention, when each worker node transmits a tensor in order to apply scheduling for distributed deep learning in a network switch, the corresponding tensor packet specifies a layer parameter in the IP option header field of each tensor packet. Add the information you have. Accordingly, the network switch can know to which layer the corresponding packet belongs by parsing the header of the tensor packet.

레이어 마킹을 통해 패킷이 어떤 레이어에 속하는지 알 수 있더라도, 네트워크 스위치의 우선순위 큐 개수가 제한이 있어 패킷들이 스케줄링하기 어렵다. 따라서, 본 발명에서는 레이어와 우선순위를 매핑하고(S140), 이를 동적으로 변경한다(S150).Although it is possible to know which layer a packet belongs to through layer marking, it is difficult to schedule the packets because the number of priority queues of the network switch is limited. Accordingly, in the present invention, layers and priorities are mapped (S140) and dynamically changed (S150).

우선, 분산 딥러닝의 특성을 살펴보면, 각 워커 노드는 레이어 기준 N부터 0까지 내림차 순으로 패킷을 전송하게 된다. 레이어 0의 파라미터를 모두 주고받은 후, 다시 0부터 N까지 오름차순으로 패킷을 전송하게 된다. 본 발명에서는 레이어의 순서가 단조증가 및 단조 감소하는 딥러닝의 특징을 이용한다.First, looking at the characteristics of distributed deep learning, each worker node transmits packets in descending order from layer reference N to 0. After all parameters of layer 0 are exchanged, packets are transmitted again in ascending order from 0 to N. In the present invention, a feature of deep learning in which the order of layers monotonically increases and decreases monotonically is used.

도 4는 본 발명의 일 실시예에 따른 분산 딥러닝 방법을 설명하기 위한 모식도이다. 4 is a schematic diagram for explaining a distributed deep learning method according to an embodiment of the present invention.

도 4의 T0를 참조하면, 큐가 4개 존재하고 레이어 4, 3, 2, 1의 패킷이 순서대로 들어왔을 때 4, 3, 2, 1을 순서대로 낮은 우선순위 큐부터 높은 우선순위 큐로 매핑한다. 도 4의 T1를 참조하면, 레이어 0의 패킷이 들어오게 되면, 매핑을 하나씩 이동시켜 3, 2, 1, 0의 패킷이 각각 우선순위의 큐로 매핑되도록 변경한 후 4 혹은 그 이상의 패킷이 오게 되면 가장 낮은 우선순위 큐로 전달한다. 이와 같은 방식으로 레이어 N부터 0까지 패킷이 내림차순으로 전송되게 되면 반응적으로 N부터 0까지 매핑을 변경시키며 이동한다. 매핑이 레이어 0에 도달하게 되면, 그 이후는 패킷이 0부터 N까지 오름차순으로 전송되므로 이에 맞춰 매핑도 0부터 N까지 반대 방향으로 변경된다.Referring to T0 of FIG. 4 , when there are 4 queues and packets of layers 4, 3, 2, and 1 are received in order, 4, 3, 2, and 1 are sequentially mapped from a low priority queue to a high priority queue do. Referring to T1 of FIG. 4 , when a packet of layer 0 is received, the mapping is moved one by one so that the packets of 3, 2, 1, and 0 are mapped to the priority queue, respectively, and when 4 or more packets arrive Forward to the lowest priority queue. In this way, when packets from layer N to layer 0 are transmitted in descending order, the mapping is changed from layer N to 0 in a responsive manner. When the mapping reaches layer 0, after that, packets are transmitted in ascending order from 0 to N, and accordingly, the mapping is also changed from 0 to N in the opposite direction.

그러나, 네트워크 스위치에서 이미 큐 안에 들어가 있는 패킷은 스케줄링이 불가능하다. 예를 들어 T0에서 T1으로 매핑이 변경되기 직전에 레이어 1의 패킷이 높은 우선순위 큐에 여러 개 삽입되어 있다면, 매핑 변경 후에 레이어 0의 패킷은 앞에 삽입된 레이어 1의 패킷이 모두 전송된 후에 전송되게 될 것이다. 따라서, 추가적인 지연이 발생할 수 있다.However, it is impossible to schedule packets that are already in the queue in the network switch. For example, if several layer 1 packets are inserted into the high priority queue just before the mapping is changed from T0 to T1, after the mapping change, the layer 0 packets are transmitted after all the previously inserted layer 1 packets are transmitted. will become Therefore, an additional delay may occur.

도 4의 T2를 참조하면, 새로운 레이어의 패킷은 앞서 스위치에 들어온 패킷들보다 먼저 처리되어야 하기 때문에 높은 우선순위의 큐는 비어있어야 한다. 따라서, 높은 우선순위 큐에 매핑된 레이어의 패킷의 헤더가 K-바이트 만큼 처리된 후 매핑을 한 단계씩 변경하는 방식을 적용한다. 이렇게 레이어 1의 패킷이 높은 우선순위 큐에서 K-바이트만큼 처리된 후 매핑을 한 단계씩 낮추고 높은 우선순위 큐에는 아무 레이어도 매핑하지 않음으로써 높은 우선순위 큐를 비어있게 할 수 있다. 이를 통해 새로운 레이어의 패킷이 스위치에 들어왔을 때, 앞서 들어온 패킷의 영향을 받지 않고 즉각적으로 처리될 수 있다.Referring to T2 of FIG. 4 , the high-priority queue should be empty because the packet of the new layer should be processed before packets that entered the switch earlier. Therefore, after the header of the packet of the layer mapped to the high priority queue is processed by K-bytes, a method of changing the mapping step by step is applied. In this way, after K-byte packets are processed in the high-priority queue, the high-priority queue can be made empty by lowering the mapping by one level and mapping no layers to the high-priority queue. Through this, when a packet of a new layer enters the switch, it can be processed immediately without being affected by the previous packet.

도 5는 기존의 분산 딥러닝 및 본 발명의 일 실시예에 따른 분산 딥러닝을 이론적으로 동작시킨 경우 각각의 학습 시간을 나타내는 도면이다. 도 5를 참조하면, 본 발명의 분산 딥러닝 방법을 이용함으로써, 우선순위가 높은 레이어의 패킷이 더 낮은 통신 지연을 겪고 학습 시간이 개선됨을 볼 수 있다.5 is a diagram showing each learning time when the conventional distributed deep learning and distributed deep learning according to an embodiment of the present invention are theoretically operated. Referring to FIG. 5 , it can be seen that by using the distributed deep learning method of the present invention, packets of a high-priority layer suffer from lower communication delay and the learning time is improved.

도 6은 기존의 분산 딥러닝을 실제로 적용한 경우의 네트워크 내부 지연을 나타내는 그래프이고, 도 7은 본 발명의 일 실시예에 따른 분산 딥러닝을 실제로 적용한 경우의 네트워크 내부 지연을 나타내는 그래프이다.6 is a graph showing the internal network delay when the existing distributed deep learning is actually applied, and FIG. 7 is a graph showing the internal network delay when the distributed deep learning according to an embodiment of the present invention is actually applied.

네트워크 내부 지연(in-network delay)은 패킷이 호스트에서 전송된 후 목적지에 도달하기까지 네트워크 스위치, 링크 등 통신 장비에 걸리는 지연을 일컫는다. In-network delay refers to the delay taken by communication equipment such as network switches and links after a packet is transmitted from a host to reaching its destination.

도 6을 참조하면, 기존의 네트워크 스위치에서 혼잡 제어 없이 여러 워커가 통신을 하게 된다면 지연시간이 급격히 증가하는 것을 볼 수 있다. 도 7을 참조하면, 본 발명의 일 실시예에 따른 네트워크 스위치에 적용하면, 우선순위가 낮은 레이어의 패킷들은 여전히 급격히 증가하는 지연을 겪지만, 높은 우선순위의 레이어들은 낮은 지연을 겪는 것을 볼 수 있다. 따라서, 분산 딥러닝에서 각 레이어의 전방향 전파(FP)를 앞당길 수 있다.Referring to FIG. 6 , it can be seen that the delay time rapidly increases when several workers communicate without congestion control in the existing network switch. Referring to FIG. 7 , when applied to a network switch according to an embodiment of the present invention, it can be seen that packets of a low-priority layer still experience a rapidly increasing delay, but high-priority layers suffer a low delay. have. Therefore, it is possible to advance the forward propagation (FP) of each layer in distributed deep learning.

이상, 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예에는 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.As mentioned above, although embodiments of the present invention have been described with reference to the accompanying drawings, those of ordinary skill in the art to which the present invention pertains can implement the present invention in other specific forms without changing its technical spirit or essential features. You will understand that there is Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

Claims

학습 데이터를 복수의 워커 노드들(worker nodes)에서 분산 학습하는 단계;
상기 워커 노드들에서 학습한 데이터의 파라미터(parameter)를 레이어(layer) 단위로 분류하고 분류된 레이어들에 우선순위를 부여하는 단계;
상기 분류된 레이어들의 정보를 텐서 패킷(tensor packet)의 헤더(header)에 마킹하는 단계;
상기 레이어들의 파라미터를 파라미터 서버(parameter server)로 전송하는 단계;
상기 파라미터 서버에서 상기 파라미터를 갱신하는 단계; 및
상기 갱신된 파라미터를 상기 워커 노드들로 전송하는 단계를 포함하되,
프로그램 가능한 네트워크 스위치(programmable network switch)를 이용하여 상기 텐서 패킷의 헤더를 파싱하여 해당 패킷이 속한 레이어를 파악하는 분산 딥러닝 방법.
Distributed learning of training data in a plurality of worker nodes;
classifying parameters of data learned by the worker nodes into layers and giving priority to the classified layers;
marking the classified information of the layers in a header of a tensor packet;
transmitting the parameters of the layers to a parameter server;
updating the parameter in the parameter server; and
transmitting the updated parameter to the worker nodes,
A distributed deep learning method for identifying a layer to which the packet belongs by parsing the header of the tensor packet using a programmable network switch.

제1항에 있어서,
상기 레이어들의 우선순위(priority)를 매핑하는 단계; 및
새로운 레이어의 패킷이 도착한 경우, 상기 새로운 레이어가 상기 레이어들보다 낮으면 상기 새로운 레이어를 가장 높은 우선순위 큐에 매핑하고, 나머지 매핑을 한 칸씩 변경하는 단계를 더 포함하는 분산 딥러닝 방법.
According to claim 1,
mapping the priorities of the layers; and
When a packet of a new layer arrives, if the new layer is lower than the layers, mapping the new layer to the highest priority queue, and changing the remaining mapping one by one.

제1항에 있어서,
상기 레이어들의 우선순위를 매핑하는 단계; 및
상기 우선순위 중에 가장 높은 우선순위 큐의 패킷을 K-바이트(bite) 처리하는 단계를 더 포함하는 분산 딥러닝 방법.
According to claim 1,
mapping the priorities of the layers; and
Distributed deep learning method further comprising the step of processing the packet of the highest priority queue among the priorities K-byte (bite).

제3항에 있어서,
상기 레이어들보다 낮은 새로운 레이어가 도착하면,
상기 매핑을 변경하여 빈 상태의 가장 높은 우선순위 큐에 상기 새로운 레이어의 패킷을 처리하는 단계를 더 포함하는 분산 딥러닝 방법.
4. The method of claim 3,
When a new layer lower than the above layers arrives,
Distributed deep learning method further comprising the step of processing the packet of the new layer in the highest priority queue in an empty state by changing the mapping.

학습 데이터를 복수의 워커 노드들에서 분산 학습하는 단계;
상기 워커 노드들에서 학습한 데이터의 파라미터를 레이어 단위로 분류하고 분류된 레이어들에 우선순위를 부여하는 패킷 스케줄링하는 단계;
상기 분류된 레이어들의 정보를 텐서 패킷(tensor packet)의 헤더(header)에 마킹하는 단계;
상기 레이어들의 우선순위를 매핑하는 단계;
상기 레이어들보다 낮은 새로운 레이어가 도착하면, 상기 새로운 레이어를 가장 높은 우선순위 큐에 매핑하고, 나머지 맵핑을 한 칸씩 변경하는 단계;
상기 레이어들의 파라미터를 파라미터 서버로 전송하는 단계;
상기 파라미터 서버에서 상기 파라미터를 갱신하는 단계; 및
상기 갱신된 파라미터를 상기 워커 노드들로 전송하는 단계를 포함하는 분산 딥러닝 방법.
Distributed learning of training data in a plurality of worker nodes;
packet scheduling for classifying parameters of data learned by the worker nodes in a layer-by-layer unit and giving priority to the classified layers;
marking the classified information of the layers in a header of a tensor packet;
mapping the priorities of the layers;
when a new layer lower than the layers arrives, mapping the new layer to a highest priority queue and changing the remaining mappings one by one;
transmitting the parameters of the layers to a parameter server;
updating the parameter in the parameter server; and
Distributed deep learning method comprising transmitting the updated parameter to the worker nodes.

삭제delete

학습 데이터를 복수의 워커 노드들에서 분산 학습하는 단계;
상기 워커 노드들에서 학습한 데이터의 파라미터를 레이어 단위로 분류하고 분류된 레이어들에 우선순위를 부여하는 패킷 스케줄링하는 단계;
상기 분류된 레이어들의 정보를 텐서 패킷(tensor packet)의 헤더(header)에 마킹하는 단계;
상기 레이어들의 우선순위를 매핑하는 단계;
상기 우선순위 중에 가장 높은 우선순위 큐의 패킷을 K-바이트로 처리하는 단계;
상기 레이어들보다 낮은 새로운 레이어가 도착하면, 상기 맵핑을 변경하여 빈 상태의 가장 높은 우선순위 큐에 상기 새로운 레이어의 패킷을 처리하는 단계;
상기 레이어들의 파라미터를 파라미터 서버로 전송하는 단계;
상기 파라미터 서버에서 상기 파라미터를 갱신하는 단계; 및
상기 갱신된 파라미터를 상기 워커 노드들로 전송하는 단계를 포함하는 분산 딥러닝 방법.Distributed learning of training data in a plurality of worker nodes;
packet scheduling for classifying parameters of data learned by the worker nodes in a layer-by-layer unit and giving priority to the classified layers;
marking the classified information of the layers in a header of a tensor packet;
mapping the priorities of the layers;
processing a packet of the highest priority queue among the priorities as K-bytes;
processing a packet of the new layer in an empty highest priority queue by changing the mapping when a new layer lower than the layers arrives;
transmitting the parameters of the layers to a parameter server;
updating the parameter in the parameter server; and
Distributed deep learning method comprising transmitting the updated parameter to the worker nodes.

삭제delete