• 검색 결과가 없습니다.

Efficient Distributed DNN Training

N/A
N/A
Protected

Academic year: 2023

Share "Efficient Distributed DNN Training"

Copied!
78
0
0
더 보기 ( 페이지)

전체 글

The distributed parallelism method should be used appropriately according to the characteristics of DNN and computing resources. Hybrid parallelism is used from the pre-processing of the input dataset to the DNN training stage.

Contributions

RAP monitors resource status in real-time and performs parallel preprocessing appropriate for resource status by adjusting the degree of parallelism. In addition, distributed preprocessing in multi-node is implemented, and the experimental results are shown.

Organization

We propose RALP (Resource-Aware Layer Placement), a scheme to reduce network traffic through layer placement, taking into account the resources used by each layer. Our experimental results show that by applying four distributed preprocesses, the preprocessing, which took 112 hours for serial execution, was reduced by as much as 97% to 4 hours.

DNN Training

Distributed Parallel Training for DNN

Each GPE performs forward and reverse traversals for the layers of the assigned partition. When GPUk finishes its partition forward pass, a minibatch reverse pass from GPUkdown to GPU1 is performed.

Training Convolutional Neural Networks

At the end of the iteration, the workers exchange gradients to synchronize (or merge) model updates. In this architecture, the PS hosts the master copy of the DDL model and is in charge of updating the model using the local results sent from all workers.

CNN Characterization

First, for a number of CNN models, a significant fraction of the memory is used by a few layers that are trained in the last phase of the forward pass of training. We also showed that a large part of the parameters that must be communicated to the PS are concentrated in the latter layer.

Figure 3: Network interference: S represents single model training, C represents consolidated training with 8 models in a cluster.
Figure 3: Network interference: S represents single model training, C represents consolidated training with 8 models in a cluster.

Layer Placement with RALP

The natural choice for layer placement would be to place the layers with small computations but with large memory requirements (ie parameters) in the PS machine. From now on, we use the term PS worker to explicitly denote the training procedure that is in progress in the PS machine. Since the memory-intensive fully connected layers in the PS machine are not computationally intensive [56], this processing is typically fast and does not add back pressure.

To facilitate this, we distribute the computations previously performed in the single PS worker so that the PS workers run in parallel while worker activations are merged. In CNN, the output size tends to decrease in the order of the forward pass layers, as shown in Figure 6.

Figure 5: Workflow of resource-aware layer placement for distributed training in RALP
Figure 5: Workflow of resource-aware layer placement for distributed training in RALP

Evaluation

In this section, we investigate the performance of RALP for isolated training across a variety of GPU allocation scenarios among workers and PSs. Our results are especially interesting because Horovod claims to scale well to the number of GPUs [22]. Note that the performance of ResNet-50, the unfriendly track, also improves by 10% in the second row of the table.

Interestingly, similar to the mixed workload results in Table 3, the performance of consolidated ResNet-50 improves by about 10% even when running with TensorFlow as shown in the second row. We compare 2PS-30W with the use of a single PS worker exercising the same number of workers (denoted 1PS-30W).

Figure 8: Performance comparison between RALP-N and Horovod while scaling up to 32 GPUs Table 2: Transfer sizes of a 32-GPU training for one step.
Figure 8: Performance comparison between RALP-N and Horovod while scaling up to 32 GPUs Table 2: Transfer sizes of a 32-GPU training for one step.

Discussion and Summary

We evaluated it with BytePS [66, 67], the state-of-the-art system for distributed DNN training. They were all (almost) state-of-the-art affordable at the time of purchase within the budget we could afford. Some systems have become legacy technologies that individually are not capable of running the large DNN models that are common today.

Second, low-class GPUs can be used to improve the performance of even high-class GPUs by incrementally adding the resources of the (old) low-class systems to the (new) high-class systems. Our experimental results show that the performance of HetPipe is better than that of the state-of-the-art DP via Horovod [22] using AllReduce communication [82].

Table 6: Heterogeneous GPUs
Table 6: Heterogeneous GPUs

Model Parallelism and Pipeline Execution

Compared to Horovod, the convergence of VGG-19 with a large parameter set to a desired accuracy becomes 49% faster and that of ResNet-152, which is too large to load on four odd GPUs in our benchmark, it becomes 39% faster using all GPUs (including odd ones). First, we generalize the PMP of a virtual worker to be used together with the DP of virtual workers, increasing the parallelism of the DNN model training. Second, we consider a heterogeneous GPU pool, which allows the use of GPUs that otherwise could not be used for training.

Finally, we present a parameter synchronization model that guarantees convergence for training models using PMP with DP. In GPipe, DP of multiple virtual workers can be performed using existing synchronization schemes such as BSP, as one virtual worker processes one minibatch at a time.

System Overview

Then, for the given DNN model and allocated GPUs, the model allocator allocates model partitions for the virtual worker so that the performance of the pipeline running on the virtual worker can be maximized. Each virtual worker has a local copy of the global weights and periodically synchronizes the weights with the parameter server. While each virtual worker processes minibatches in a pipeline fashion, there are many minibatches that are processed in parallel.

Since the virtual worker does not apply updates even from the first minibatch in a wave that will be reflected in the weights used by the last minibatch, the local deadlock threshold in WSP is Nm−1. Additionally, each virtual worker only pushes the aggregated updates from all minibatches in a wave, rather than each minibatch, to the parameter server. This results in a significant reduction in overall communication costs.

Figure 12: Pipeline execution of minibatches where M p,k indicates the execution of a minibatch p in par- par-tition k, which is executed in GPU k and the yellow and green colors indicate the forward and backward passes, respectively.
Figure 12: Pipeline execution of minibatches where M p,k indicates the execution of a minibatch p in par- par-tition k, which is executed in GPU k and the yellow and green colors indicate the forward and backward passes, respectively.

Pipelined Model Parallelism Within a VW

Then, initially, we can assume that w0, the initial version of the weights, is given to the virtual worker. When the virtual worker starts processing a new minibatch, it uses the latest value of wlocal without waiting for other minibatches to update their weights. Similarly, when the virtual worker finishes with minibatcheslocal+1 and updates wlocalwithuslocal+1, it will start processing minibatches 2×(slocal+1) without waiting for the previous most recent local minibatches to fill.

Therefore, with the exception of the initial minibatches 1 toslocal+1, for minibatch p the virtual worker will use the version of the weights that reflects (at least) all local updates of minibatches 1 top−(slocal+1). However, local obsolescence in HetPipe is inherently caused by processing mini-batches in a pipelined manner within a virtual worker.

Data Parallelism with Multiple VWs

At this point, the virtual worker computes the aggregated updates from minibatchc×(slocal+1) +1 to minibatch(c+1)×(slocal+1) and pushes the updates ˜u to the parameter server. Note that the parameter server only updates itscglobaltoc+1 after each virtual worker has pushed the aggregated updates of wavec. In WSP, each virtual worker is allowed to continue training without retrieving the global weights for each wave.

Therefore, a virtual worker with local clockc where c≥D+1 must use a version of the weights that includes all (collected) updates from wave 0 to c-D-1. Also, the weight version may include some recent global updates of other virtual workers and some recent local updates within a virtual worker outside of wavec-D-1.

Figure 13: Local and global staleness with WSP
Figure 13: Local and global staleness with WSP

Partitioning Algorithm

Experimental Results

Since the nodes are heterogeneous, the division of layers for a DNN model is therefore different for each virtual worker. On the other hand, since the performance of each virtual worker varies, a left-over can degrade performance with DP. This is possible since the same partition of the model can be assigned locally to the GPU on the same node for each virtual worker.

Second, the training performance of ResNet-152 and VGG-19 with NP is low because Nmis is bounded by the virtual worker with the smallest GPU memory. This is because Hetpipe allows each virtual worker to process a large number of mini-batches simultaneously.

Table 8: Resource allocation for the three policies considered
Table 8: Resource allocation for the three policies considered

Discussion and Summary

We focus on parallelizing preprocessing for DNN-based recommendations to increase the efficiency of data preprocessing. The reasons why data preprocessing for DNN-based recommender system is essential are as follows. We propose an Adaptive Resource Preprocessing (RAP) scheme that improves performance by considering computing resources.

To develop the RAP, we go through the analysis of the dataset preprocessing steps, where we discuss three insights that we illuminate for parallel dataset preprocessing. In the next section, we discuss dataset preprocessing for a DNN-based recommender system as a background.

Dataset Pre-Processing for DNN-based Recommendation System

The first step is feature re-initialization, which essentially takes categorical features and re-inserts them appropriately into data structures for pre-processing. Since the chronological order of the data sets is essential, it was difficult to parallelize this traditional preprocessing step. In the last step, all the files created in step 2 are combined into one file and then compressed.

However, in practice it is done this way for "checkpointing" purposes just in case a failure occurs during the preprocessing phase. This is because, as previously mentioned, the preprocessing phase can take tens to hundreds of hours depending on the dataset size.

Three Insights to Parallelizing Dataset Pre-Processing

This requires reading the entire raw dataset, from typically a series of files, in chronological order [1 in Figure 18] because the preprocessed dataset must maintain the order of the raw dataset and because recommender systems require training according to the chronological or der of the dataset . However, in step 1 there is the feature indexing procedure which requires a chronologically ordered, that is, sequential processing of the entire data set to generate a feature indexing file. Here we can see that the number of units and the number of chunks to be processed at once is different depending on the resource of the computing node.

Also, considering the memory size of the computing node, it may not be possible to process all units at once. It is essential to determine that X and Y will be processed immediately, depending on the situation of the computing resource.

Figure 19: The number of units X, and the number of chunks Y of parallel execution
Figure 19: The number of units X, and the number of chunks Y of parallel execution

Resource-Adaptive Parallelizing Dataset Pre-Processing

Each node has its own resource manager and the preprocessing is performed in the same way as the preprocessing in a single node. However, it should be noted that, as explained in Insight 1, the indexing of attributes in Step 1 must maintain chronological order. After step 1 is completed, each node sends the function indexing result to the master node (node ​​0) and waits.

When the master node receives files from all nodes, it sends the entire feature indexing file in chronological order to all nodes.

Performance Evaluation

Finally, parallel preprocessing can be invoked simply by adding an execution option with the parallelization mode name to be used in each step. In the unit parallel of the Terabyte dataset in Machine-H in Figure 20 (a), the reinitialization step completed the fastest, but the preprocessing could not be completed because OOM occurred in the conversion step. However, in the case of RAP, all preprocessing is completed using resources efficiently.

By preprocessing the Terabyte dataset in Machine-H and Machine-L, RAP has 93% and 89% faster results than serial execution, respectively. In (a) and (b) of Figure 21, in the Terabyte data set, the preprocessing time decreases almost linearly in both machines.

Table 11: Machine specifications
Table 11: Machine specifications

Summary

Ng, "Large Scale Distributed Deep Networks," i Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2012. Liu, "Asynchronous Decentralized Parallel Stochastic Gradient Descent," i Proceedings of the International Conference on Machine Learning (ICML), 2018. Building high-level features Using Large Scale Unsupervised Learning," inProceedings of the International Conference on Machine Learning (ICML), 2012.

Dean, "Device Placement Optimization with Reinforcement Learning," in Proceedings of the International Conference on Machine Learning (ICML), 2017. Hechtman, "Mesh-TensorFlow: Deep Learning for Super-computers," in Proceedings of the Advances in Neural Information Processing Systems ( NIPS), 2018. Rabinovich, "Going Deeper with Convolutions," in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

Zhou, "Fast Distributed Deep Learning over RDMA," i Proceedings fra European Conference on Computer Systems (EuroSys), 2019.

수치

Figure 1: CNN model training structure 3.1 Training Convolutional Neural Networks
Figure 2: Distributed training: time breakdown (after 100 step executions) including network transfer time, where the left bars are average step execution time for all workers while right bars are the average step execution time for the slowest worker
Figure 3: Network interference: S represents single model training, C represents consolidated training with 8 models in a cluster.
Figure 4: Memory usage and computation of CNN
+7

참조

관련 문서

[r]

발전원가의 경우, British Gas/Lurgi 가스화기에 저온정제를 적용한 경우가 가장 높았고 가장 낮은 경우는 역시 transport 가스화기, air-blown인 경우로 나타났다.. 한편, Net

① Our syntactic knowledge crucially includes rules that tell us how words form groups in a sentence, or how they are hierarchically arranged with respect to

과거에는 금이나 은 등 실제 실물이 화폐의 역할을 담당했지만 점차 그 실물에 대한 교환 증서가 화폐의 역할을 담당하게 됐다.. 그럼에도 불구하고 여전히 우리는 그

If dental clinic recognizes the evaluation of infection control notified by the Ministry of Health & Welfare and performs the infection control suitable for

In previous analysis, The heritage Foundation deter- mined that the new regulations and benefit man- dates put in place through the aca caused premi- ums to increase drastically

 Policing: increasing the capacity of the Somaliland Police to provide security in an effective and human rights compliant manner, and improving human resource and

Date of Birth or Alien Registration No.(if any). 国籍

– 장기간으로 보면 이들 집단들은 변하게 되나, 단기적으로는 처리 기가 고정된 메모리 참조 집단들을 주로 이용하여 일을 처리한다 – 시간 참조지역성temporal locality of reference • 명령문: 반복문, 서브루틴 – 공간 참조지역성spatial locality of reference • 데이터: 테이블,