• 검색 결과가 없습니다.

Performance Modeling and Device Placement Technique for Parallel Training Deep Neural

N/A
N/A
Protected

Academic year: 2023

Share "Performance Modeling and Device Placement Technique for Parallel Training Deep Neural"

Copied!
35
0
0
더 보기 ( 페이지)

전체 글

Also, our device placement technique can leverage our performance model to avoid offline profiling. Our prediction model only takes less than a few seconds, while our device placement algorithm using the results from the modeling shows better training speed than the data parallelism as a basis. Our heterogeneous-aware hardware deployment achieves 1.12x for the VGG model compared to the baseline partitioning scheme, which does not consider heterogeneous GPU performance.

When heterogeneous groups of GPUs are used by different DNN jobs, different types of GPUs are used simultaneously to train a single network. The following questions are our motivating question for distributed training in heterogeneous GPU computing environments. However, since the efficiency depends on the model partition and the device placement of the partitioned model, the optimization of.

In order to be able to use heterogeneous GPEs to train a single DNN model, the heterogeneous performance of GPEs must be taken into account, which is a more complex problem. In this work, we discuss model partitioning and device placement to improve pipeline parallelism under heterogeneous GPE performance. For pipeline parallelism, it can have different computation and communication times depending on the device placement in the layer.

The performance model can help us avoid off-line profiling to obtain the necessary performance information for device placement algorithms or configurations for distributed executions.

Data parallelism

Model Parallelism

Pipeline Parallelism

Straggler Problem

Motivation

Analysis

Because the parameter server and the worker server are on the same node, they will be w-1 instead of w. Model parallelism and pipeline parallelism Communication in model parallelism and pipeline parallelism occurs only between partitioned models. When the calculated activation output of one layer is transferred to the input of the next layer, it is transmitted over the network or PCI-E interconnect.

K is the array size and W, H and D are the width, height and depth of the activation output (the next layer's input data). In interlayer communication, since only one matrix is ​​transmitted, a performance model for the communication time can be defined in a very simple form. When the training parameters are asymmetrically distributed on the parameter server, so the communication traffic applied to each parameter server is different.

For example, the distribution of parameters in inception3 is almost uniformly distributed due to similar size matrices, so similar load balancing is displayed for each parameter server. However, in the case of Overfeat, the size of some of the parameters is very large compared to others. Also, since parameter passing in the Tensorflow is performed by gRPC framework and multi-threading, mutually.

Therefore, the communication time modeling takes into account the load imbalance and the number of gRPC threads.

Figure 4: Computation time Breakdown
Figure 4: Computation time Breakdown

Methodology

This memory size is divided by the global memory bandwidth to obtain the memory read/write time. The final computation time is obtained by combining the core computation time and the memory read/write time. The formula below represents the method for calculating the computation time of an iteration time for a DL model.

RW(l, d) represents memory read/write time reclaimed by all request bytes for input data, filter data and enable output data to read or write divided by memory bandwidth of device. Linear regression using synthetic data Figure 6 shows that synthetic data is generated and the actual transmission time is measured, and the linear relationship is shown as in the above analysis. The time required for one gRPC thread to transfer one matrix can be expressed by the following equation, and the coefficients are approximated by parameterizing them.

The only communication is between partitioned models, and the data transfer size is the same as the size of the activation output for the last layer of a partition. When multiple gRPCs transmit multiple parameters simultaneously, the number of gRPC threads, serialization/deserialization time, and interference between data transfers via shared resources such as network, PCI-E should be considered. Since serialization/deserialization uses the CPU core, it is assumed that it can be executed concurrently in a multi-core environment.

On the other hand, since transfer time uses network switch or PCI-E, gRPC threads interfered with each other. When multiple threads try to send its own parameter, 'Transfer' parts get a slowdown from each other while serialization and deserialization parts don't. The default value of the number of gRPC threads in Tensorflow is 8, so there can be at most 8 times latency.

We first analyze the architecture of a DL model and its parameter server and worker server configuration and discover one slowest worker and its parameters sent to or from a parameter server.

Figure 6: Linearity of data transmission time for one data payload
Figure 6: Linearity of data transmission time for one data payload

Result

Each communication time was measured from the time parameter servers sent parameters to each work server until the time each worker finished receiving. Our communication model predicts these four models well and the overall average error rate is 13.79%. And we did not include small DL models in this result, because small DL models were not trained with distributed GPU execution.

Total iteration time We combine computation prediction and communication prediction in iteration time. We see that this is because the computation time error rate for Alexnet was not negligible. Since Overfeat communication is dominant in a repetition time, Overfeat repetition time results appear to be similar to communication time.

Calculation time of the layer Figure 11 shows the comparison of the estimated calculation time and the actual calculation time. We run this experiment on the Overfeat model, VGG16 and Inception4 using two types of GPUs. The result shows an error in the prior layer and a relatively good predictor in the middle or posterior layer.

The reason that the front layers have a relatively large error is interpreted as the fact that the memory read/write cost is very high due to the large amount of input/activation output and filter size, so that the memory access time becomes a bottleneck. Since the per-layer error averages 25% over these three models and the front-layer error is large, better accuracy can be expected by improving the model memory read/write performance. This is because in pipeline parallelism, only one parameter array is sent by one gRPC thread, so no interference occurs.

Convolutional Network generally shows that the size of the data transferred from front to back is quite large, and shows that the performance modeling intuition is well suited.

Figure 8: One iteration time of prediction and measurement when each configuration executed on TitanX
Figure 8: One iteration time of prediction and measurement when each configuration executed on TitanX

Disccusion

Motivation

Methodology

To achieve the maximum efficiency of the pipeline, balancing the execution times of each step must be achieved. Since the execution time of the slowest step becomes the final iteration time for processing a minibatch in pipeline parallelism. This problem can be defined with a mini-max problem and can be formulated with an objective function and constraints.

We have formulated this optimization formula that can also be partitioned to use fewer GPUs than user-specified GPUs. This occurs when the communication time of any partition is dominant over the execution time of the partition.

Figure 13: Computations and Communications in Pipeline Parallelism
Figure 13: Computations and Communications in Pipeline Parallelism

System Overview

Result

The Hetero-Pipeline applies our device placement technique, and performance modeling is also applied considering communication times and performance of the heterogeneous GPUs. Equal- Pipeline, on the other hand, showed better speedup than data parallelism due to the effects of the pipeline, but no longer increases the efficiency of the pipeline through the skewed model partitioning. For the VGG16 model, the Hetero-Pipeline with the best model split was 1.42x and 1.12x faster than the data parallelism and equal-pipeline, respectively.

Because Overfeat has a small number of layers and each layer has a long computation time, splitting the model was very simple and shows the same device placement as Equal-Pipeline. For smaller and more layered models, the effect of the Hetero-Pipeline technique will be critical.

Discussion

Performance Modeling

Pipeline Parallelism

Agarwal, "Cntk: Microsoft's Open Source Deep Learning Toolkit," in Proceedings of 22nd International ACM SIGKDD on Knowledge Discovery and Data Mining, ser. Xing, "Addressing the Straggler problem for iterative convergent parallel ml," in Proceedings of Seventh ACM Symposium on Cloud Computing, ser. 9] “Efficient and robust parallel DNN training through model parallelism on multi-gpu platform”, CoRR, vol.

Guo, “Optimus: An efficient dynamic resource scheduling for deep learning clusters,” in Proceedings of the Thirteenth EuroSys Conference, ser. I would like to express my special thanks to my advisor Young-ri Choi for her patient guidance, enthusiastic encouragement and helpful criticism for this research work. I would like to express my appreciation to Professor MyeongJae Jeon who always listens carefully to my thoughts and words and motivates me to complete this research work.

Secondly, I would also like to express my appreciation to the following companies for their assistance and great help. Finally, I would like to thank my parents for their support and encouragement throughout my studies.

수치

Figure 1: GPU used and idle ratio
Figure 2: Pipeline Parallelism
Figure 3: One iteration can be devided into four steps.
Figure 4: Computation time Breakdown
+7

참조

관련 문서

We proposed a regularization-direct method to numerically solve first kind Fred- holm integral equations using vector forms of BPFs and their related properties.. The

The extent of destruction of microorganisms in PEF processing depends mainly on the electric field strength of the pulses and treatment time.. These are the two most

generalization of the exponential integral and to further derive some inequalities for the generalized function2. The results are presented in

Mice were subjected to saline injected group (Control, A), tacrolimus injected group (TAC, B), tacrolimus plus green tea extract group (TAC+G, C),

In 2015, New South Wales Aboriginal Affairs, Department of Education (‘NSW Aboriginal Affairs’) engaged the Australian Institute of Health and Welfare (AIHW) to provide an overview

Focus area: AusAID will act as a catalyst for discussions highlighting where commonalities exist between Indigenous community development and international poverty alleviation

The Australian Institute of Health and Welfare (AIHW) annually collects data from organisations funded by the Australian Government to provide one or more of the following

Office of the Director-General Offices of the Deputy Directors-General Accessions Division Administration and General Services Division Agriculture and Commodities Division

• The Committee recommends that the Standing Committee on Tax and Revenue of the next Parliament consider expanding its biannual inquiries into the Australian Taxation Office