The distributed parallelism method should be used appropriately according to the characteristics of DNN and computing resources. Hybrid parallelism is used from the pre-processing of the input dataset to the DNN training stage.
Contributions
RAP monitors resource status in real-time and performs parallel preprocessing appropriate for resource status by adjusting the degree of parallelism. In addition, distributed preprocessing in multi-node is implemented, and the experimental results are shown.
Organization
We propose RALP (Resource-Aware Layer Placement), a scheme to reduce network traffic through layer placement, taking into account the resources used by each layer. Our experimental results show that by applying four distributed preprocesses, the preprocessing, which took 112 hours for serial execution, was reduced by as much as 97% to 4 hours.
DNN Training
Distributed Parallel Training for DNN
Each GPE performs forward and reverse traversals for the layers of the assigned partition. When GPUk finishes its partition forward pass, a minibatch reverse pass from GPUkdown to GPU1 is performed.
Training Convolutional Neural Networks
At the end of the iteration, the workers exchange gradients to synchronize (or merge) model updates. In this architecture, the PS hosts the master copy of the DDL model and is in charge of updating the model using the local results sent from all workers.
CNN Characterization
First, for a number of CNN models, a significant fraction of the memory is used by a few layers that are trained in the last phase of the forward pass of training. We also showed that a large part of the parameters that must be communicated to the PS are concentrated in the latter layer.

Layer Placement with RALP
The natural choice for layer placement would be to place the layers with small computations but with large memory requirements (ie parameters) in the PS machine. From now on, we use the term PS worker to explicitly denote the training procedure that is in progress in the PS machine. Since the memory-intensive fully connected layers in the PS machine are not computationally intensive [56], this processing is typically fast and does not add back pressure.
To facilitate this, we distribute the computations previously performed in the single PS worker so that the PS workers run in parallel while worker activations are merged. In CNN, the output size tends to decrease in the order of the forward pass layers, as shown in Figure 6.

Evaluation
In this section, we investigate the performance of RALP for isolated training across a variety of GPU allocation scenarios among workers and PSs. Our results are especially interesting because Horovod claims to scale well to the number of GPUs [22]. Note that the performance of ResNet-50, the unfriendly track, also improves by 10% in the second row of the table.
Interestingly, similar to the mixed workload results in Table 3, the performance of consolidated ResNet-50 improves by about 10% even when running with TensorFlow as shown in the second row. We compare 2PS-30W with the use of a single PS worker exercising the same number of workers (denoted 1PS-30W).

Discussion and Summary
We evaluated it with BytePS [66, 67], the state-of-the-art system for distributed DNN training. They were all (almost) state-of-the-art affordable at the time of purchase within the budget we could afford. Some systems have become legacy technologies that individually are not capable of running the large DNN models that are common today.
Second, low-class GPUs can be used to improve the performance of even high-class GPUs by incrementally adding the resources of the (old) low-class systems to the (new) high-class systems. Our experimental results show that the performance of HetPipe is better than that of the state-of-the-art DP via Horovod [22] using AllReduce communication [82].

Model Parallelism and Pipeline Execution
Compared to Horovod, the convergence of VGG-19 with a large parameter set to a desired accuracy becomes 49% faster and that of ResNet-152, which is too large to load on four odd GPUs in our benchmark, it becomes 39% faster using all GPUs (including odd ones). First, we generalize the PMP of a virtual worker to be used together with the DP of virtual workers, increasing the parallelism of the DNN model training. Second, we consider a heterogeneous GPU pool, which allows the use of GPUs that otherwise could not be used for training.
Finally, we present a parameter synchronization model that guarantees convergence for training models using PMP with DP. In GPipe, DP of multiple virtual workers can be performed using existing synchronization schemes such as BSP, as one virtual worker processes one minibatch at a time.
System Overview
Then, for the given DNN model and allocated GPUs, the model allocator allocates model partitions for the virtual worker so that the performance of the pipeline running on the virtual worker can be maximized. Each virtual worker has a local copy of the global weights and periodically synchronizes the weights with the parameter server. While each virtual worker processes minibatches in a pipeline fashion, there are many minibatches that are processed in parallel.
Since the virtual worker does not apply updates even from the first minibatch in a wave that will be reflected in the weights used by the last minibatch, the local deadlock threshold in WSP is Nm−1. Additionally, each virtual worker only pushes the aggregated updates from all minibatches in a wave, rather than each minibatch, to the parameter server. This results in a significant reduction in overall communication costs.

Pipelined Model Parallelism Within a VW
Then, initially, we can assume that w0, the initial version of the weights, is given to the virtual worker. When the virtual worker starts processing a new minibatch, it uses the latest value of wlocal without waiting for other minibatches to update their weights. Similarly, when the virtual worker finishes with minibatcheslocal+1 and updates wlocalwithuslocal+1, it will start processing minibatches 2×(slocal+1) without waiting for the previous most recent local minibatches to fill.
Therefore, with the exception of the initial minibatches 1 toslocal+1, for minibatch p the virtual worker will use the version of the weights that reflects (at least) all local updates of minibatches 1 top−(slocal+1). However, local obsolescence in HetPipe is inherently caused by processing mini-batches in a pipelined manner within a virtual worker.
Data Parallelism with Multiple VWs
At this point, the virtual worker computes the aggregated updates from minibatchc×(slocal+1) +1 to minibatch(c+1)×(slocal+1) and pushes the updates ˜u to the parameter server. Note that the parameter server only updates itscglobaltoc+1 after each virtual worker has pushed the aggregated updates of wavec. In WSP, each virtual worker is allowed to continue training without retrieving the global weights for each wave.
Therefore, a virtual worker with local clockc where c≥D+1 must use a version of the weights that includes all (collected) updates from wave 0 to c-D-1. Also, the weight version may include some recent global updates of other virtual workers and some recent local updates within a virtual worker outside of wavec-D-1.

Partitioning Algorithm
Experimental Results
Since the nodes are heterogeneous, the division of layers for a DNN model is therefore different for each virtual worker. On the other hand, since the performance of each virtual worker varies, a left-over can degrade performance with DP. This is possible since the same partition of the model can be assigned locally to the GPU on the same node for each virtual worker.
Second, the training performance of ResNet-152 and VGG-19 with NP is low because Nmis is bounded by the virtual worker with the smallest GPU memory. This is because Hetpipe allows each virtual worker to process a large number of mini-batches simultaneously.

Discussion and Summary
We focus on parallelizing preprocessing for DNN-based recommendations to increase the efficiency of data preprocessing. The reasons why data preprocessing for DNN-based recommender system is essential are as follows. We propose an Adaptive Resource Preprocessing (RAP) scheme that improves performance by considering computing resources.
To develop the RAP, we go through the analysis of the dataset preprocessing steps, where we discuss three insights that we illuminate for parallel dataset preprocessing. In the next section, we discuss dataset preprocessing for a DNN-based recommender system as a background.
Dataset Pre-Processing for DNN-based Recommendation System
The first step is feature re-initialization, which essentially takes categorical features and re-inserts them appropriately into data structures for pre-processing. Since the chronological order of the data sets is essential, it was difficult to parallelize this traditional preprocessing step. In the last step, all the files created in step 2 are combined into one file and then compressed.
However, in practice it is done this way for "checkpointing" purposes just in case a failure occurs during the preprocessing phase. This is because, as previously mentioned, the preprocessing phase can take tens to hundreds of hours depending on the dataset size.
Three Insights to Parallelizing Dataset Pre-Processing
This requires reading the entire raw dataset, from typically a series of files, in chronological order [1 in Figure 18] because the preprocessed dataset must maintain the order of the raw dataset and because recommender systems require training according to the chronological or der of the dataset . However, in step 1 there is the feature indexing procedure which requires a chronologically ordered, that is, sequential processing of the entire data set to generate a feature indexing file. Here we can see that the number of units and the number of chunks to be processed at once is different depending on the resource of the computing node.
Also, considering the memory size of the computing node, it may not be possible to process all units at once. It is essential to determine that X and Y will be processed immediately, depending on the situation of the computing resource.

Resource-Adaptive Parallelizing Dataset Pre-Processing
Each node has its own resource manager and the preprocessing is performed in the same way as the preprocessing in a single node. However, it should be noted that, as explained in Insight 1, the indexing of attributes in Step 1 must maintain chronological order. After step 1 is completed, each node sends the function indexing result to the master node (node 0) and waits.
When the master node receives files from all nodes, it sends the entire feature indexing file in chronological order to all nodes.
Performance Evaluation
Finally, parallel preprocessing can be invoked simply by adding an execution option with the parallelization mode name to be used in each step. In the unit parallel of the Terabyte dataset in Machine-H in Figure 20 (a), the reinitialization step completed the fastest, but the preprocessing could not be completed because OOM occurred in the conversion step. However, in the case of RAP, all preprocessing is completed using resources efficiently.
By preprocessing the Terabyte dataset in Machine-H and Machine-L, RAP has 93% and 89% faster results than serial execution, respectively. In (a) and (b) of Figure 21, in the Terabyte data set, the preprocessing time decreases almost linearly in both machines.

Summary
Ng, "Large Scale Distributed Deep Networks," i Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2012. Liu, "Asynchronous Decentralized Parallel Stochastic Gradient Descent," i Proceedings of the International Conference on Machine Learning (ICML), 2018. Building high-level features Using Large Scale Unsupervised Learning," inProceedings of the International Conference on Machine Learning (ICML), 2012.
Dean, "Device Placement Optimization with Reinforcement Learning," in Proceedings of the International Conference on Machine Learning (ICML), 2017. Hechtman, "Mesh-TensorFlow: Deep Learning for Super-computers," in Proceedings of the Advances in Neural Information Processing Systems ( NIPS), 2018. Rabinovich, "Going Deeper with Convolutions," in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
Zhou, "Fast Distributed Deep Learning over RDMA," i Proceedings fra European Conference on Computer Systems (EuroSys), 2019.