Real-Time Control for Power Cost Efficient Deep Learning Processing With Renewable Generation

(1)

Received June 17, 2019, accepted July 25, 2019, date of publication August 14, 2019, date of current version August 29, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2935389

Real-Time Control for Power Cost Efficient Deep

Learning Processing With Renewable Generation

DONG-KI KANG AND CHAN-HYUN YOUN, (Member, IEEE)

School of Electrical Engineering, KAIST, Daejeon 34141, South Korea Corresponding author: Chan-Hyun Youn ([email protected])

This work was supported in part by ‘‘The Cross-Ministry Giga KOREA Project’’ grant funded by the Korean Government (MSIT), Development of Mobile Edge Computing Platform Technology for URLLC Services under Grant GK18P0400, and in part by the Institute for Information and Communications Technology Promotion (IITP) grant funded by the Korean Government (MSIT) (Service Mobility Support Distributed Cloud Technology) under Grant 2017-0-00294, 2017-0-00294.

ABSTRACT The explosive increase in deep learning (DL) deployment has led GPU power usage to become a major factor in operational cost of modern HPC clusters. The complex mixture of DL processing, fluctuated renewable generation, and dynamic electricity price impedes the elaborate GPU power control, so as to lead an undesirable cost. However, most previous studies have been concerned only with the design of power management method using DL, and have not care about the cost caused by GPU power consumption for DL processing itself. This paper, as the opposite direction of these trends, proposes a real-time power controller called DeepPow-CTR for cost efficient DL processing in GPU based clusters. We design the GPU frequency scaling algorithm based on model predictive control (MPC), to delicately tune the DL power consumption in response to dynamic renewable generation and electricity price. At the same time, we avoid the unacceptable DL performance degradation by regulating memory-access / feed-forward / back-propagation (MFB) time per each minibatch data in deep neural network (DNN) model training. To solve the designed nonlinear MPC problem rapidly and accurately, we apply the damped Broyden-Fletcher-Goldfarb-Shanno (BFGS) based sequential quadratic programming (SQP) method to our DeepPow-CTR. Our experimental results on lab-scale testbed using real trace data of renewable generation and electricity price, demonstrate that the proposed DeepPow-CTR has superiority and practicality in terms of DL processing power cost and performance, compared to existing methods.

INDEX TERMS Artificial intelligence, power engineering computing, renewable energy sources, frequency control, nonlinear control systems, mathematical programming.

I. INTRODUCTION

Deep learning (DL) method which is a ‘‘representation learn-ing’’, has emerged as a novel and powerful heuristic for solving complex problems in various industrial fields [1]–[3]. Based on a large sample dataset, the non-polynomial regres-sion approach of the DL method is able to derive the suprising accuracy level for a learning model by scaling up the magni-tude of network hidden layers. For the processing of complex applications such as image processing, product recommenda-tion systems, big data analysis and natural language process-ing, the DL method outperforms existing machine learning techniques such as support vector machine (SVM) and k-nearest neighbors (KNN) [4], [5]. Beyond the conventional purposes, the DL paradigm has been rapidly extended to

The associate editor coordinating the review of this article and approving it for publication was Yin Zhang.

wide research area such as data center management [6]–[8], wireless network management [9]–[14], and building com-fort management [15], [16].

Recently, the power management is one of major industrial problems for which many researchers attempt to actively deploy the DL method. Lee et al. [17] proposed a volutional neural network (CNN) based deep power con-trol (DPC) scheme to maximize the energy cost efficieny of wireless communication systems (WCSs). Their proposed method is able to determine the appropriate amount of power transmission for WCS in real-time, based on the offline training of CNN model. Yan and Xu [18] proposed an intel-ligent load frequency control (LFC) to achieve the balanc-ing between the power generation and demands. Instead of using a vanilla reinforcement learning (RL), they designed a data-driven deep reinforcement learning (DRL) so as to enable control actions in continous domain. Li et al. [7] also

VOLUME 7, 2019

2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.

(2)

proposed the deep reinforcement learning method to solve the cooling optimization problem for data centers so as to minimize the energy cost. Their proposed method optimizes the temperatue flow management, in response to various envi-ronmental states. Although all above approaches are superior to conventional ones in performance however, we found the critical problem that they did not consider the significant power cost caused by DL processing itself.

The GPU worker based computing clusters which are commonly used for DL processing, have brought the non-negligible power consumption [19], [20]. Table1shows the history of NVIDIA GPU architecture evolution with respect to a processing capability and a thermal design power (TDP) [21]. Even though the ratio of flops-to-power in GPU has been improved steadily, the absolute amount of the GPU power consumption is still very high (> 200W).

Table 2 shows the power consumption of NVIDIA

DGX-1 cluster [22] for DL processing according to the number of assigned GPU workers and running DL applications [15]. Both the AlexNet and GoogleNet model training require the considerable power consumption level (>1200W), and the power consumption even for small scale DL processing (CifarNet/CIFAR-10) with 8-GPUs is still high (450W).

At cluster level, besides GPUs, there are two other major factors that affect the power consumption cost of DL process-ing: electricity price and renewable power capacity. They may change in response to power demands and weather condi-tions. In order to achieve power cost efficient DL processing, we need to adaptively adjust the GPU power consumption in response to states of electricity price and renewable gen-eration. For instance, during hours when electricity price is high and renewable power capacity is low, even at cost of slight DL performance degradation, we should temporarily lower the power supply for GPUs so as to avoid unacceptable power consumption cost rising. Consequently, we need the elaborate and adaptive GPU power control method which suitably balances DL power consumption and performance, given dynamic electricity price and renewable generation.

This paper proposes a real-time GPU power controller called DeepPow-CTR which achieves the cost efficient bal-ancing of DL processing power and performance based on cluster owner-specified trade-off. The DeepPow-CTR tunes the DL processing performance by regulating memory-access

/ feed-forward / back-propagation (MFB) time per each

minibatchdata in deep neural network (DNN) model train-ing. To do this, in the DeepPow-CTR, we design the GPU core/memory frequency scaling algorithm, based on the non-linear model predictive control (NMPC) formulation which is suitable for constrained control problems [23].

In order to solve the NMPC formulation rapidly and accu-rately, we apply the damped

Broyden-Fletcher-Goldfarb-Shanno (BFGS) based sequential quadratic programming (SQP) method with the exact penalty merit function to our DeepPow-CTR [24], [25]. Compared to other candi-dates such as sequential linearization programming (SLP) and meta-heuristics (e.g., Genetic Algorithm, Simulated

TABLE 1.Evolution of NVIDIA GPUs w.r.t processing capability and thermal design power (TDP) [21].

TABLE 2.Power consumption of DGX-1 GPU cluster for DL processing (network model/dataset) [15].

Annealing, and Greedy Algorithm), the SQP method shows the superior convergence speed for finding optimal solutions of complex nonlinear problems. Through the integration of GPU frequency scaling algorithm and the SQP method for NMPC optimization in our DeepPow-CTR, we easily achieve the real-time GPU power control for cost efficient DL pro-cessing.

For experiments, we tested the training of multi-ple DNN models such as ResNet [26], VGGNet [27], and GoogleNet [28]. We used image datasets such as

MNIST [29], CIFAR-10 [30], and ImageNet [31], [32].

We deployed 8 workers containing NVIDIA Pascal archi-tecture (GTX1060/1080) [21] based GPU devices for our lab-scale testbed. To verify the practicality of the DeepPow-CTR, we reflected the real trace data of electricity price (retrieved from Federal Energy Regulatory Commission (FERC) [33]) and renewable generation (retrieved from U.S. Measurement and Instrumentation Data Center (MIDC) [34]) to the experiments. The DeepPow-CTR reduces 15% eco-nomic operational cost on average, compared to existing cluster power control approaches.

II. SYSTEM MODEL AND PROBLEM FORMULATION

A. MINIBATCH BASED SGD CALCULATION TIME MODEL The purpose of DNN training is to establish the network model function that best reflects the real-world based on sam-ple input dataset. Due to the learning convergence rate and the limited GPU memory size, the DNN training is generally performed by using the stochastic gradient descent (SGD) method [35]. In the SGD method, the entire raw dataset is

(3)

FIGURE 1. Unfolding the procedures of deep neural network (DNN) training based on stochastic gradient descent (SGD) method.

divided into multiple sub-datasets called minibatch, and the DNN model parameters are calculated and updated in mini-batches. The atomic procedures of the SGD method can be described as follows.

• minibatch upload: The arbitrary minibatch data is selected from the raw dataset in the storage by the pre-defined SGD policy. It is uploaded to the GPU mem-ory. The GPU memory controller manages the transfer interactions between the main memory and the GPU memory [36].

• feed-forward: On the GPU streaming multiprocessors

(SMs), the product of matrix (network model parame-ters) / vector (layer output) is calculated along DNN hid-den layers. The final (output) layer extracts the estimated features per each sample input data.

• back-propagation: On the GPU streaming multipro-cessors (SMs), based on the estimation error by the pre-defined loss function, the incremental values of net-work model parameters are calculated backward along DNN layers. This is performed by the chain rule based gradient search. In general, the back-propagation pro-cessing time is longer than the feed-forward one.

• gradient transfer: The incremental values of network model parameters are transfered from the GPU mem-ory to the main memmem-ory. Same to ‘minibatch upload’, the GPU memory controller is responsible for this work.

• communication: The incremental values of network

model parameters are moved from the GPU worker(s) to the parameter server (PS). The PS sends the updated net-work model parameters to GPU net-worker(s). If the GPU worker(s) and the PS reside in the same machine, then we can neglect the associated communication overhead. We consider that a training iteration is completed when the above five procedures for a minibatch are finished. Once all the training iterations for the entire raw dataset are completed, then a training epoch is considered to be finished. The DNN

training usually requires multiple training epochs to achieve the acceptable estimation accuracy. Fig1shows these flows of the DNN training using the SGD method. Based on the concept of training iteration and epoch, the DL processing time can be defined as follows.

ttot = titer× dds

mbe × ep, (1)

titer = tup+ tfw+ tbp+ ttr+ tcomm,

= tMFB+ tcomm, (2)

where ttot and titer represent the total training time and the training iteration time, respectively. tup, tfw, tbp, ttr, and tcomm represent the completion time for minibatch upload, feed-forward, back-propagation, gradient transfer, and communi-cation, respectively. tMFBrepresents the MFB time which is the training iteration time except the communication time. ds,

mb, and ep represent the number of entire sample input data, the minibatch size, and the number of epochs, respectively. The term d_mbdsedenotes the number of training iterations.

Note that the length of tfwand tbp depends on the com-puting speed of GPU SMs, while the one of tup and ttr depends on the access rate of the GPU memory controller. Therefore, the GPU core frequency scaling enables tuning for

tfwand tbp, while the GPU memory frequency scaling enables tuning for tup and ttr. In above Eq (1), titercan be regarded as the representative factor for the totral DNN training time since ds, mb, and ep are fixed during entire DL processing. In addition, for simplicity, we only focus on the case of DL processing in the single GPU worker (that is, we do not consider the complex parallel DL processing in this paper). In this case, the communication time can be assumed to be zero (tcomm≈0) since the GPU worker and PS reside in the same machine.

B. MAPPING OF GPU FREQUENCY VALUES TO DEEP LEARNING PROCESSING POWER AND PERFORMANCE We now present the statistical model functions to map the GPU frequency values to DL processing power and perfor-mance. Let lup(cycles) and ltr(cycles) denote the workload amount of associated GPU memory events for minibatch upload and gradient transfer, repectively. Let lfw (cycles) and lbp (cycles) denote the workload amount of associated GPU core events for procedures of feed-forward and back-propagation, repectively. Then, according to [37], a DL core time tcore and a DL memory time tmem can be defined as follows.

tcore= tcore,0+κcore(lfw+ lbp)(fc)−1, (3)

tmem = tmem,0+κmem(lup+ ltr)(fm)−1, (4) where fc (cycles/sec) and fm (cycles/sec) denote the GPU core and memory frequency values, respectively. tcore,0 and

tmem,0 denote the time constant for GPU core events and one for memory events, respectively.κcoreandκmemdenote the associated time model coefficient for GPU core events and one for memory events, respectively. Based on Eq (3)

(4)

and Eq (4), the MFB time in Eq (2) can be reformulated as follows.

tMFB= tcore+ tmem. (5) The DL processing power consumption is also determined based on GPU frequency values. According to [38] and Eq (2), the DL processing power consumption p can be defined as follows.

p = p0+ξmem(lup+ ltr)fm+ξcore(lfw+ lbp)v2fc, (6)

where p0denotes the power constant.ξcoreandξmemdenote the associated power model coefficient for GPU core events and one for memory events, respectively. v denotes the GPU core voltage value.

According to [39], v is linearly proportional to the core frequency value fc, therefore v = h × fc where h is a positive coefficient. Then, from Eq (6), we reformulate the DL processing power consumption as follows.

p = p0+ξmem(lup+ ltr)fm+ h2ξcore(lfw+ lbp)(fc)3. (7) In practice, we can choose one among two regression based methods to statistically derive tMFB and p; GPU ker-nel profiling based estimation [40], and processing output based estimation [38]. In the former approach, we establish the relationship between the dependant variables (power and execution time of GPU kernels) and the independant variables (performance counter of GPU kernels) via the benchmark profiling. In this approach, we estimate the DL processing power and performance by using the information of associ-ated GPU kernel instructions. This approach does not require indiviual offline profiling for each DL application, but is usually impractical because it requires analyzing the entire kernel information of each DL source code. In the latter approach, by execution of DL applications over pre-defined GPU frequency pair sets, we derive the proper statistical model parameters for each DL applications. Although this approach causes the offline profiling overheads, we adopt it since it does not need to figure out the structure of DL source codes.

Note that Eq (5) and Eq (7) are non-linear functions. Therefore, according to [41], we use the following non-linear regression form to construct the models.

Z =β(0)+

2

X

l=1

β(l)_ϕ(l)₊_, ₍₈₎

where Z ,β, ϕ, and denote the model output, the statistical model parameter, associated non-linear term and error value, respectively. For Eq (5), Z = tMFB,β(0) = tcore,0+ tmem,0, β(1)₌_κcore_(lfw_{+ l}bp_),_β(2)₌_κmem_(lup_{+ l}tr_),_ϕ(1)₌_(fc₎−1

andϕ(2) = (fm)−1. For Eq (7), Z = p,β(0) = p0,β(1) = ξmem_(lup _{+ l}tr_),_β(2) _{= h}2_ξcore_(lfw _{+ l}bp_),_ϕ(1) _{= f}m _and ϕ(2) ₌_(fc₎3_{. By resolving Eq (}₈_{), we easily construct GPU}

frequency based DL power and performance models.

C. POWER COST MODEL FOR DEEP LEARNING WITH RENEWABLE GENERATION

Modern HPC clusters (or data centers) are supplied with power from both the utility grid and on-site renewable generators [42]–[44]. In this case, the cluster owner pur-chases electricity from the utility grid up to the amount of shortfall in power capacity when the renewable generation does not meet the power capacity requirement. In particular, we consider two major power sources for renewable gener-ation: solar (or Photovoltaic) and wind power. The amount of solar power depends on solar radiation and outside air temperature (OAT) while the one of wind power depends on the wind speed [45]–[47]. Obviously, the power capacity from renewable generation might be changed depending on time of day due to intermittency and fluctuation of weather condition. Meanwhile, the electricity price of the utility grid dynamically changes corresponding to the electricity market condition [48]. These facts indicate that the cost efficient set-ting of GPU frequency values might be continously chanaged corresponding to renewable power capacity and electricity price.

1) RENEWABLE GENERATION MODEL

At time h, the solar power capacity sp_h, the wind power capacity wp_h, and the associated total renewable generation rg_hcan be defined as follows.

rg_h =sp_h+wp_h, (9) sp_h =η × S × sr_h×(1 − 0.005 × (oat_h−25)), (10) wp_h=            0, if wsh< wsci Prate∗wsh−wsci wsr −wsci, if ws ci≤wsh≤wsr Prate, if wsr ≤wsh≤wsco 0, if wsh> wsco. (11)

For details about Eqs (9) - (11), see [45]. The time dependant variables such as solar radiation srh, outside air temperature oath, and wind speed wsh can be predicted by using various prediction algorithms [49]. For simplicity, we assume that each renewable generation rg_hcan be accu-rately predicted for all time steps h = 1, . . ., by using the prediction algorithm.

2) POWER COST MODEL FOR DEEP LEARNING

We assume that the marginal cost for the renewable genera-tion is zero. In this case, if the renewable generagenera-tion meets the power capacity requirement for DL processing in cluster, then the cluster owner does not need to pay any electricity bill. Now, let I and pi_hdenote the number of GPU workers and the power consumption of the i-th GPU worker at time h. Based on Eq (7) and Eq (9), we present the power cost model

(5)

FIGURE 2. The structure of DeepPow-CTR: nonlinear MPC based GPU power controller for DL processing with renewable generation and utility grid.

for DL processing as follows.

ch=            0, if I X ∀i pi_h≤rg_h θh×( I X ∀i pi_h−rg_h), otherwise , (12)

whereθhrepresents the electricity price at time h. The short-age of required power capacity beyond the renewable genera-tion might be converted to the power cost for DL processing. Intuitively, it is desired to elevate the GPU frequency values up given high renewable generation and low electricity price, while it is desired to lower the one under the opposite case. To do this, we present our power controller for DL processing,

DeepPow-CTRin next subsection.

D. NONLINEAR MPC FORMULATION FOR COST EFFICIENT DL PROCESSING IN CLUSTER

The objective of the DeepPow-CTR is to find the real-time optimal GPU frequency setting for cost-efficient DL process-ing, in response to the renewable generation and electricity price. Fig 2 shows the detailed structure of our DeepPow-CTR. Note that the DeepPow-CTR should minimize the degradation of DL processing performance while minimizing the associated power cost for DL. Therefore, the target of the control algorithm is naturally converted to accomplish-ment of the cluster owner-defined trade-off between the DL power consumption and processing performance. In addition, for practicality of the DeepPow-CTR, we also consider the limited power budgeting constraint [50].

To achieve a real-time elaborate power control, we adopt the model predictive control (MPC) approach which is suit-able for constraint-driven complex control problems [23]. Let ˜f = (f1,c 1 , f 1,m 1 , . . . , f I,c 1 , f I,m 1 , f 1,c 2 , f 1,m 2 , . . . , f I,c H , f I,m H ) ∈

R2I ·H denote the GPU frequency trajectory over the

control horizon H . Here, f_hi,c and f_hi,m denote the GPU core/memory frequency values of the i-th GPU worker at discrete time h. Given current GPU frequency vector (f₀1,c, f₀1,m, . . . , f₀I,c, f₀I,m) ∈ R2I at current time h = 0, we design the MPC based GPU power control problem as follows.

Problem 1 (DL power Consumption Control Problem):

min ˜f V := H X h=1 (ωpch+ I X i ωt,i_ψi_tMFB,i h ) +ωc H −1 X h=0 I X i=1 (υi,c(f_h+i,c₁−f_hi,c)2+υi,m(f_h+i,m₁−f_hi,m)2), (13a) subject to Eq (12) and

f_mini,c ≤ f_hi,c≤ f_maxi,c, ∀i=1, . . ., I, ∀h=1, . . ., H, (13b)

f_mini,m ≤ f_hi,m≤ f_maxi,m, ∀i=1, . . ., I, ∀h=1, . . ., H, (13c) I

X

i

pi_h ≤pb_h, ∀h = 1, . . ., H. (13d)

Here, V : R2I ·H → R is the integrated cost function of DL power consumption cost, DL performance loss cost, and the GPU frequency control cost. The positive constants

wp, wt_i, and wcdenote the associated weight values for each cost term. The positive constantsψi, vi,c, and vi,mdenote the associated price for each cost term. Without loss of generality, we set vi,c, and vi,m as very small values. f_mini,c, fmaxi,c, fmini,m,

and fmaxi,m denote the GPU core/memory frequency bounds.

The constraints (13b) and (13c) enforce the available GPU frequency range which is dedicated to each GPU worker. The constraint (13d) assigns the limited power budget to the cluster for maintenance safety, where pb_his the determined power budget for time h.

In order to facillitate the resolving of the Problem 1, we need to remove if-otherwise term in Eq (12). To do this, we add the artificial trajectory ˜z = (z1, . . . , zH) ∈ RH to the problem. Then, we reformulate the Problem1as follows.

Problem 2 (DL power consumption control problem with ˜z): min ˜f,˜z V := H X h=1 (ωpzh+ I X i ωt,i_ψi_titer,i h ) +ωc H −1 X h=0 I X i=1 (υi,c(f_h+i,c₁−f_hi,c)2+υi,m(f_h+i,m₁−f_hi,m)2), (14a) subject to constraints (13b), (13c), (13d) and

θh×( I X i=1 pi_h−rg_h) ≤ zh, ∀h = 1, . . ., H, (14b) 0 ≤ zh, ∀h = 1, . . ., H. (14c)

(6)

Obviously, the optimal solution and the associated function value of the Problem II-D are equivalent to ones of the

Problem1. Note that the ProblemII-Dis the nonlinear MPC

problem due to Eq (14a) and constraint (14b). Therefore, to solve the ProblemII-D, we use the sequential quadratic programming (SQP) method that is suitable for optimiza-tion problems involving nonlinear funcoptimiza-tions. In next secoptimiza-tion, we present the detail of the SQP method used to get the optimal solution of the ProblemII-D.

III. BFGS BASED SQP METHOD FOR DL POWER CONTROL PROBLEM

The SQP method adopts an iterative searching proce-dure which approximates the original problem by using quadratic programming (QP) at each iterate step [24]. This method, based on the second-order Taylor series, uses the Newton/Gauss-Newton approach to search the approxi-mated optimal solution sequence. In order to construct QP sub-problems for Problem II-D, we prepare the following ingredients.

• Current GPU frequency vector (f1,c

0 , f 1,m 0 , . . . , f

I,c

0 , f₀I,m) ∈ R2I. This vector is used for the second term in Eq (14a).

• Initial GPU frequency trajectory ˜f(k=0)and initial artifi-cal trajectory ˜z(k=0)where k is the SQP iterate step. Note that we easily ensure the existence of feasible solutions by following setting, ˜f(k=0) _:=     fmin ... fmin     ∈ R2I ·H, (15) fmin :=           f_min1,c f_min1,m ... f_minI,c f_minI,m           ∈ R2I, (16) ˜ z(k=0) :=      θmax×(PIi=1pimax) ... θmax×(PIi=1pimax)      ∈ RH, (17)

where θmax and pimax denote the possible maximum

electricity price, and the possible maximum power con-sumption (with fmaxi,c and fmaxi,m) of the i-th GPU worker,

respectively. This settings always comply with con-straints (13d), (14b), and (14c) if the cluster owner has the sufficient power budget.

• Gradient vector of cost function ∇_˜f_,˜zV¯. This is a

first-order Taylor series term used for a QP sub-problem.

• Hessian matrix of Lagrangian function L for

ProblemII-D, ∇2

˜f,˜zL which is a second-order Taylor

series term for a QP sub-problem. This is used to capture

whether the solution satisfies the sufficient conditions of the optimality or not.

• Gradient vector of constraints (13b), (13c), (13d), (14b)

and (14c), ∇_˜f_,˜zC. The linearization for constraints enable using the simplex method for optimization of the QP sub-problem.

Letλ ∈ R(4I +3)H denote the Lagrangian multiplier for constraints (13b), (13c), (13d), (14b) and (14c). Then we formulate the associated Lagrangian function L : R2I ·H × RH × R(4I +3)H → R for ProblemII-Das follows.

L(˜f, ˜z, λ) := ¯V (˜f, ˜z) + λTC(˜f, ˜z), (18) where C(˜f, ˜z) =          Cmin(˜f) Cmax(˜f) Cpb(˜f) C_θ(˜f, ˜z) Cposi(˜z)          ∈ R(4I +3)H, (19) Cmin(˜f, ˜z) =           −f₁1,c+ f_min1,c −f₁1,m+ f_min1,m ... −f_HI,c+ f_minI,c −f_HI,m+ f_minI,m           ∈ R2I ·H, (20) Cmax(˜f, ˜z) =           f₁1,c− fmax1,c f₁1,m− fmax1,m ... f_HI,c− fmaxI,c f_HI,m− fmaxI,m           ∈ R2I ·H, (21) Cpb(˜f, ˜z) =      PI ipi1−pb1 ... PI ipiH−pbH      ∈ RH, (22) C_θ(˜f, ˜z) =      θ1(PI∀ipi1−rg1) − z1 ... θH(PI∀ipiH−rgH) − zH      ∈ RH, (23) Cposi(˜f, ˜z) =     −z1 ... −zH    ∈ R H_. ₍₂₄₎

Based on Eqs (18) - (24), we formulate the QP sub-problem of the SQP method at the iterate step k as follows.

Problem 3 (QP Sub-Problem of SQP Method at Step k):

min

d QP

(k)_{:= ∇ ¯}_V_(˜f(k)_{, ˜z}(k)₎T_{d +}1 2d

(7)

subject to

∇C(˜f(k), ˜z(k))Td + C(˜f(k), ˜z(k)) ≤ 0. (25b) Here, d(k) ∈ R(2I +1)H := argmin_dQP(k) is the opti-mal direction at iterate step k, and B(˜f(k), ˜z(k), λ(k)) = ∇2

˜f,˜zL(˜f

(k)_{, ˜z}(k)_{, λ}(k)_{) is the Hessian matrix of L at step} k. The Eq (25b) is the local affine approximation of constraint C(˜f(k), ˜z(k)). Obviously, Cmin(˜f(k)), Cmax(˜f(k)),

and Cposi(˜z(k)) are originally linear constraints, therefore

their first-order Taylor approximations are equivalent to themselves.

We can use well-known QP solving methods such as active-set method, interior point method, and augmented method [25] in order to resolve above QP sub-problem. The optimal solution d(k)at step k, can be used as a direction to get the next iterate solution (˜f(k+1), ˜z(k+1)) and the next Lagrange multiplierλ(k+1). The updating of iterate solution is defined as follows. ˜f(k+1) ˜ z(k+1) ! = ˜f (k) ˜ z(k) ! +α(k)d(k), (26) λ(k+1) ₌_λ(k)₊_α(k)₍_γ(k)₋_λ(k)₎_, ₍₂₇₎

whereα(k)∈(0, 1] denotes the step length for updating solu-tions, andγ(k)denotes the optimal Lagrangian multiplier for QP sub-problem at step k. In order to find the appropriateα(k), we adopt the exact l1 (absolute value) penalty function as

follows [24], [51]. φ(˜f(k)_{, ˜z}(k)_{) = ¯}_V_(˜f(k)_{, ˜z}(k)_{) +} 1 µ(k)kC +_(˜f(k)_{, ˜z}(k)_)k 1, (28) where C+(˜f(k), ˜z(k)) = (C₁+(˜f(k),˜z(k)),. . . ,C_{(4I +3)H}+ (˜f(k),˜z(k))), (29) C_i+(˜f(k),˜z(k)) = ( 0, if Ci(˜f(k), ˜z(k)) ≤ 0 Ci(˜f(k), ˜z(k)), if Ci(˜f(k), ˜z(k))> 0. (30)

Here, µ(k) > kλ(k)k∞ represents the penalty parameter

at step k. Due to the non-smoothness of Eq (28), we use the directional derivative Dd(φ(˜f(k), ˜z(k))) in the direction d

which is defined as follows.

Proposition 1: We assume that pb_h ≥ PI

i=1pimin, ∀h =

1, . . . , H. Then, given direction d the associated directional derivative Dd(φ(˜f(k), ˜z(k))) is as follows. Dd(φ(˜f(k), ˜z(k))) = ∇_˜f ,˜zV¯(˜f(k), ˜z(k))Td − 1 µ(k)kC +_(˜f(k)_{, ˜z}(k)_)k 1. (31) Proof: By assumption, we have at least one feasible solution based on Eqs (15) - (17). Sinceβ(1)andβ(2)in Eq (8) are always non-zero constants (6= 0), there is no zero row vector in ∇C. All of the row vectors are linearly independant since the each input variable (all of GPU frequency values and z1, . . . , zH) is determined individually. Therefore, ∇C has full row rank and then, we derive Eq (31) by using the

Algorithm 1 GPU Power Control for DL Processing via the BFGS Based SQP Method

INPUT:

current GPU core/memory frequency vector (f₀1,c, f₀1,m, . . . , f₀I,c, f₀I,m) ∈ R2I,

statistical model parameters by Eq (8) βi,pw 0 , β i,pw 1 , β i,pw 2 , β i,t 0 , β i,t 1 , β i,t 2 , ∀i = 1, . . . , I,

cluster power budget trajectory pb_h, ∀h = 1, . . . , H, electricity price trajectoryθh, ∀h = 1, . . . , H, renewable generation trajectory rg_h, ∀h = 1, . . . , H. PARAMETER:

initial approximated Hessian B(k=0)=I,

initial Lagrangian multipliersλ(k=0), initial iterate solution ˜f(k=0), ˜z(k=0). OUTPUT:

control action (f₁1,c, f₁1,m, . . . , f₁I,c, f₁I,m) ∈ R2I for next time step, from optimal solution ˜f∗.

01 : selectη ∈ (0, 0.5), τ ∈ (0, 1) 02 : QP(k=0)← ¯V(˜f(k=0), ˜z(k=0)) 03 : for k = 1, 2, . . .

04 : get feasible d(k)=argmin_dQP(k)in Problem3

05 : if k ≥ kmaxor QP(k−1)−QP(k)≤δ 06 : return ˜f(k)as ˜f∗ 07 : end if 08 : select arbitrarilyµ(k)> kλ(k)k_∞ 09 : l ← 1 10 : ˜x ← [˜f(k)Tz˜(k)T]T 11 : whileφ(˜x + αd(k))> φ(˜x) + ηlD_d(k)(φ(˜x)) 12 : l ←τ ∗ l 13 : end while 14 : α ← l 15 : get ˜f(k+1), ˜z(k+1), λ(k+1)by Eq (26) - (27) 16 : get B(k+1)by Eq (35) 17 : end for

concept of l1merit function over the augmented Lagrangian

in [24].

The directional derivative D_d(k)(φ(˜f(k), ˜z(k))) is used to get

the properα(k)based on the backtracking line search method with Armijo condition [25]. This procedure is shown in Algorithm1.

Note that it is generally not easy to directly calculate ∇2

˜f,˜zL due to the computation complexity and the non-positive

definiteness [52]. Instead, we track the approximated Hessian of L by using quasi-Newton approach. Based on damped

Broyden-Fletcher-Goldfarb-Shanno(BFGS) [24], we derive B(k):=B(˜f(k), ˜z(k), λ(k)) as follows. y(k) = ∇_˜f,˜zL(˜f(k+1)_,˜z(k+1)_,λ(k+1)_)−∇ ˜f,˜zL(˜f(k), ˜z(k),λ(k)), (32) r(k) =σ(k)y(k)+(1 −σ(k))B(k)d(k), (33)

(8)

TABLE 3. Specification of GPU workers and real-time power controller. σ(k) =      1, if d(k)Ty(k)_≥₀_.2d(k)T_B(k)_d(k) 0.8d(k)B(k)d(k) d(k)B(k)d(k)−d(k)y(k), otherwise , (34) B(k+1) =B(k)−B (k)_d(k)_d(k)T_B(k) d(k)TB(k)d(k) + r (k)_r(k)T d(k)Tr(k). (35) Note that B(k+1)is positive definite matrix if B(k)is positive definite one [25]. Therefore, we set the initial approximated Hessian matrix as B(0) = I where I is the identity matrix.

Obviously, the matrices B(k), k = 1, 2, . . . updated by Eq (35) are always positive definite since B(0)=I is positive

definite.

In Algorithm 1, we describe the detailed procedures of GPU power control for DL processing via the BFGS based SQP method. As the output of the Algorithm, the control action derived from the optimal GPU frequency trajectory ˜f∗ is applied to next MPC time step. In line 01, we define the constantsη, τ for back-tracking line search. In line 02, the we set the initial QP sub-function value by using the original cost function value ¯V with the initial solution. In lines 03 - 17, by using the iterative SQP method, we approach to the approximated optimal GPU frequency trajectory. In line 04, we find the searching direction d(k) at step k by solving

ProblemII-D. In lines 05 - 06, we return the derived GPU

frequency trajectory if the SQP searching step count reaches the limit (kmax) or the associated optimality is at a target

level (δ). In line 08, we set the proper penalty parameter based on the norm of Lagrangian multiplier. In lines 11 - 14, we find the proper step length used to update the next iterate solution. In lines 15 - 16, we get the next iterate solution, the next Lagrangian multiplier, and the next approximated Hessian matrix.

IV. PERFORMANCE EVALUATION

In this section, we investigate the power cost efficiency for DL processing by using the proposed control algorithm through experiments based on the testbed.

A. TESTBED SPECIFICATION

For the testbed establishment, we deploy 8 heterogeneous

GPU workers and the DeepPow-CTR. Table 3 presents

the detailed specification of associated servers. Each GPU worker contains single GPU chipset based on NVIDIA Pascal architectures (Geforce series, GTX1060/GTX1080) which are frequency scaling capable. We use the official tool, nvidia-smi in order to monitor the GPU status (e.g., current frequency values, power usage, and utilization) of workers in real-time [53]. However, nvidia-smi is not avail-able for frequency scaling of our GPU workers since its automatic boost management invalidates the manual fre-qeuncy setting. Instead, we use the third-party software, nvidia-inspector to adjust the GPU core/memory frequency values in the fine-grained manner [54]. The available range of GPU frequency values depends on the GPU chipset type. For GTX1060(6GB), the available (core/memory)

fre-quency range is defined by (580 − 1850Mhz, 2850 −

4400Mhz). For GTX1080 (8GB), the available one is defined

by (580 − 1950Mhz, 3515 − 5414Mhz). Each GPU worker

has our implemented Java [55] based background mon-itoring agent which iteratively retrieves the GPU status via the nvidia-smi and reports the associated data to the DeepPow-CTR. They also contain our implemented fre-quency modulating agent which iteratively adjusts the GPU core/memory frequency values via the nvidia-inspector, in response to the control messages from the DeepPow-CTR. The DeepPow-CTR contains our implemented SQP opti-mizer which iteratively determines the best GPU frequency setting for the testbed. It also iteratively runs the nonlinear regression solver to resolve Eq (8) for the construction of DL processing power and performance model.

B. DEEP LEARNING APPLICATION SPECIFICATION

For evaluation, we train three DNN models; Resnet-152 [26], VGG-19net [27], and Inception-V3 [28]. As sample input dataset, we test MNIST [29], CIFAR-10 [30] and Ima-genet [31], [32]. Contrary to MNIST(60MB) and CIFAR-10(300MB), Imagenet contains large-scale image dataset (over 1M raw images(>50GB)). GPU workers 1-3 run Resnet with Imagenet (minibatch size = 16), 4-6 run VGG-19net with CIFAR-10 (minibatch size = 32), and 7-8 run Inception-V3 with MNIST (minibatch size = 32). We run the DL applications by using the framework CNTK [56]. To report the MFB time to the DeepPow-CTR, we deploy our imple-mented Java based DL monitoring agents for each running DL application. The agents parse the DNN model training output through Windows Powershell scripts [57], and send the parsing data to the DeepPow-CTR.

C. RENEWABLE GENERATION AND ELECTRICITY PRICE We present two scenraios of renewable generation and elec-tricity price; scenario 1 for summer season and 2 for winter season. The scenario graphs are shown in Fig.3and4.

For electricity price trajectory, we use the real trace data of New York independent system operator (NYISO) elec-tricity market retrieved from the Federal Energy Regulatory Commission (FERC) [33]. We consider the data of three generators in NYISO; West babylon IC, Syracuse power, and

(9)

FIGURE 3. Experimental scenario 1 (summer season, Jun 01-03. 2018). (a-c): Utility grid electricity priceθh

(Locational Based Marginal Price, LBMP) in New York independent system operator (NYISO) [33], and (d-f): Renewable (solar and wind power) generation rg_hover geo-locational regions, derived from Measurement and Instrumentation Data Center (MIDC) [34].

FIGURE 4. Experimental scenario 2 (winter season, Dec 01-03. 2018). (a-c): Utility grid electricity priceθh

(Locational Based Marginal Price, LBMP) in New York independent system operator (NYISO) [33], and (d-f): Renewable (solar and wind power) generation rg_hover geo-locational regions, derived from Measurement and Instrumentation Data Center (MIDC) [34].

Beacon LESR. Fig 3(a)-3(b) show the associated trajectory data. Note that the locational based marginal price (LBMP) varies according to the geo-location and the time of day. For example, the LBMP in ‘Syracuse power’ in Fig 4(b) goes down to 13$/MWh (lowest point) at hours 30 while one in ‘Beacon LESR’ keeps constant 35$/MWh (steady point) at the same time of day.

For renewable generation trajectory, we use the real trace data retrieved from the Measurement and Instrumentation

Data Center (MIDC) [34]. In order to calculate solar and wind power capacity, we use the raw data of solar radiation, outside air temperature (OAT), and wind speed in three geo-locations; Natural Energy Laboratory of Hawaii Authority (NELHA), NREL National Wind Technology Center M2, and Oak Ridge National Labotary. Fig 3(d) - 3(f) show the solar and wind power trajectory at each region. Similar to the case of electricity price, the amount of renewable generation is different according to location and time. In order to apply

(10)

FIGURE 5. Experimental results for scenario 1 (of Fig3). (a-c) : GPU core frequency trajectory, (d-f) : GPU memory frequency trajectory, (g-i) : DL processing power consumption cost (θh×(PIi =1pih−rgh)) and performance loss cost (or MTB time cost) (PIi =1ψithMTB,i), and (j-l): Total cost ¯V

comparison of DeepPow-CTR and baseline approach withσ = 0.1, 0.4, 0.7, 1.

the trace data to the scale of our testbed, we scale down the unit size of the data (e.g., we multiply the power capacity of renewable generation by 0.001, kW → W ).

D. CLOSED-LOOP COST PERFORMANCE OF PROPOSED DEEPPOW-CTR

Fig5shows the integrated experimental results for scenario 1 (of Fig 3(a), summer season). We assume that the MPC closed-loop step starts at second day (hour 24) for each scenario. We simply set the experimental parameters as

ψi ₌ _{1 ∀i ($), w}t ₌ _{5, w}p ₌ _{1, w}c ₌ _{1, v}i,c ₌

0.001, vi,m = 0.00001 ∀i (cent), pbh = 1000 ∀h (W), and H = 10. Fig 5(a) - 5(c) show the GPU core frequency

values of 8 GPU workers, in response to the electricity price and renewable generation of Fig3. Obviously, the proposed DeepPow-CTR properly controls the GPU frequency val-ues of GPU workers in cost-efficient manner. For example, in Fig 5(a), the DeepPow-CTR increases GPU core fre-quency values of all the workers during 11:00-15:00. This is because that the renewable power capacity is high (solar power: ≈ 1050kW , wind power: ≈ 200kW ) at NELHA during hours 35-41. On the other hand, the DeepPow-CTR drops the GPU frequency values during 17:00 - 23:00. This is because that the amount of renewable generation is zero during that time interval. Fig 5(g) - 5(i) show the derived DL processing power cost and performance loss cost (i.e., MTB

(11)

FIGURE 6. Experimental results for scenario 2 (of Fig4). (a-c) : GPU core frequency trajectory, (d-f) : GPU memory frequency trajectory, (g-i) : DL processing power consumption cost (θh×(PIi =1pih−rgh)) and performance loss cost (or MTB time cost) (PIi =1ψithMTB,i), and (j-l): Total cost ¯V

comparison of DeepPow-CTR and baseline approach withσ = 0.1, 0.4, 0.7, 1.

time cost) of 8 GPU workers, by GPU frequency scaling of the DeepPow-CTR. These results are consistent with Fig.5(a) - 5(c). In Fig.5(g), we achieve the smallest MTB time cost at hour 16:00, since the associated GPU frequency values reach the highest level. Nevertheless, the associated DL power cost is not high during the same time, due to the low electricity price and the high renewable power capacity. The results in Fig 5(h) and 5(i) also show the trends similar to Fig 5(g). Fig 5(j) - 5(l) show the cost comparison of the proposed DeepPow-CTR and the baseline apporach which pins up the GPU frequency value based on a constantσ. In the baseline approach, GPU frequency values are determined by σ as follows:

f = fmin+σ(fmax−fmin). (36)

Obviously, as σ increases, the GPU frequency values also increase. Note that we cannot find the universal constant σ which derives the optimal cost for all cases in Fig 5(j)- 5(l). The proposed DeepPow-CTR outperforms the baseline approach no matter the value ofσ for all cases. This is because the baseline approach is not able to react adaptively to the dynamic electricity price and renewable generation. Interestingly, the GPU memory frequency values scarecly changes over the hours as shown in Fig 5(d) - 5(f). This is because that the GPU memory frequency does not have significant effect on the DL processing performance (note that just in our experiments). Note that the computa-tional load for DL processing heavily depends on the capacity GPU SMs, therefore the GPU core frequency scaling is more

(12)

FIGURE 7. Average cost ($) of DeepPow-CTR, baseline

(σ = 0.1, 0.4, 0.7, 1), and the modified MPP approach [50] over the control horizon H.

critical for DL processing performance than the memory one (note that just in our experiments).

Fig 6 shows the integrated experimental results for sce-nario 2 (of Fig 4(a), winter season). Compared to Fig 5, in scenario 2, the electricity price is expensive and renew-able power capacity is low. In Fig 6(a), the DeepPow-CTR maintains lower GPU frequency values for all GPU workers over the entire hours than ones in scenario 1. Fig 6(g) - 6(i) show the derived DL processing power cost and the MTB time cost by GPU frequency scaling of the DeepPow-CTR given scenario 2. Consistent with Fig 4, all the associated DL processing power cost and the MTB time cost are high. However, the proposed DeepPow-CTR derives the control action which still achieves the high cost efficiency and the timely feedback. Fig 6(j) - 6(l) show the cost comparison of the DeepPow-CTR and the baseline apporach in scenario 2. Similar to Fig 5, the DeepPow-CTR achieves the best cost efficiency for all cases.

Subsequently, we compare the proposed DeepPow-CTR to the conventional one; maximum possible performance (MPP) based MPC approach [50]. Note that the MPP approach is originally designed for CPU power control in clusters, there-fore we reformulate the cost function in [50] as follows.

J = H X h=1 ( I X i=1 pi_h− pref)2+ wt H X h=1 I X i=1 (t_hi − t_maxi )2, (37)

where pref and tmaxi denote the power reference point and

the allowable maximum MTB time (i.e., worst DL perfor-mance) of the i-th GPU worker, respectively. Although this

approach is simple and ensures the acceptable DL perfor-mance however, it still does not consider the electricity price and renewable power capacity so as to cause the operational cost inefficiency for DL processing.

Fig7compares the average cost of the proposed DeepPow-CTR, baseline approach, and the modified MPP approach based on Eq (37). Note that the DeepPow-CTR outperforms the MPP approach in both scenario 1 and 2. This is because that the MPP approach does not decrease (raise) the GPU frequency values when the renewable power capcity is low and electricity price is high (the renewable power capcity is high and electricity price is low). Overall, the proposed DeepPow-CTR improves the operational cost efficiency on average 15% compared to others. The experimental results demonstrate that our algorithm has the adaptivity in electric-ity market condition and renewable generation, and achieves the cost efficient power control for DL processing.

V. CONCLUSION

In this paper, to achieve the cost efficient DL pro-cessing, we proposed the GPU frequency scaling based real-time power controller called DeepPow-CTR. We define the completion time of memory-access / feed-forward /

back-propagation (MFB) for each training iteration as DL performance metric. Our sequential quadratic program-ming (SQP) based nonlinear model predictive control (MPC) formulation is able to achieve the proper balancing of DL processing power and performance, in response to elec-tricity price and renewable generation. To show the prac-ticality of the DeepPow-CTR, we used the real trace data retrieved from the Measurement and Instrumentation Data Center (MIDC) and the Federal Energy Regulatory Com-mission (FERC). We tested the training of well-known deep neural network (DNN) models such as Resnet, VGGnet, and Inception, with datasets such as MNIST, CIFAR-10, and Imagenet. The experimental results on lab-scale testbed con-sisting of multiple NVIDIA GPU workers (GTX1060/1080), demonstrate that our proposed DeepPow-CTR reduces the operational cost for DL processing about 15% compared to existing cluster power control approaches.

REFERENCES

[1] S. Pouyanfar, ‘‘A survey on deep learning: Algorithms, techniques, and applications,’’ ACM Comput. Surv. (CSUR), vol. 51, no. 5, p. 92, 2018. [2] L. Deng, ‘‘A tutorial survey of architectures, algorithms, and applications

for deep learning,’’ APSIPA Trans. Signal Inf. Process., vol. 3, p. e2, Jan. 2014.

[3] Z. Ning, P. Dong, X. Wang, J. J. P. C. Rodrigues, and F. Xia, ‘‘Deep rein-forcement learning for vehicular edge computing: An intelligent offloading system,’’ IEEE Trans. Ind. Informat., vol. 15, no. 5, pp. 3058–3067, May 2019.

[4] ILSVRC 2012 Competition. Accessed: Apr. 2019. [Online]. Available: http://www.image-net.org/challenges/LSVRC/2012/

[5] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521, no. 7553, p. 436, 2015.

[6] J. Gao, ‘‘Machine learning applications for data center optimiza-tion,’’ Google White Paper. Accessed: Apr. 2019. [Online]. Available: https://ai.google/research/pubs/pub42542

[7] Y. Li, Y. Wen, K. Guan, and D. Tao, ‘‘Transforming cooling opti-mization for green data center via deep reinforcement learning,’’ 2017,

(13)

[8] L. Peng and P.-H. Ho, ‘‘A novel framework of distributed datacenter networks to support intelligent services: Architecture, operation, and solu-tions,’’ IEEE Access, vol. 6, pp. 77485–77493, 2018.

[9] Q. Mao, F. Hu, and Q. Hao, ‘‘Deep learning for intelligent wireless net-works: A comprehensive survey,’’ IEEE Commun. Surveys Tuts., vol. 20, no. 4, pp. 2595–2621, 4th Quart., 2018.

[10] G.-S. Kim, J.-H. Kim, and L. Peng, ‘‘Asymmetric traffic provisioning in integrated cloud-fog based on flexible multi-flow optical transponder,’’

China Commun., vol. 16, no. 3, pp. 69–80, Mar. 2019.

[11] L. Peng, A. R. Dhaini, and P.-H. Ho, ‘‘Toward integrated Cloud–Fog networks for efficient IoT provisioning: Key challenges and solutions,’’

Future Gener. Comput. Syst., vol. 88, pp. 606–613, Nov. 2018.

[12] H. Mei, H. Lu, and L. Peng, ‘‘Data offloading in cache-enabled cross-haul networks,’’ Comput. Commun., vols. 142–143, pp. 1–8, Jun. 2019.

[13] S. Xiao, H. Yu, Y. Wu, Z. Peng, and Y. Zhang, ‘‘Self-evolving trading strategy integrating Internet of Things and big data,’’ IEEE Internet Things

J., vol. 5, no. 4, pp. 2518–2525, Aug. 2018.

[14] Y. Zhang, ‘‘GroRec: A group-centric intelligent recommender system integrating social, mobile and big data technologies,’’ IEEE Trans. Services

Comput., vol. 9, no. 5, pp. 786–795, Sep./Oct. 2016.

[15] Z. Zhang and K. P. Lam, ‘‘Practical implementation and evaluation of deep reinforcement learning control for a radiant heating system,’’ in Proc. 5th

Conf. Syst. Built Environments, Nov. 2018, pp. 148–157.

[16] Q. Liu, Y. Ma, M. Alhussein, Y. Zhang, and L. Peng, ‘‘Green data center with IoT sensing and cloud-assisted smart temperature control system,’’

Comput. Netw., vol. 101, pp. 104–112, Jun. 2016.

[17] W. Lee, M. Kim, and D.-H. Cho, ‘‘Deep power control: Transmit power control scheme based on convolutional neural network,’’ IEEE Commun.

Lett., vol. 22, no. 6, pp. 1276–1279, Jun. 2018.

[18] Z. Yan and Y. Xu, ‘‘Data-driven load frequency control for stochastic power systems: A deep reinforcement learning method with continuous action search,’’ IEEE Trans. Power Syst., vol. 34, no. 2, pp. 1653–1656, Mar. 2019.

[19] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks, ‘‘Minerva: Enabling low-power, highly-accurate deep neural network accelerators,’’ in Proc.

ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016, pp. 267–278.

[20] N. A. Gawande, D. J. Kerbyson, J. B. Landwehr, C. Siegel, N. R. Tallent, and A. Vishnu, ‘‘Scaling deep learning workloads: NVIDIA DGX-1/Pascal and intel knights landing,’’ in Proc. IEEE Int. Parallel Distrib. Process.

Symp. Workshops (IPDPSW), May/Jun. 2017, pp. 399–408.

[21] Nvidia Corporation. Accessed: Apr. 2019. [Online]. Available: https://www.nvidia.com/en-us/

[22] Nvidia DGX-1. Accessed: Apr. 2019. [Online]. Available: http://www. nvidia.com/dgx-1

[23] D. Q. Mayne, ‘‘Model predictive control: Recent developments and future promise,’’ Automatica, vol. 50, no. 12, pp. 2967–2986, 2014.

[24] P. T. Boggs and J. W. Tolle, ‘‘Sequential quadratic programming,’’ Acta

Numer., vol. 4, pp. 1–51, Jan. 1995.

[25] J. Nocedal and S. Wright, Numerical Optimization. Springer, 2006. [26] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for

image recognition,’’ in Proc. IEEE Conf. Comput. Vis. pattern Recognit., Jun. 2016, pp. 770–778.

[27] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail-able: https://arxiv.org/abs/1409.1556

[28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘‘Rethinking the inception architecture for computer vision,’’ in Proc. IEEE Conf.

Comput. Vis. Pattern Recognit., Jun. 2016, pp. 2818–2826.

[29] Mnist Dataset. Accessed: Apr. 2019. [Online]. Available: http://yann. lecun.com/exdb/mnist/

[30] Cifar10 Dataset. Accessed: Apr. 2019. [Online]. Available: https://www. cs.toronto.edu/ kriz/cifar.html

[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classification with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf.

Pro-cess. Syst., 2012, pp. 1097–1105.

[32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet: A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.

Vis. Pattern Recognit., Jun. 2009, pp. 248–255.

[33] Federal Energy Regulatory Commission. Accessed: Apr. 2019. [Online]. Available: https://www.ferc.gov/

[34] Measurement and Instrumentation Data Center (MIDC). Accessed: Apr. 2019. [Online]. Available: https://midcdmz.nrel.gov/

[35] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, ‘‘Accurate, large minibatch SGD: Train-ing imagenet in 1 hour,’’ 2017, arXiv:1706.02677. [Online]. Available: https://arxiv.org/abs/1706.02677

[36] NVIDIA GTX1080 Whitepaper. Accessed: Apr. 2019. [Online]. Available: https://international.download.nvidia.com/geforce-com/ international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf [37] Y. Abe, H. Sasaki, S. Kato, K. Inoue, M. Edahiro, and M. Peres, ‘‘Power

and performance characterization and modeling of GPU-accelerated sys-tems,’’ in Proc. IEEE 28th Int. Parallel Distrib. Process. Symp., May 2014, pp. 113–122.

[38] X. Mei, X. Chu, H. Liu, Y.-W. Leung, and Z. Li, ‘‘Energy efficient real-time task scheduling on CPU-GPU hybrid clusters,’’ in Proc. IEEE Conf.

Comput. Commun. (INFOCOM), May 2017, pp. 1–9.

[39] Q. Fang, J. Wang, Q. Gong, and M. Song, ‘‘Thermal-aware energy management of an HPC data center via two-time-scale control,’’ IEEE Trans. Ind. Informat., vol. 13, no. 5, pp. 2260–2269, Oct. 2017.

[40] H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Matsuoka, ‘‘Statistical power modeling of Gpu kernels using performance counters,’’ in Proc. Int. Conf. Green Comput., Aug. 2010, pp. 115–122.

[41] J. C. McCullough, Y. Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K. Gupta, ‘‘Evaluating the effectiveness of model-based power characterization,’’ in Proc. USENIX Annu. Tech. Conf, vol. 20, Jun. 2011, p. 12.

[42] C. Ren, D. Wang, B. Urgaonkar, and A. Sivasubramaniam, ‘‘Carbon-aware energy capacity planning for datacenters,’’ in Proc. IEEE 20th Int.

Symp. Modeling, Anal. Simulation Comput. Telecommun. Syst., Aug. 2012, pp. 391–400.

[43] Z. Ning, X. Wang, J. J. P. C. Rodrigues, and F. Xia, ‘‘Joint computation offloading, power allocation, and channel assignment for 5G-enabled traf-fic management systems,’’ ACM Trans. Intell. Syst. Technol., vol. 15, no. 5, pp. 3058–3067, May 2019. doi:10.1109/tii.2019.2892767.

[44] Z. Ning, P. Dong, X. Kong, and F. Xia, ‘‘A cooperative partial computation offloading scheme for mobile edge computing enabled Internet of Things,’’

IEEE Internet Things J., vol. 6, no. 3, pp. 4804–4814, Jun. 2019. [45] D.-K. Kang, E.-J. Yang, and C.-H. Youn, ‘‘Deep learning-based sustainable

data center energy cost minimization with temporal MACRO/MICRO scale management,’’ IEEE Access, vol. 7, pp. 5477–5491, 2019. [46] A. Yona, T. Senjyu, A. Y. Saber, T. Funabashi, H. Sekine, and C.-H. Kim,

‘‘Application of neural network to one-day-ahead 24 hours generating power forecasting for photovoltaic system,’’ in Proc. Int. Conf. Intell. Syst.

Appl. Power Syst., Nov. 2007, pp. 1–6.

[47] D. T. Nguyen and L. B. Le, ‘‘Optimal bidding strategy for microgrids con-sidering renewable energy and building thermal dynamics,’’ IEEE Trans.

Smart Grid, vol. 5, no. 4, pp. 1608–1620, Jul. 2014.

[48] S. Kwon, L. Ntaimo, and N. Gautam, ‘‘Optimal day-ahead power procure-ment with renewable energy and demand response,’’ IEEE Trans. Power

Syst., vol. 32, no. 5, pp. 3924–3933, Sep. 2017.

[49] S. X. Chen, H. B. Gooi, and M. Q. Wang, ‘‘Sizing of energy storage for microgrids,’’ IEEE Trans. Smart Grid, vol. 3, no. 1, pp. 142–151, Mar. 2012.

[50] H. Lim, A. Kansal, and J. Liu, ‘‘Power budgeting for virtualized data cen-ters,’’ in Proc. USENIX Annu. Tech. Conf. (USENIX)., vol. 59, Jun. 2011, p. 5.

[51] T. F. Edgar, Optimization of Chemical Processes. Hoboken, NJ, USA: Wiley, 2003.

[52] M. Cannon, ‘‘Efficient nonlinear model predictive control algorithms,’’

Annu. Rev. Control, vol. 28, no. 2, pp. 229–237, 2004.

[53] NVIDIA Management Library NVML. Accessed: Apr. 2019. [Online]. Available: https://developer.nvidia.com/nvidia-management-library-nvml

[54] Guru3D NVIDIA Inspector. Accessed: Apr. 2019. [Online]. Available: http://www.guru3d.com/files-details/nvidia-inspector-download.html [55] Oracle Java SDK. Accessed: Apr. 2019. [Online]. Available: https://www.

java.com/en/

[56] Microsoft CNTK. Accessed: Apr. 2019. [Online]. Available: https://github. com/Microsoft/CNTK

[57] Microsoft PowerShell Scripts. Accessed: Apr. 2019. [Online]. Available: https://docs.microsoft.com/en-us/powershell/scripting/overview?view= powershell-6

(14)

DONG-KI KANG received the M.S. degree in computer engineering from Chonbuk National University, Jeonju, South Korea, in 2011. He is currently pursuing the Ph.D. degree in electrical engineering with the Korea Advanced Institute of Technology (KAIST), Daejeon, South Korea. His research interests include cloud computing, deep learning, and GPU power control.

CHAN-HYUN YOUN (S’84–M’87) received the B.Sc. and M.Sc. degrees in electronics engineer-ing from Kyungpook National University, Daegu, South Korea, in 1981 and 1985, respectively, and the Ph.D. degree in electrical and communica-tions engineering from Tohoku University, Sendai, Japan, in 1994. From 1986 to 1997, he was the Leader of the High-Speed Networking Team, KT Telecommunications Network Research Lab-oratories, where he had been involved in the research and developments of centralized switching maintenance systems, maintenance and operation systems for various ESS’s systems, high-speed networking, and ATM network testbed. Since 1997, he has been a Professor at the School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, where he was an Associate Vice-President of the Office of Planning and Budgets, from 2013 to 2017. His research interests include distributed high performance computing systems, cloud computing systems, edge computing systems, deep learning frame-work, and others. He received the Best Paper Award from CloudComp 2014, and wrote a book on Cloud Broker and Cloudlet for Workflow Scheduling, (Springer, 2017). He was a General Chair for the 6th EAI International Conference on Cloud Computing (Cloud Comp 2015), KAIST, in 2015. He was a Guest Editor the IEEE WIRELESSCOMMUNICATIONS, in 2016.