A parallel Boltzmann machine on distributed-memory multiprocessors

(1)

A Parallel Boltzmann Machine on Distributed-Memory Multiprocessors

Jong

H.

Nang. D.

H.

Oh, Hyunsoo

Ycun.

and

S.

R. Maeng Department of Computer Science and Center for AI Research

Korea

Advanced

Institute

of Science and Technology e-mail : j-adam.k&.a~.k~

373-1 Wng-DUIg, Yus~ng-Ku. TaejOn 305-701, Korea

Abstract

In

this

paper, an efficient mapping scheme of Boltzmann

Machine

computations onto a distributed-memory mul-

t@”r,

which exploits the synchronous spatral parallelism, is plesentad. In this scheme, the neurons in a BoltPnann Machine

are

partitioned intop disjoint sets.

and

each set is mapped on a processor of ap -processor sys-

tan.

A parallel convergence

and

leaming algorithms of Balk" Machines, necess~~y communication pattern among

the

p”.and their time complexities when neurons

are

partitioned and mapped onto a distributed- rnemtny multiprocessar

are

investigated An expected p

--

speed-up of the padelizing scheme over a sin- gle pcessm

is

also analyzed theoretically which can be used

as

a basis in detMnining the most costeffective or optimal number of processors according to the given communication capabilities and interconnection topologies.

Introduction

A B d t z ” Machine [3,7] is a pbabiliitic neural network model, in which the individual neurons probabilisti- cally

determine the

next state upon the state of their neighbour neurons and the connection strength to them. It has been widely applied to

the

well-known combinatorial optimization p b l e m s [2,10] and image classification

[3.

One major problem in B o l t ” Machine applications is

that

it requires huge amount of computational resowes, when the

number

of neurons ⁱⁿthe Bolmann Machine becomes large.

or

when it is applied to a real-world applica- tion problem in which a long annealimg steps should be required to get a good solution. A general and natural way to speed-up the computations of Bdtzmann Machine is to use a multiptwesor, in which each processor usually takes charge of multiple neurons in the Boltzmsnn Machine and simulates them in a time-sharing fashion 1131.

The

basic parallelizing method for a Balk" Machine has

been

already proposed in [ 1, lo]. However.

since

they did not d d e r the overheads in

the

intaprocessor communication to exchange the changed neuron states, the actual pexfomance on a parallel canputer would be different from their simulation results if communication overheads

are

not negligible. Another pllnllelizing methods considering underlying parallel architecture were also pro- posed in [6,8]. In these w&, however, only a basic parallel convergence algoxithm and its appropriate intercon-

nection

topology were proposed without analysis in

[a].

and it was highly depended on a specific hardware architec- ture (i.e., Transputer) [81.

In

this paper, we investigate a new paralklizing method for convergence and leaming algorithm of a synchronous B o l t z m a ~

Machine on

a parallel computer, especially DMM (Distributed-Memay Multi-). We follow our previous works [II, 121 on

the ‘on

model. in which the neurons

in

neural network

are

partitioned into p disjoint sets, and each set is mapped

on

a processor of ap-processor system. Ap-processor speed-up is also

analyzed

theoretically.

Boltzmann Machine

A B o l a ” Machine can be repesnted by a undirected graph G

= { U $

} , where

U

= { u o , .

. .

, u m - l }

denotes

the set of m

neurons

and

C

is a

set

of unordered pairs from

U

denoting the connections between neurons [2]. A

connection

{ u i . u j } E

C

connects

the

neucolls ui and U,. ‘Ibe consensus Ck denotes the overall d e s i i i l - ity of

all the

activated

connections

in a given conIiguration

k and is

given by

whete S ( U i . U j ) E

R

is the strength assigned to connection { U i , U j } , and ~ ( u i ) is the state of neuron ui in a

(2)

configuration

k . The

difference in the consemus when the state of neuron ui is changed and the states of all other

neurons

is unchanged in configuration

k

,

Ace@).

is given by [2] :

where C, denotes the set of connections incident with neuron ui ({ ui,ui} d C4). The probability A ~ u ~ c ) of accepting a state transition of neuron ui , given the configuration

k

, is chosen

as

where c denotes the value of the mtrol parameter ( c

E R

+ ), and

In the Boltzmann

Machine

leaming f-on, the set of neurons is divided into three disjoint subsets. Ui, u h ,

and

U,.

With

U

=

UiyuhyU,.

where ui,

uh.

and

U, denote

the sets of input.,

hidden.

and o u w t neurons, respectively.Let

luil =mi. IuhI

=mh,and

Iu,l

=&.

TheconnectionsoftheBoltzmannMachine,assumed

in this paper,

are

chosen

such

that all input neurons

are

mutually ~ ~ ~ e c t e d , all output units are mutually connected,

all

hidden neurons ^arewnnected to all input andoutput neurons, andall n e m s have a biasconnection as shown in Figure l-(a). It is a commonly used connection pattern for the Boltzmann

Machine

when applied to classification

is given by Eq.(2).

output 'Neurons

U0

Hidden

1 Neurons u h Input

(a) A Structure of Boltzmann Machine

0 : a set of neurons sbulated by processor0

.

^aset of neurons Wulated by proce~s~rl

0 ': a set of neurons sunulated by processor2

@) Mapping Boltzmann Machine on DMM Figure 1 : A Smcture and Mapping of the Boltzmann Machine

The objective of learning algorithm in the Bolamann Machine is to adjust the connection strengths such that the Boltvnann Machine in free-running situation tends to be with a large probability in environmental configurations that

are

used

as

example in

the

clamped situation, which is controUed by following equation.

AS(Ui.Uj) =ll(<pij>

-

<Pij>), (4) where ll denotes a constant, <pi,> and <pi;> denote

the

expectation of the probability that the connection s (ui .U,) is activated in the clamped situation and in the free-running situation, respectively.

Mapping Boltzmann Machine on DMM

A DMM in our model consists of p processors, has no shared memory, and communicates only by message passing via point-to-point links. In our scheme, the input neurolls

in

a Boltzmann Machine are equally partitioned into p parts. and each part of

I

^{Ui Ilp}neurons is assigned to each processor. The

I

Ui Ilp neurons allocated t o a processor

are

simulated in a timesharing fashion by

the

processor. This scheme is also applied to hidden and output

neurons

If the connection weight value between two neighbouring neurons assigned to two different processors is stored on only one plocessor, an excessive interprocessor communications can not be avoidable during the computations for consensus difference ACea, because both processors should know it. "herefore, in our mapping scheme, each processor keeps in its local memory the all data necessary to simulate the assigned neurons. It includes a virtual copy of the state of neurons which are neighbours of assigned neurons, and the connection strengths between them.

Although this scheme duplicates the storages for neuron states and connection weights, it allows the processor to a~ shown

in

Figure I-@).

(3)

canpute the

A C , , ,

for assigned neuron i , which is the main computation in the Bolrzmann Machine simulation.

as

independently as possible.

The

inconsistency of

ne-

states stared on different processors can be resolved by

communications during the convergence phase. and the inconsisremy of coNLection weights can be d v e d by the recomputarion as in [12]. l h e y will be explained in more detail in the following.

A Parallel Boltzmann Machine Convergence Algorithm

In our scheme of the virtual implementation of a B o l a ” Machine on a

DMM,

each processor simultaneously selects a neuron from the allocated wbnetwoxk

and

inaependentl y changes its state with the probability given in Eq.(3). which is almost equivalent to the p sequential trials. However, these locally calculated transition pbabili- ties may be incorrect in the global sem,because each pmcesorlrses

the

neuron states storedin itslocal memory, which may be changed at the same

time

by the other processors. As a result, the asymptotic convergence can no longer be g ” t e u i . However,

the

probability of erroneously calculated state transitions k m u e s during the consensus optimimhn process as the value of the con1101 parametex c decreases, which was p v e d via simulations in [21. As soon as the neulon stste alloceted to one processor is changed, it should be informed to other processors which take charge of its neighbour nmrons be& next parallel ITA is started. Since the Bolamann Machine sttuc- ture we assume in this paper is almost fully-umnected (actually 3/4connected)

as

shown in Figure l-(a) and the other p” may

also

change

the

state of a neuron

at

the same time, a communication pattem called all-to-all broadcasting 191 is required to exchange the new

neuron

sate values. Once all-ballb“g ’ is completed, every processor

is

infarmed

of

upto-dnte nwron states so that t h e e ” in the transition probgbility in next parallel trial could be minimized. Algorithm 1 is aparaUel convergence algorithm for the R t h processor on ap-processor

DMM.

w h u t

U k

denotes the set

of

neunrns assigned to

the

Rth processor which will

be

pemabed and m &notes the set of whole neurons in the Bolt.rmsnn

Mahine.

The all-to-ull-broadcast

is

a procedm to broadcast and receive the states of neurons changed by olherp-1 p”.

AlgaSthm 1 : A parallel Coavergence Algorithm procedure PARALLEI-”VERGE ( U k . m)

c = c o

while (network-is_unrtable) do for t = 1 to

MAX-TRIAL,

do

Selects a neurons ui where ui E

U

; Computes AC,,B) using Eq.(2) ;

it(A,(iXc)>RANDOM_PROBO)then TIC&)=

1

- r l ( U i ) ;

.11-toall-bro4dcar(rl~ui)) ; end-for

c = a c ; end-while end

A Parallel Boltzmann Machine Learning Algorithm

In the parallel leaming phase, the connection weights

are adjusted

using the informations collected at clamped and free-nmning phases. which

are

governed by Eq.(4). However, since the same connection weights are stored in two p”.a coherence mechanism is requind to maintain these weights consistently. We already developed a coherence me&”. called a recomputation method [121. to Consistently ”inthe connection weights. In thii method,

the

new connection weights

are calculated

independently by two p” using the informations

collected

at

the

clamped and free-running phases, rather

than one

processor calculates and sends

the

new weights to other

processor

in which thenumber of necescrary messagesisprapomonal to thenumber

of

total connections

in the

Bel-

Machine.

These two independently calculated new CoMectiOn weights

are

exactly the same becarrse the informations

collected

by two processors

are

identical. which is guaranteed by the all-ballbroadcasting in

the

parallel convergence process. A parallel leaming cycle algorithm of Boltzmann Machine for

the k

th processor on a p

-- DMM

can be roughly described as Algaithm 2. in which

U/, U/.

and

U:,

denote the input. hidden, and output neurons

allocated

to the kth processor, mpectively.

Note that IU

Ilp = m / p = IUkl where

U

= U , Q U f i u U j . Without loss of generality, we can assume that the number and order of training examples

610

(4)

tobeclamped.E,ispredefinedbyhost,andlet IE

I

= t .

Algorithm 2 :

A

Parallel Boltzmann Machine Leaming Cycle Algorithm procedure PARALLEL-BOL'IZMANN_MACHI"ING

PARALL=-LEARNING-CYCLE ; whiIe (stq~-criterion are not d e d ) do end-while

end

procedure PARALLEL-LEARNING-CYCLE for T = 1 tot do/* Parallel Clamped Phase */

CLAMl-EXAMPLE(T) ; PARALLEL-CONVERGE(U~, m) ;

COLLECT-STATISTICS(<pi;> where U ~ E

U '

and u j ~ (I) ;

RUN-FREE(U,bUfiyUkl;

PARALLJL-CONVERGE(U~uU/uUi, m) ;

COLLECT-STATISTICS(<ppii>

where U ~ E

U k

and U;'

U )

; end-for

far T = 1 to t do /* Parallel Free-Running Phase */

end-for

ADJUST-CO"ECTION-STRENGTHS(U,C, U/. uj)

;

end

Analysis

To

derive the time complexities of sequential and parallel algorithms, we first assume that the number of total neu- rons assigned to a processor is m = m l + m/

+

^{m i ,}where m!, m/. and m i denote the number of input, hidden, andoutputn~assignedtothekthprocessor,respectively(i.e.. IU'I = m k , IUFl =m$,

IUil

=m/,and I V i

I

=

&. The

time complexities of the convergence algorithm used in the clamped and freerunning phases

are

diffemt f i a n each other because the sx of neurons to be perturbed are different. In the clamped phase, only hid- den

neurons

which

are

connected to input and output neurons

are

candidates to be perturbed. So, the time complexity of sequential convergence algorithm used in clamped phase.

T

1. is given by

T

1 = (Ma .(&+mi)+F).L .M (5)

where Ma is a multiply-and-add time of two floating-point numbers, M is the number of temperature cooling cycles to make a Boltzmann Machine stable in sequential algorithm, and L is the n u m b of trials (state transitions) on each temperature in the clamped phase of sequential algorithm in which only hidden neurons are perturbed. On the other hand, in the free running phase, since all neurons are candidates to be pertury. larger L is quired. The time complexity of sequential convergence algorithm used in the free-running phase,

T

1 , is given by

(6) where 4 , Lh. and Lo denote the number of selecting input neurons, hidden neurons, and output neurons in the free-running phase. respectively. If we assume that the n u m b of neurons in the input, hidden, and output layer are almost the same (i.e., mi

=

^mh

=

^m,

=

^m/3), Eq.(6) can be rewritten

as follows,

since the number of required trials

is

pportionai to the number of neurons to be permrbed (i.e., Li =Lh =Lo

=

L )

Ti

= ((M,.(mi+mh)+F).L,

+

(Ma(mi+mo)+F).Lh+ (Ma ~(m,+mh)+F)&).M

Ti

=(Ma.(mo+mi)+F)3.L.M = 3.T1 (7)

Since the p trials

are

computed concmntly in our parallel convergence algorithm, the number of required steps to stabilize

the

network in clamped and in parallel free-running phases are L / p , and 3-L / p , respectively. On the other hand, since it requires longer Markov chains because each parallel trial contains erroneous computations [l], the number of cooling cycle in the parallel convergence algorithm, M', is P M , where fb 1. The time complexities of parallel convergence algorithms used in the clamped phase, Tp , and in

the

free-running phase, Tp , are given by,

(5)

T i =

(M,*(&+mi)+F+AAB@)).--.pM

3.L = 3-T,

(9)

P

where

AAB@

) denotes the number of steps to

all-to-all-broadcast

a unit of message.

By Eqs.(5),

(a,

^(8).^{and (9).}

the

p -pmcessor speed-up of convergence algorithm in the clamped phase, S @ ), is the S a m e ~ t h e s p e e d - ~ ~ o f f r e e - r u n ~ , S ‘ @ ) , ~ d c o u l d b e r o u g h l ~ d a i ~ e d ~ E q . ( l O ) ,

Ring one -port p - 1

d-port

r k p 1

In Eq.(lO) the most important parameters an? the complexity of AAB@) and the communication/computation ratio which is defined as

A

=

C/Ma

where C is the unit time of

communication

for a processar to send and/or receive a unit of message. The A value lies between 0.5

-

²⁵⁶^[4].In one-port communication,

the

lower bound of AAB@ ) on a p -processor

DMM

is @-1). since each processor needs to receive from every other processor, i.e.. @-1) pro- cessors [12]. This lower bound can be achieved in the ring topology which is the cheapest one to be constructed. If a processor can send and receive on its d -port, d

>

1, concurrently during each time unit, more richly connected topologies would result in less all-to-all b” ’ g time. In d -pat communication,

the

lower bound

for

all-&all brocdcastin g on a p -processor

DMM

is @-l)/d, since each p ” r needs to receive from every other processor

and

the processor can receive

the

d

messages

concurrently. Tabk-1 summarizes the time complexities of the AAB@ ) for the processor communication capebilities and some well-known interconnection network topologies.

Table-1 : ’Ihe Lower Bound Complexities of AAB@) for Well-known Topologies

Mesh

I

^Hypercube Complete Connection

p - 1 p - 1 p - 1

r %pi V- 1

¹

The

thecxetical convergence speed-up aspects for well-known interconnection topologies

are

shown in Figure 2-(a) graphically, when m = 2048.8 = 40,p = 1.1,

A

= 8, and each processor communicates with d-port Figure 2-@) shows the speed-up ratio vetsus the number of “ m s in Boltzmann Machine when p is fixed as 64 and they have 2-pelt communication capability and intuwnnected by ring topology.

Note

that the speed-up ratio is heavily influenced by the

A

value when the the number of neurons in the Boltzmann Machine is small, and it approaches U)

lim S (p) = p

/p

⁼^64/1.1

=

^{58 as}shown in Figure 2-@). F a the parallel convergence algorithm of Bollzmann as+-

Machine. it is seen from Figure 2 that there is a cost-effective number of processors depending on the A value, underlying topology. and the number of

neumns

in

the

Bolamenn Machine, such that though more processors are added to the parallel computations.

the

speed-up ratio is not increased as much.

N u m k of Rocessor @) (a) The p - mspeed-up of convcsgencc Algorithm

Number of Ne& (m)

@) The ppmce- Speed-up of Canvergence Algorithm for some well-known topologies when ~ 2 0 4 8 w h p = 6 4

Figure 2 :

The

p -processor Speed-up of Parallel Convergence Algorithm

612

(6)

The time complexity to adjust the connection strengths in the sequential algorithm,

T;,

is proportional to the number of dl co~e!ctim in the Boltzmann Machine. whereas the time to adjust connection strengths in the parallel algorithm. T i , is propomonal to the number of connections incident with the neurons allocated to a processor. If we ignore the ovaheads to clamp the e x p p l e and to collect the statistics. thep -processor speed-up of the parallel leaming cycle over a single processor, S (p ), can be derived as follows ;

As shown in Eq.(ll), the p -processor speed-up of the leaming cycle is very complex because it is d e p d e n t on the ratio of times to m v e r g e and times to adjust weights which is an application dependent feature. When the number of kaming cycles in padel learning is increased

h

times compared to the sequential one because of

the

erroneous computations. the speed-up may be degraded by

h

times (h>l).

Summary

Since the Boltz.mann Machine requires huge amount of computational resources when the size is large or applied to a real application, it has been a main obstacle which prevents it to be used widely. To overcome this obstacle, we proposed a mapping scheme of Bolkmann Machine onto a Distributed-Memory Multiprocessor. and derive the parallel convergence and learning Boltzmann Machine algorithms. The time complexities of them are theoretically analyzed upon the underlying intemnnection topologies and communication capabilities of processors. The p

-

processor speed-ups are also derived by comparing them with a single processor's case. It is identified from our analysis that all-tw~U broadcasting is the primary communication pattern in OUT parallel algorithm, and there is the cost-effective number of processors corresponding to the characteristics of underlying hardware architecture.

Ref er en c es

[l]

[2]

[3]

[4]

E. H. L.

Aarts

and J.

H.

M. Korst, "Simulation of Leaming in Parallel Networks Based on the Boltzmann Machine,", Proc. of the 2nd European Simulation Congress, 1986, pp.391-398.

E.

H.

L.

Aarts

and J.

H.

M Korst, Simulated Annealing and Boltzmnn Machines, John Wiley & Sons, 1989.

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, "A Leaming Algorithm for Bolamann Machines," Cognitive Science, Vo1.9, 1985, pp.147-169.

M. Annaratone, C. Pommerell, and R. Ruhl, "Interpmxssor Communication Speed and Performance in Distributed-Memory Parallel Processofi," Proc. of the 16th Annual Int'l Symp. on Computer Architecture, R. Azencott and A. Doutriaux, "Synchronous Boltzmann Machines and Outline Based Image Classification,"

Proc. of Int'l Conference on Neural Network, Paris, 1990, pp.7-10.

A. Rrscha,

"A

Parallel Bolkmann Machine Simulator for Distributed Memory Multiprocessor Systems,"

Proc. of Int'l Conference on Neural Network, Paris, 1990, pp.647-650.

G. E. Hinton and TJ. Sejnowski, "Learning and Relearning in Boltzmann Machines," in D.E. Rumelhart &

J.L. McClelland (E&.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition.

Vo1.1: Foundation, Cambridge, MA, MIT

Press,

1985.

[8] A.

Iazzetta.

R. Vaccam, and U. Villano, "A Transputer Implementation of Boltzmann Machines,'' Proc. of 1st Italian Worhhop on Parallel Architecture and Neural Nehvork, pp.128-145, 1988.

[9]

S.

L. Johsson and C. T. Ho, "Optimum Broadcasting and Personalized Communication in Hypercubes," _IEEE Transactions on Computer, V01.38, No.9.1989, pp.1249-1268.

[lo] J.

H.

M. Korst, "Combinatorial Optimization on a Boltzmann Machine," Journal of Parallel and Distributed Computing, Vol.6.1989, pp.331-357.

[ll] H.

Yoon

and Jong

H.

Nang, "Multilayered Neural Network on Distributed-Memory Multiprocessors." Proc.

of Int'l Conference on Neural Network, Paris, 1990. pp.669-672.

1121 H. Yoon, Jong

H.

Nang, and S. R. Maeng, "Parallel Simulation of Multilayered Neural Networks on Distributed-Memory Multiprocessors." Microprocessing and Microprogramming, Vol. 29,1990, pp.185-195.

U31 H. Yoon, Jong

H.

Nang, and

S.

R. Maeng. "Neural Networks on Parallel Computers," in BranLo Soucek (E&.), Suth Generation Computer Technologies, Wiley, 1991 (in press).

pp.315-324,1989.

[5]

[6]

[7]