A Parallel Boltzmann Machine on Distributed-Memory Multiprocessors
Jong
H.
Nang. D.H.
Oh, HyunsooYcun.
andS.
R. Maeng Department of Computer Science and Center for AI ResearchKorea
AdvancedInstitute
of Science and Technology e-mail : j-adam.k&.a~.k~373-1 Wng-DUIg, Yus~ng-Ku. TaejOn 305-701, Korea
Abstract
In
this
paper, an efficient mapping scheme of BoltzmannMachine
computations onto a distributed-memory mul-t@”r,
which exploits the synchronous spatral parallelism, is plesentad. In this scheme, the neurons in a BoltPnann Machineare
partitioned intop disjoint sets.and
each set is mapped on a processor of ap -processor sys-tan.
A parallel convergenceand
leaming algorithms of Balk" Machines, necess~~y communication pattern amongthe
p”.and their time complexities when neuronsare
partitioned and mapped onto a distributed- rnemtny multiprocessarare
investigated An expected p--
speed-up of the padelizing scheme over a sin- gle pcessmis
also analyzed theoretically which can be usedas
a basis in detMnining the most costeffective or optimal number of processors according to the given communication capabilities and interconnection topologies.Introduction
A B d t z ” Machine [3,7] is a pbabiliitic neural network model, in which the individual neurons probabilisti- cally
determine the
next state upon the state of their neighbour neurons and the connection strength to them. It has been widely applied tothe
well-known combinatorial optimization p b l e m s [2,10] and image classification[3.
One major problem in B o l t ” Machine applications is
that
it requires huge amount of computational resowes, when thenumber
of neurons in the Bolmann Machine becomes large.or
when it is applied to a real-world applica- tion problem in which a long annealimg steps should be required to get a good solution. A general and natural way to speed-up the computations of Bdtzmann Machine is to use a multiptwesor, in which each processor usually takes charge of multiple neurons in the Boltzmsnn Machine and simulates them in a time-sharing fashion 1131.The
basic parallelizing method for a Balk" Machine hasbeen
already proposed in [ 1, lo]. However.since
they did not d d e r the overheads inthe
intaprocessor communication to exchange the changed neuron states, the actual pexfomance on a parallel canputer would be different from their simulation results if communication over- headsare
not negligible. Another pllnllelizing methods considering underlying parallel architecture were also pro- posed in [6,8]. In these w&, however, only a basic parallel convergence algoxithm and its appropriate intercon-nection
topology were proposed without analysis in[a].
and it was highly depended on a specific hardware architec- ture (i.e., Transputer) [81.In
this paper, we investigate a new paralklizing method for convergence and leaming algorithm of a synchronous B o l t z m a ~Machine on
a parallel computer, especially DMM (Distributed-Memay Multi-). We follow our previous works [II, 121 onthe ‘on
model. in which the neuronsin
neural networkare
partitioned into p disjoint sets, and each set is mappedon
a processor of ap-processor system. Ap-processor speed-up is alsoanalyzed
theoretically.Boltzmann Machine
A B o l a ” Machine can be repesnted by a undirected graph G
= { U $
} , whereU
= { u o , .. .
, u m - l }denotes
the set of mneurons
andC
is aset
of unordered pairs fromU
denoting the connections between neurons [2]. Aconnection
{ u i . u j } EC
connectsthe
neucolls ui and U,. ‘Ibe consensus Ck denotes the overall d e s i i i l - ity ofall the
activatedconnections
in a given conIigurationk and is
given bywhete S ( U i . U j ) E
R
is the strength assigned to connection { U i , U j } , and ~ ( u i ) is the state of neuron ui in aconfiguration
k . The
difference in the consemus when the state of neuron ui is changed and the states of all otherneurons
is unchanged in configurationk
,Ace@).
is given by [2] :where C, denotes the set of connections incident with neuron ui ({ ui,ui} d C4). The probability A ~ u ~ c ) of accepting a state transition of neuron ui , given the configuration
k
, is chosenas
where c denotes the value of the mtrol parameter ( c
E R
+ ), andIn the Boltzmann
Machine
leaming f-on, the set of neurons is divided into three disjoint subsets. Ui, u h ,and
U,.
WithU
=UiyuhyU,.
where ui,uh.
andU, denote
the sets of input.,hidden.
and o u w t neurons, respectively.Letluil =mi. IuhI
=mh,andIu,l
=&.TheconnectionsoftheBoltzmannMachine,assumed
in this paper,are
chosensuch
that all input neuronsare
mutually ~ ~ ~ e c t e d , all output units are mutually connected,all
hidden neurons are wnnected to all input andoutput neurons, andall n e m s have a biasconnection as shown in Figure l-(a). It is a commonly used connection pattern for the BoltzmannMachine
when applied to classificationis given by Eq.(2).
output 'Neurons
U0
Hidden
1 Neurons u h Input
(a) A Structure of Boltzmann Machine
0 : a set of neurons sbulated by processor0
.
a set of neurons Wulated by proce~s~rl0 ': a set of neurons sunulated by processor2
@) Mapping Boltzmann Machine on DMM Figure 1 : A Smcture and Mapping of the Boltzmann Machine
The objective of learning algorithm in the Bolamann Machine is to adjust the connection strengths such that the Boltvnann Machine in free-running situation tends to be with a large probability in environmental configurations that
are
usedas
example inthe
clamped situation, which is controUed by following equation.AS(Ui.Uj) =ll(<pij>
-
<Pij>), (4) where ll denotes a constant, <pi,> and <pi;> denotethe
expectation of the probability that the connection s (ui .U,) is activated in the clamped situation and in the free-running situation, respectively.Mapping Boltzmann Machine on DMM
A DMM in our model consists of p processors, has no shared memory, and communicates only by message passing via point-to-point links. In our scheme, the input neurolls
in
a Boltzmann Machine are equally partitioned into p parts. and each part ofI
Ui Ilp neurons is assigned to each processor. TheI
Ui Ilp neurons allocated t o a processorare
simulated in a timesharing fashion bythe
processor. This scheme is also applied to hidden and outputneurons
If the connection weight value between two neighbouring neurons assigned to two different processors is stored on only one plocessor, an excessive interprocessor communications can not be avoidable during the computations for consensus difference ACea, because both processors should know it. "herefore, in our mapping scheme, each pro- cessor keeps in its local memory the all data necessary to simulate the assigned neurons. It includes a virtual copy of the state of neurons which are neighbours of assigned neurons, and the connection strengths between them.Although this scheme duplicates the storages for neuron states and connection weights, it allows the processor to a~ shown
in
Figure I-@).canpute the
A C , , ,
for assigned neuron i , which is the main computation in the Bolrzmann Machine simulation.as
independently as possible.The
inconsistency ofne-
states stared on different processors can be resolved bycommunications during the convergence phase. and the inconsisremy of coNLection weights can be d v e d by the recomputarion as in [12]. l h e y will be explained in more detail in the following.
A Parallel Boltzmann Machine Convergence Algorithm
In our scheme of the virtual implementation of a B o l a ” Machine on a
DMM,
each processor simultaneously selects a neuron from the allocated wbnetwoxkand
inaependentl y changes its state with the probability given in Eq.(3). which is almost equivalent to the p sequential trials. However, these locally calculated transition pbabili- ties may be incorrect in the global sem,because each pmcesorlrsesthe
neuron states storedin itslocal memory, which may be changed at the sametime
by the other processors. As a result, the asymptotic convergence can no longer be g ” t e u i . However,the
probability of erroneously calculated state transitions k m u e s during the con- sensus optimimhn process as the value of the con1101 parametex c decreases, which was p v e d via simulations in [21. As soon as the neulon stste alloceted to one processor is changed, it should be informed to other processors which take charge of its neighbour nmrons be& next parallel ITA is started. Since the Bolamann Machine sttuc- ture we assume in this paper is almost fully-umnected (actually 3/4connected)as
shown in Figure l-(a) and the other p” mayalso
changethe
state of a neuronat
the same time, a communication pattem called all-to-all broadcasting 191 is required to exchange the newneuron
sate values. Once all-ballb“g ’ is completed, every processoris
infarmedof
upto-dnte nwron states so that t h e e ” in the transition probgbility in next parallel trial could be minimized. Algorithm 1 is aparaUel convergence algorithm for the R t h processor on ap-processorDMM.
w h u tU k
denotes the setof
neunrns assigned tothe
Rth processor which willbe
pemabed and m ¬es the set of whole neurons in the Bolt.rmsnnMahine.
The all-to-ull-broadcastis
a procedm to broadcast and receive the states of neurons changed by olherp-1 p”.AlgaSthm 1 : A parallel Coavergence Algorithm procedure PARALLEI-”VERGE ( U k . m)
c = c o
while (network-is_unrtable) do for t = 1 to
MAX-TRIAL,
doSelects a neurons ui where ui E
U
; Computes AC,,B) using Eq.(2) ;it(A,(iXc)>RANDOM_PROBO)then TIC&)=
1
- r l ( U i ) ;.11-toall-bro4dcar(rl~ui)) ; end-for
c = a c ; end-while end
A Parallel Boltzmann Machine Learning Algorithm
In the parallel leaming phase, the connection weights
are adjusted
using the informations collected at clamped and free-nmning phases. whichare
governed by Eq.(4). However, since the same connection weights are stored in two p”.a coherence mechanism is requind to maintain these weights consistently. We already developed a coherence me&”. called a recomputation method [121. to Consistently ”inthe connection weights. In thii method,the
new connection weightsare calculated
independently by two p” using the informationscol- lected
atthe
clamped and free-running phases, ratherthan one
processor calculates and sendsthe
new weights to otherprocessor
in which thenumber of necescrary messagesisprapomonal to thenumberof
total connectionsin the
Bel-Machine.
These two independently calculated new CoMectiOn weightsare
exactly the same becarrse the informationscollected
by two processorsare
identical. which is guaranteed by the all-ballbroadcasting inthe
parallel convergence process. A parallel leaming cycle algorithm of Boltzmann Machine forthe k
th processor on a p-- DMM
can be roughly described as Algaithm 2. in whichU/, U/.
andU:,
denote the input. hidden, and output neuronsallocated
to the kth processor, mpectively.Note that IU
Ilp = m / p = IUkl whereU
= U , Q U f i u U j . Without loss of generality, we can assume that the number and order of training examples610
tobeclamped.E,ispredefinedbyhost,andlet IE
I
= t .Algorithm 2 :
A
Parallel Boltzmann Machine Leaming Cycle Algorithm procedure PARALLEL-BOL'IZMANN_MACHI"INGPARALL=-LEARNING-CYCLE ; whiIe (stq~-criterion are not d e d ) do end-while
end
procedure PARALLEL-LEARNING-CYCLE for T = 1 tot do/* Parallel Clamped Phase */
CLAMl-EXAMPLE(T) ; PARALLEL-CONVERGE(U~, m) ;
COLLECT-STATISTICS(<pi;> where U ~ E
U '
and u j ~ (I) ;RUN-FREE(U,bUfiyUkl;
PARALLJL-CONVERGE(U~uU/uUi, m) ;
COLLECT-STATISTICS(<ppii>
where U ~ EU k
and U;'U )
; end-forfar T = 1 to t do /* Parallel Free-Running Phase */
end-for
ADJUST-CO"ECTION-STRENGTHS(U,C, U/. uj)
;end
Analysis
To
derive the time complexities of sequential and parallel algorithms, we first assume that the number of total neu- rons assigned to a processor is m = m l + m/+
m i , where m!, m/. and m i denote the number of input, hidden, andoutputn~assignedtothekthprocessor,respectively(i.e.. IU'I = m k , IUFl =m$,IUil
=m/,and I V iI
=&. The
time complexities of the convergence algorithm used in the clamped and freerunning phasesare
diffemt f i a n each other because the sx of neurons to be perturbed are different. In the clamped phase, only hid- denneurons
whichare
connected to input and output neuronsare
candidates to be perturbed. So, the time complex- ity of sequential convergence algorithm used in clamped phase.T
1. is given byT
1 = (Ma .(&+mi)+F).L .M (5)where Ma is a multiply-and-add time of two floating-point numbers, M is the number of temperature cooling cycles to make a Boltzmann Machine stable in sequential algorithm, and L is the n u m b of trials (state transitions) on each temperature in the clamped phase of sequential algorithm in which only hidden neurons are perturbed. On the other hand, in the free running phase, since all neurons are candidates to be pertury. larger L is quired. The time complexity of sequential convergence algorithm used in the free-running phase,
T
1 , is given by(6) where 4 , Lh. and Lo denote the number of selecting input neurons, hidden neurons, and output neurons in the free-running phase. respectively. If we assume that the n u m b of neurons in the input, hidden, and output layer are almost the same (i.e., mi
=
mh=
m,=
m /3), Eq.(6) can be rewrittenas follows,
since the number of required trialsis
pportionai to the number of neurons to be permrbed (i.e., Li =Lh =Lo=
L )Ti
= ((M,.(mi+mh)+F).L,+
(Ma(mi+mo)+F).Lh+ (Ma ~(m,+mh)+F)&).MTi
=(Ma.(mo+mi)+F)3.L.M = 3.T1 (7)Since the p trials
are
computed concmntly in our parallel convergence algorithm, the number of required steps to stabilizethe
network in clamped and in parallel free-running phases are L / p , and 3-L / p , respectively. On the other hand, since it requires longer Markov chains because each parallel trial contains erroneous computations [l], the number of cooling cycle in the parallel convergence algorithm, M', is P M , where fb 1. The time complexities of parallel convergence algorithms used in the clamped phase, Tp , and inthe
free-running phase, Tp , are given by,T i =
(M,*(&+mi)+F+AAB@)).--.pM
3.L = 3-T,(9)
Pwhere
AAB@
) denotes the number of steps toall-to-all-broadcast
a unit of message.By Eqs.(5),
(a,
(8). and (9).the
p -pmcessor speed-up of convergence algorithm in the clamped phase, S @ ), is the S a m e ~ t h e s p e e d - ~ ~ o f f r e e - r u n ~ , S ‘ @ ) , ~ d c o u l d b e r o u g h l ~ d a i ~ e d ~ E q . ( l O ) ,Ring one -port p - 1
d-port
r k p 1
In Eq.(lO) the most important parameters an? the complexity of AAB@) and the communication/computation ratio which is defined as
A
=C/Ma
where C is the unit time ofcommunication
for a processar to send and/or receive a unit of message. The A value lies between 0.5-
256 [4]. In one-port communication,the
lower bound of AAB@ ) on a p -processorDMM
is @-1). since each processor needs to receive from every other processor, i.e.. @-1) pro- cessors [12]. This lower bound can be achieved in the ring topology which is the cheapest one to be constructed. If a processor can send and receive on its d -port, d>
1, concurrently during each time unit, more richly connected topologies would result in less all-to-all b” ’ g time. In d -pat communication,the
lower boundfor
all-&all brocdcastin g on a p -processorDMM
is @-l)/d, since each p ” r needs to receive from every other processorand
the processor can receivethe
dmessages
concurrently. Tabk-1 summarizes the time complexities of the AAB@ ) for the processor communication capebilities and some well-known interconnection network topologies.Table-1 : ’Ihe Lower Bound Complexities of AAB@) for Well-known Topologies
Mesh
I
Hypercube Complete Connectionp - 1 p - 1 p - 1
r %pi V- 1
1The
thecxetical convergence speed-up aspects for well-known interconnection topologiesare
shown in Figure 2-(a) graphically, when m = 2048.8 = 40,p = 1.1,A
= 8, and each processor communicates with d-port Figure 2-@) shows the speed-up ratio vetsus the number of “ m s in Boltzmann Machine when p is fixed as 64 and they have 2-pelt communication capability and intuwnnected by ring topology.Note
that the speed-up ratio is heavily influenced by theA
value when the the number of neurons in the Boltzmann Machine is small, and it approaches U)lim S (p) = p
/p
= 64/1.1=
58 as shown in Figure 2-@). F a the parallel convergence algorithm of Bollzmann as+-Machine. it is seen from Figure 2 that there is a cost-effective number of processors depending on the A value, underlying topology. and the number of
neumns
inthe
Bolamenn Machine, such that though more processors are added to the parallel computations.the
speed-up ratio is not increased as much.N u m k of Rocessor @) (a) The p - mspeed-up of convcsgencc Algorithm
Number of Ne& (m)
@) The ppmce- Speed-up of Canvergence Algorithm for some well-known topologies when ~ 2 0 4 8 w h p = 6 4
Figure 2 :
The
p -processor Speed-up of Parallel Convergence Algorithm612
The time complexity to adjust the connection strengths in the sequential algorithm,
T;,
is proportional to the number of dl co~e!ctim in the Boltzmann Machine. whereas the time to adjust connection strengths in the parallel algorithm. T i , is propomonal to the number of connections incident with the neurons allocated to a processor. If we ignore the ovaheads to clamp the e x p p l e and to collect the statistics. thep -processor speed-up of the parallel leaming cycle over a single processor, S (p ), can be derived as follows ;As shown in Eq.(ll), the p -processor speed-up of the leaming cycle is very complex because it is d e p d e n t on the ratio of times to m v e r g e and times to adjust weights which is an application dependent feature. When the number of kaming cycles in padel learning is increased
h
times compared to the sequential one because ofthe
erroneous computations. the speed-up may be degraded byh
times (h>l).Summary
Since the Boltz.mann Machine requires huge amount of computational resources when the size is large or applied to a real application, it has been a main obstacle which prevents it to be used widely. To overcome this obstacle, we proposed a mapping scheme of Bolkmann Machine onto a Distributed-Memory Multiprocessor. and derive the parallel convergence and learning Boltzmann Machine algorithms. The time complexities of them are theoretically analyzed upon the underlying intemnnection topologies and communication capabilities of processors. The p
-
processor speed-ups are also derived by comparing them with a single processor's case. It is identified from our analysis that all-tw~U broadcasting is the primary communication pattern in OUT parallel algorithm, and there is the cost-effective number of processors corresponding to the characteristics of underlying hardware architecture.
Ref er en c es
[l]
[2]
[3]
[4]
E. H. L.
Aarts
and J.H.
M. Korst, "Simulation of Leaming in Parallel Networks Based on the Boltzmann Machine,", Proc. of the 2nd European Simulation Congress, 1986, pp.391-398.E.
H.
L.Aarts
and J.H.
M Korst, Simulated Annealing and Boltzmnn Machines, John Wiley & Sons, 1989.D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, "A Leaming Algorithm for Bolamann Machines," Cognitive Science, Vo1.9, 1985, pp.147-169.
M. Annaratone, C. Pommerell, and R. Ruhl, "Interpmxssor Communication Speed and Performance in Distributed-Memory Parallel Processofi," Proc. of the 16th Annual Int'l Symp. on Computer Architecture, R. Azencott and A. Doutriaux, "Synchronous Boltzmann Machines and Outline Based Image Classification,"
Proc. of Int'l Conference on Neural Network, Paris, 1990, pp.7-10.
A. Rrscha,
"A
Parallel Bolkmann Machine Simulator for Distributed Memory Multiprocessor Systems,"Proc. of Int'l Conference on Neural Network, Paris, 1990, pp.647-650.
G. E. Hinton and TJ. Sejnowski, "Learning and Relearning in Boltzmann Machines," in D.E. Rumelhart &
J.L. McClelland (E&.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition.
Vo1.1: Foundation, Cambridge, MA, MIT
Press,
1985.[8] A.
Iazzetta.
R. Vaccam, and U. Villano, "A Transputer Implementation of Boltzmann Machines,'' Proc. of 1st Italian Worhhop on Parallel Architecture and Neural Nehvork, pp.128-145, 1988.[9]
S.
L. Johsson and C. T. Ho, "Optimum Broadcasting and Personalized Communication in Hypercubes," IEEE Transactions on Computer, V01.38, No.9.1989, pp.1249-1268.[lo] J.
H.
M. Korst, "Combinatorial Optimization on a Boltzmann Machine," Journal of Parallel and Distributed Computing, Vol.6.1989, pp.331-357.[ll] H.
Yoon
and JongH.
Nang, "Multilayered Neural Network on Distributed-Memory Multiprocessors." Proc.of Int'l Conference on Neural Network, Paris, 1990. pp.669-672.
1121 H. Yoon, Jong
H.
Nang, and S. R. Maeng, "Parallel Simulation of Multilayered Neural Networks on Distributed-Memory Multiprocessors." Microprocessing and Microprogramming, Vol. 29,1990, pp.185-195.U31 H. Yoon, Jong
H.
Nang, andS.
R. Maeng. "Neural Networks on Parallel Computers," in BranLo Soucek (E&.), Suth Generation Computer Technologies, Wiley, 1991 (in press).pp.315-324,1989.
[5]
[6]
[7]