multi-scale highlighting foregrounds White matter hyperintensities segmentation using the ensemble U-Net with NeuroImage

(1)

ContentslistsavailableatScienceDirect

NeuroImage

journalhomepage:www.elsevier.com/locate/neuroimage

White matter hyperintensities segmentation using the ensemble U-Net with multi-scale highlighting foregrounds

Gilsoon Park

^a

, Jinwoo Hong

^a

, Ben A. Duﬀy

^b

, Jong-Min Lee

^a^,^∗

, Hosung Kim

^b

aDepartment of Biomedical Engineering, Hanyang University, Seoul, South Korea

bUSC Mark and Mary Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, CA 90033, USA

a r t i c le i n f o

Keywords:

White matter hyperintensities Segmentation

Deep learning U-Net

Multi-scale highlighting foregrounds

a b s t r a ct

Whitematterhyperintensities(WMHs)areabnormalsignalswithinthewhitematterregiononthehumanbrain MRIandhavebeenassociatedwithagingprocesses,cognitivedecline,anddementia.Inthecurrentstudy,wepro- posedaU-Netwithmulti-scalehighlightingforegrounds(HF)forWMHssegmentation.Ourmethod,U-Netwith HF,isdesignedtoimprovethedetectionoftheWMHvoxelswithpartialvolumeeffects.Weevaluatedthesegmen- tationperformanceoftheproposedapproachusingtheChallengetrainingdataset.Thenweassessedtheclinical utilityoftheWMHvolumesthatwereautomaticallycomputedusingourmethodandtheAlzheimer’sDisease NeuroimagingInitiativedatabase.WedemonstratedthattheU-NetwithHFsignificantlyimprovedthedetection oftheWMHvoxelsattheboundaryoftheWMHsorinsmallWMHclustersquantitativelyandqualitatively.Upto date,theproposedmethodhasachievedthebestoverallevaluationscores,thehighestdicesimilarityindex,and thebestF1-scoreamong39methodssubmittedontheWMHSegmentationChallengethatwasinitiallyhosted byMICCAI2017andiscontinuouslyacceptingnewchallengers.Theevaluationoftheclinicalutilityshowed thattheWMHvolumethatwasautomaticallycomputedusingU-NetwithHFwassignificantlyassociatedwith cognitiveperformanceandimprovestheclassificationbetweencognitivenormalandAlzheimer’sdiseasesubjects andbetweenpatientswithmildcognitiveimpairmentandthosewithAlzheimer’sdisease.Theimplementation ofourproposedmethodispubliclyavailableusingDockerhub(https://hub.docker.com/r/wmhchallenge/pgs).

1. Introduction

Whitematterhyperintensities(WMHs)appearasabnormal hyper- signalswithinthewhitematterregiononthehumanbrainMRIinclud- ingT2-weighted (T2w),protondensity(PD),andﬂuid-attenuatedin- versionrecovery(FLAIR)imaging.Theseatypicalsignalsmostlyresult fromagingprocesses suchasdemyelinationandaxonalloss,bothas aresultofcerebralsmallvesseldiseases(PrinsandScheltens,2015).

Theyarefrequentlyobservedintheelderlyandtendtoincreaseinsize andnumberwithage(Habesetal.,2016),whileatthesametimebe- ingassociatedwithseveralpotentialvascularriskfactors,particularly hypertension(Abrahametal.,2016).

Based on quantitative analyses of WMHs (Barber et al., 1999; Duboiset al., 2014; Habes etal., 2016; Leeet al., 2016; Prinsand Scheltens,2015),previousstudiesshowedthatthepresenceandsever- ityofWMHs areassociatedwithdementia(Duboisetal.,2014)and increaserisk ofconditions suchasAlzheimer’sdisease(Habeset al., 2016;Leeetal., 2016),vasculardementia,anddementiawithLewy bodies(Barberetal.,1999).Thesestudiessuggestthatthequantitative

∗Correspondingauthor.

E-mailaddress:[email protected](J.-M.Lee).

characterizationofWMHs playsanimportantroleinvariousclinical researchintoneurologicaldisorders.

ManualdelineationofWMHsprovidesground-truthforvolumetric quantiﬁcationofWMHs.However,itisalaborious,tedious,andtime- consuming taskandrequiresahighlevelofexpertise toavoidunac- ceptablelevelsofintra-andinter-ratervariability.Besides,thisbecomes moreproblematicwiththesizeofadataset,encouragingautomatedseg- mentation.

Anumberof methodshavebeenproposedtosegmentWMHs au- tomatically.Jeonetal.(2011)attemptedWMHssegmentationbased on theMarkov random ﬁeldand anintensity thresholding method.

Otherstudieshavedevelopedk-nearestneighbors-basedclusteringap- proaches (Griﬀanti etal., 2016; Jianget al., 2018;Steenwijk etal., 2013).Thesemethodsusedvariousfeatures(e.g.,spatialinformation, intensityinformation,andtexture)fromT1wandFLAIRimagesasthe inputoftheclusteringalgorithm.Dadaretal.(2017)evaluatedvari- ousclassiﬁerssuchaslogisticregression,supportvectormachines,de- cisiontrees,andrandomforests.Theyobservedthattherandomfor- estachievedthebestperformance.Recently,convolutionalneuralnet- works(CNN), aclassof deepneuralnetworks,have rapidlybecome

https://doi.org/10.1016/j.neuroimage.2021.118140.

Received15May2020;Receivedinrevisedform13April2021;Accepted16April2021 Availableonline3May2021.

(2)

aprimarymethodinmedicalimagesegmentationandshownremark- ableperformance(Litjensetal.,2017).ForthesegmentationofWMHs, Moeskopsetal. (2018)proposed aCNNmodelbasedon multi-scale patchesextractedfromT1w,T1w inversion-recovery,andFLAIRim- ages.Rachmadietal.(2018)addedglobalspatialinformationtothe patchthatwasusedasaninputtoaCNNtoimprovethesegmentation ofWMHs.

Theuseofdifferentevaluationmetrics(e.g.,Dice,Jaccard,Haus- dorff indices),as wellas evaluations againstmanually WMHsdelin- eationsbydifferentexperts,makesitdifficulttocomparetheperfor- manceofsegmentationmethodsfromvariousstudiessystematically.To addresstheseissues,theWMHSegmentationChallenge2017washeld forastandardizedcomparisonoftheautomaticsegmentationofWMHs (Kuijfetal.,2019)inconjunctionwiththe20thInternationalConfer- enceonMedicalImageComputingandComputerAssistedIntervention (MICCAI)(Descoteauxetal.,2017).TheChallengeprovidedapublic platformtostandardizetheevaluationofWMHssegmentationmethods basedonaunifieddatasetofMRI,evaluationmetrics,andexpertlabel- ing.Twentyteamsproposednewmethodsandperformedtrainingand testingonthedatasetprovided.Byusinganensembleapproach,which isagoodwaytoreduceover-fittingofdeeplearningalgorithms,withthe U-Netwhichisadeepencoder-decoderarchitectureandhasskipcon- nectionsconcatenatingthefeaturemapsintheencodertothefeature mapsinthedecoder,Lietal.(2018)achievedthebestperformance.

Partialvolumeeffects(PVE)onMRIimagesduetolimitedspatial resolutionisassociatedwiththeproblemwhereonepixel/voxelrepre- sentsasignalofamixtureofdifferentbraintissues.Thiseffectbecomes strongatvoxelsoftheboundarybetweenstructureshavingdifferenttis- suecharacteristicsorbetweenalesionandthesurroundingbraintissue.

UnidentiﬁcationofthevoxelwithPVEmayresultinunderestimating thesegmentationofthetargetlesion,inparticularforsmalllesionsand lesionsofboundaries.

Suchaproblemmaynotbeinevitablefordeepneuralnetworkap- proaches.Indeed,wetestedastandardU-NetmethodonWMHsegmen- tationandobservedtheunsuccessfulclassiﬁcationofvoxelsplacedat boththeedgesofWMHsorsmallWMHs.Thepredictedprobabilityof foregroundislowfortheWMHvoxelswithstrongPVEduetotheirfea- turesuncertainty,andthenetworkstendtofailtoidentifythosepartial volumeWMHs(PV-WMHs).Ingeneral,thevolumeofthePV-WMHsis relativelysmallcomparedtothevolumeoftheoverallWMHs.Thusthe PV-WMHscontributelittletothenetworkloss,resultinginthenetwork insuﬃcientlylearnthePV-WMHs.

Weproposethenetworktobetrainedusingamulti-scaleapproach ofhighlightingforegrounds(HF).InthestandardU-Net,thetrainingis accomplishedbyminimizingtheDicelossattheoutputlayer,whichis computedbycomparingtheposteriorprobabilitymapwiththeground truthlabel.ToemphasizethelabelvoxelsonWMHboundaries,which wouldlikelylieonPV-WMHvoxels,weproposetocomputeandmin- imizetheDicelossesattheparticularintermediatedecoderlayersby comparingtheiroutputprobabilitymapswiththecorrespondinglabels generatedusingtheproposedHFmethod.TheHFapproachdownsam- plesthegroundtruthlabelssequentiallybyapplying2×2max-pooling withstride2,resizingthelabelstothesizeofeachofthedecoderlayers.

Theapproachemphasizestheinﬂuenceofthevoxelslyingonthelesion boundariesorconsistingofsmalllesionsonthetrainingofanetwork.

WeusedthelabelsgeneratedusingtheproposedHFmethodtotrain auxiliaryclassifiersintheintermediatedecoderlayers.Trainingbyin- sertingauxiliaryclassifiersintheintermediatelayersisknownasdeep supervision.Innaturalimageclassification,GoogLeNet(Szegedyetal., 2015) is based on this method. However, this study did not sys- tematically evaluate the use of auxiliary classifiers. Independently, Leeetal.(2015)proposeddeeply-supervisednetworksthatwerecom- binedwithauxiliaryclassifiersatallintermediatelayersforimageclas- sification.They showed thatthe deepsupervision method improved theconvergencerateofnetworks byalleviatingthevanishingorex- plodinggradientproblem.Italsoimprovedthediscriminativeability

ofthefeatureslearnedbydirectlydrivingthelow-andmid-levelfea- turesinintermediatelayerstoveryhigh-levelfeatures(i.e.,targetout- put).Wangetal.(2015)appliedthedeepsupervisionmethodtodeeper convolutionalnetworks.Theyproposedtoaddauxiliaryclassifiersaf- tercertainintermediatelayersforbetterclassificationperformancein thedeeperconvolutionalnetworks.Chenetal.(2016a)usedthedeep supervisionframeworkforneuronalstructuresegmentationonelectron microscopyimages.Theirmethodaddeddeconvolutionallayerstothe intermediateencoderlayerstotrainauxiliaryclassifiersandfusedthe outputs.Severalvariantsofthedeeply-supervisednetworkswerefurther introducedtosegmentbraintissue(Chenetal.,2018),liver,wholeheart andgreatvessel(Douetal.,2017),andretinalvessels(Linetal.,2018).

Zhuetal.(2017)proposedtouseaU-Netwithdeepsupervisionfor prostatesegmentationinMRimages.Thesepreviousworks(Chenetal., 2018;Chenetal.,2016a;Douetal.,2017;Linetal.,2018;Zhuetal., 2017)reportedthatdeepsupervisioncouldimprovesegmentationac- curacy.

Though theproposed method is similar tothese previous works (Chenetal.,2018;Chenetal.,2016a;Douetal.,2017;Linetal.,2018; Zhuetal.,2017),ourapproachhasseveraldiﬀerencesintheprocessing.

Wegeneratelabelimagesatdifferentresolutionsfromthegroundtruth labelingwhileemphasizingtheforegroundvoxelsusingtheproposed highlightforeground(HF)method.Thegeneratedmulti-scalelabelim- ageisusedforthenetworktolearnthefeaturesfocusingonthefore- groundareaatdifferentresolutions.Ontheotherhand,theseprevious worksusedthegroundtruthlabelswithoutmodificationattheoriginal imageresolution.Duringtraining,theyupsampledfeaturemapstothe originalimageresolutioninordertogeneratetheoutput.Instead,the proposedmethodcomputesthelossfunctionswithouttheneedforthe upsamplingofnetworkoutputs,mitigatingGPUmemoryusage.

TheMICCAI WMHSegmentationChallenge continuestohostfur- therstudiessinceitsinitialopening.Ofthe39methodssubmittedthe challenge asofMarch1st,2020,ourteamthatproposedtheensem- bleU-Netwithmulti-scaleHFcurrentlyachievesthebestperformance.

Inthefollowingsections,weoutlinethemethodsweusedtoachieve state-of-the-artperformanceonthistask.WefirstoutlinetheU-Netwith HFsmodel.Then,weshowthattheproposedmethodsignificantlyim- provedtheWMHssegmentationperformancecomparedtothestandard U-Net.Intheresultssection,weevaluatetheproposedmethodincom- parison toothermethodsandassessthe influenceofthe multi-scale HFsandtheeffectaccordingtotheirorganization.Finally,using the Alzheimer’sDiseaseNeuroimagingInitiative(ADNI)dataset,weinves- tigatethepotentialclinicalutilityoftheHFmethodbyautomatically segmenting WMHsandassociatingtheWMH volumeswithcognitive performancescores thatareusedforthediagnosis ofmild cognitive impairmentandAlzheimer’sDisease(i.e.,Mini-MentalStateExamina- tion(MMSE),Alzheimer’sDiseaseAssessmentScale-CognitiveSubscale (ADAS-Cog),andClinicalDementiaRatingScalesum(CDRsum)).

2. Materialsandmethods

2.1. Challengedatasetandpre-processing

Wevalidatedourmethodbasedonthedatasetandevaluationframe- workin WMHSegmentation Challenge2017becausethisprovidesa standardized assessment of thesegmentation performance of WMHs (Kuijf et al., 2019).The challenge organizationprovided a training datasetandatestdatasetconsistingof170subjectsintotal.Detailsof thedatasetsaregiveninTable1.Thetrainingdatasetincluded60sub- jectsandwaspubliclyavailableanddownloadableafterregistrationat https://wmh.isi.uu.nl/data/.Weusedthisdatasettotrainournetwork andinvestigatetheeﬀectoftheHFmethod.Thetestdatasetincluding therestofthe110subjectswasonlyavailableforevaluatingpredictions whensubmittedtothechallenge.

Each subjectincludedthebrainMR imagesbefore andafterpre- processing for T1w andFLAIR images anda manual delineation of

(3)

Table1

TheoverviewofthecharacteristicsoftheWMHSegmentationChallenge2017dataset.

Institute Scanner T1 voxel size(TR/TE(/TI)) FLAIR voxel size(TR/TE/TI) Train Test

UMC Utrecht 3T Philips Achieva 1 . 00 × 1 . 00 × 1 . 00 m m ³(7.9/4.5 ms) 0 . 96 × 0 . 95 × 3 . 00 m m ³(11,000/125/2800 ms) 20 30 NUHS Singapore 3T Siemens TrioTim 1 . 00 × 1 . 00 × 1 . 00 m m ³(2300/1.9/900 ms) 1 . 00 × 1 . 00 × 3 . 00 m m ³(9000/82/2500 ms) 20 30 VU Amsterdam 3T GE Signa HDxt 0 . 94 × 0 . 94 × 1 . 00 m m ³(7.8/3.0 ms) 0 . 98 × 0 . 98 × 1 . 20 m m ³(8000/126/2340 ms) 20 30 1.5T GE Signa HDxt 0 . 98 × 0 . 98 × 1 . 50 m m ³(12.3/5.2 ms) 1 . 21 × 1 . 21 × 1 . 30 m m ³(6500/117/1987 ms) 0 10 3T Philips Ungenuity 0 . 87 × 0 . 87 × 1 . 00 m m ³(9.9/4.6 ms) 1 . 04 × 1 . 04 × 0 . 56 m m ³(4800/279/1650 ms) 0 10

∗Abbreviations:TR:repetitiontime;TE:echotime;TI:inversiontime.

WMHs.TheimageswereacquiredfromfivedifferentMRscannersin threedifferent institutes.Theimagesacquiredinthethree MRscan- ners(i.e.,3TPhilipsAchieva,3TSiemensTrioTim,and3TGESigna HDxt)wereusedforbothtrainingandtesting.Theimagesacquiredin theothertwoMR scanners(i.e.,1.5TGESignaHDxtand3TPhilips Ingenuity)wereusedonlyfortesting.Allthe3DFLAIRimagesfrom VUAmsterdaminstitutewereresampledintotheaxialdirectionwith 3mmslicethickness.TheWMHswerelabeledontheFLAIRimagesby twoexperts,basedonStandardsforReportIngVascularchangesonnEu- roimaging(STRIVE)criteria(Wardlawetal.,2013).Theorganizerspro- videdthepre-processeddatalike1)T1wimagesthatwereregisteredto theFLAIRimagesusingtheElastixtoolbox(Kleinetal.,2009);2)the T1wandFLAIRimagesthatunderwentcorrectionfortheintensitynon- uniformityusingSPM12(AshburnerandFriston,2000).

We further pre-processed these data for training or testing our method.First,toreducefalsepositives,weremovednon-braintissue, using ROBEX(Iglesiasetal., 2011).Second, we performedintensity normalizationtomatchtheintensitydistributionamongthetraining data.Foreachimage,wecalculatedmeansandvariancesusingintensi- tiesrangingfrom2ndto98thpercentilesinthebrainregion.Wethen normalizedtheintensitieswithinthebrainineachimageusingz-score transformation.Finally,toequalizethesizeoftheinputdatatothenet- work,theaxialslicesineach3Dimagewerecroppedorpaddedtoa sizeof200×200.Weused2Dslicestotrainour2DCNNmodel.

2.2. Networkarchitecture

Inthecurrentstudy,weproposetocombinemulti-scaleHFswith aU-Netarchitecture(Ronnebergeretal.,2015).Themainideaofthe U-Netistheskipconnectionsbetweentheencoderanddecodertoal- lowthenetworktoreusethefeaturemapsintheencoder.Thisgener- allyhelpsthenetworktopredictdensesegmentationresultsandalle- viatesthevanishinggradientproblem.VariantsoftheU-Nethavebeen usedindiversemedicalimagesegmentationproblemsandhavedemon- stratedoutstanding performance(Çiçeket al.,2016; Drozdzalet al., 2018;Guerreroetal.,2018).IntheoriginalChallengein 2017,Four amongthetopteams,includingthetopteam(Lietal.,2018),exploited theU-Netarchitecture(Kuijfetal.,2019).Here,weadvancea2DU- NetforWMHssegmentation(Fig.1)withtwonovelcomponents:useof diﬀerentdetailsofthenetworkarchitectureandinclusionofthemulti- scaleHFs.

2.2.1. Detailsofthenetworkconﬁguration

Asseenin Fig.1, themodified networkis basedontheencoder- decoderstructure.Intheencoderduringbothtrainingandtesting,the axialslicesofFLAIRandT1wmodalitiesarefedintothenetworkas atwo-channelinput(i.e.,theinputsizeis200×200×2).Wechose theaxialsliceforthe2Dinputdatabecausethebestimageresolution wasfoundontheFLAIRaxialslice.Inournetwork,weadoptthefol- lowingconfigurationsassuggestedinrecentworks:1)Insteadof3×3 kernelconvolutionsinthestandardU-Net,weuse5×5kernelconvo- lutionsinthefirsttwolayersforhandlingdifferenttransformationsas inLietal.(2018);2)Batchnormalizationisaddedtothe18convolu- tionallayerseach,whichacceleratesthetrainingprocessandimprove thenetworkperformancebyreducinginternalcovariateshift(Ioffeand

Szegedy, 2015).3) Finally,we use anexponential linearunit (ELU) (Clevertetal.,2015)astheactivationfunctionfornon-linearitycapac- ityinsteadofarectiﬁedlinearunit(ReLU)inthestandardU-Net.ReLU isneitheractivatednorupdatedatanegativevalue,whileELUdoesnot onlyhaveallthestrengthsofReLUbutalsoisactivatedandupdatedat negativevalues,improvingthelearningcharacteristics(Clevertetal., 2015).Theencodercontainsfour2×2max-poolinglayerswithastride 2aftereverytwoconvolutionlayersfordownsampling(Fig.1).Upsam- plinglayersbasedonnearest-neighborinterpolationareappliedafter everytwoconvolutionallayersinthedecoder(Fig.1).Priortodown- sampling,thefeaturemapsintheencoderareconcatenatedtothefea- ture mapsrightafterupsampling in thedecoder.At theoutputcon- volutionallayers,a1×1convolutionwithsoftmaxfunctionisusedto convertthefeaturemapsintothelabelspacewiththedepthoftwo(i.e., twoclasses;WMHsandnon-WMHs).

2.2.2. Multi-scalehighlightingforegrounds

Weaddmodiﬁedlabelimagesbythemulti-scaleHFapproachtothe intermediatelayersinthedecoderinournetwork(Fig.1).Theinter- mediateoutputconvolutionallayersconvertthefeaturemapsintothe multi-scalesegmentationprobabilitymaps(blackarrowsinFig.1).The multi-scaleHFsmax-poolthelabelimage(Fig.2).Giventheforeground pixelsinonelabelimage,thelabelimagemax-pooledbyHFisdeﬁned asfollow:

𝑙^𝐻𝐹_𝑘 =𝑓_𝑀𝑃( 𝑙^𝐻𝐹_𝑘₋₁)

(𝑘=1,2, …,𝑛), (1)

Where𝑓_𝑀𝑃 isa2 × 2max-poolingoperatorwithstride2and𝑙_𝑘^𝐻𝐹 is thelabelimagemax-pooled,whichgeneratedbyapplying𝑓_𝑀𝑃 𝑘times and𝑙₀^𝐻𝐹istheoriginalforegroundimage(i.e.,groundtruthlabel).The backgroundimage,𝑙^𝐵𝐺_𝑘 ,isdeﬁnedasfollow:

𝑙^𝐵𝐺_𝑘 =1−𝑙^𝐻𝐹_𝑘 (𝑘=0, 1,2, …,𝑛), (2) Where𝑙^𝐵𝐺₀ istheoriginalbackgroundimageand𝑙^𝐵𝐺_𝑘 isabackgroundof inverting𝑙^𝐻𝐹_𝑘 .

Theforeground/backgroundimagesbymulti-scaleHFsareusedto generatelossesthroughcomparisonwiththecorrespondingoutputs.We used asoft Dice score asalossfunction. Let 𝐿^𝑐=(𝑙^𝑐₀, 𝑙^𝑐₁,…,𝑙^𝑐_𝑀)be ground truth labelimage (𝑙^𝑐₀) and𝑀 represents diﬀerent scalesfor multi-scaleHF.The𝑐 representsthetypeeitherHFor BG.Then,let 𝑆^𝑐=(𝑠^𝑐₀, 𝑠^𝑐₁, …,𝑠^𝑐_𝑀)bethesegmentationresultingfromthenetwork.

𝑠^𝑐₀and(𝑠^𝑐₁, …, 𝑠^𝑐_𝑀)istheoutputsegmentationatthesizeoftheorigi- nallabelimage𝑙^𝑐₀andmulti-scalelabelimages(𝑙^𝑐₁, …, 𝑙^𝑐_𝑀)each.The Multi-scaleLossFunctions(MLF)canbewrittenas

MLF=

∑C

c=1

∑M

m=0

w_m2(l^c_m^◦s^c_m)+𝜖

l^c_m+s^c_m+𝜖, (3)

where𝐶istheforegroundorbackground,^◦istheelement-wiseprod- uct,𝑤_𝑚istheweightfor𝑚thscalelossfunction,and𝜖isasmoothing constanttopreventMLFfromdivisionby0,whichwesetas0.00001for currentnetworktraining.Thesumofallof𝑤_𝑚isone.Weevaluatedvar- ioussetsof𝑤_𝑚andfoundthebestsegmentationperformancewhen𝑤_𝑚 wasthesameforallofthelossesasdescribedinthefollowingsection.

(4)

Fig.1.TheworkﬂowoftheproposedU-Netwithmulti-scalehighlightingforegrounds.

Fig.2.Illustrationofthemulti-scalehighlightingforegrounds.

3. Experiments

3.1. Evaluationmetricsandrankingsystem

WeusedthefiveevaluationmetricsthattheWMHchallengeadopted tocomparethemethodsthatparticipatingteamsdevelopedquantita- tively(detailsinTable3).LetMLbetheWMHsmanuallylabeledby expertandALbetheWMHsautomaticallylabeledbytheproposedap- proach.Thefiveevaluationmetricsareasfollow:(1)theDicesimilarity coefficient(DSC)astheoverlapindexbetweenMLandAL,(2)amod- ifiedHausdorff distance(95thpercentile;H95)astheoveralldistance betweenMLandALboundaries,(3)theabsolutepercentagevolumedifference(AVD)betweenthetotalWMHsofMLandAL,(4)recallasthe sensitivityindetectingindividuallesions,and(5)F1-scoreastheaver- ageofprecisionandrecallindetectingindividuallesions.Thechallenge

organizationdefinedtheindividuallesionsinbothrecallandF1as3D connectedcomponentswithinanimage.Allfivemeasurementsarepos- itivereal-valuedandthecloserthemeasuredvaluesofDSC,recall,and F1approachone,thehigherthesimilaritybetweenMLandAL.Onthe contrary,thecloserthemeasuredvaluesofH95andAVDaretozero, thehigherthesimilaritybetweenMLandAL.Table3detailsthedef- initionofthesefivemetrics.Thesemetricsforourtestingresultswere computedbythechallengeorganization.

Thechallengeorganizationproposedasystemforrankingtheover- all performanceof participatingteams.Thissystemconsistedoffour steps.First,themeanofeachmetricwascomputedoveralltestdatafor eachteam’smethod.Second,foreachevaluationmetric,theorganiza- tionsortedalloftheteamsfrombesttoworst.Next,thebestandworst teamsreceivedarankscoreofzeroandone,respectively,forthatmetric.

Otherteamswereassignedarankscorebetweenzeroandonefollowing

(5)

theirresultswithintherangeofthatmetric.Finally,theﬁverankscores wereaveragedintotheoverallrankscoreindicatingtheoverallperfor- manceofthatteam.Theoverallrankscorewasusedtodeterminethe rankingofeachteamontheresultboardonthechallengehomepage (https://wmh.isi.uu.nl/results/).

3.2. Implementationdetails

The proposed network and experimental networks were imple- mentedinPythonusingTensorflow(Abadietal.,2016).Thenetworks weretrainedonfourNVIDIATitan-XpGPUswith12GBRAM.Thehyper- parametersof thenetworks weresetasfollows: mini-batchsize=30, optimizer=Adam (Kingma andBa, 2014),learning rate=0.0002, the numberofepochs=1000,andHeinitialization(Heetal.,2015).Early stoppingbasedonavalidationdatasetwasusedtoavoidoverfittingin trainingdata.Theperformanceofallnetworksconvergedwithin1000 epochs.Dataaugmentationwasappliedduringtrainingtoenhancero- bustnessin thefaceof limitedtrainingdata. Tothis end,flipping of axial-slicedimagestoeachaxisandvariousaffinetransformsincluding translation,rotation,scaling,andshearingwererandomlyapplied.The detailsoftheparametersofdataaugmentationwereasfollows:theprob- abilityofflippingeachaxis=0.5,therangeoftranslationratio=(−0.1, 0.1),therangeofrotation=(−15^◦,15^◦),therangeofscalingratio=(0.9, 1.1),therangeofshearing=(−18^◦,18^◦).Thenetworksweretrainedon an1:3ratiooforiginaldatatoaugmenteddataateachepoch.

Toachieve robust segmentation results, we applied anensemble methodandaflipaveragingtotheproposedmethod.Inatrainingstep, weperformed5-fold cross-validationwhereeachfold(n=12)images wererandomlyandequallysampledfromeachsitedatasetinthewhole trainingdataset(n=60).Then,validatingeachfold,wetrainedthepro- posednetworkusingtheotherfourfolds(n=48),resultingin5networks trainedseparately.Inatestingstep,wefirstflippedeachindividualim- agewithrespect tothex-axis,y-axis,andxy-axis,whichgenerated3 flippedimagesperindividual.Then,eachofthefourimageswasused asinputtoanetworkgeneratedfrom5-foldcross-validation.Theflip averagingwasthemajorvotingoffouroutputsfromanoriginalinput andthreeinputsflippedtothex-axis,y-axis,andxy-axisinanetwork.As aresult,thefinaloutputwasthemajorvotingoffiveoutputsgenerated fromtheflipaveragingofeachnetwork.

3.3. Theevaluationoftheimportanceofmulti-scalehighlighting foregrounds

Toinvestigatetheimportanceofmulti-scaleHFinWMHssegmen- tation,weevaluatedourmodelwithvariousparametersofmulti-scale HF,whichincluded1)thetypeofpoolingforHF,and2)theweightsof lossesatoutputlayers.Weperformedthisevaluationthroughacross- scannervalidationusing thetrainingdatasets(60subjects)of WMHs segmentationchallenge.Thedatasetfortheevaluationwassplitintoa trainingdataset(1stand2ndsites:40subjects)andavalidationdataset (3rdsite:20subjects).Ourevaluationresultsarebasedontherawnet- workoutput(i.e.,withoutensemblingandﬂipaveraging).

ThetypeofpoolingforHF:Wecomparedthenetworkusingthepro- posedmax-pooledlabelimageswiththecorrespondingaverage-pooling version.Wedownsampledthegroundtruthlabelimageintothelower resolutionsattheintermediate outputlayers byrepeatedlyusinga2

×2averagepoolingoperationwithstride2.Wethengeneratedtheir hard labelsbythresholdingthedownsampled softlabels at 0.5.The networkforthisexperimentwasequippedwitheitherfourmax-pooled orfouraverage-pooledlabelimagesandthesamelossweightsatall oftheoutputlayers.Toavoidcherry-pickingresultsbyrandomseed, weperformedacross-scannervalidation(3-foldcrossvalidation)with 5runsof randominitialization(resultingin15 trainednetworks for eachtestedmethod)fortheoriginalU-Net,U-NetwithHF,andU-Net withaverage-pooledlabels(AVG).Eachfoldrepresentedeachscanner dataset:i.e.,theUMCdataset,NUHSdataset,orVUdataset.Wetreated

theresultsofeachfoldasanindividualsample.Weperformedpairedt- testsforthecomparisonamongthetestedmethodsandcalculated95%

conﬁdenceinterval(CI)accordingtothepairedsampletest.

The weights oflosses about allof the outputlayers: In this experi- ment,weinvestigatedtheimpactofthesetof𝑤_𝑚 inEq.(3).Thenet- workwithfourmax-pooledlabelimageswasusedforthisexperiment.

Three setsof𝑤𝑚 weretested:1)𝑤0=₁₅⁵,𝑤1=₁₅⁴,𝑤2=₁₅³,𝑤3=₁₅², and𝑤4= ¹

15;2)𝑤0= ³

15,𝑤1= ³

15,𝑤2= ³

15,𝑤3= ³

15,and𝑤4= ³

15;3)Fi- nally,𝑤0= ¹

15,𝑤1= ²

15,𝑤2= ³

15,𝑤3= ⁴

15,and𝑤4= ⁵

15;Forthisexper- iments,wetrainedthreenetworks(onlyonerunforeachfold;atotal ofthreeruns)andaveragedtheresultsonapatient-leveltoinvestigate statisticaldiﬀerences.

3.4. Clinicalutilityevaluation

Theaimherewastoassesstheclinicalutilityoftheproposedmethod byevaluatingwhethertheproposedHF-basedautomatedvolumetryis abiomarkerof thedementiaseverity(i.g.,cognitiveperformancede- cline)oradiagnosticmeasurementofdementia.Accordingly,weana- lyzedtheassociationoftheWMHvolumeswithcognitiveperformance scoresorwiththediagnosisamongcognitivelynormal(CN),mildcog- nitiveimpairment(MCI),andearlyAlzheimer’sDisease(AD).Also,to analyzetheeffectofWMHvolumeonsubjectdiagnosis(CN,MCI,and AD),wecomparedclassificationperformancesusinglogisticregression underthreedifferentconditions(WMHvolumeonly,clinicalvariables only,andWMHvolumeandclinicalvariablescombined).

3.4.1. ADNIdatasetandpreprocessing

WeemployedthedatasetsfromtheAlzheimer’sDiseaseNeuroimag- ingInitiative(ADNI)database(http://adni.loni.usc.edu).Werandomly selected243ADNIsubjectswhocompletedcognitiveevaluationsand MRIscansattheirbaselinevisits.Thedatainclude73CN,115MCI, and55ADsubjectsatbaselinediagnosis.WeusedMini-MentalState Examination(MMSE),Alzheimer’sDiseaseAssessmentScale-Cognitive Subscale (ADAS-Cog),andClinical DementiaRatingScalesum(CDR sum)ascognitiveperformanceevaluationsthatfollowedastandardized protocol(Petersenetal.,2010).ThescorerangesofMMSE,ADAS-Cog, andCDRsumwere0–30,0–70,and0–18,respectively.AhigherADAS- CogoraCDRsumscoreindicateslowercognitiveperformance.Onthe otherhand,alowerMMSEmeanslowercognitiveperformance.Each individualMRIscanconsistedofasetofaT1wimageandFLAIRimage acquiredaxial-plane.Table2detailsdemographicandclinicalinforma- tion.

We pre-processedall imagesthrough thesamesteps thatarede- scribedin Section2.1. Challengedatasetandpre-processingfor con- sistentdataprocessing.Wedidnotperformthecroppingorpaddingof theimagesto200×200axialplanesbecauseinputtinganarbitrarysize imageisacceptedinfullyconvolutionalnetworksattesttime.

3.4.2. Statisticalanalysis

Weusedagenerallinearmodel(GLM)toassesswhetherWMHvol- umescomputedusingourmethodwereassociatedwiththecognitive performancescoresorthediagnosisofsubjects.Weincludedeachtype ofthecognitivescoresorthediagnosis(i.e.,CN,MCI,andAD)asade- pendentvariableandtheWMHsvolumeasanindependentvariable.In theGLM,wealsoincludedage,cardiovascularrisk,education,gender, ApoE4genotype,andraceascovariatestocollectfortheirconfound- ingeﬀects.Thecardiovascularriskscorerangedfrom0to5bycount- ingthefollowingdiseasesorcharacteristicsindividually:hypertension, stroke,smoking,diabetesmellitus,andcardiovasculardisease.Tomiti- gatethepossibleissueoftheskeweddistributionofthecognitivescores andWMHvolumes,weappliedthesquareroottransformationtoMMSE andCDRsum,andthelog-transformationtoWMHvolumesassuggested inCarmichaeletal.(2010).Then,allthetransformedvaluesandADAS- Cogwerenormalizedusingthez-scoretransformation.Allthestatistical testswereimplementedusingMatlab2019b.

(6)

Table2

Theoverviewofthecharacteristicsof243ADNIsubjects.

Characteristic

Total,Mean (SD)

Diagnosis at baseline, Mean (SD) Cognitively normal MCI AD

Sample, No. 243 73 115 55

Age, y 73 (7.5) 75 (6.5) 71 (7.9) 76 (7.1)

CV risk 1.7 (1.0) 1.7 (1.1) 1.7 (1.0) 1.6 (1.0) Education, y 16 (2.7) 16 (2.5) 17 (2.7) 16 (2.7) Male, No. (%) 123 (51) 36 (49.3) 56 (48.7) 31 (56.4) ApoE4 genotype, 0/1/2 114/100/29 48/23/2 50/49/16 16/28/11 Race

White 218 63 105 50

Other 25 10 10 5

MMSE score 27 (2.8) 29 (1.3) 28 (2.1) 23 (2.1) ADAS-Cog score 11 (6.8) 6 (3.1) 10 (4.8) 20 (5.8) CDR sum 1.75 (1.99) 0.12 (0.53) 1.4 (1.1) 4.6 (1.7) WMHs, c m ³ 9.94 (11.74) 9.13 (14.33) 8.72 (9.04) 13.58 (12.41)

Table3

Listofevaluationmetricsandtheirdeﬁnitions.Theorganizationofthechallengedeﬁned individuallesionsforrecallandF1as3Dconnectedcomponents.𝑁meansthenumber of3Dconnectedcomponents. ̂𝑑(𝑋,𝑌)meansmax

𝑥∈𝑋min

𝑦∈𝑌𝑑(𝑥,𝑦),where𝑑(𝑥,𝑦)isthedistance betweenxandypoints.𝑉_𝑀𝐿and𝑉_𝐴𝐿meantheWMHvolumecomputedbymanual segmentationandourmethod,respectively.

DSC H95 AVD Recall F 1

Equation _𝐹_𝑃+²_𝐹^{𝑇 𝑃}_𝑁+2_{𝑇 𝑃} max { ̂𝑑 ( 𝑋, 𝑌 ) , ̂𝑑 ( 𝑌 , 𝑋 )} ^|𝑉^𝑀𝐿_𝑉⁻^𝑉^𝐴𝐿^|

𝑀𝐿

𝑁𝑇 𝑃 𝑁𝑇 𝑃+𝑁𝐹 𝑁

2𝑁𝑇 𝑃 2𝑁𝑇 𝑃+𝑁𝐹 𝑁+𝑁𝐹 𝑃

∗Abbreviations–DSC:theDiceSimilarityCoefficient;H95:amodifiedHausdorff distance(95thpercentile);AVD:theabsolutepercentagevolumedifference;Recall:the sensitivityfordetectingindividuallesions;F1:F1-scoreforindividuallesions;TP:true positive;FN:falsenegative;FP:falsepositive.

3.4.3. Classiﬁcationanalysis

WeassessedwhetheralterationsinWMHvolumewereusedtoclas- sifyanunseenindividualintoCN, MCI,orAD.Tothisend,weused logisticregressionasaclassifierandperformedtheclassificationunder threedifferentconditionswhereinputfeaturesvaried:1)WMHvolume only;2)clinicalvariablesonly(age,cardiovascularrisk,education,gender,ApoE4genotype,andrace;thesearementionedinSection3.4.2), and3)WMHvolumeandclinicalvariablescombined.Theclassification wasevaluatedusingaleave-one-outstrategy.Wecalculatedthereceiver operatingcharacteristic(ROC)curvesfromtheclassificationresulting infromeachofthethreeconditionsandcomparedtheirareaunderthe curve(AUC)values.ClassificationanalysiswasprocessedusingMatlab 2019b.

4. Results

Inthissection,wecompareourresultswiththeresultsofthetop 2nd–5thalgorithmslistedintheWMHSegmentationChallenge asof April30,2020.Wealsoshowtheinﬂuenceofmulti-scaleHFonthepro- posednetwork’ysperformance.Finally,theresultsoftheclinicalanaly- sisarepresented.

4.1. ResultsoftheWMHsegmentationchallenge

As of March 1st2020, the following four teams, as well as our team, were listed as the top ﬁve in the challenge leaderboard: 1) sysu_media_2– deep2Dmulti-scalestackedU-Netandensemblelearning;

2)sysu_media– fullyconvolutionalensembleneuralnetworks(Lietal., 2018);3)anonymous_20200413– brainatlasguidedattentionU-Net;

and4)coroﬂo– multi-dimensionalconvolutionalgatedrecurrentunits andensemblelearning.Moredescriptionofeachmethodisavailableon thechallengewebsite.

Table4showstheresultsofthetopﬁveteamsaboutﬁvemetrics(i.e., DSC,H95,AVD,recall,andF1).Boldtextindicatesthebestperformance

amongallalgorithmsforthegivenmetric.Ourmethodpgsachievedthe bestoverallperformance.

TheresultsofourmethodaredetailedinTable5.Partofthetest dataset(n=90)wasfromthreesites(Utrecht,Singapore,andAMSGE3T) thatmatchedthesitesfromwhichthetrainingdatasetswereacquired.

Whereastheother20subjectswerefromAMSGE1.5TandAMSPETMR whichwerecompletelyunseen.However,theproposedmethoddemon- stratedsimilarperformanceinthisunseendatasetrelativetothethree- sitedataset.

4.2. Theinﬂuenceofthemulti-scalehighlightingforegroundsandother parametersonsegmentationaccuracy

Weevaluatedthesegmentationaccuracyoftheproposednetwork underdifferentconfigurationsofthemulti-scaleHF.Table6showsthe segmentationresultsand95%confidenceintervalsfor15multipleruns oforiginalU-Net,U-NetwithHF,andU-NetwithAVG.TheU-Netwith HFachievedthebestperformanceacrossallmetricscomparedtothe othertwomethods.Table7showstheeffectoftheweightsoflossesat alltheoutputlayersonsegmentationaccuracy.Thenetworkwithequal weightsamongall thefiveoutputlayersoutperformed thenetworks usingotherarrangementsoftheweights.

4.3. Clinicalutilityoftheproposedsegmentationapproach

WMHvolumescomputedusingourmethodweresigniﬁcantlyasso- ciatedwithcognitivescoresandgroupdiagnosisoftheANDIsubjects (Table8;MMSE,CDRsum,andgroupdiagnosis:Bonferronicorrection p-value <0.05; ADAS-Cog:FDRcorrectionp-value< 0.05). Inother words,thebiggerWMHwas,themorewascognitivedeclineandthe moreseverewasthediagnosisofapatientobserved.Furthermore,our linearmodelshowedthata1-standarddeviation(SD)increaseinWMH volumecorresponded0.133and0.165SDincreasesinADAS-Cogand CDRsumscores,respectively,anda0.176SDdecreaseinMMSEscore.

(7)

Table4

ComparisonofourmethodwithothermethodsintheMICCAIWMHsegmentationchal- lenge.Boldfontsindicatethebestresultamongallteams.

Team Rank DSC H95 (mm) AVD (%) Recall F 1

1 pgs (ours) 0.0185 0.81 5.63 18.58 0.82 0.79 2 sysu_media_2 0.0187 0.80 5.76 28.73 0.87 0.76 3 sysu_media 0.0288 0.80 6.30 21.88 0.84 0.76 4 anonymous_20200413 0.0314 0.79 6.17 22.99 0.83 0.77

5 coroﬂo 0.0493 0.79 5.46 22.53 0.76 0.77

∗Abbreviations–DSC:theDiceSimilarityCoefficient;H95:amodifiedHausdorff distance (95thpercentile);AVD:theabsolutepercentagevolumedifference;Recall:thesensitivity fordetectingindividuallesions;F1:F1-scoreforindividuallesions.

Fig.3. Thereceiveroperatingcharacteristic(ROC)curvesforlogisticregressionabouttwotypesoftheclassiﬁcation(CNvsADandMCIvsAD).

Table5

Resultsaboutﬁvemetricsoftheproposedmethodinthetestdatasetofthe challenge.

DSC H95 (mm) AVD (%) Recall F 1 Utrecht ( n = 30) 0.81 6.76 18.62 0.81 0.75 Singapore ( n = 30) 0.84 4.71 15.69 0.83 0.80 AMS GE3T ( n = 30) 0.80 3.74 22.03 0.83 0.82 AMS GE1.5T ( n = 10) 0.74 9.30 22.09 0.75 0.77 AMS PETMR ( n = 10) 0.80 7.00 13.25 0.82 0.81 Weighted average 0.81 5.63 18.58 0.82 0.79 rank [0…1] 0.000 0.004 0.003 0.086 0.000

EventhoughfeedingWMH volumealonetotheclassiﬁerdid not achieveabetterclassiﬁcationcomparedtousingclinicalvariablesonly (Fig.3),WeobservedthatcombiningWMHvolumewithclinicalvari-

ablesresultedinthebestclassificationperformancesbetweenCNand AD(AUC=0.75)andbetweenMCIandAD(AUC=0.67).Intheclassifica- tionofCNvs.MCI,theclassificationperformancewasnotimprovedby combiningWMHvolumeandclinicalvariables(AUC=0.58)compared tousingclinicalvariablesonly(AUC=0.59).

5. Discussion

WeproposedanewU-Netvariantwithmulti-scalehighlightingforegrounds(HF)inthispaper.Ournetworkframeworkwasdesignedto improvethedetectionoftheWMHvoxelsinvolvingadegreeofpar- tial volumeeﬀects.Weaddedthemulti-scalelabelimagesthatwere max-pooledbyHFtoaU-NetforWMHssegmentation.Theproposed methodhasbeenplacedatthetoprankfortheoverallscoreintheMIC- CAIWMHSegmentationChallenge.TheWMHvolumecomputedusing ourautomatedapproachwassigniﬁcantlyassociatedwithcognitiveper-

Table6

Thesegmentationresultsand95%conﬁdenceintervalsforthe15multiplerunsoforiginalU-Net,U-NetwithHF,andU-Netwith average-pooledlabels(AVG).Boldfontsindicatethebestresult.

DSC HD95 (mm) AVD (%) recall F 1-score

Original U-Net Mean ± SD 0.8033 ± 0.0204 ^∗ 6.2509 ± 1.6870 ^∗ 18.8337 ± 6.086 ^∗ 0.7261 ± 0.0538 ^∗ 0.7273 ± 0.0466 ^∗ 95% CI [0.0060, 0.0089] [ − 0.7435, − 0.3917] [ − 2.8959, − 1.2721] [0.0212, 0.0440] [0.0228, 0.0298]

Original + AVG Mean ± SD 0.8037 ± 0.0191 ^∗ 6.9537 ± 2.5907 ^∗ 17.6740 ± 6.4449 0.7373 ± 0.0552 ^∗ 0.7340 ± 0.0513 ^∗ 95% CI [0.0055, 0.0085] [ − 1.7264, − 0.8143] [ − 2.0247, − 0.1762] [0.0086, 0.0341] [0.0126, 0.0266]

Original + HF Mean ± SD 0.8107 ± 0.0203 5.6833 ± 1.8080 16.7497 ± 6.9472 0.7587 ± 0.0456 0.7536 ± 0.0436

∗p-valuecorrectedbyFDR<0.05betweenU-NetwithHFandother(pairedt-test).

Abbreviation– SD:standarddeviation;CI:conﬁdenceinterval.

(8)

Table7

Inﬂuenceofvariousweightingamongalltheoutputlayerlossesonthesegmentationaccuracy.Boldfontsindicatethebest result.

DSC H95 (mm) AVD (%) Recall F 1

Original + HF (all 0.2) 0.8139 ± 0.0878 5.0383 ± 4.3894 15.893 ± 24.845 0.7725 ± 0.1033 0.7575 ± 0.0903 Original + HF (1/15 →5/15) 0.8095 ± 0.0952 5.4751 ± 4.9443 17.890 ± 29.337 0.7707 ± 0.1055 0.7523 ± 0.0956 Original + HF(5/15 →1/15) 0.8093 ± 0.0912 6.1427 ± 5.7800 16.709 ± 18.847 0.7496 ± 0.1020 0.7595 ± 0.0746

Table8

SummaryoftheresultsofthegenerallinearmodelusingWMHvolumesandcovariatesasindependentvariables andcognitivescoresandgroupdiagnosisasdependentvariables.

Cognitive scores

Group(CN/MCI/AD)

MMSE ADAS-Cog CDR sum

β p-value β p-value β p-value β p-value

WMH volume − 0.176 < 0.001 ^∗^∗ 0.133 0.017 ^∗ 0.165 0.009 ^∗^∗ 0.131 0.008 ^∗^∗

Age – – – – – – – –

CV risk – – – – – – – –

Education 0.062 < 0.001 ^∗^∗ − 0.061 0.001 ^∗^∗ − 0.047 0.031 ^∗ – –

Gender – – – – – – – –

ApoE4 genotype − 0.296 < 0.001 ^∗^∗ 0.184 < 0.001 ^∗^∗ 0.421 < 0.001 ^∗^∗ 0.299 < 0.001 ^∗^∗

Race – – – – – – – –

∗p-valuecorrectedbyFDR<0.05.

∗∗p-valuecorrectedbyBonferronimethod<0.05.

formancescores(MMSE,ADAS-Cog,andCDR-sum)andthedementia diagnosis(CN,MCI,andAD).Furthermore,theautomatedWMHvol- umetryimprovedtheclassiﬁcationofunseensubjectsintoCN,MCI,or AD.

5.1. ComparisontoothermethodsevaluatedintheWMHsegmentation challenge

Ourmethodhasachievedthebestoverallevaluationscores,thehighestdicesimilarityindex,andthebestF1-scoreintheMICCAIWMHSeg- mentationChallengeamongallthelisted39methods.Ourmethodhas alsoachievedthetop5forotherevaluationmetrics(Hausdorff distance 95:2nd,averagevolumedifference:3rd, andrecall:5th).Giventhat theDicesimilarityindexrepresentstheoverlapbetweentheautomated andmanualsegmentationandtheHausdorff distancerepresentstheir boundarygap,ourachievementofhighaccuracyinthesetwoindices demonstratesthattheproposedmethodsuccessfullydetectstheWMH voxelsthatarelocatedeitherattheboundaryoftheWMHorinsmall WMHvolumesandconsequentlyexposedtoPVE.Indeed,avisualin- spectionofindividualsegmentationsshowsthesuperiorsegmentation ofsuchpartialvolumevoxelsintheproposedU-Netwithmulti-scaleHF comparedtothestandardU-Net(Fig.4).InFig.4,theproposedmethod moreaccuratelysegmentedWMHsontheboundaryofmanualWMHs (Subject1)aswellassmallclustersofWMHs(Subjects2-4)compared tothestandardU-Net.Despitearelativelylowrecall(5thrank,meaning relativelymorefalse-negativevoxelsdetected),weachievedthebestF1- scorewhichistheharmonicmeanofrecallandprecision.Thissuggests twothings.First,ourmethodyieldedahighertrue-positiverateanda lowerfalse-positiveratethanothermethods.Second,ourmethodmay havedifficultyin detectingsomeWMHvoxelseventhoughitdetects thesmallclusterandtheborderofWMHsbetterthanotherapproaches.

5.2. Conﬁgurationofthemulti-scalehighlightingforegrounds

Thesegmentationperformancewasthebestwhensettingequalall theweightsofdifferentscaleHFssuggestingthatthefeaturemapsgen- eratedfromallscalesareequallyimportantfordeeplearningofWMHs segmentation.Our method achievedstatistically significantimprove- mentsin allmetricscompared totheoriginalU-Net, andsignificant improvementsinallmetricsexceptedaveragevolumedifferencecom- paredtotheU-NetwithAVG(Table6).AlthoughU-NetwithAVGtends

tohavebetterperformanceinallmetricsthantheoriginalU-Net,these differences didnotreachthestatisticalsignificance.These resultsin- dicatethatthesuperiorperformanceofU-NetwithHFdidnotmerely resultfromaluckyrandomseed.ThesignificantimprovementsinDice scoreandF1-scorebyourmethodsuggestthattheproposedmethodcan moreaccuratelydetectWMHclustersthatarehardtodetectbyother methods,suchassmallWMHsinvolvinglargepartialvolumeeffects.

Basedontheseresults,therefore,weconfirmthattheproposedHF approachishighlyadvantageoustoWMHssegmentationandlikelyto segmentationofbrainlesionswhichhavesimilarcharacteristics(size, shape,orintensity),suchasmultiplesclerosis(Weedaetal.,2019),mi- crobleeds(Seghieretal.,2011),andperivascularspace(Ballerinietal., 2018).Sinceourmethodusesmulti-scalelabelimagesemphasizingfore- groundvoxels(i.e.,unbalanceddata),itcanbeusedincombinationwith variouslossfunctions(boundaryloss,(Kervadecetal.,2019);focalloss, (Linetal.,2017);Tverskyloss,(Salehietal.,2017);andfocalTversk loss,(AbrahamandKhan,2019))toovercometheproblemofunbal- anceddata. Itmaybeworth tryingtocombinetheabove-mentioned lossfunctionsandHFinvariouswaysandcomparetheresultstofinda lossfunctionthatfitswellwithHF.Additionally,ourmethodcanalso simplybe appliedtovariousnetworks basedontheencoder-decoder structuresuchasU-Netvariantsorothervariantsofdeepsupervision methods.

5.3. Clinicalevaluation

PreviousstudiesshowedthattheincreaseinWMHsvolumerelatesto cognitivedecline(Barberetal.,1999;Duboisetal.,2014;Habesetal., 2016;Leeetal.,2016;PrinsandScheltens,2015).Onthebasisofﬁnd- ingsinthesestudies,weevaluatedtheclinicalutilityoftheproposed approachbyinvestigatingwhetherautomatedWMHvolumetrycanpre- dictcognitiveperformancedeclinesorthediagnosisofasubject(CN, MCI,andAD).Ourresultsdemonstratethattheautomaticallycomputed WMHvolumeissigniﬁcantlyassociatedwithcognitiveperformancein thedirectionwehypothesized(Table8).

Furthermore,wehypothesizedthatfeedingtheautomaticallycom- putedWMHvolumestoaclassiﬁerindividuallycandiagnosesubjects.

Theresultsofthisexperimentshowedthatclassiﬁcationperformancefor CNvs.ADandMCIvs.ADcanbeimprovedusingthecombinedfeature- setofWMHvolumesandclinicalvariables.Theseresultsareconsistent

(9)

Fig.4.Segmentationresultsinthetrainingdataset.

Fromtoptobottomareaxialslicesoffourdiﬀerent subjects.FromlefttorightareFLAIRimage,ground truth,theresults fromthestandard U-Net,andthe resultsfromtheproposedU-Netwithmulti-scaleHF method.YellowvoxelsindicateWMHssegmentedus- ingeachmethod.Colorboxesshowourapproach’sim- provementoffalse-positives(orangeborder)orfalse- negatives(greenborder)observedinthestandardU- Net.

withpreviousﬁndingsthatWMHsprovideanimagingmarkerforAD (Habesetal.,2016;Leeetal.,2016;PrinsandScheltens,2015).

6. Conclusions

Inthecurrentstudy,weproposedaU-Netwithmulti-scalehighlightingforegrounds(HF).Ourvariousevaluationsshowthattheproposed method improvesdetectingWMH voxels withpartialvolume eﬀects asintended.However,itstillremainschallengingforourmodeltore- tainbothhighprecisionandrecall.Attention-basedmodels(Chenetal., 2016b;Wooetal.,2018)thateﬀectivelylearnimportantcharacteristics ofatargetstructureforsegmentationcanpotentiallybeawaytosolve thisissue.ToimproveWMHssegmentation,integratinganattention- basedmodelintodeepneuralnetworksisthussuggestedinthefuture.

Ourclinicalevaluationdemonstratestheclinicalutilityofourmethod.

Yet,theindividualdiagnosisof unseensubjectsusingWMH volumes aloneisbelowtheclinicalstandard.Inafurtherstudy,otherinforma- tionofWMHssuchasthelocationordistributionofWMHvolumesand thelongitudinaltrajectory ofWMH volumechangeswouldbe incor- poratedfortheimprovementofindividualdiagnosis.Theimplementa- tionofourproposedmethodisavailableatDockerhub(Merkel(2014); https://hub.docker.com/r/wmhchallenge/pgs).

Dataandcodeavailabilitystatements

The data that support the ﬁndings of this study are openly available in 2017 MICCAI WMH Segmentation Chal- lenge homepage at https://wmh.isi.uu.nl/data/, reference number (Kuijf, 2019). Our code also is openly available at https://hub.docker.com/r/wmhchallenge/pgs.

Creditauthorshipcontributionstatement

GilsoonPark:Methodology,Software,Formalanalysis,Writing– originaldraft,Visualization.JinwooHong:Methodology,Formalanal- ysis. BenA. Duﬀy:Methodology, Software. Jong-Min Lee:Concep- tualization,Methodology,Supervision.HosungKim:Formalanalysis, Methodology,Writing– originaldraft.

Acknowledgments

ThisresearchwassupportedbytheBio&MedicalTechnologyDevel- opmentProgramoftheNationalResearchFoundation(NRF)&funded by theKoreangovernment(MSIT) (No.2020M3E5D9080788)anda grantoftheKoreaHealthTechnologyR&DProjectthroughtheKorea HealthIndustryDevelopmentInstitute(KHIDI),fundedbytheMinistry of Health & Welfare, RepublicofKorea (grantnumber: HI19C0745) andagrantoftheNationalInstitutesofHealthgrants(P41EB015922, U54EB020406,U19AG024904),BrightFocus(A2019052S).

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., 2016. Tensorﬂow: Large-Scale Machine Learning on Hetero- geneous Distributed Systems. arXiv preprint arXiv:1603.04467.

Abraham, N. , Khan, N.M. , 2019. A novel focal tversky loss function with improved attention u-net for lesion segmentation. IEEE, pp. 683–687 .

Abraham, H.M. , Wolfson, L. , Moscufo, N. , Guttmann, C.R. , Kaplan, R.F. , White, W.B. , 2016. Cardiovascular risk factors and small vessel disease of the brain: Blood pressure, white matter lesions, and functional decline in older persons. J. Cereb. Blood Flow Metab. 36, 132–142 .

Ashburner, J. , Friston, K.J. , 2000. Voxel-based morphometry —the methods. Neuroimage 11, 805–821 .