Follow spam detection based on cascaded social information

(1)

ContentslistsavailableatScienceDirect

Information Sciences

journalhomepage:www.elsevier.com/locate/ins

Sihyun Jeong

^a

, Giseop Noh

^a

, Hayoung Oh

^b

, Chong-kwon Kim

^a^,^∗

aDept. of Computer Science and Engineering, Seoul National University, Gwanak-gu, Seoul 151–744, Republic of Korea

bSchool of Electronic and Engineering, Soongsil University, Dongjak-gu, Seoul 156–743, Republic of Korea

a r t i c l e i n f o

Article history:

Received 23 November 2015 Revised 27 April 2016 Accepted 11 July 2016 Available online 15 July 2016 Keywords:

Online social network Security

Twitter Follow spam Triad Status theory

a b s t r a c t

Inthelastdecadewehavewitnessedtheexplosivegrowthofonlinesocialnetworkingser- vices(SNSs)suchasFacebook,Twitter,RenRenandLinkedIn.WhileSNSsprovidediverse beneﬁtsfor example,forstering inter-personal relationships, community formations and newspropagation,theyalsoattracteduninvitednuiance.SpammersabuseSNSsasvehicles tospreadspamsrapidlyandwidely.Spams,unsolicitedorinappropriatemessages,signiﬁ- cantlyimpairthecredibilityandreliabilityofservices.Therefore,detectingspammershas becomeanurgentandcriticalissueinSNSs.ThispaperdealswithFollowspaminTwitter.

Insteadofspreadingannoyingmessagestothepublic,aspammerfollows(subscribesto) legitimateusers,andfollowedalegitimateuser.Basedontheassumptionthattheonline relationshipsofspammersaredifferentfromthoseoflegitimateusers,weproposedclassi- ficationschemesthatdetectfollowspammers.Particularly,wefocusedoncascadedsocial relationsand devisedtwoschemes,TSP-Filteringand SS-Filtering,eachofwhichutilizes TriadSignificanceProfile (TSP)and Socialstatus(SS)inatwo-hopsubnetworkcentered ateachother.Wealsoproposeanemsembletechnique,Cascaded-Filtering,thatcombine bothTSPandSSproperties.OurexperimentsonrealTwitterdatasetsdemonstratedthat theproposedthreeapproachesareverypractical.Theproposedschemesarescalablebe- causeinsteadofanalyzingthewholenetwork,theyinspectuser-centeredtwohopsocial networks.Ourperformancestudyshowedthatproposedmethodsyieldsignificantlybetter performancethanpriorschemeintermsoftruepositivesandfalsepositives.

1. Introduction

Theuse ofsocial networking services (SNSs)continues togrow exponentially withthe widespreadadoption of smart devicessuchassmartphones, smartpads,smartwatches,andso on.SNSscanconnect peopleandcan beused toshare information in real time. SNSs such as Facebook, Twitter, and Ren-Ren are becoming the most inﬂuential mediums for buildingsocial relations, as well as forthe sharing andpropagation of information. According to recent announcement, Twitter,oneofthelargest andthe mostpopular SNSs,passed255mmonthlyactive usersandexpects80%ofadvertising revenuefrommobileusers.¹

Afterrepeatedexplosivegrowthinuserpopulation,maturedSNSssuchasFacebookandTwitterbecomeanecessingsin modernlife indevelopedcountries.Inaddition,relatively newSNSssuch asRenRenandSinaWeibo, targetedforspeciﬁc

∗ Corresponding author.

E-mail addresses: [email protected] (S. Jeong), [email protected] (G. Noh), [email protected] (H. Oh), [email protected] , [email protected] (C.-k.

Kim).

1http://thenextweb.com/twitter/2014/04/29/twitter- passes- 255m- monthly- active- users- 198m- mobile- users- sees- 80- advertising- revenue- mobile/ .

(2)

countryorlanguagespeakers,replicatetheeruptiveexpansionoftheearlierSNSs.Forexample,aninﬂuentialusercanbe exploitedbyapersonworkinginonlinemarketingtomaximizethemarketingeffect;malicioususers(attackers)disseminate falseinformationorfraudulentmessagesforthepurposeofphishing,scam,ormalwareintrusion.Thatis,theattackerspost multipleunrelatedmessageswithtrendingtopicstoattractlegitimateusersandencouragethemtoclickthemaliciouslinks inthemessages.

Spamreferstounwantedmessagesfromunknownsources(attackers).OneofthemajornegativeaspectsofSNSisspam.

IntheearlyInternetera,spamappearedinemailsorSMS(shortmessageservice).However,thedomainofspamexpanded intoSNSasthepopularityandusageoftheservicescontinuedtoincrease.FalseinformationfromSNScanspreadrapidlyin realtime.Followspamwasreportedrecentlyandisasystemthattriestoincreasethenumberofrelations(orfriendships) inusersnetworksforthepurposeofsendingspamviaSNS.Theattackpatternofthefollowspambeginswiththeattacker disseminatingspammeraccountsthatfollowalargenumberoflegitimateusersforthepurposeofreceivingafollow-back ordrawingattentiontothespamaccount[15].Duetotheconsequentexposureofthepublictospamcontent,thispractice deﬁnitelylowersthereliabilityofSNS.

In practice,Twitterhas experienced Followspam problems,reducing users trust inmessage distribution andincreas- ing computation overhead.In 2008, Twitterofficially announced that Follow spam accountshad followedso many peo- plethat they threatenedthe performanceofthe entiresystem.² Evenwiththeemerging threatfromFollowspam,ithas been barely investigated orresearched. A contents-based spamfiltering approach is employed in the Twitterspam field [5,12,28,42].However,sincespamcontentskeepchangingtoavoidcontent-baseddetectionbyinsertingURLsandimagesin spammessages,thecontents-basedapproachisvulnerableagainstevolvingmessagepatterns.Toovercomethelimitations ofthecontent-basedapproach,anewapproachusinginherentpropertiesofSNSwasintroduced.

[15]firstemphasizedthatFollowspamshouldbedetectedbyusingitslink-farmingproperty.TheyproposedaPageRank- based rankingalgorithm tolower theimpact ofspammers. However, thisapproachcan be burdensomesince itneeds to utilizesocialnetworkdatafortheentirenetwork (i.e.,all informationfornodesandedges).Therefore,ithasahighcom- putational cost and can barely detect Follow spammers in real time. As a result, a novel detection mechanismwith low computationalcost andrealtimespamfilteringisneededwhilemaintainingthedetectionperformance.Inthispaper,we suggesttwosocial network-baseddetectionschemesforcounteringTwitterspam.First,spammeraccountsarefilteredout withthe use ofa Triad SignificanceProfile (TSP)that measures the structuraldifferences betweenthe frequencies of13 isomorphic subgraphs.We discoveredthat TSPofa spammeraccount isdifferentfromthat ofalegitimate usersaccount withonly 2-hopsocial networks.According toour experiment, 92.1%ofspammers are classifiedcorrectlywhenwe used onlyTSP featuresforclassification. Thisresultsuggeststhatfrequencyanddistribution ofisomorphicsubgraphscould be informativefeaturesforidentifyingspammers.Secondly,weintroduceanewdetectionapproachusingthesocialstatus(SS) theory[27]todistinguishspammeraccounts.Legitimateuserstypicallyfollowaccountsofahigherstatusthanthemselves, whereas spammersare likelytofollowinarandom manner.Withtheseapproaches,we canconfirmthat cascadedsocial network-basedapproaches(TSPandSS)caneffectivelydetectFollowspammerswithalowcost.

Our experiments on real Twitter datasets clearly show that our three mechanisms, TSP-Filtering, SS-Filtering and Cascaded-Filtering, arevery practical forthe following reasons.First,our approachesrequireonly a smalluser-related2- hopneighborhoodsocialnetwork.Actually,thereareonlyfewexistingworksfocusedonsmallneighborhoodgraphinother areas[1,30],butnoneofthemdiscoveredthepowerofneighborhoodsocialnetworkclearly.Therefore,theycanbeapplied tospamdetectionsystemsinsocialnetworksasrealtimesolutions.Second,serviceproviders canmaintainthecredibility andreliabilityoftheir SNSsbyapplyingourapproaches.Legitimateusersarelesslikelytobeblockedbythesystemwith lowfalsepositives(5.7%).Also,ahighproportionoftruepositives(96.3%)providesasecureenvironmenttousers.

Ourmaincontributionsaresummarizedasfollows:

• Fortheﬁrsttime,wediscoveredthefeasibilityofcascadedsocialinformationsuchastriad(orisomorphicsubgraph)and socialstatusbasedpositivelinkprobabilityasgoodfeaturesforclassifyingspammersandlegitimateusersinTwitter;

• Ourapproachesinvolvemorelightweightcomputationforrealtimespamdetectionthanthepreviousscheme(i.e.,global information).Sincetocheckwhetheracertainuserisspammerornot,weonlyfocusedonthe2-hopsocialnetworksof eachuser(i.e.,localinformation).andextractcascadedsocialinformationfromthenetwork;

• Basedon Twitter’sspampolicy,noveltriad graphbasedfeatures andsocialstatustheory basedfeatures are proposed andcascadedtogethertofacilitatespamdetection;

• Tothe best ofour knowledge, ourapproaches are the ﬁrstexperiments withreal world data toprovide the credible andreliableTwittersystemwithtruepositiveresultsofup to96.3%.Webelievethatourﬁndingscanprovidevaluable insightstotheareaofspamdetectionanddefenseinvarioussocialnetworks;

• Tosumup,thispaperistheﬁrstworkwhichclarifythedifferencebetweenspammerandlegitimateuserintheviewof subgraphconsistingofneighborhood.Moreover,wesuggestthenovelthreeapproachessuchasTSP-Filtering,SS-Filtering andCascaded-Filteringforthefollowspamdetectionbasedoncascadedsocialinformation.

Therestofthepaperisorganizedasfollows:Firstly,weintroduceinteresting relatedworksontwitterspamandspam detectionmechanismsinSection2.Then,todescribeourapproach,weintroducethemotivationforourworkinSection3.

2https://blog.twitter.com/2008/making-progress-on-spam .

(3)

Then,weproposetheTSP-FilteringandSS-Filteringmethods withtherespectiveperformance evaluationinSections4and 5.In Section 6,we propose Cascaded–Filtering withhigher true positive andlower false positive than TSP-Filtering and SS-Filtering.Section7presentsanoverallevaluationanddiscussionofourthreeapproaches(TSP-Filtering,SS-Filteringand Cascaded-Filtering)andCollusionrank[15].Lastly, we closethispaperwithfuturework andthe conclusionin Sections8 and9.

2. Relatedwork 2.1. Twitter-spamﬁltering

2.1.1. Contentbasedtwitter-spamﬁltering

Twittercontents such asuserproﬁle, tweets,andtheactivity logprovidevariousoptions fordistinguishingspammers fromlegitimateusers.SpammersgenerallywritetweetsthatcontainahashtagandURLaccordingtothefollowingresearch studiesthatanalyzedcommonlyusedhashtagsandURL:[5,12,28,42].

COMPA[12]detected compromisedaccountsthatwrotespamtweetsbasedonthetweetinglanguage oftheuser’saccount,thetweetingtimewindow,theURL,andthe“mention” receiver.Thisisapersonalizeddetectionapproachthatlearns theprevious behavioralpatternof eachuser. Benevenutoetal.[5]and Martinez–Romoetal.[28] proposed classiﬁcation modelsthatlearnedthenumberofhashtagsandURLs[5]orspamURLs thatareused inspamgroundtruthtweets.Yardi etal.[42] studied spammers’ strategicbehavioralpatterns andalso concludedthat theuse ofhashtags relatedto trend- ingtopicsisa veryeffectivespammingstrategy. Gaoetal.[14]builtatemplatebased onthesentencestructureofspam groundtruthtweetsandusedtemplatematchingtoﬁlteroutspamtweets.

2.1.2. Social-networkbasedtwitter-spamﬁltering

As the attack strategies of Twitter spammers have become more effective,a variety ofsocial-network-based Twitter- spam-detectionapproacheshavebeenintroduced.

Twitterspammersattemptto convincethe publicthat their accountsare legitimateorfamous, andtheydisplay their spamtweetsto the public.It iseasier forspammers toattract user interestby exploitingasmuch social informationas possible.Previousworksidentiﬁedfollower-marketcustomerswhopurchasedfakeaccountsthattheyusedtofollowthem- selves[22,34].Jiang etal.[22]analyzed thebehavioralsynchronicityamongthesefakeaccountsto detecttheir presence.

Stringhinietal.[34]examinedrealfollowermarketsanddetectedfakeaccountsbyidentifyingthoseforwhichthenumber offollowees increasedsuddenlybutdidnot decreaseanyfurther.Viswanath etal.[36] usedPrincipalcomponentanalysis (PCA),ananomaly-detectionapproach,todetectintentional“follow” or“like” behaviorsfrommarketcustomers.

OnTwitter,Followspam(orlinkfarming)refers totheact offollowingamassnumberofpeopletogarnerattentionor follow-backs.³Ghoshetal.[15]pointedoutthatmostFollowspammersattainedahigherrankinexistingrankingalgorithms becausetheirreciprocal-followerrateis82%.Basedontheirﬁndings,theyproposedtheapplicationofCollusionrank.

2.1.3. Subnetworkbasedspamﬁltering

The authorsof [37]directly crawled Twitter’sdata andanalyzed them withboth contents andsocial graph modeling basedapproaches.Based onanalysis ofthecontents, categorizedintolegitimates andspams, they provedthat their pro- posedreputationfeaturehasthebestperformanceamongallsocialgraph-basedfeaturesfordetectingabnormalbehaviors.

However,they onlyconsidered therelationship betweenoutdegreesandindegreesina simpleTwittergraphforthepro- posedreputationfeature.Eventhoughthisschemealsoutilizesasmallgraph(subgraph),asophisticatedgraphdesignisas onlypartofthetriadapproach.Theauthorsof[30] usedneighborhoodsubnetwork(i.e.,egonetwork)todetectcomment spammersinYoutube.TheyalsoutilizedselecteddiscriminatingmotifsandanalyzedtheminYoutube video-userrelation network.It seemsvery similarwithourwork, butitusedspam campaignrelatedmotifs.Therefore, itcannot distinguish spammerswhen theyuseother sophisticatedstrategy. [1]extractedweightedsubgraphsfromtarget networkandutilizes themasdiscriminatingfeaturestodetectspammers.Italsoanalyzedsubgraphsbytypesofanomalies.Basedonpowerlaw characteristicofsocial network, it comparedspammer to legitimateusers neighborhood subnetworkinterms ofedge or weightdistribution.

2.2.Linkspamﬁltering

Linkspamhasbeenwidelystudiedintheweb spamdetectionﬁeld. Thistypeofspamispresentedasnumerouslinks froma largenumberofweb pagestoa fewtarget webpages. StudiesonLinkspam havebeenreceivingattentiondueto thelimitationsof PageRank[31]andHITS[24].Thanks tosigniﬁcant link characteristics,manyweb linkgraph structure- basedspam detectionapproacheshave beenintroduced [3,8,18,25,40,41].TrustRank [18]is one of the most popularLink spamdetectionalgorithms.Itpropagatesthe‘non-spam’labelthroughsocialnetworks. Likewise,BadRank [40]propagates the‘spam’labelthroughsocial networks.ComparedtoPageRank[31],thesetwo algorithmsutilize‘non-spam’and‘spam’

3https://blog.twitter.com/2008/making-progress-spam/ .

(4)

labelpropagationtolowertherankofspamwebpages.[4]proposedanadvancedLinkspamdetectionalgorithmusingboth

‘spam’and‘non-spam’labelpropagation.Theselabelpropagationalgorithmsrequireseedknowledgesuchasasetofspam nodesandasetofnon-spamnodes.Therefore,noiseintheinitialdatasetcanbeacriticalissueforthesealgorithms.

2.3. Sybildetection

MostSNSspamdetectionsystemsrelyonSybildetectionalgorithms.Peer-to-peersystemsconsistofmultiplenodeswith severalconnections(edges).Thesystemhastoensurethateachnodeisclearlyidentiﬁed;otherwise,amalicioususer(Sybil) canattempttocreatemultiplefakeidentitiesmasqueradingashonestnodes[11].Theycanthenmanipulatethesystem(by zombiemachines) orattack thesystemin orderto gain illegalproﬁtsuch aspositive feedbackin thereputation system, gettingmorevotesininternetpolls,ortargetingsitestoincreasetheirrankinGooglePageRank.

TherearetwomainapproachestotheSybilattack:centralizedanddecentralized.Centralizeddefenseobtainsadmission control throughacentral authority. Decentralizeddefense hasnotrustedcentralauthority andcontrolstheIPaddress by bindingan identity.Forthedecentralizedattack, SybilGuard[44]proposesthat wheneach nodereceives √

kindependent samplesfromasetofhonestnodesofsizek,arandomwalkcanbeperformedtotrytodiscovertheSybilidentitiesbyusing theintersectionprobabilitybetweenhonestandSybilgroups.ThenumberofavailableattackedgesinSybilaretheoretically boundO(^√^k^log^k)^.^SybilLimit^[43]îsânênhanced ^methodîntroduced^by^[44]^.^They^reduced^theâttackêdge^boundⁱⁿ^near optimalO(logk) by exploitingvarious random walk methods. GateKeeper[35] adapts theticket distributionalgorithm to obtaineachnode’sprobabilityofSybil/honestusers.

Secondly,the centralized method.SybilInfer [10] assumedthat the central authority knowsthe entiresocial network.

After random walks,eachnode isassigned aSybil/honest probability bymeasuring the Bayesianinference. SybilDefender [39] assumedthat when starting a random walk in Sybil nodes, it will pass the intersection between honestand Sybil nodes.TheseapproachesapplycommunitydetectionalgorithmstoﬁndSybilcommunities.SumUp[35]addressesthevote aggregationproblembyconsideringeachvoter’strustgraphandcalculatingasetofmax-ﬂowpathsfromallvoters.

Currently,therearemanySybildetectingmethodswithvarioussocialnetworkproperties.SybilRank[6]investigateseach nodeby assumingthathonestnodeswillhavehigherdegree-normalizedlandingprobability.Arandomwalkisperformed tomeasure therankingtodeterminewhethertheaccountis Sybilornot. SybilShield[33]utilizesa multi-communityso- cial networkstructure environment,considering sociologicalpropertiestocut theedge betweenhonestandSybil groups, performingmodifiedrandomwalksandfiguringoutthepropertiesofmulti-hopedges.SybilBelief[16]detectsSybilnodes basedona semi-supervisedlearningframework.ThismethodmodifiestheLoopyBelief propagationsystemandthepair- wiseMarkovrandomfieldtodefineeachnode’sclassification(Sybil/honest).

2.4. Dataminingapproachforspamdetection

In spam detection problem, most ofexisting studies related the problemto classification taskas follows. In general, spam classifier firstly learn features extracted from SNS using pattern of legitimate users such as the number of followees/followers, postuploading time and contents information of user profile and posts. Then, classifier determines if newlygiventestuserisspammerorlegitimateuserbycomparingtolearnedpattern.Therefore,ifthetestusersbehavioral patternfeatureisfarfromlegitimateuserspatternfeature(learnedfeature),theclassifiercouldclassifyanddetecttheuser asspammer.Insome cases,classifiersadoptclassification thresholdtohandlethetradeoff betweentruepositiveandfalse positive.SincereliabilityandcredibilityiscrucialinusingSNS,low falsepositiveistreatedparticularlyaccordingtospam detectionsystem.

In detail,[23] used linearregression forclassifyinganddetecting spammers and itstated that deviant usersfromle- gitimate users patterns could be classified asspammers. Similarly, [36] utilizedPCA (Principal Component Analysis) and it detected Facebook spammers who aredistant fromthe principal componentof legitimateusers. Also, Markovrandom fieldbasedspamclassificationapproachwasproposedin[13].Especially,contents-basedspamdetectionapproacheslargely usedNavebayesclassifierorSVMclassifierwithcontents-relatedfeatures.Intheearlystageofspamdetection,[17,21]and many similar studies analyzedtoken or wordin spamcontents andapplied extractedfeatures to the Nave bayes classifier.[32]proposedoptimizedversionofSVMspamclassifierandachievedefficiencythanprevious ones.[45]relievedfalse positiveproblemby adoptingboundaryregiontoclassificationresult.Sincemostofspamclassificationisbinaryclassifica- tionofspamandnon-spam,ternaryclassificationgivesthreeclassification labelsincludingboundaryregionwhichmeans reconsideringregion.

3. Motivation

3.1. Web,socialnetworkandTwitter

Like theWeb,wheretheimportanceofeach pageislargelydeterminedby whoreferenceswhom,the inﬂuenceofin- dividualsonmanySNSsisdeterminedbythenumberofindexestheyreceive.Forexample,thenumberoffollowersisthe mostimportantfactoronTwitteranddeterminessocialcapital,whilethenumberof“likes” onFacebookissimilar.Thisfea- ture,however,hasattractedaplethoraoffraudswhotrytoincreasetheimportanceorreputationofentitiesbygenerating

(5)

bogusindexes,leadingtothedeﬁnitionofthespamdexingclassofattacks.Twitter’ssizehasexpandedexponentiallyover thepastseveralyearsanditnowhasover255millionactiveusersafterasuccessionofrapidgrowthspurts thatresulted inan averageannualgrowthrateof25%.⁴ Notably,thesocial-interactionstructureofTwitterisveryinteresting.Users can followfamous persons–usually celebritiesor standoutopinion leaders–thatthey are unacquainted with, aswell asclose friends.Therefore,Twitterplaysaninformation-propagationroleinadditiontotheroleofanonlinesocialnetwork[26].

Moreimportantly,contrarytotheWeb,Facebook,andmanyothersocialnetworkswherespamindexesusuallyoriginate fromfakeaccountsandfromacircleofcolludinglinkfarms,amaliciouspersoncancollectfollowersorfansfrominnocent, social-capital-conscioususersonTwitter. Further,thehighrateoffollow-backs makes thedetectionofTwitterspammers moredifficultbecausetheyreceivemanylegitimatefollowersjustbyfollowingtargetusers[15].Existinglink-farm-detection methodswellfittedforwebspamdetectionfieldthereforelosemuchoftheireffectivenessinthedetectionofspamdexing onTwitter.

Inthispaper,wedemonstratethefeasibilityofa cascadedSNS-basedsecurityschemetodetect Followspam.Different fromthe unpractical andheuristic approachesof previous works,withthe characteristics offollow-backs we apply triad frequenciesandstatustheory fortheﬁrst time inourFollow spamdetectionscheme.Notethat the mainpurposeofthis studyisnottheattainmentofengineeringoptimizationfortheperformanceenhancementofpriorschemes,butrather,itis theexaminationofthefeasibilityofasocial-network-basedsecurityschemeinapopularonlinesocialnetworkingsite,i.e.

Twitter.

Beforeweformalizetheproblem,weaddressthecharacteristicsofTwitter.All13typesofdirectedsocialgraphmodels andsocial statuswithlocalinformationcanbe observedinTwitter.Additionally,Twitterhaswell-deﬁnedsocialrelations intheformofthe“follower” and“friend” relationships. Inadditionto thesecharacteristics,spamsshow up frequentlyin Twitter.WepracticallyexploitthepolicyofTwitteragainstspamstodesignourproposedscheme.Twitter’sspampolicyis summarizedasfollows:

• “If youhave asmall numberoffollowers comparedto the amountofpeople you are following”,theaccount maybe consideredaspamaccount.

• “Multipleduplicateupdatesononeaccount” isafactorusedtodetectspam.

• “Ifyourupdatesconsistmainlyoflinks,andnotpersonalupdates”,itisconsideredspam.

Firstpolicy is relatedwiththe social-interaction structureof Twitterwhilesecond andthird policies haveto dowith spamcontents.Mostpreviousworksfocusedoncontentsanalysisorfullinformationusageofsocialnetworkswithahigh amountofcomputationaloverhead.Differentfrompreviousapproachesconsideringsecondandthirdpolicies,weaccurately detectFollow spam usingonly localinformationofthe social-interaction structureofTwitter. Thatis,ourcascaded social networkschemeisapplicableregardlessofthecontentsuchasTweet,timeandlinks.

3.2.LinkspamandFollowspam

TheconceptoflinkfarmingoriginatedfromWebspam.TheintentoflinkandFollowspamistoincreasethepopulation ofaspeciﬁc (target)website orreputation.Sincenormalsearch engines (e.g.,Google)placepopular websitesonthe ﬁrst page,link-farmingwebsitescreatenumerouslinkstothetargetwebsite.

PageRank[31],themostpopularwebsiterankingalgorithm,rankswebsitesbasedontheindegree ofthesite. Actually, thepopularityofinlink nodesisalso important,butnumerousinlinks arelikely toincrease the targetwebsite’s ranking.

Therefore,linkfarmsgenerallycontainplurallinks,andthelinksarecreatedfrommanynodestoafewtargetnodes.

Followspam,aspecialattackstrategyonTwitter,hasbeenshowntobealinkfarmingtechnique.Fig.1showsanexample ofFollowspam.

Followspamconsistsofnumerouslinks,butsomedifferencesexist.First,linksarecreatedbyafewspammernodesand theytargetmanylegitimateusernodes.Morespeciﬁcally,originallinkspamdenotesmanyspammernodes-fewlegitimate nodesrelationship whileFollow spam denotesfew spammernodes-many legitimate nodes.Second,the purposeofFollow spam is not just linking, butreceiving a follow-back(reciprocal link). A user on Twittercan seetweets (contents) from anotheruserwhen he/shefollows(subscribe)the other’saccount.Consequently, spammersneed tobe followedby other userstoshowtheirspammingcontentssuchasURL,imageandadvertisement.

Therefore, to gain more followers and attention, spammers send a large number of follow linksto legitimate users.

Surprisingly,themajority offollowerswho Follow spam accountshavebeenpreviously targetedby spamaccounts. Tobe specific,82% oflegitimateuserssenda follow-backto spammers[15].Ifsisa spammeraccount andhis/heroutlinks are allattack edgesforfollowback,theattack strengthofs(AS(s))isdefinedin(1).Wedefinedtheratiobetweensuccessful followspamlinks(followbacklinks)ofspammers(N_fb(s))andtotalfollowspamlinksofs(N_f(s))asAS(s)asfollows:

AS

(

^s

)

=Nf b

(

^s

)

/Nf

(

^s

)

⁽¹⁾

In(1),N_f(s) hasthe samemeaning ofoutdegreeofs.Therefore, theattack strength (AS(s))offollowspamrelies ona successfulnumberoffollowbacks.

4http://thenextweb.com/twitter/2014/04/29/twitter- passes- 255m- monthly- active- users- 198m- mobile- users- sees- 80- advertising- revenue- mobile/ .

(6)

Fig. 1. Overview of Follow spam .

Table 1 Twitter data set.

The number of total users The number of spammers 54 ,981,152 41 ,352

Table 2

Performance estimation of Collusionrank [15] . True positive False positive Spammer 94 .0% 9 .9%

Legitimate user 90 .1% 6 .0%

3.3. TwitterdatasetandCollusionrank

Weconductedanexperimentwithalarge-scaleTwitter-followlinkdatasetthatwasprovidedbyMPI-SWS[9].Thisdata setwascollectedinSeptember2009,contains1,963,263,821directedsociallinks,andthenumberofcorrespondingusersis 54,981,152.WealsousedtheFollowspammerdatasetfrom[15]thatcontains41,352spammersasthegroundtruth.

Table1showsTwitterdatasetusedinourexperiment.

WecomparedtheperformanceoftheproposedmethodwiththatofCollusionrank[15]presentedinWWW2012.Collu- sionranklowerstheinﬂuencescoresofuserswho connecttospammersandﬁlteroutthoseuserswhogainhighrankings by linkfarming.It isauser-ranking algorithmbasedon PageRank.Sincewe usedthesame datasetasCollusionrank,we comparetheperformanceoftheproposedmethodwiththetruepositiveandfalsepositiveresultsofCollusionrank.Accord- ingto[15],Collusionrankdetected94%ofthe41,352spammersthatappearedinthelastlowrankingscores10%ofranking positions;consequently, wecouldextract thefalse positivesoflegitimateusers(9.9%)fromCollusionrankwithadetection thresholdof10%.We reiteratethatthedetailedperformance ofCollusionrankisnotdescribedin[15],exceptforthetrue positivesforspammerswithinthethresholdofthelast10%. Table2istheestimatedperformanceofCollusionRankfroma truepositivevalueof94%.

Collusionrankhasgoodperformanceintermsoftruepositiveandfalsepositive,butithassomelimitationsasfollows:

First, itneeds to analyze every node andedge ina social network. The PageRank-based algorithmtypically estimates every node’sreputation orrankingdependingon thereputation ofother nodesand edgeformation.However, to classify spammers,computingranksoneverynodeisnotpractical.InrealSNSs,spammersdisseminatespammingcontentssimul- taneously.Therefore, a realtime spam filteringapproach is moreeffective; fastspam filteringsignificantly decreases the numberofvictimsofspam.Assuch,analyzingallsocialnetworkinformationisnotverypragmatic.

(7)

Table 3

Average indegree and outdegree.

Average indegree Average outdegree

Spammer 303 .6 866 .5

Legitimate user 401 .5 462 .0

Table 4

Performance evaluation using only indegree and outdegree.

Classiﬁer Type True positive False positive

J48 Spammer 83 .9% 19 .3%

RandomForest Spammer 80 .8% 19 .6%

Second,it has ahighproportion offalse positivesin detecting legitimateusers. If9.9% oflegitimateuser accountsin Twitterwereblocked,mostpeoplewouldstopusingTwitter.Ahighnumberoftruepositivesindetectingspammersisalso crucial;butthecredibilityandreliabilityoftheservicearemaintainedbykeepingthenumberoffalsepositiveslow.

Inthefollowingsections,weproposecascadedsocialinformation-basedspamdetectionmechanismsthatovercomethe limitationsofCollusionrank.

3.4.Indegreeandoutdegreeofsampledataset

SinceFollowspamhasalinkfarmingpropertythatinvolvescreatingmanyoutlinks,weshouldinvestigatewhetherspam- mersinTwitterhaveahigheroutdegreethanlegitimateusers.Also,basedonTwitter’sspampolicy,wefocusontheratio oftheindegreetotheoutdegreeforbothlegitimateusersandspammers.

Inthispaper, we userandomly selected 1000legitimate usersand 1000 spammers asthe experimental dataset. We determinedalargeenoughsamplesizewitha95%conﬁdenceleveland5%conﬁdenceinterval.

Table3istheaverageindegreeandoutdegreeoflegitimateusersandspammers.

Inevitably,spammerstendtohaveapproximatelytwotimesasmanyoutdegreesaslegitimateusers.Themostinteresting observationisthattheratiobetweentheaverageindegreeandoutdegreeshowssigniﬁcantdifferencesbetweenlegitimate usersandspammers.Theaverageindegreeandoutdegreeoflegitimateusersaresimilarandtheratiobetweenthetwois 0.86.However,theratiobetweentheaverageindegreeandoutdegreeofspammersis0.35.Thisindicatesthattheindegree andoutdegreecouldberoughlyinformativeforclassifyingspammers.

Toclassifyspammers by onlyindegree andoutdegree,we used J48andRandomForest classifiersbuiltinWeka.⁵ Both algorithms are decisiontreebased classifiers.While J48generates only one decisiontree,RandomForest corrects overfit- tingproblemsbyconstructingmultipledecisiontreesduringthetrainingprocess.Table4istheclassificationperformance evaluationusingonlyindegreeandoudegree.

AsmentionedintheTwitterspampolicy,weprovedthat thenumberofoutdegreescanbea highlyusefulfeaturefor spamclassiﬁcation.However,comparisonusingonlythenumberofdegreetypesbetweenFollowspamsandlegitimateusers isnotenoughofaperformancemeasuretoinspectspammersasshowninTable4.Tomakeupforthespamdetectionissue, wetriedtoapplyTSPandSSasdescribedinSections4and5.

4. Twitter-spamdetectionwithtriadsigniﬁcanceproﬁle

4.1. Followspamdetectionwithtriadsigniﬁcanceproﬁle(TSP-Filtering)

Apriorstudyshowedthat,interestingly,severaltypesofnetworksfromdifferentﬁeldssuchasbiologyandthesocialsci- encessharecommonproperties.Inparticular,[29]showedthatsomeofthe13isomorphictriadtypesareover-represented whilesome areunder-represented.Tothebestofourknowledge,weﬁrstusedthisfacttodiscernTwitterFollowspam.In termsofasocialgraph,auserisanodeandafollowfromapersontoanotherpersonisadirectedlinkfromthefollower (theperson)tothefollowee(anotherperson).Fig.2showsthe13isomorphictriadclassesintroducedby[38].

NotethataFollowspammerinevitablygeneratesmanyfollows(directedlinks)toreceivefollow-backs(redirectedlinks).

Foreach spammer,wefoundall ofthecorrespondingtriadsandcountedthefrequencyofthe13 isomorphictriadclasses (fordetailedrepresentationofthetriadclasses,referto Fig.2).Weperformedthesameprocedures withlegitimateusers andcomparedthedifferencesbetweenthefrequencyofeachtriadclassforboththespammer-centrictriadsandthelegiti- mateuser-centrictriads.Wearguethatthetriadfrequenciesofrealsocialnetworksaredifferentfromthoseofspammers.

5http://www.cs.waikato.ac.nz/ ^∼ml/weka/ .

(8)

Fig. 2. 13 isomorphic triad classes for analyses.

Fig. 3. A user u ’s social network graph G u(red-colored edges and named nodes). (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

Thetriadfrequencies ofspammersaresimilartothoseofrandomnetworkswiththesamegraphpropertiesincludingthe averageindegreeandtheaverageoutdegree.

ForagivenlocalnetworkG_uofauseruasshowninFig.3,weestimatedthenumberofoccurrencesforeachtriadclass.

Gu consistsofsociallinksbetweenuand1-hopneighborhoodsofu.Supposethat uisfollowingr₁,r₂ andr₃ andis also followedbyrr,r₅ andr₆.Inthiscase,r₁,r₂andr₃ are“Followees” ofu.Inthesamemanner,r₄,r₅ andr₆are“Followers”

ofu.Also,therearedirectedsocial linksbetweenthem(representedasred-colored linksinFig.3).Todeterminewhether useruisaspammerornot,weanalyzeduseru’ssocialgraphGuconsistingof7nodesand10edges.Thisisasubgraphof aTwittersocialnetwork,andeveryusercanhavehis/herownsocialnetwork.

Todiscoverthephenomenon wherebyspammersocialnetworkscomprisesubgraphfeatures thataredifferentfromle- gitimateusersocial networks,wecomparedspammertriadfrequencieswiththoseoflegitimateusers.Foreachtriadclass i,thestatisticaltriadoccurrenceisdescribedbytheZ-scoreZ_i[29]inEq.(2).

Zi=

(

^Nspami−<Nlegiti>

)

/std

(

^Nlegiti

)

⁽²⁾

whereNspam_i istheoccurrencenumberofthetriad classi inaspammer’snetwork, and<N_legit

i> andstd(^Nlegit_i)^are^the mean and standard deviations of its appearancesin the legitimate user networks, respectively. The TSPis therefore the vectoroftheZ-scoresthatarenormalizedtolength1inEq.(3)

TSPi=Zi/

Zi2

¹₂

(3)

(9)

Fig. 4. Average TSP of spammers and legitimate users. Error bar means standard deviation of spammers TSPs.

Tovisualizethisinsightfromnetworkcomparison,wecomputedtheaveragevectorofTSPforrandom1000spammers fromtheoriginaldataset[15]andnormalizedit.Similarly,wealsocomputedN_legit

i basedonrandom1000legitimateusers.

Wedeterminedthatthesamplesize1000waslargeenoughwith95%conﬁdenceleveland5%conﬁdenceinterval.

Fig.4compares the TSPsofspammers andlegitimate users.Legitimate usersgenerallyhave moretriadscompared to spammers,meaningthattheneighborsoflegitimateusersaresociallywellconnectedwithisomorphictriadpatterns;therefore,thisphenomenonproducedmoretriadcountsoverall.Alternatively,spammershavelowertriadcountsthanlegitimate usersbecausetheir1-hopneighborsarenotlikelytoacquaintthemselveswiththeother1-hopneighbors.

Sincespammersusuallyselecttheirfolloweesrandomly,therearefewconnectionsbetweenspammers’neighbors.Triad 021D,however, indicates exceptional triad counts, whereby spammers have more021D triadsthan legitimate users.The 021Dtriadclassrepresentstheplural-followingactionsfromanode.Italsorepresentslink-farmingactivity.Sincetheactions ofFollowspammersinvolvetheproductionofnumerousout-links,theirhigh021Dtriadcountsmakesense.Thedistinction betweentheTSPsofspammersandlegitimateusersthereforeexplainswhyourTSP-detectionapproachisfeasible.Wegive astatisticalanalysisofsensitivityofeveryproposedstrategyindiscussionsection(Section7).Inthefollowingsection,we provideatrue-positiverateandfalse-negativeratetosupporttheexcellenceofthismethod.

Asmentionedearlier,werandomlysampledsetsof1000spammersand1000legitimateusersfromtheoriginaldataset [15]andconductedanexperimentwithTSP. Wedetermined that thesamplesizewaslarge enoughwith95% conﬁdence leveland5%conﬁdenceinterval.

4.2.TSP-ﬁltering

ThefollowingprocesswasusedfortheapplicablevalueoftheTSP-Filteringbasedontheexperiment.First,weobtained themeanandstandarddeviationsofeachfrequencyforthetriadclassacrossalloftheTwitteraccounts.Since1000legiti- mateusersaresufficientlyrepresentativetosupporteveryTwitteraccount(confidencelevel:95%,confidenceinterval:5%), we computedthe meanandstandard deviations ofthe 1000 randomly-sampledlegitimate users.The meanvalue of the triadclassiis<N_legit

i>andstandarddeviationofthetriadclassiwith<N_legit

i>isstd(^Nlegit_i),respectively.Figureshows thesampleduserslocalsocialnetworksandtriadfrequencynormalization.

Second, we countedthe spammer-triad frequencies and thelegitimate user-triad frequencies forevery social-network subgraphofeveryuseraccount; thetriadfrequencyrepresentsthetriadappearancesineach usernetwork.Then,we nor- malizedthefrequencieswith<N_legit

i>andstd(^Nlegit_i)⁽^Fig.⁵^).În^the^caseôf^the^spammer,^we^canûseÊq.⁽²⁾^;^however,ⁱⁿ thelegitimateuser’scase, wecanusethere-translatedEq.(4),whereNspam_i istheoccurrencenumberofthetriadclassi inalegitimateuser’snetwork:

Z_i=

(

^Nlegiti−<N_legit_i>

)

^/std

(

^Nlegiti

)

⁽⁴⁾

Lastly,we computedeachuser’s TSPusingEq.(3).Auser’sTSPcomprises 13Z-scoresof 13triad classes.These13Z- scorescouldbe informativefora machine-learningmechanism. Wealsoaddedtwofeatures formachine learning,namely theindegreesandoutdegreesofeachuserbasedonthemotivationexperiments.

(10)

Fig. 5. Triad frequency normalization.

Table 5

Performance evaluation using TSP-Filtering (w/o indegree and outdegree).

J48 Spammer 91 .0% 9 .4%

Table 6

Performance evaluation using TSP-Filtering (w/ indegree and outdegree).

J48 Spammer 91 .7% 9 .2%

4.3. PerformanceevaluationofTSP-ﬁltering

We conductedthe experimentusing J48andRandomForest implemented inWeka (10-fold validation).Table 5 shows theperformance evaluationresultsfortheTSPmethodwithoutindegrees andoutdegrees.Table6showstheperformance evaluationresultsfortheTSPmethodwithindegreesandoutdegrees.Ontheother hand,Table6showstheperformance evaluationresultsfortheTSPmethodwithindegreesandoutdegrees.

FromTable5,evenwithoutindegreesandoutdegrees,TSP-FilteringforRandomForesthasapowerfulspam-classiﬁcation performancewith92.1%.FromTable6,theproposedapproachwithindegreesandoutdegreeshas92.3%truepositivesanda lowerproportionoffalsepositives(7.6%)thanCollusionrank(9.9%).UnlikeCollusionrank,whichneedstoanalyzeeverylink torankevery node,ourTSPapproachisafastandlow-costdetectionmechanismthat usesonlythe1-hop-neighborhood network foreach user. Therefore, the TSP approach is a more lightweight and eﬃcient mechanism for detecting follow spammersinrealtime.

Todeﬁne a preferredsequence ofattributes, we measured the importance of feature attributesbased oninformation gainasshowninTable7.InTable7,featureattributeslistedindescendingorderofinformationgain.Informationgaincan becomputedasfollows:

In f ormationGain

(

C,A

)

=Entropy

(

^C

)

−Entropy

(

^C

|

^A

)

⁽⁵⁾

In Eq.(5),C represents thegiven class such asspammer andlegitimateuser. Ais the feature attribute.For example, InformationGain(spammer,021D) refers to the amount of entropydecrease in a spammer class when the feature attribute 021Disprovided.

(11)

Table 7

The importance of feature attributes based on information gain (TSP- Filtering).

Feature attributes Information gain

021D 0 .2867

021U 0 .2556

021C 0 .2366

111U 0 .2267

201 0 .1418

030T 0 .1408

111D 0 .1399

120D 0 .136

120U 0 .1075

120C 0 .0871

300 0 .0859

210 0 .0794

030C 0 .0465

Asweshowedintheexperimentresults,021Disthemostsignificantfactorinclassifyingfollowspammersbecauseofits propertyoftwo out-edges.Followspammers tendtohavemanyout-edges tolegitimateusers.Thistendency ispresented naturallyin 021D. The following attribute, 021U, is alsosignificant inclassifying legitimate users because ofits two in- edges.LegitimateusersarelikelytohavemorefollowersthanspammersatstablepointsoftheTwitterSNSsystem.Twitter isaveryspecialSNSduetoitssubscriptioncharacteristics.Themoreinformativetheusers’contentsare,themorefollowers subscribetotheuser.Sincemostspammersuploadadvertisementsorspammingcontentsontheiraccount,theyhavefewer followersthanlegitimateusers.Understandably, some legitimateTwitteruserstrytofollowmanyusers atthe initialand transitionpointsfor subscriptions or other reasons. However, legitimate usersat the steadypoint have a larger number of indegrees (i.e., followers) than outdegrees (i.e., friends) dueto effective influence or fruitful contents because of the psychologyofpopularity.Inaddition,theremaining featuresofTSParegradually reflected inthedistinctionbetweenthe followspammersandlegitimateusersbecauseofthesocial-interaction.

5. Twitter-spamdetectionwithstatustheory 5.1. Followspamdetectionwithsocialstatus(SS-ﬁltering)

Basedonsignsofdirectedlinksderivedfromsocialmediasitessuch asEpinions,Slashdot,andWikipedia,socialstatus theory is first applied to predict certain kindsof social relationships [27]. To findstrong consistency in how the model fitsthedataacross othersocial networksaswell asthepowerofinfluenceinTwitterusingourTSPfilteringscheme,we additionallyproposeSS-Filtering forTwitternetworkanalysis.Twitterhasaspecialcharacteristicwherebyusersfollow(or subscribeto)otherusersandthiscanbetranslatedintothesocial-statustheory.Generally,mostTwitterusersfollowusers whoaremoreinfluentialthanthemselves.EspeciallyonSNS,amore-influentialuserissimilartoauserwithahighsocial status.When a legitimateuser follows others witha higher social status, is a spammer’s following patternsimilar to a legitimateuser?Wefocusedonthefactthatspammersarelikelytofollowthepropertiesofarandomnetwork.

Oursocial-status-basedintuitionisderivedfromthefollowingquestion:“Isthefollowingpatternofaspammersimilarto thatofa legitimateuserwhen alegitimateuserfollows otherswith ahigherstatus?”We focusedonthe factthatspammers are likely to followthe properties ofa random network. In fact, spammers have little knowledge of the social relations betweenlegitimate users. They tend to select target users randomly. One may argue that a strong spammer is able to gatherinformationregardingthesocialrelationsoflegitimateusers.However,suchsocialengineeringhasonlypartialand instantaneousinﬂuencecomparedtotheaverageandapparentinﬂuenceoflegitimateusers.

5.2.SS-ﬁltering

Asocial-statusspam-ﬁlteringsystemcancomputethefollowingmetricbasedonuseru’s2-hopsocialnetwork.Toapply Twittertothestatustheory,we deﬁnedthestatusofa useruastheratiooftheindegree (indegree(u))totheoutdegree (outdegree(u))oftheuserasshowninEq.(6).

status

(

^u

)

=indegree

(

^u

)

/outdegree

(

^u

)

⁽⁶⁾

WethendeﬁnedapositivelinkinTwitterastheprobabilitythattheuserfollowsanotheruserofahigherstatus.Fig.6 showstheconceptofsocialstatus,positive linkandnegativelinkinstatustheory[27]andsocialbalancetheory[7,19,20]. Apositive link,‘+’linkmeansthat anodeXlinkstoeach nodeA,B,CandD,whohashighersocialstatusthanX.Onthe otherhand,anegativelink,-meansthatthenodeDlinkstoanodeX,whohaslowersocialstatusthanD.

(12)

Fig. 6. Social status, positive link (orange link) and negative link (blue link). (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

Table 8

Average values comparison of social status-related features between legitimate user and spammer.

Social status-related features Legitimate user Spammer Average status of a user 1 .82 0 .39 Average positive link probability ( PLP ) 0 .83 0 .91 Average status of followees ([0-1] scale) 0 .05 0 .0 0 04

Fig. 7. Comparison between legitimate users and spammers for average status of followees.

ThefollowingEq.(7)istheexpressionofthepositivelinkprobability(PLP)ofauseru:

PLP

(

^u

)

=Npos

(

^u

)

/

(

^N^pos

(

^u

)

+Nneg

(

^u

))

⁽⁷⁾

whereNpos(u)meansthenumberofpositivelinksandNneg(u)meansthenumberofnegativelinks.Weassumedthatnega- tivelinksincludelinksbetweensamestatus.Additionally,wealsoconsidertheaveragestatusoffollowees.Table8compares socialstatus-relatedfeaturesbetweenalegitimateuserandspammer.

Wecan deriveinteresting observationsfromTable8.First, theaveragestatusofa spammerissignificantlylower than that ofalegitimateuser,andthisis attributedto thespammer’slink-farmingproperty.Inthisobservationoftheaverage statusofauser,thetwoproposed schemes(TSPfilteringandSSfiltering) havesomethingincommon.Second,thePLPof aspammertoobtainagreaternumberofreciprocalfollowingsishigherthanthatofalegitimateuser.Actually,thisresult countersourassumptionthatspammersselectfolloweesrandomlyandthePLPofaspammerwouldbelowerthanthatof alegitimateuser.However,thisresultismainlyduetothesignificantlylowstatusofspammeraccounts(i.e.,agoodmany outdegreesofeachspammeraccount).Wearrivedatthisconclusionfromcomparisonoftheaveragestatusoffollowees.In Table8,wecomputedtheaveragestatusoffolloweesby[0–1]scalenormalizationformeasurablecomparison.Comparedto theaveragestatusoflegitimateusers’followees,spammers’followeeshavedefinitelylowerstatus.Therefore,thoughthePLP ofspammersishigherthanthatoflegitimateusers,mostspammers followuserswithlowstatus.Thisprovidessufficient evidenceforourassumption.Thisintuitivelyindicatesthat aspammerusuallytargetsuserswho arenothighlyinfluential duetolackingrealsocialnetworks.Alternatively,alegitimateuserfollows(orsubscribesto)influentialusersortheironline friendsasshowninFig.7.

(13)

Fig. 8. Green squares indicate legitimate user accounts and red triangles mean spammer accounts. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

Table 9

Performance evaluation using SS-Filtering (w/o indegree and outdegree).

J48 Spammer 90 .4% 17 .1%

Table 10

Performance evaluation using SS-Filtering (w/ indegree and outdegree).

J48 Spammer 88 .0% 10 .0%

Fig.8 shows the relationship betweenan average followee status andthe PLP, demonstratingwide variationsamong theaveragefolloweestatuses oflegitimateusers.Thisillustrates thatlegitimateusersfollowthepropertiesofarealsocial network,whereasspammersdonot.

5.3.PerformanceevaluationofSS-ﬁltering

Forthisexperiment,weusedthesamedatasetcontainingsamplesofspammersandlegitimateusersfromtheprevious section. We used the user’sstatus, average status of followees, PLP, indegree, andoutdegree as the feature vectors. We conductedtheexperimentusingJ48andRandomForestinWeka(10-foldvalidation).Thefollowingtables(Tables9and 10) aretheresultoftheperformanceevaluationsofSS-Filteringconsideringindegreeandoutdegree(i.e.,withoutorwith).

Consequently,theproposedapproachalsoshowsagoodproportionoftrue-positives(91.9%)andaloweramountoffalse positives(9.7%)comparedtoCollusionrank.SimilartotheTSPmethodusingthe1-hop-networkofeachuser,theSSmethod usesonlythesmall2-hop-networkforeachuser.Theselightweightschemesmakethespam-ﬁlteringprocess muchfaster.

ConsideringthatCollusionrankusesevery linkandnodeforcomputation, ourdetectionmechanismusingthesocialstatus isaseﬃcientastheTSPapproachinrealtimespamﬁltering.Table11showstheimportanceoffeatureattributesbasedon informationgainwithEq.(5).

InSS-Filtering,theuser’sstatusisthemostsigniﬁcantfeaturesimilarwithTSP-Filtering.Thisisbecausespammershave fewer indegrees than outdegrees. Therefore, the status of a spammer is lower than that of a legitimate user. However, the followee’s average status was also important as well as the user’s status.Normally, a legitimate user subscribes to informativeusers’accounts,andthestatusoftheseusersislikelytobehighduetotheirmanyfollowers.Thismeansthat

(14)

Table 11

The importance of feature attributes based on information gain (SS-ﬁltering).

Feature attribute Information gain A user’s status 0 .386

Followee’s status 0 .327

PLP 0 .107

Table 12

Performance evaluation using Cascaded-ﬁltering.

J48 Spammer 94 .0% 7 .6%

Table 13

Overall performance comparison.

Collusionrank TSP-filtering SS-filtering Cascaded-filtering True positive(Spammer) 94% 92 .3% 91 .9% 96 .3%

False positive (Legitimate user) 9 .9% 7 .6% 9 .7% 5 .7%

legitimateusersfollowuserswithhigherstatusthanthemselves.Onthe otherhand,spammers tendtoselectandfollow legitimateusersrandomly.Asaresult,theirfollowees(targets)mayhavealowerstatusthanthemselves.

6. Twitter-spamdetectionwithcascadedapproach(Cascaded-ﬁltering)

Inprevioussections,weproposedTSP-FilteringandSS-Filteringusingpartialinformation(i.e.,uptothe2-hopsocialnet- workofauser)forlightweightandreal-timespammerdetection.BothalgorithmshavefewerfalsepositivesthanCollusion- rank,buttheirtruepositivesarenotsuperiortoCollusionrank.Therefore,wesuggestahybridapproach(Cascaded-ﬁltering) that utilizes every featureattribute used by both TSP-Filteringand SS-Filtering. Weconducted an experimentoneach of 1000 legitimate user and1000 spammer account withTSP features (TSP of 13 triad classes), social statusfeatures (the user’sstatus,averagestatus offollowee,PLP), indegreeandoutdegree. Table12 showsthe performanceevaluationresults usingCascaded-ﬁltering.

7. Discussion

7.1. Overallperformancecomparison

In thispaper, we compared three Follow spam filtering mechanisms(TSP-filtering, SS-filtering andCascaded-Filtering) withCollusionrank.CollusionrankisthefirstFollowspam-targetedfilteringalgorithmpublished.ItisaPageRank-basedal- gorithm, so it canbe appliedwhen the spam-filteringsystemcontains informationon every Twittersocial network. We propose theTSP-filteringandSS-filteringmethods,both ofwhichcanbe appliedwithonlythe2-hopsocial networkofa user;thisisthemostpowerfulfeatureofthesemethods.Table13showstheoverallcomparisonofthethreemethodswith Collusionrank.

BothTSP-filteringandSS-filteringhavelowertrue-positiveratesandmorefavorablefalse-positiverates.However,weare convinced that ourdetectionmechanismis moreeffectiveforthe followingreasons.In general, forSNSssuch asTwitter andFacebook,real-time spamdetectionisthemostimportantissue.Sincespammers simultaneouslydisseminate numer- ousspammingcontentstoSNSs,fastandimmediatefilteringwithminimalinformationisneededtopreventspamming.To detect spamcontents promptly,a lightweight computationalcost isessential. TSP-filteringandSS-filtering identifyspam- mersbyusingsmallsubgraphswithupto2-hopsocialnetworksforauser.Ontheotherhand,Collusionrankrequiresmore than24GB ofRAMtoperformcomputationonadatasetwith1,963,263,821edges.Therefore,forapplicationto theentire Twitter-userpopulation,itisnotefficienttouseeverynodeandedge.Wethereforesuggestadivide-and-conquerapproach likeourTSP-filteringandSS-filteringmethodsforimmenseSNSs.

Anotherissueisthefalse-positive rate.ForthereliabilityandconvenienceofSNSs,aspam-filteringsystemshould not filterlegitimateusersasspammers,sincethisblocksusers’utilizationabilitiesandleadstonotorietyandsystemicfailure.In comparisonwiththefalse-positiverate,compensationofthetrue-positiverateofTSP-filteringandSS-filteringisnaturaland easier.Spam-reportingservicesareincorporatedintothedesignofmostSNSsforthedetectionofcontentabusers,andthis

(15)

Fig. 9. ROC curve of (a) TSP-filtering and (b) SS-filtering (c) Cascaded-filtering ( X axis : False positive , Y axis : True positive).

Fig. 10. Sensitivity of the proposed detection strategies.

Table 14

True positive, False negative and recall of every suggested strategy.

True positive False negative Recall In/Out degree 0 .808 0 .804 0 .501 TSP-filtering 0 .923 0 .924 0 .500 SS-filtering 0 .919 0 .903 0 .504 Cascaded-filtering 0 .963 0 .943 0 .505

complementarytoolcouldbehelpfulforthedetectionofsubtlespammingactions.Accordingly,TSP-filteringandSS-filtering couldbepracticalspam-filteringmechanismsforuseunderSNSconditions.

The performance ofCascaded-filtering, which employs every feature attribute used inTSP-filtering and SS-filtering, is superiortotheotherthreeschemesincludingCollusionRank.Thisschemecanaccuratelydetectmorespammersandblock orsuspendlesslegitimateusers.Consequently,thisresultsupportstheideathatthereisadistinctionbetweenthelegitimate user’ssocial network andthe spammer’ssocial network. Especially,thisscheme doesnot requirethe full social network information,includingusersthatarenotdirectlyrelatedtotheaccount,todeterminewhetheraspecificuserisaspammer ornot.Moreover,fortheserviceprovider,halfthenumberoffalsepositivesofCollusionrankisprettyattractivetoensurea reliableandconvenientSNSsystem.

Fig.9showstheReceiverOperatingCharacteristic (ROC)curve andtheAreaUndertheROCCurve(AUC)foreverypro- posed approach. Due to its high true positive and low false positive values, Cascaded-ﬁltering has the highest AUC. To comparewithCollusionrank,we estimatedtheAUCofCollusionrankusingitstruepositive andfalse positivevalues.Con- sequently,compared toCollusionrank(AUC=0.92),ourapproachesarecompetitiveandrequiremuch lesssocial network information.

7.2.Sensitivityanalysis

Fig.10showsastatisticalanalysistoevaluatethesensitivityoftheTSP-detectionstrategy.

Wecomputedtherecall(sensitivity)asfollowingequation:

recall=TruePositi

v

es/

(

^T^rue^P^ositi

v

es+FalseNegati

v

es

)

⁽⁸⁾

Tosumup, fromTable 14, two proposed approaches, SS-Filtering andCascaded-Filtering, show better sensitivity than spamclassiﬁcationusingIn/Out degreeonly. Actually,TSP-ﬁltering hasslightlylower sensitivitythanthose approaches, it

(16)

Algorithm1 TriadCensus INPUT :G

fori:=1to16do N_i=0 endfor for

v

∈V do

ifu∈

v

then

S:=Neighbor(^u)∪Neighbor(

v

)

\

^{^u,

v

} ifLink(u,

v

)andLink(

v

,u)then

TriType:=3 else

TriType:=2 endif

N[TriType]:=N[TriType]+n-

|

^S

|

^-²

foreachw∈Sdo

ifu<w∨(

v

_<w∧w<u∧¬Link(^v,w) ^then TriType:=TriType[Tricode(

v

,u,w)]

N[TriType]:=N[TriType]+1 endif

endfor endif endfor sum:=0

fori:=2to16do sum:=sum+N[i] endfor

N[1]:=(1/6)n(n-1)-sum returnN

countersthislimitationwith bothhightrue positiveandfalse negative. Cascaded-Filtering,which isthe hybridapproach usingeveryfeasiblefeatures,showsthebestperformanceatsensitivitywithsuperiortruepositiveandfalsepositive.

7.3. Complexityanalysis

Followingpseudo-codeisTSP-Filteringanditconsistsoftwosteps:TriadcensusalgorithmandTriadSignificanceProfile computationalgorithm. At first,we modified input ofTriad census algorithm [2]to apply to two-hop input subnetwork.

G=(^V,E)îs^the^two-hop^directed^subnetworkôfâûser.

Algorithm1computesfrequenciesof16isomorphictriadtypes.Inourwork,weusedonly13isomorphictriadsbecause triad types withisolated nodesare counted frequently so that this interrupts observing signiﬁcanceof triads. From this algorithm, we can get frequencies of13 isomorphic triad types of given graph G. TSP-Filtering get 2-hop neighborhood social network ofa certain useras input.This algorithm hasthe complexityO(m) andm isthe numberof edges inthe graph G. In Algorithm 1,TriType means how many nodesare connected each other. This indicates that an isolate node existswhenTriTypeis2.[2]usedtheconceptofTricodetocounttriadeﬃciently,butwedontexplaindetailoftheconcept inthispaper.

Algorithm2computesZ-scoreandTSPofacertaininputuser.Forinputattributes,thisalgorithmhasinputvectorscon- sistingofmeanandstandard deviationoflegitimateuserstriadfrequency. Theseattributescouldbe gotatpreprocessing phase,whichcomputesmeanandstandarddeviationoflegitimateusersforthetrainingvalue.Sinceourworkistoclassify spammers fromlegitimateusers, weshould comparethe userstriad frequencyN (testingvalue) tolegitimateusers triad

Algorithm2 TriadSigniﬁcanceProﬁle.

INPUT :N,<N_legit

i>,std(N_legit) fori:1to13do

Z_i=(N[i]− <N_legit

i>)/std(N_legit

i) endfor

fori:1to13do TSP_i= Z_i/(

Z_i²)¹²⁾ endfor

returnTSP

(17)

Algorithm3 PositiveLinkProbability.

INPUT :G,Status Positi

v

eLink:=0

foreachu∈Neighbor(

v

₎do ifStatus[

v

]<Status[u]then

Positi

v

eLink:=Positi

v

eLink+1 endif

endforPLP:=Positi

v

eLink/

|

^Neighbor(

v

₎

|

returnPLP

Table 15

Complexity analysis.

Collusionrank TSP-filtering SS-filtering Cascaded-filtering Time complexity O( N + M) O( m ) O(n) O( m + n ) Space complexity O( N ) O( m ) O( n ) O( m + n )

frequencyN_legit (trainingvalue).ThisalgorithmhastimecomplexityO(1)becauseofasimplenormalizationprocess.There- fore,TSP-FilteringisacombinedalgorithmofTriadCensusalgorithmandTSPalgorithm.Thenoveltyofourworkisthatwe usedausers2-hopneighborhoodsocialnetworkasinitialinput.Also,we compareeach userscensustooveralllegitimate userstoobserveoverrepresentationandunderrepresentationineachisomorphictriad.Overrepresentationandunderrep- resentationoftriadscould bethe distinct featuresto classify spammersfromlegitimate users.So,TSP-Filtering hasO(m) timecomplexityforatarget user.Followingpseudo-codeisSS-Filtering.Actually,inpreprocessingphase,wecancompute thestatusofauserwithhis/herindegreeandoutdegree.Sincewedeﬁnedthestatusofauserastheratiooftheindegree totheoutdegreeoftheuser(Eq.(6)),wejustupdatethreevaluesindegree,outdegreeandstatusinrealtimecase.So,we onlyneed positive link probability (PLP) ofa user.The simple algorithm forPLPis asfollows.Status means atable with statusofeveryuserinG.

Algorithm3computestheratioofthenumberofneighborswithhigherstatusthanuser

v

tothenumberofneighbors ofuser

v

.SinceSS-Filteringhighlydependsonpreprocessingphase,thissimplealgorithmhasthecomplexityO(n)whenn meansthenumberofusersinG.

Additionally,Table 15 showsthe complexity analysis of the proposed schemes andCollusionrank. Actually, since our workisforeachusers2-hopneighborhood socialnetwork,complexitycomparisonwithCollusionrankseems tobeironic.

Tosumup,Collusionrankisforclassifyinglargenumberofspammeraccountsfromwholesocialnetwork,butTSP-filtering, SS-filteringandCascaded-filteringisfordeterminingifauserisspammerornot.nmeansthe numberofuser(nodes)in 2-hopneighborhoodsocialnetworkandmisthenumberofrelation(edges)in2-hopneighborhoodsocialnetwork.Also,N meansthenumberofeveryusers(nodes)inwholesocialnetworkandMmeansthenumberofevery relations(edges)in wholesocialnetwork.ThistableshowedthatourapproachesneedlesscomputationthanCollusionrank.

8. Futurework

8.1. Dynamicfollowspammerdetection

Firstofall,camouﬂageattack,whichusescompromiseduseraccounttodisseminatespamcontents,couldbeoneofthe interestingfuturework.Tosolvethecamouﬂageattackproblem,unsupervisedschemecanbeconsideredfordetectingnewly occurredspammerstotheproposedscheme(i.e.,unsupervisedschemebasedoncascadedsocialinformation).Forexample, wecanselectself-organizingmap(SOM)amongmachinelearningalgorithmsformappingbetweenspammerlabeledwith supervisedschemebasedoncascadedsocialinformationandnewspammerbasedonunsupervisedscheme.

Indetail,SOM isone ofclusteringalgorithms usingEuclideandistanceconcept betweensourceanddestination.Ifwe train spammer patterns with cascaded social information scheme as a supervised concept that can be a partial part of pre-deﬁnedmap ofSOMforprocessingandclusteringnewinputautomaticallyasa unsupervisedconcept.SOMcalculates EuclideandistancebetweenspammeraccountwithcascadedsocialinformationandnewTwitteraccounttocheckwhether the tendencyof newaccount is closerto spammer or legitimate.In each epoch, SOM can update the tendency ofeach accountbasedonEuclideandistancedynamicallyanddistinguishbetweentime-varyingspammerandlegitimatebasedon clusteringmap.Infuture,wewillapplySOMorotheroptimalsolutionfordetectingdynamicspammeraccordingtotime.

8.2.Generalizedapplicationsoftheproposedscheme

Intheview of socialbehavior, people havedifferentsocial networking stylebasedon their purposeofusing SNS.For example,spammers make numerousfollowing linkstounspeciﬁedindividuals withthe aimofspreading spamcontents.

On the other hand, ifsomeone wants to make friends in SNS, they would carefullyselect users asfriends based on a

(18)

personal preferenceandcloseness.And ﬁnally,thesesocial linksmadeby varioususers could beinterpreted incascaded socialinformation.

As showninourexperiment, cascadedsocial informationandits structuralanalysissuggest variousfutureapplication in social network security. In future work, we are to ﬁnda correlation betweenvarious behavior patternsof using SNS andcascadedsocialinformation.ThebehavioralpatternsofusingSNScanbemorespeciﬁedbyvarioustypesregardlessof spammerandlegitimate.Oneo