ContentslistsavailableatScienceDirect
Information Sciences
journalhomepage:www.elsevier.com/locate/ins
Follow spam detection based on cascaded social information
Sihyun Jeong
a, Giseop Noh
a, Hayoung Oh
b, Chong-kwon Kim
a,∗aDept. of Computer Science and Engineering, Seoul National University, Gwanak-gu, Seoul 151–744, Republic of Korea
bSchool of Electronic and Engineering, Soongsil University, Dongjak-gu, Seoul 156–743, Republic of Korea
a r t i c l e i n f o
Article history:
Received 23 November 2015 Revised 27 April 2016 Accepted 11 July 2016 Available online 15 July 2016 Keywords:
Online social network Security
Twitter Follow spam Triad Status theory
a b s t r a c t
Inthelastdecadewehavewitnessedtheexplosivegrowthofonlinesocialnetworkingser- vices(SNSs)suchasFacebook,Twitter,RenRenandLinkedIn.WhileSNSsprovidediverse benefitsfor example,forstering inter-personal relationships, community formations and newspropagation,theyalsoattracteduninvitednuiance.SpammersabuseSNSsasvehicles tospreadspamsrapidlyandwidely.Spams,unsolicitedorinappropriatemessages,signifi- cantlyimpairthecredibilityandreliabilityofservices.Therefore,detectingspammershas becomeanurgentandcriticalissueinSNSs.ThispaperdealswithFollowspaminTwitter.
Insteadofspreadingannoyingmessagestothepublic,aspammerfollows(subscribesto) legitimateusers,andfollowedalegitimateuser.Basedontheassumptionthattheonline relationshipsofspammersaredifferentfromthoseoflegitimateusers,weproposedclassi- ficationschemesthatdetectfollowspammers.Particularly,wefocusedoncascadedsocial relationsand devisedtwoschemes,TSP-Filteringand SS-Filtering,eachofwhichutilizes TriadSignificanceProfile (TSP)and Socialstatus(SS)inatwo-hopsubnetworkcentered ateachother.Wealsoproposeanemsembletechnique,Cascaded-Filtering,thatcombine bothTSPandSSproperties.OurexperimentsonrealTwitterdatasetsdemonstratedthat theproposedthreeapproachesareverypractical.Theproposedschemesarescalablebe- causeinsteadofanalyzingthewholenetwork,theyinspectuser-centeredtwohopsocial networks.Ourperformancestudyshowedthatproposedmethodsyieldsignificantlybetter performancethanpriorschemeintermsoftruepositivesandfalsepositives.
© 2016ElsevierInc.Allrightsreserved.
1. Introduction
Theuse ofsocial networking services (SNSs)continues togrow exponentially withthe widespreadadoption of smart devicessuchassmartphones, smartpads,smartwatches,andso on.SNSscanconnect peopleandcan beused toshare information in real time. SNSs such as Facebook, Twitter, and Ren-Ren are becoming the most influential mediums for buildingsocial relations, as well as forthe sharing andpropagation of information. According to recent announcement, Twitter,oneofthelargest andthe mostpopular SNSs,passed255mmonthlyactive usersandexpects80%ofadvertising revenuefrommobileusers.1
Afterrepeatedexplosivegrowthinuserpopulation,maturedSNSssuchasFacebookandTwitterbecomeanecessingsin modernlife indevelopedcountries.Inaddition,relatively newSNSssuch asRenRenandSinaWeibo, targetedforspecific
∗ Corresponding author.
E-mail addresses: [email protected] (S. Jeong), [email protected] (G. Noh), [email protected] (H. Oh), [email protected] , [email protected] (C.-k.
Kim).
1http://thenextweb.com/twitter/2014/04/29/twitter- passes- 255m- monthly- active- users- 198m- mobile- users- sees- 80- advertising- revenue- mobile/ .
http://dx.doi.org/10.1016/j.ins.2016.07.033 0020-0255/© 2016 Elsevier Inc. All rights reserved.
countryorlanguagespeakers,replicatetheeruptiveexpansionoftheearlierSNSs.Forexample,aninfluentialusercanbe exploitedbyapersonworkinginonlinemarketingtomaximizethemarketingeffect;malicioususers(attackers)disseminate falseinformationorfraudulentmessagesforthepurposeofphishing,scam,ormalwareintrusion.Thatis,theattackerspost multipleunrelatedmessageswithtrendingtopicstoattractlegitimateusersandencouragethemtoclickthemaliciouslinks inthemessages.
Spamreferstounwantedmessagesfromunknownsources(attackers).OneofthemajornegativeaspectsofSNSisspam.
IntheearlyInternetera,spamappearedinemailsorSMS(shortmessageservice).However,thedomainofspamexpanded intoSNSasthepopularityandusageoftheservicescontinuedtoincrease.FalseinformationfromSNScanspreadrapidlyin realtime.Followspamwasreportedrecentlyandisasystemthattriestoincreasethenumberofrelations(orfriendships) inusersnetworksforthepurposeofsendingspamviaSNS.Theattackpatternofthefollowspambeginswiththeattacker disseminatingspammeraccountsthatfollowalargenumberoflegitimateusersforthepurposeofreceivingafollow-back ordrawingattentiontothespamaccount[15].Duetotheconsequentexposureofthepublictospamcontent,thispractice definitelylowersthereliabilityofSNS.
In practice,Twitterhas experienced Followspam problems,reducing users trust inmessage distribution andincreas- ing computation overhead.In 2008, Twitterofficially announced that Follow spam accountshad followedso many peo- plethat they threatenedthe performanceofthe entiresystem.2 Evenwiththeemerging threatfromFollowspam,ithas been barely investigated orresearched. A contents-based spamfiltering approach is employed in the Twitterspam field [5,12,28,42].However,sincespamcontentskeepchangingtoavoidcontent-baseddetectionbyinsertingURLsandimagesin spammessages,thecontents-basedapproachisvulnerableagainstevolvingmessagepatterns.Toovercomethelimitations ofthecontent-basedapproach,anewapproachusinginherentpropertiesofSNSwasintroduced.
[15]firstemphasizedthatFollowspamshouldbedetectedbyusingitslink-farmingproperty.TheyproposedaPageRank- based rankingalgorithm tolower theimpact ofspammers. However, thisapproachcan be burdensomesince itneeds to utilizesocialnetworkdatafortheentirenetwork (i.e.,all informationfornodesandedges).Therefore,ithasahighcom- putational cost and can barely detect Follow spammers in real time. As a result, a novel detection mechanismwith low computationalcost andrealtimespamfilteringisneededwhilemaintainingthedetectionperformance.Inthispaper,we suggesttwosocial network-baseddetectionschemesforcounteringTwitterspam.First,spammeraccountsarefilteredout withthe use ofa Triad SignificanceProfile (TSP)that measures the structuraldifferences betweenthe frequencies of13 isomorphic subgraphs.We discoveredthat TSPofa spammeraccount isdifferentfromthat ofalegitimate usersaccount withonly 2-hopsocial networks.According toour experiment, 92.1%ofspammers are classifiedcorrectlywhenwe used onlyTSP featuresforclassification. Thisresultsuggeststhatfrequencyanddistribution ofisomorphicsubgraphscould be informativefeaturesforidentifyingspammers.Secondly,weintroduceanewdetectionapproachusingthesocialstatus(SS) theory[27]todistinguishspammeraccounts.Legitimateuserstypicallyfollowaccountsofahigherstatusthanthemselves, whereas spammersare likelytofollowinarandom manner.Withtheseapproaches,we canconfirmthat cascadedsocial network-basedapproaches(TSPandSS)caneffectivelydetectFollowspammerswithalowcost.
Our experiments on real Twitter datasets clearly show that our three mechanisms, TSP-Filtering, SS-Filtering and Cascaded-Filtering, arevery practical forthe following reasons.First,our approachesrequireonly a smalluser-related2- hopneighborhoodsocialnetwork.Actually,thereareonlyfewexistingworksfocusedonsmallneighborhoodgraphinother areas[1,30],butnoneofthemdiscoveredthepowerofneighborhoodsocialnetworkclearly.Therefore,theycanbeapplied tospamdetectionsystemsinsocialnetworksasrealtimesolutions.Second,serviceproviders canmaintainthecredibility andreliabilityoftheir SNSsbyapplyingourapproaches.Legitimateusersarelesslikelytobeblockedbythesystemwith lowfalsepositives(5.7%).Also,ahighproportionoftruepositives(96.3%)providesasecureenvironmenttousers.
Ourmaincontributionsaresummarizedasfollows:
• Forthefirsttime,wediscoveredthefeasibilityofcascadedsocialinformationsuchastriad(orisomorphicsubgraph)and socialstatusbasedpositivelinkprobabilityasgoodfeaturesforclassifyingspammersandlegitimateusersinTwitter;
• Ourapproachesinvolvemorelightweightcomputationforrealtimespamdetectionthanthepreviousscheme(i.e.,global information).Sincetocheckwhetheracertainuserisspammerornot,weonlyfocusedonthe2-hopsocialnetworksof eachuser(i.e.,localinformation).andextractcascadedsocialinformationfromthenetwork;
• Basedon Twitter’sspampolicy,noveltriad graphbasedfeatures andsocialstatustheory basedfeatures are proposed andcascadedtogethertofacilitatespamdetection;
• Tothe best ofour knowledge, ourapproaches are the firstexperiments withreal world data toprovide the credible andreliableTwittersystemwithtruepositiveresultsofup to96.3%.Webelievethatourfindingscanprovidevaluable insightstotheareaofspamdetectionanddefenseinvarioussocialnetworks;
• Tosumup,thispaperisthefirstworkwhichclarifythedifferencebetweenspammerandlegitimateuserintheviewof subgraphconsistingofneighborhood.Moreover,wesuggestthenovelthreeapproachessuchasTSP-Filtering,SS-Filtering andCascaded-Filteringforthefollowspamdetectionbasedoncascadedsocialinformation.
Therestofthepaperisorganizedasfollows:Firstly,weintroduceinteresting relatedworksontwitterspamandspam detectionmechanismsinSection2.Then,todescribeourapproach,weintroducethemotivationforourworkinSection3.
2https://blog.twitter.com/2008/making-progress-on-spam .
Then,weproposetheTSP-FilteringandSS-Filteringmethods withtherespectiveperformance evaluationinSections4and 5.In Section 6,we propose Cascaded–Filtering withhigher true positive andlower false positive than TSP-Filtering and SS-Filtering.Section7presentsanoverallevaluationanddiscussionofourthreeapproaches(TSP-Filtering,SS-Filteringand Cascaded-Filtering)andCollusionrank[15].Lastly, we closethispaperwithfuturework andthe conclusionin Sections8 and9.
2. Relatedwork 2.1. Twitter-spamfiltering
2.1.1. Contentbasedtwitter-spamfiltering
Twittercontents such asuserprofile, tweets,andtheactivity logprovidevariousoptions fordistinguishingspammers fromlegitimateusers.SpammersgenerallywritetweetsthatcontainahashtagandURLaccordingtothefollowingresearch studiesthatanalyzedcommonlyusedhashtagsandURL:[5,12,28,42].
COMPA[12]detected compromisedaccountsthatwrotespamtweetsbasedonthetweetinglanguage oftheuser’sac- count,thetweetingtimewindow,theURL,andthe“mention” receiver.Thisisapersonalizeddetectionapproachthatlearns theprevious behavioralpatternof eachuser. Benevenutoetal.[5]and Martinez–Romoetal.[28] proposed classification modelsthatlearnedthenumberofhashtagsandURLs[5]orspamURLs thatareused inspamgroundtruthtweets.Yardi etal.[42] studied spammers’ strategicbehavioralpatterns andalso concludedthat theuse ofhashtags relatedto trend- ingtopicsisa veryeffectivespammingstrategy. Gaoetal.[14]builtatemplatebased onthesentencestructureofspam groundtruthtweetsandusedtemplatematchingtofilteroutspamtweets.
2.1.2. Social-networkbasedtwitter-spamfiltering
As the attack strategies of Twitter spammers have become more effective,a variety ofsocial-network-based Twitter- spam-detectionapproacheshavebeenintroduced.
Twitterspammersattemptto convincethe publicthat their accountsare legitimateorfamous, andtheydisplay their spamtweetsto the public.It iseasier forspammers toattract user interestby exploitingasmuch social informationas possible.Previousworksidentifiedfollower-marketcustomerswhopurchasedfakeaccountsthattheyusedtofollowthem- selves[22,34].Jiang etal.[22]analyzed thebehavioralsynchronicityamongthesefakeaccountsto detecttheir presence.
Stringhinietal.[34]examinedrealfollowermarketsanddetectedfakeaccountsbyidentifyingthoseforwhichthenumber offollowees increasedsuddenlybutdidnot decreaseanyfurther.Viswanath etal.[36] usedPrincipalcomponentanalysis (PCA),ananomaly-detectionapproach,todetectintentional“follow” or“like” behaviorsfrommarketcustomers.
OnTwitter,Followspam(orlinkfarming)refers totheact offollowingamassnumberofpeopletogarnerattentionor follow-backs.3Ghoshetal.[15]pointedoutthatmostFollowspammersattainedahigherrankinexistingrankingalgorithms becausetheirreciprocal-followerrateis82%.Basedontheirfindings,theyproposedtheapplicationofCollusionrank.
2.1.3. Subnetworkbasedspamfiltering
The authorsof [37]directly crawled Twitter’sdata andanalyzed them withboth contents andsocial graph modeling basedapproaches.Based onanalysis ofthecontents, categorizedintolegitimates andspams, they provedthat their pro- posedreputationfeaturehasthebestperformanceamongallsocialgraph-basedfeaturesfordetectingabnormalbehaviors.
However,they onlyconsidered therelationship betweenoutdegreesandindegreesina simpleTwittergraphforthepro- posedreputationfeature.Eventhoughthisschemealsoutilizesasmallgraph(subgraph),asophisticatedgraphdesignisas onlypartofthetriadapproach.Theauthorsof[30] usedneighborhoodsubnetwork(i.e.,egonetwork)todetectcomment spammersinYoutube.TheyalsoutilizedselecteddiscriminatingmotifsandanalyzedtheminYoutube video-userrelation network.It seemsvery similarwithourwork, butitusedspam campaignrelatedmotifs.Therefore, itcannot distinguish spammerswhen theyuseother sophisticatedstrategy. [1]extractedweightedsubgraphsfromtarget networkandutilizes themasdiscriminatingfeaturestodetectspammers.Italsoanalyzedsubgraphsbytypesofanomalies.Basedonpowerlaw characteristicofsocial network, it comparedspammer to legitimateusers neighborhood subnetworkinterms ofedge or weightdistribution.
2.2.Linkspamfiltering
Linkspamhasbeenwidelystudiedintheweb spamdetectionfield. Thistypeofspamispresentedasnumerouslinks froma largenumberofweb pagestoa fewtarget webpages. StudiesonLinkspam havebeenreceivingattentiondueto thelimitationsof PageRank[31]andHITS[24].Thanks tosignificant link characteristics,manyweb linkgraph structure- basedspam detectionapproacheshave beenintroduced [3,8,18,25,40,41].TrustRank [18]is one of the most popularLink spamdetectionalgorithms.Itpropagatesthe‘non-spam’labelthroughsocialnetworks. Likewise,BadRank [40]propagates the‘spam’labelthroughsocial networks.ComparedtoPageRank[31],thesetwo algorithmsutilize‘non-spam’and‘spam’
3https://blog.twitter.com/2008/making-progress-spam/ .
labelpropagationtolowertherankofspamwebpages.[4]proposedanadvancedLinkspamdetectionalgorithmusingboth
‘spam’and‘non-spam’labelpropagation.Theselabelpropagationalgorithmsrequireseedknowledgesuchasasetofspam nodesandasetofnon-spamnodes.Therefore,noiseintheinitialdatasetcanbeacriticalissueforthesealgorithms.
2.3. Sybildetection
MostSNSspamdetectionsystemsrelyonSybildetectionalgorithms.Peer-to-peersystemsconsistofmultiplenodeswith severalconnections(edges).Thesystemhastoensurethateachnodeisclearlyidentified;otherwise,amalicioususer(Sybil) canattempttocreatemultiplefakeidentitiesmasqueradingashonestnodes[11].Theycanthenmanipulatethesystem(by zombiemachines) orattack thesystemin orderto gain illegalprofitsuch aspositive feedbackin thereputation system, gettingmorevotesininternetpolls,ortargetingsitestoincreasetheirrankinGooglePageRank.
TherearetwomainapproachestotheSybilattack:centralizedanddecentralized.Centralizeddefenseobtainsadmission control throughacentral authority. Decentralizeddefense hasnotrustedcentralauthority andcontrolstheIPaddress by bindingan identity.Forthedecentralizedattack, SybilGuard[44]proposesthat wheneach nodereceives √
kindependent samplesfromasetofhonestnodesofsizek,arandomwalkcanbeperformedtotrytodiscovertheSybilidentitiesbyusing theintersectionprobabilitybetweenhonestandSybilgroups.ThenumberofavailableattackedgesinSybilaretheoretically boundO(√klogk).SybilLimit[43]isanenhanced methodintroducedby[44].Theyreducedtheattackedgeboundinnear optimalO(logk) by exploitingvarious random walk methods. GateKeeper[35] adapts theticket distributionalgorithm to obtaineachnode’sprobabilityofSybil/honestusers.
Secondly,the centralized method.SybilInfer [10] assumedthat the central authority knowsthe entiresocial network.
After random walks,eachnode isassigned aSybil/honest probability bymeasuring the Bayesianinference. SybilDefender [39] assumedthat when starting a random walk in Sybil nodes, it will pass the intersection between honestand Sybil nodes.TheseapproachesapplycommunitydetectionalgorithmstofindSybilcommunities.SumUp[35]addressesthevote aggregationproblembyconsideringeachvoter’strustgraphandcalculatingasetofmax-flowpathsfromallvoters.
Currently,therearemanySybildetectingmethodswithvarioussocialnetworkproperties.SybilRank[6]investigateseach nodeby assumingthathonestnodeswillhavehigherdegree-normalizedlandingprobability.Arandomwalkisperformed tomeasure therankingtodeterminewhethertheaccountis Sybilornot. SybilShield[33]utilizesa multi-communityso- cial networkstructure environment,considering sociologicalpropertiestocut theedge betweenhonestandSybil groups, performingmodifiedrandomwalksandfiguringoutthepropertiesofmulti-hopedges.SybilBelief[16]detectsSybilnodes basedona semi-supervisedlearningframework.ThismethodmodifiestheLoopyBelief propagationsystemandthepair- wiseMarkovrandomfieldtodefineeachnode’sclassification(Sybil/honest).
2.4. Dataminingapproachforspamdetection
In spam detection problem, most ofexisting studies related the problemto classification taskas follows. In general, spam classifier firstly learn features extracted from SNS using pattern of legitimate users such as the number of fol- lowees/followers, postuploading time and contents information of user profile and posts. Then, classifier determines if newlygiventestuserisspammerorlegitimateuserbycomparingtolearnedpattern.Therefore,ifthetestusersbehavioral patternfeatureisfarfromlegitimateuserspatternfeature(learnedfeature),theclassifiercouldclassifyanddetecttheuser asspammer.Insome cases,classifiersadoptclassification thresholdtohandlethetradeoff betweentruepositiveandfalse positive.SincereliabilityandcredibilityiscrucialinusingSNS,low falsepositiveistreatedparticularlyaccordingtospam detectionsystem.
In detail,[23] used linearregression forclassifyinganddetecting spammers and itstated that deviant usersfromle- gitimate users patterns could be classified asspammers. Similarly, [36] utilizedPCA (Principal Component Analysis) and it detected Facebook spammers who aredistant fromthe principal componentof legitimateusers. Also, Markovrandom fieldbasedspamclassificationapproachwasproposedin[13].Especially,contents-basedspamdetectionapproacheslargely usedNavebayesclassifierorSVMclassifierwithcontents-relatedfeatures.Intheearlystageofspamdetection,[17,21]and many similar studies analyzedtoken or wordin spamcontents andapplied extractedfeatures to the Nave bayes classi- fier.[32]proposedoptimizedversionofSVMspamclassifierandachievedefficiencythanprevious ones.[45]relievedfalse positiveproblemby adoptingboundaryregiontoclassificationresult.Sincemostofspamclassificationisbinaryclassifica- tionofspamandnon-spam,ternaryclassificationgivesthreeclassification labelsincludingboundaryregionwhichmeans reconsideringregion.
3. Motivation
3.1. Web,socialnetworkandTwitter
Like theWeb,wheretheimportanceofeach pageislargelydeterminedby whoreferenceswhom,the influenceofin- dividualsonmanySNSsisdeterminedbythenumberofindexestheyreceive.Forexample,thenumberoffollowersisthe mostimportantfactoronTwitteranddeterminessocialcapital,whilethenumberof“likes” onFacebookissimilar.Thisfea- ture,however,hasattractedaplethoraoffraudswhotrytoincreasetheimportanceorreputationofentitiesbygenerating
bogusindexes,leadingtothedefinitionofthespamdexingclassofattacks.Twitter’ssizehasexpandedexponentiallyover thepastseveralyearsanditnowhasover255millionactiveusersafterasuccessionofrapidgrowthspurts thatresulted inan averageannualgrowthrateof25%.4 Notably,thesocial-interactionstructureofTwitterisveryinteresting.Users can followfamous persons–usually celebritiesor standoutopinion leaders–thatthey are unacquainted with, aswell asclose friends.Therefore,Twitterplaysaninformation-propagationroleinadditiontotheroleofanonlinesocialnetwork[26].
Moreimportantly,contrarytotheWeb,Facebook,andmanyothersocialnetworkswherespamindexesusuallyoriginate fromfakeaccountsandfromacircleofcolludinglinkfarms,amaliciouspersoncancollectfollowersorfansfrominnocent, social-capital-conscioususersonTwitter. Further,thehighrateoffollow-backs makes thedetectionofTwitterspammers moredifficultbecausetheyreceivemanylegitimatefollowersjustbyfollowingtargetusers[15].Existinglink-farm-detection methodswellfittedforwebspamdetectionfieldthereforelosemuchoftheireffectivenessinthedetectionofspamdexing onTwitter.
Inthispaper,wedemonstratethefeasibilityofa cascadedSNS-basedsecurityschemetodetect Followspam.Different fromthe unpractical andheuristic approachesof previous works,withthe characteristics offollow-backs we apply triad frequenciesandstatustheory forthefirst time inourFollow spamdetectionscheme.Notethat the mainpurposeofthis studyisnottheattainmentofengineeringoptimizationfortheperformanceenhancementofpriorschemes,butrather,itis theexaminationofthefeasibilityofasocial-network-basedsecurityschemeinapopularonlinesocialnetworkingsite,i.e.
Twitter.
Beforeweformalizetheproblem,weaddressthecharacteristicsofTwitter.All13typesofdirectedsocialgraphmodels andsocial statuswithlocalinformationcanbe observedinTwitter.Additionally,Twitterhaswell-definedsocialrelations intheformofthe“follower” and“friend” relationships. Inadditionto thesecharacteristics,spamsshow up frequentlyin Twitter.WepracticallyexploitthepolicyofTwitteragainstspamstodesignourproposedscheme.Twitter’sspampolicyis summarizedasfollows:
• “If youhave asmall numberoffollowers comparedto the amountofpeople you are following”,theaccount maybe consideredaspamaccount.
• “Multipleduplicateupdatesononeaccount” isafactorusedtodetectspam.
• “Ifyourupdatesconsistmainlyoflinks,andnotpersonalupdates”,itisconsideredspam.
Firstpolicy is relatedwiththe social-interaction structureof Twitterwhilesecond andthird policies haveto dowith spamcontents.Mostpreviousworksfocusedoncontentsanalysisorfullinformationusageofsocialnetworkswithahigh amountofcomputationaloverhead.Differentfrompreviousapproachesconsideringsecondandthirdpolicies,weaccurately detectFollow spam usingonly localinformationofthe social-interaction structureofTwitter. Thatis,ourcascaded social networkschemeisapplicableregardlessofthecontentsuchasTweet,timeandlinks.
3.2.LinkspamandFollowspam
TheconceptoflinkfarmingoriginatedfromWebspam.TheintentoflinkandFollowspamistoincreasethepopulation ofaspecific (target)website orreputation.Sincenormalsearch engines (e.g.,Google)placepopular websitesonthe first page,link-farmingwebsitescreatenumerouslinkstothetargetwebsite.
PageRank[31],themostpopularwebsiterankingalgorithm,rankswebsitesbasedontheindegree ofthesite. Actually, thepopularityofinlink nodesisalso important,butnumerousinlinks arelikely toincrease the targetwebsite’s ranking.
Therefore,linkfarmsgenerallycontainplurallinks,andthelinksarecreatedfrommanynodestoafewtargetnodes.
Followspam,aspecialattackstrategyonTwitter,hasbeenshowntobealinkfarmingtechnique.Fig.1showsanexample ofFollowspam.
Followspamconsistsofnumerouslinks,butsomedifferencesexist.First,linksarecreatedbyafewspammernodesand theytargetmanylegitimateusernodes.Morespecifically,originallinkspamdenotesmanyspammernodes-fewlegitimate nodesrelationship whileFollow spam denotesfew spammernodes-many legitimate nodes.Second,the purposeofFollow spam is not just linking, butreceiving a follow-back(reciprocal link). A user on Twittercan seetweets (contents) from anotheruserwhen he/shefollows(subscribe)the other’saccount.Consequently, spammersneed tobe followedby other userstoshowtheirspammingcontentssuchasURL,imageandadvertisement.
Therefore, to gain more followers and attention, spammers send a large number of follow linksto legitimate users.
Surprisingly,themajority offollowerswho Follow spam accountshavebeenpreviously targetedby spamaccounts. Tobe specific,82% oflegitimateuserssenda follow-backto spammers[15].Ifsisa spammeraccount andhis/heroutlinks are allattack edgesforfollowback,theattack strengthofs(AS(s))isdefinedin(1).Wedefinedtheratiobetweensuccessful followspamlinks(followbacklinks)ofspammers(Nfb(s))andtotalfollowspamlinksofs(Nf(s))asAS(s)asfollows:
AS
(
s)
=Nf b(
s)
/Nf(
s)
(1)In(1),Nf(s) hasthe samemeaning ofoutdegreeofs.Therefore, theattack strength (AS(s))offollowspamrelies ona successfulnumberoffollowbacks.
4http://thenextweb.com/twitter/2014/04/29/twitter- passes- 255m- monthly- active- users- 198m- mobile- users- sees- 80- advertising- revenue- mobile/ .
Fig. 1. Overview of Follow spam .
Table 1 Twitter data set.
The number of total users The number of spammers 54 ,981,152 41 ,352
Table 2
Performance estimation of Collusionrank [15] . True positive False positive Spammer 94 .0% 9 .9%
Legitimate user 90 .1% 6 .0%
3.3. TwitterdatasetandCollusionrank
Weconductedanexperimentwithalarge-scaleTwitter-followlinkdatasetthatwasprovidedbyMPI-SWS[9].Thisdata setwascollectedinSeptember2009,contains1,963,263,821directedsociallinks,andthenumberofcorrespondingusersis 54,981,152.WealsousedtheFollowspammerdatasetfrom[15]thatcontains41,352spammersasthegroundtruth.
Table1showsTwitterdatasetusedinourexperiment.
WecomparedtheperformanceoftheproposedmethodwiththatofCollusionrank[15]presentedinWWW2012.Collu- sionranklowerstheinfluencescoresofuserswho connecttospammersandfilteroutthoseuserswhogainhighrankings by linkfarming.It isauser-ranking algorithmbasedon PageRank.Sincewe usedthesame datasetasCollusionrank,we comparetheperformanceoftheproposedmethodwiththetruepositiveandfalsepositiveresultsofCollusionrank.Accord- ingto[15],Collusionrankdetected94%ofthe41,352spammersthatappearedinthelastlowrankingscores10%ofranking positions;consequently, wecouldextract thefalse positivesoflegitimateusers(9.9%)fromCollusionrankwithadetection thresholdof10%.We reiteratethatthedetailedperformance ofCollusionrankisnotdescribedin[15],exceptforthetrue positivesforspammerswithinthethresholdofthelast10%. Table2istheestimatedperformanceofCollusionRankfroma truepositivevalueof94%.
Collusionrankhasgoodperformanceintermsoftruepositiveandfalsepositive,butithassomelimitationsasfollows:
First, itneeds to analyze every node andedge ina social network. The PageRank-based algorithmtypically estimates every node’sreputation orrankingdependingon thereputation ofother nodesand edgeformation.However, to classify spammers,computingranksoneverynodeisnotpractical.InrealSNSs,spammersdisseminatespammingcontentssimul- taneously.Therefore, a realtime spam filteringapproach is moreeffective; fastspam filteringsignificantly decreases the numberofvictimsofspam.Assuch,analyzingallsocialnetworkinformationisnotverypragmatic.
Table 3
Average indegree and outdegree.
Average indegree Average outdegree
Spammer 303 .6 866 .5
Legitimate user 401 .5 462 .0
Table 4
Performance evaluation using only indegree and outdegree.
Classifier Type True positive False positive
J48 Spammer 83 .9% 19 .3%
Legitimate user 80 .7% 16 .1%
RandomForest Spammer 80 .8% 19 .6%
Legitimate user 80 .4% 19 .2%
Second,it has ahighproportion offalse positivesin detecting legitimateusers. If9.9% oflegitimateuser accountsin Twitterwereblocked,mostpeoplewouldstopusingTwitter.Ahighnumberoftruepositivesindetectingspammersisalso crucial;butthecredibilityandreliabilityoftheservicearemaintainedbykeepingthenumberoffalsepositiveslow.
Inthefollowingsections,weproposecascadedsocialinformation-basedspamdetectionmechanismsthatovercomethe limitationsofCollusionrank.
3.4.Indegreeandoutdegreeofsampledataset
SinceFollowspamhasalinkfarmingpropertythatinvolvescreatingmanyoutlinks,weshouldinvestigatewhetherspam- mersinTwitterhaveahigheroutdegreethanlegitimateusers.Also,basedonTwitter’sspampolicy,wefocusontheratio oftheindegreetotheoutdegreeforbothlegitimateusersandspammers.
Inthispaper, we userandomly selected 1000legitimate usersand 1000 spammers asthe experimental dataset. We determinedalargeenoughsamplesizewitha95%confidenceleveland5%confidenceinterval.
Table3istheaverageindegreeandoutdegreeoflegitimateusersandspammers.
Inevitably,spammerstendtohaveapproximatelytwotimesasmanyoutdegreesaslegitimateusers.Themostinteresting observationisthattheratiobetweentheaverageindegreeandoutdegreeshowssignificantdifferencesbetweenlegitimate usersandspammers.Theaverageindegreeandoutdegreeoflegitimateusersaresimilarandtheratiobetweenthetwois 0.86.However,theratiobetweentheaverageindegreeandoutdegreeofspammersis0.35.Thisindicatesthattheindegree andoutdegreecouldberoughlyinformativeforclassifyingspammers.
Toclassifyspammers by onlyindegree andoutdegree,we used J48andRandomForest classifiersbuiltinWeka.5 Both algorithms are decisiontreebased classifiers.While J48generates only one decisiontree,RandomForest corrects overfit- tingproblemsbyconstructingmultipledecisiontreesduringthetrainingprocess.Table4istheclassificationperformance evaluationusingonlyindegreeandoudegree.
AsmentionedintheTwitterspampolicy,weprovedthat thenumberofoutdegreescanbea highlyusefulfeaturefor spamclassification.However,comparisonusingonlythenumberofdegreetypesbetweenFollowspamsandlegitimateusers isnotenoughofaperformancemeasuretoinspectspammersasshowninTable4.Tomakeupforthespamdetectionissue, wetriedtoapplyTSPandSSasdescribedinSections4and5.
4. Twitter-spamdetectionwithtriadsignificanceprofile
4.1. Followspamdetectionwithtriadsignificanceprofile(TSP-Filtering)
Apriorstudyshowedthat,interestingly,severaltypesofnetworksfromdifferentfieldssuchasbiologyandthesocialsci- encessharecommonproperties.Inparticular,[29]showedthatsomeofthe13isomorphictriadtypesareover-represented whilesome areunder-represented.Tothebestofourknowledge,wefirstusedthisfacttodiscernTwitterFollowspam.In termsofasocialgraph,auserisanodeandafollowfromapersontoanotherpersonisadirectedlinkfromthefollower (theperson)tothefollowee(anotherperson).Fig.2showsthe13isomorphictriadclassesintroducedby[38].
NotethataFollowspammerinevitablygeneratesmanyfollows(directedlinks)toreceivefollow-backs(redirectedlinks).
Foreach spammer,wefoundall ofthecorrespondingtriadsandcountedthefrequencyofthe13 isomorphictriadclasses (fordetailedrepresentationofthetriadclasses,referto Fig.2).Weperformedthesameprocedures withlegitimateusers andcomparedthedifferencesbetweenthefrequencyofeachtriadclassforboththespammer-centrictriadsandthelegiti- mateuser-centrictriads.Wearguethatthetriadfrequenciesofrealsocialnetworksaredifferentfromthoseofspammers.
5http://www.cs.waikato.ac.nz/ ∼ml/weka/ .
Fig. 2. 13 isomorphic triad classes for analyses.
Fig. 3. A user u ’s social network graph G u(red-colored edges and named nodes). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Thetriadfrequencies ofspammersaresimilartothoseofrandomnetworkswiththesamegraphpropertiesincludingthe averageindegreeandtheaverageoutdegree.
ForagivenlocalnetworkGuofauseruasshowninFig.3,weestimatedthenumberofoccurrencesforeachtriadclass.
Gu consistsofsociallinksbetweenuand1-hopneighborhoodsofu.Supposethat uisfollowingr1,r2 andr3 andis also followedbyrr,r5 andr6.Inthiscase,r1,r2andr3 are“Followees” ofu.Inthesamemanner,r4,r5 andr6are“Followers”
ofu.Also,therearedirectedsocial linksbetweenthem(representedasred-colored linksinFig.3).Todeterminewhether useruisaspammerornot,weanalyzeduseru’ssocialgraphGuconsistingof7nodesand10edges.Thisisasubgraphof aTwittersocialnetwork,andeveryusercanhavehis/herownsocialnetwork.
Todiscoverthephenomenon wherebyspammersocialnetworkscomprisesubgraphfeatures thataredifferentfromle- gitimateusersocial networks,wecomparedspammertriadfrequencieswiththoseoflegitimateusers.Foreachtriadclass i,thestatisticaltriadoccurrenceisdescribedbytheZ-scoreZi[29]inEq.(2).
Zi=
(
Nspami−<Nlegiti>)
/std(
Nlegiti)
(2)whereNspami istheoccurrencenumberofthetriad classi inaspammer’snetwork, and<Nlegit
i> andstd(Nlegiti)arethe mean and standard deviations of its appearancesin the legitimate user networks, respectively. The TSPis therefore the vectoroftheZ-scoresthatarenormalizedtolength1inEq.(3)
TSPi=Zi/
Zi2
12(3)
Fig. 4. Average TSP of spammers and legitimate users. Error bar means standard deviation of spammers TSPs.
Tovisualizethisinsightfromnetworkcomparison,wecomputedtheaveragevectorofTSPforrandom1000spammers fromtheoriginaldataset[15]andnormalizedit.Similarly,wealsocomputedNlegit
i basedonrandom1000legitimateusers.
Wedeterminedthatthesamplesize1000waslargeenoughwith95%confidenceleveland5%confidenceinterval.
Fig.4compares the TSPsofspammers andlegitimate users.Legitimate usersgenerallyhave moretriadscompared to spammers,meaningthattheneighborsoflegitimateusersaresociallywellconnectedwithisomorphictriadpatterns;there- fore,thisphenomenonproducedmoretriadcountsoverall.Alternatively,spammershavelowertriadcountsthanlegitimate usersbecausetheir1-hopneighborsarenotlikelytoacquaintthemselveswiththeother1-hopneighbors.
Sincespammersusuallyselecttheirfolloweesrandomly,therearefewconnectionsbetweenspammers’neighbors.Triad 021D,however, indicates exceptional triad counts, whereby spammers have more021D triadsthan legitimate users.The 021Dtriadclassrepresentstheplural-followingactionsfromanode.Italsorepresentslink-farmingactivity.Sincetheactions ofFollowspammersinvolvetheproductionofnumerousout-links,theirhigh021Dtriadcountsmakesense.Thedistinction betweentheTSPsofspammersandlegitimateusersthereforeexplainswhyourTSP-detectionapproachisfeasible.Wegive astatisticalanalysisofsensitivityofeveryproposedstrategyindiscussionsection(Section7).Inthefollowingsection,we provideatrue-positiverateandfalse-negativeratetosupporttheexcellenceofthismethod.
Asmentionedearlier,werandomlysampledsetsof1000spammersand1000legitimateusersfromtheoriginaldataset [15]andconductedanexperimentwithTSP. Wedetermined that thesamplesizewaslarge enoughwith95% confidence leveland5%confidenceinterval.
4.2.TSP-filtering
ThefollowingprocesswasusedfortheapplicablevalueoftheTSP-Filteringbasedontheexperiment.First,weobtained themeanandstandarddeviationsofeachfrequencyforthetriadclassacrossalloftheTwitteraccounts.Since1000legiti- mateusersaresufficientlyrepresentativetosupporteveryTwitteraccount(confidencelevel:95%,confidenceinterval:5%), we computedthe meanandstandard deviations ofthe 1000 randomly-sampledlegitimate users.The meanvalue of the triadclassiis<Nlegit
i>andstandarddeviationofthetriadclassiwith<Nlegit
i>isstd(Nlegiti),respectively.Figureshows thesampleduserslocalsocialnetworksandtriadfrequencynormalization.
Second, we countedthe spammer-triad frequencies and thelegitimate user-triad frequencies forevery social-network subgraphofeveryuseraccount; thetriadfrequencyrepresentsthetriadappearancesineach usernetwork.Then,we nor- malizedthefrequencieswith<Nlegit
i>andstd(Nlegiti)(Fig.5).Inthecaseofthespammer,wecanuseEq.(2);however,in thelegitimateuser’scase, wecanusethere-translatedEq.(4),whereNspami istheoccurrencenumberofthetriadclassi inalegitimateuser’snetwork:
Zi=
(
Nlegiti−<Nlegiti>)
/std(
Nlegiti)
(4)Lastly,we computedeachuser’s TSPusingEq.(3).Auser’sTSPcomprises 13Z-scoresof 13triad classes.These13Z- scorescouldbe informativefora machine-learningmechanism. Wealsoaddedtwofeatures formachine learning,namely theindegreesandoutdegreesofeachuserbasedonthemotivationexperiments.
Fig. 5. Triad frequency normalization.
Table 5
Performance evaluation using TSP-Filtering (w/o indegree and outde- gree).
Classifier Type True positive False positive
J48 Spammer 91 .0% 9 .4%
Legitimate user 90 .6% 9 .0%
RandomForest Spammer 92 .1% 8 .4%
Legitimate user 91 .6% 7 .9%
Table 6
Performance evaluation using TSP-Filtering (w/ indegree and outde- gree).
Classifier Type True positive False positive
J48 Spammer 91 .7% 9 .2%
Legitimate user 90 .8% 8 .3%
RandomForest Spammer 92 .3% 7 .6%
Legitimate user 92 .4% 7 .7%
4.3. PerformanceevaluationofTSP-filtering
We conductedthe experimentusing J48andRandomForest implemented inWeka (10-fold validation).Table 5 shows theperformance evaluationresultsfortheTSPmethodwithoutindegrees andoutdegrees.Table6showstheperformance evaluationresultsfortheTSPmethodwithindegreesandoutdegrees.Ontheother hand,Table6showstheperformance evaluationresultsfortheTSPmethodwithindegreesandoutdegrees.
FromTable5,evenwithoutindegreesandoutdegrees,TSP-FilteringforRandomForesthasapowerfulspam-classification performancewith92.1%.FromTable6,theproposedapproachwithindegreesandoutdegreeshas92.3%truepositivesanda lowerproportionoffalsepositives(7.6%)thanCollusionrank(9.9%).UnlikeCollusionrank,whichneedstoanalyzeeverylink torankevery node,ourTSPapproachisafastandlow-costdetectionmechanismthat usesonlythe1-hop-neighborhood network foreach user. Therefore, the TSP approach is a more lightweight and efficient mechanism for detecting follow spammersinrealtime.
Todefine a preferredsequence ofattributes, we measured the importance of feature attributesbased oninformation gainasshowninTable7.InTable7,featureattributeslistedindescendingorderofinformationgain.Informationgaincan becomputedasfollows:
In f ormationGain
(
C,A)
=Entropy(
C)
−Entropy(
C|
A)
(5)In Eq.(5),C represents thegiven class such asspammer andlegitimateuser. Ais the feature attribute.For example, InformationGain(spammer,021D) refers to the amount of entropydecrease in a spammer class when the feature attribute 021Disprovided.
Table 7
The importance of feature attributes based on information gain (TSP- Filtering).
Feature attributes Information gain
021D 0 .2867
021U 0 .2556
021C 0 .2366
111U 0 .2267
201 0 .1418
030T 0 .1408
111D 0 .1399
120D 0 .136
120U 0 .1075
120C 0 .0871
300 0 .0859
210 0 .0794
030C 0 .0465
Asweshowedintheexperimentresults,021Disthemostsignificantfactorinclassifyingfollowspammersbecauseofits propertyoftwo out-edges.Followspammers tendtohavemanyout-edges tolegitimateusers.Thistendency ispresented naturallyin 021D. The following attribute, 021U, is alsosignificant inclassifying legitimate users because ofits two in- edges.LegitimateusersarelikelytohavemorefollowersthanspammersatstablepointsoftheTwitterSNSsystem.Twitter isaveryspecialSNSduetoitssubscriptioncharacteristics.Themoreinformativetheusers’contentsare,themorefollowers subscribetotheuser.Sincemostspammersuploadadvertisementsorspammingcontentsontheiraccount,theyhavefewer followersthanlegitimateusers.Understandably, some legitimateTwitteruserstrytofollowmanyusers atthe initialand transitionpointsfor subscriptions or other reasons. However, legitimate usersat the steadypoint have a larger number of indegrees (i.e., followers) than outdegrees (i.e., friends) dueto effective influence or fruitful contents because of the psychologyofpopularity.Inaddition,theremaining featuresofTSParegradually reflected inthedistinctionbetweenthe followspammersandlegitimateusersbecauseofthesocial-interaction.
5. Twitter-spamdetectionwithstatustheory 5.1. Followspamdetectionwithsocialstatus(SS-filtering)
Basedonsignsofdirectedlinksderivedfromsocialmediasitessuch asEpinions,Slashdot,andWikipedia,socialstatus theory is first applied to predict certain kindsof social relationships [27]. To findstrong consistency in how the model fitsthedataacross othersocial networksaswell asthepowerofinfluenceinTwitterusingourTSPfilteringscheme,we additionallyproposeSS-Filtering forTwitternetworkanalysis.Twitterhasaspecialcharacteristicwherebyusersfollow(or subscribeto)otherusersandthiscanbetranslatedintothesocial-statustheory.Generally,mostTwitterusersfollowusers whoaremoreinfluentialthanthemselves.EspeciallyonSNS,amore-influentialuserissimilartoauserwithahighsocial status.When a legitimateuser follows others witha higher social status, is a spammer’s following patternsimilar to a legitimateuser?Wefocusedonthefactthatspammersarelikelytofollowthepropertiesofarandomnetwork.
Oursocial-status-basedintuitionisderivedfromthefollowingquestion:“Isthefollowingpatternofaspammersimilarto thatofa legitimateuserwhen alegitimateuserfollows otherswith ahigherstatus?”We focusedonthe factthatspammers are likely to followthe properties ofa random network. In fact, spammers have little knowledge of the social relations betweenlegitimate users. They tend to select target users randomly. One may argue that a strong spammer is able to gatherinformationregardingthesocialrelationsoflegitimateusers.However,suchsocialengineeringhasonlypartialand instantaneousinfluencecomparedtotheaverageandapparentinfluenceoflegitimateusers.
5.2.SS-filtering
Asocial-statusspam-filteringsystemcancomputethefollowingmetricbasedonuseru’s2-hopsocialnetwork.Toapply Twittertothestatustheory,we definedthestatusofa useruastheratiooftheindegree (indegree(u))totheoutdegree (outdegree(u))oftheuserasshowninEq.(6).
status
(
u)
=indegree(
u)
/outdegree(
u)
(6)WethendefinedapositivelinkinTwitterastheprobabilitythattheuserfollowsanotheruserofahigherstatus.Fig.6 showstheconceptofsocialstatus,positive linkandnegativelinkinstatustheory[27]andsocialbalancetheory[7,19,20]. Apositive link,‘+’linkmeansthat anodeXlinkstoeach nodeA,B,CandD,whohashighersocialstatusthanX.Onthe otherhand,anegativelink,-meansthatthenodeDlinkstoanodeX,whohaslowersocialstatusthanD.
Fig. 6. Social status, positive link (orange link) and negative link (blue link). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Table 8
Average values comparison of social status-related features between legiti- mate user and spammer.
Social status-related features Legitimate user Spammer Average status of a user 1 .82 0 .39 Average positive link probability ( PLP ) 0 .83 0 .91 Average status of followees ([0-1] scale) 0 .05 0 .0 0 04
Fig. 7. Comparison between legitimate users and spammers for average status of followees.
ThefollowingEq.(7)istheexpressionofthepositivelinkprobability(PLP)ofauseru:
PLP
(
u)
=Npos(
u)
/(
Npos(
u)
+Nneg(
u))
(7)whereNpos(u)meansthenumberofpositivelinksandNneg(u)meansthenumberofnegativelinks.Weassumedthatnega- tivelinksincludelinksbetweensamestatus.Additionally,wealsoconsidertheaveragestatusoffollowees.Table8compares socialstatus-relatedfeaturesbetweenalegitimateuserandspammer.
Wecan deriveinteresting observationsfromTable8.First, theaveragestatusofa spammerissignificantlylower than that ofalegitimateuser,andthisis attributedto thespammer’slink-farmingproperty.Inthisobservationoftheaverage statusofauser,thetwoproposed schemes(TSPfilteringandSSfiltering) havesomethingincommon.Second,thePLPof aspammertoobtainagreaternumberofreciprocalfollowingsishigherthanthatofalegitimateuser.Actually,thisresult countersourassumptionthatspammersselectfolloweesrandomlyandthePLPofaspammerwouldbelowerthanthatof alegitimateuser.However,thisresultismainlyduetothesignificantlylowstatusofspammeraccounts(i.e.,agoodmany outdegreesofeachspammeraccount).Wearrivedatthisconclusionfromcomparisonoftheaveragestatusoffollowees.In Table8,wecomputedtheaveragestatusoffolloweesby[0–1]scalenormalizationformeasurablecomparison.Comparedto theaveragestatusoflegitimateusers’followees,spammers’followeeshavedefinitelylowerstatus.Therefore,thoughthePLP ofspammersishigherthanthatoflegitimateusers,mostspammers followuserswithlowstatus.Thisprovidessufficient evidenceforourassumption.Thisintuitivelyindicatesthat aspammerusuallytargetsuserswho arenothighlyinfluential duetolackingrealsocialnetworks.Alternatively,alegitimateuserfollows(orsubscribesto)influentialusersortheironline friendsasshowninFig.7.
Fig. 8. Green squares indicate legitimate user accounts and red triangles mean spammer accounts. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Table 9
Performance evaluation using SS-Filtering (w/o indegree and outde- gree).
Classifier Type True positive False positive
J48 Spammer 90 .4% 17 .1%
Legitimate user 82 .9% 9 .6%
RandomForest Spammer 88 .6% 14 .6%
Legitimate user 85 .4% 11 .4%
Table 10
Performance evaluation using SS-Filtering (w/ indegree and outdegree).
Classifier Type True positive False positive
J48 Spammer 88 .0% 10 .0%
Legitimate user 90 .0% 12 .0%
RandomForest Spammer 91 .9% 9 .7%
Legitimate user 90 .3% 8 .1%
Fig.8 shows the relationship betweenan average followee status andthe PLP, demonstratingwide variationsamong theaveragefolloweestatuses oflegitimateusers.Thisillustrates thatlegitimateusersfollowthepropertiesofarealsocial network,whereasspammersdonot.
5.3.PerformanceevaluationofSS-filtering
Forthisexperiment,weusedthesamedatasetcontainingsamplesofspammersandlegitimateusersfromtheprevious section. We used the user’sstatus, average status of followees, PLP, indegree, andoutdegree as the feature vectors. We conductedtheexperimentusingJ48andRandomForestinWeka(10-foldvalidation).Thefollowingtables(Tables9and 10) aretheresultoftheperformanceevaluationsofSS-Filteringconsideringindegreeandoutdegree(i.e.,withoutorwith).
Consequently,theproposedapproachalsoshowsagoodproportionoftrue-positives(91.9%)andaloweramountoffalse positives(9.7%)comparedtoCollusionrank.SimilartotheTSPmethodusingthe1-hop-networkofeachuser,theSSmethod usesonlythesmall2-hop-networkforeachuser.Theselightweightschemesmakethespam-filteringprocess muchfaster.
ConsideringthatCollusionrankusesevery linkandnodeforcomputation, ourdetectionmechanismusingthesocialstatus isasefficientastheTSPapproachinrealtimespamfiltering.Table11showstheimportanceoffeatureattributesbasedon informationgainwithEq.(5).
InSS-Filtering,theuser’sstatusisthemostsignificantfeaturesimilarwithTSP-Filtering.Thisisbecausespammershave fewer indegrees than outdegrees. Therefore, the status of a spammer is lower than that of a legitimate user. However, the followee’s average status was also important as well as the user’s status.Normally, a legitimate user subscribes to informativeusers’accounts,andthestatusoftheseusersislikelytobehighduetotheirmanyfollowers.Thismeansthat
Table 11
The importance of feature attributes based on information gain (SS-filtering).
Feature attribute Information gain A user’s status 0 .386
Followee’s status 0 .327
PLP 0 .107
Table 12
Performance evaluation using Cascaded-filtering.
Classifier Type True positive False positive
J48 Spammer 94 .0% 7 .6%
Legitimate user 92 .4% 6 .0%
RandomForest Spammer 96 .3% 5 .7%
Legitimate user 94 .3% 3 .7%
Table 13
Overall performance comparison.
Collusionrank TSP-filtering SS-filtering Cascaded-filtering True positive(Spammer) 94% 92 .3% 91 .9% 96 .3%
False positive (Legitimate user) 9 .9% 7 .6% 9 .7% 5 .7%
legitimateusersfollowuserswithhigherstatusthanthemselves.Onthe otherhand,spammers tendtoselectandfollow legitimateusersrandomly.Asaresult,theirfollowees(targets)mayhavealowerstatusthanthemselves.
6. Twitter-spamdetectionwithcascadedapproach(Cascaded-filtering)
Inprevioussections,weproposedTSP-FilteringandSS-Filteringusingpartialinformation(i.e.,uptothe2-hopsocialnet- workofauser)forlightweightandreal-timespammerdetection.BothalgorithmshavefewerfalsepositivesthanCollusion- rank,buttheirtruepositivesarenotsuperiortoCollusionrank.Therefore,wesuggestahybridapproach(Cascaded-filtering) that utilizes every featureattribute used by both TSP-Filteringand SS-Filtering. Weconducted an experimentoneach of 1000 legitimate user and1000 spammer account withTSP features (TSP of 13 triad classes), social statusfeatures (the user’sstatus,averagestatus offollowee,PLP), indegreeandoutdegree. Table12 showsthe performanceevaluationresults usingCascaded-filtering.
7. Discussion
7.1. Overallperformancecomparison
In thispaper, we compared three Follow spam filtering mechanisms(TSP-filtering, SS-filtering andCascaded-Filtering) withCollusionrank.CollusionrankisthefirstFollowspam-targetedfilteringalgorithmpublished.ItisaPageRank-basedal- gorithm, so it canbe appliedwhen the spam-filteringsystemcontains informationon every Twittersocial network. We propose theTSP-filteringandSS-filteringmethods,both ofwhichcanbe appliedwithonlythe2-hopsocial networkofa user;thisisthemostpowerfulfeatureofthesemethods.Table13showstheoverallcomparisonofthethreemethodswith Collusionrank.
BothTSP-filteringandSS-filteringhavelowertrue-positiveratesandmorefavorablefalse-positiverates.However,weare convinced that ourdetectionmechanismis moreeffectiveforthe followingreasons.In general, forSNSssuch asTwitter andFacebook,real-time spamdetectionisthemostimportantissue.Sincespammers simultaneouslydisseminate numer- ousspammingcontentstoSNSs,fastandimmediatefilteringwithminimalinformationisneededtopreventspamming.To detect spamcontents promptly,a lightweight computationalcost isessential. TSP-filteringandSS-filtering identifyspam- mersbyusingsmallsubgraphswithupto2-hopsocialnetworksforauser.Ontheotherhand,Collusionrankrequiresmore than24GB ofRAMtoperformcomputationonadatasetwith1,963,263,821edges.Therefore,forapplicationto theentire Twitter-userpopulation,itisnotefficienttouseeverynodeandedge.Wethereforesuggestadivide-and-conquerapproach likeourTSP-filteringandSS-filteringmethodsforimmenseSNSs.
Anotherissueisthefalse-positive rate.ForthereliabilityandconvenienceofSNSs,aspam-filteringsystemshould not filterlegitimateusersasspammers,sincethisblocksusers’utilizationabilitiesandleadstonotorietyandsystemicfailure.In comparisonwiththefalse-positiverate,compensationofthetrue-positiverateofTSP-filteringandSS-filteringisnaturaland easier.Spam-reportingservicesareincorporatedintothedesignofmostSNSsforthedetectionofcontentabusers,andthis
Fig. 9. ROC curve of (a) TSP-filtering and (b) SS-filtering (c) Cascaded-filtering ( X axis : False positive , Y axis : True positive).
Fig. 10. Sensitivity of the proposed detection strategies.
Table 14
True positive, False negative and recall of every suggested strategy.
True positive False negative Recall In/Out degree 0 .808 0 .804 0 .501 TSP-filtering 0 .923 0 .924 0 .500 SS-filtering 0 .919 0 .903 0 .504 Cascaded-filtering 0 .963 0 .943 0 .505
complementarytoolcouldbehelpfulforthedetectionofsubtlespammingactions.Accordingly,TSP-filteringandSS-filtering couldbepracticalspam-filteringmechanismsforuseunderSNSconditions.
The performance ofCascaded-filtering, which employs every feature attribute used inTSP-filtering and SS-filtering, is superiortotheotherthreeschemesincludingCollusionRank.Thisschemecanaccuratelydetectmorespammersandblock orsuspendlesslegitimateusers.Consequently,thisresultsupportstheideathatthereisadistinctionbetweenthelegitimate user’ssocial network andthe spammer’ssocial network. Especially,thisscheme doesnot requirethe full social network information,includingusersthatarenotdirectlyrelatedtotheaccount,todeterminewhetheraspecificuserisaspammer ornot.Moreover,fortheserviceprovider,halfthenumberoffalsepositivesofCollusionrankisprettyattractivetoensurea reliableandconvenientSNSsystem.
Fig.9showstheReceiverOperatingCharacteristic (ROC)curve andtheAreaUndertheROCCurve(AUC)foreverypro- posed approach. Due to its high true positive and low false positive values, Cascaded-filtering has the highest AUC. To comparewithCollusionrank,we estimatedtheAUCofCollusionrankusingitstruepositive andfalse positivevalues.Con- sequently,compared toCollusionrank(AUC=0.92),ourapproachesarecompetitiveandrequiremuch lesssocial network information.
7.2.Sensitivityanalysis
Fig.10showsastatisticalanalysistoevaluatethesensitivityoftheTSP-detectionstrategy.
Wecomputedtherecall(sensitivity)asfollowingequation:
recall=TruePositi
v
es/(
TruePositiv
es+FalseNegativ
es)
(8)Tosumup, fromTable 14, two proposed approaches, SS-Filtering andCascaded-Filtering, show better sensitivity than spamclassificationusingIn/Out degreeonly. Actually,TSP-filtering hasslightlylower sensitivitythanthose approaches, it
Algorithm1 TriadCensus INPUT :G
fori:=1to16do Ni=0 endfor for
v
∈V doifu∈
v
thenS:=Neighbor(u)∪Neighbor(
v
)\
{u,v
} ifLink(u,v
)andLink(v
,u)thenTriType:=3 else
TriType:=2 endif
N[TriType]:=N[TriType]+n-
|
S|
-2foreachw∈Sdo
ifu<w∨(
v
<w∧w<u∧¬Link(v,w) then TriType:=TriType[Tricode(v
,u,w)]N[TriType]:=N[TriType]+1 endif
endfor endif endfor sum:=0
fori:=2to16do sum:=sum+N[i] endfor
N[1]:=(1/6)n(n-1)-sum returnN
countersthislimitationwith bothhightrue positiveandfalse negative. Cascaded-Filtering,which isthe hybridapproach usingeveryfeasiblefeatures,showsthebestperformanceatsensitivitywithsuperiortruepositiveandfalsepositive.
7.3. Complexityanalysis
Followingpseudo-codeisTSP-Filteringanditconsistsoftwosteps:TriadcensusalgorithmandTriadSignificanceProfile computationalgorithm. At first,we modified input ofTriad census algorithm [2]to apply to two-hop input subnetwork.
G=(V,E)isthetwo-hopdirectedsubnetworkofauser.
Algorithm1computesfrequenciesof16isomorphictriadtypes.Inourwork,weusedonly13isomorphictriadsbecause triad types withisolated nodesare counted frequently so that this interrupts observing significanceof triads. From this algorithm, we can get frequencies of13 isomorphic triad types of given graph G. TSP-Filtering get 2-hop neighborhood social network ofa certain useras input.This algorithm hasthe complexityO(m) andm isthe numberof edges inthe graph G. In Algorithm 1,TriType means how many nodesare connected each other. This indicates that an isolate node existswhenTriTypeis2.[2]usedtheconceptofTricodetocounttriadefficiently,butwedontexplaindetailoftheconcept inthispaper.
Algorithm2computesZ-scoreandTSPofacertaininputuser.Forinputattributes,thisalgorithmhasinputvectorscon- sistingofmeanandstandard deviationoflegitimateuserstriadfrequency. Theseattributescouldbe gotatpreprocessing phase,whichcomputesmeanandstandarddeviationoflegitimateusersforthetrainingvalue.Sinceourworkistoclassify spammers fromlegitimateusers, weshould comparethe userstriad frequencyN (testingvalue) tolegitimateusers triad
Algorithm2 TriadSignificanceProfile.
INPUT :N,<Nlegit
i>,std(Nlegit) fori:1to13do
Zi=(N[i]− <Nlegit
i>)/std(Nlegit
i) endfor
fori:1to13do TSPi= Zi/(
Zi2)12) endfor
returnTSP
Algorithm3 PositiveLinkProbability.
INPUT :G,Status Positi
v
eLink:=0foreachu∈Neighbor(
v
)do ifStatus[v
]<Status[u]thenPositi
v
eLink:=Positiv
eLink+1 endifendforPLP:=Positi
v
eLink/|
Neighbor(v
)|
returnPLP
Table 15
Complexity analysis.
Collusionrank TSP-filtering SS-filtering Cascaded-filtering Time complexity O( N + M) O( m ) O(n) O( m + n ) Space complexity O( N ) O( m ) O( n ) O( m + n )
frequencyNlegit (trainingvalue).ThisalgorithmhastimecomplexityO(1)becauseofasimplenormalizationprocess.There- fore,TSP-FilteringisacombinedalgorithmofTriadCensusalgorithmandTSPalgorithm.Thenoveltyofourworkisthatwe usedausers2-hopneighborhoodsocialnetworkasinitialinput.Also,we compareeach userscensustooveralllegitimate userstoobserveoverrepresentationandunderrepresentationineachisomorphictriad.Overrepresentationandunderrep- resentationoftriadscould bethe distinct featuresto classify spammersfromlegitimate users.So,TSP-Filtering hasO(m) timecomplexityforatarget user.Followingpseudo-codeisSS-Filtering.Actually,inpreprocessingphase,wecancompute thestatusofauserwithhis/herindegreeandoutdegree.Sincewedefinedthestatusofauserastheratiooftheindegree totheoutdegreeoftheuser(Eq.(6)),wejustupdatethreevaluesindegree,outdegreeandstatusinrealtimecase.So,we onlyneed positive link probability (PLP) ofa user.The simple algorithm forPLPis asfollows.Status means atable with statusofeveryuserinG.
Algorithm3computestheratioofthenumberofneighborswithhigherstatusthanuser
v
tothenumberofneighbors ofuserv
.SinceSS-Filteringhighlydependsonpreprocessingphase,thissimplealgorithmhasthecomplexityO(n)whenn meansthenumberofusersinG.Additionally,Table 15 showsthe complexity analysis of the proposed schemes andCollusionrank. Actually, since our workisforeachusers2-hopneighborhood socialnetwork,complexitycomparisonwithCollusionrankseems tobeironic.
Tosumup,Collusionrankisforclassifyinglargenumberofspammeraccountsfromwholesocialnetwork,butTSP-filtering, SS-filteringandCascaded-filteringisfordeterminingifauserisspammerornot.nmeansthe numberofuser(nodes)in 2-hopneighborhoodsocialnetworkandmisthenumberofrelation(edges)in2-hopneighborhoodsocialnetwork.Also,N meansthenumberofeveryusers(nodes)inwholesocialnetworkandMmeansthenumberofevery relations(edges)in wholesocialnetwork.ThistableshowedthatourapproachesneedlesscomputationthanCollusionrank.
8. Futurework
8.1. Dynamicfollowspammerdetection
Firstofall,camouflageattack,whichusescompromiseduseraccounttodisseminatespamcontents,couldbeoneofthe interestingfuturework.Tosolvethecamouflageattackproblem,unsupervisedschemecanbeconsideredfordetectingnewly occurredspammerstotheproposedscheme(i.e.,unsupervisedschemebasedoncascadedsocialinformation).Forexample, wecanselectself-organizingmap(SOM)amongmachinelearningalgorithmsformappingbetweenspammerlabeledwith supervisedschemebasedoncascadedsocialinformationandnewspammerbasedonunsupervisedscheme.
Indetail,SOM isone ofclusteringalgorithms usingEuclideandistanceconcept betweensourceanddestination.Ifwe train spammer patterns with cascaded social information scheme as a supervised concept that can be a partial part of pre-definedmap ofSOMforprocessingandclusteringnewinputautomaticallyasa unsupervisedconcept.SOMcalculates EuclideandistancebetweenspammeraccountwithcascadedsocialinformationandnewTwitteraccounttocheckwhether the tendencyof newaccount is closerto spammer or legitimate.In each epoch, SOM can update the tendency ofeach accountbasedonEuclideandistancedynamicallyanddistinguishbetweentime-varyingspammerandlegitimatebasedon clusteringmap.Infuture,wewillapplySOMorotheroptimalsolutionfordetectingdynamicspammeraccordingtotime.
8.2.Generalizedapplicationsoftheproposedscheme
Intheview of socialbehavior, people havedifferentsocial networking stylebasedon their purposeofusing SNS.For example,spammers make numerousfollowing linkstounspecifiedindividuals withthe aimofspreading spamcontents.
On the other hand, ifsomeone wants to make friends in SNS, they would carefullyselect users asfriends based on a
personal preferenceandcloseness.And finally,thesesocial linksmadeby varioususers could beinterpreted incascaded socialinformation.
As showninourexperiment, cascadedsocial informationandits structuralanalysissuggest variousfutureapplication in social network security. In future work, we are to finda correlation betweenvarious behavior patternsof using SNS andcascadedsocialinformation.ThebehavioralpatternsofusingSNScanbemorespecifiedbyvarioustypesregardlessof spammerandlegitimate.Oneo