• 검색 결과가 없습니다.

Movie Popularity Classification Based on Support Vector Machine Combined with Social Network Analysis

N/A
N/A
Protected

Academic year: 2021

Share "Movie Popularity Classification Based on Support Vector Machine Combined with Social Network Analysis"

Copied!
17
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Movie Popularity Classification Based on Support Vector Machine Combined with Social Network Analysis

Tserendulam Dorjmaa*․Taeksoo Shin**

Abstract

Submitted:January 30, 2016 1st Revision:June 26, 2017 Accepted:July 7, 2017

* Ph.D. Candidate, Department of Management Information Systems, Yonsei University

** Corresponding Author, Professor of Management Information Systems at Division of Business Administration, College of Government and Business, Yonsei University Wonju Campus

The rapid growth of information technology and mobile service platforms, i.e., internet, google, and facebook, etc.

has led the abundance of data. Due to this environment, the world is now facing a revolution in the process that data is searched, collected, stored, and shared. Abundance of data gives us several opportunities to knowledge discovery and data mining techniques. In recent years, data mining methods as a solution to discovery and extraction of available knowledge in database has been more popular in e-commerce service fields such as, in particular, movie recommendation. However, most of the classification approaches for predicting the movie popularity have used only several types of information of the movie such as actor, director, rating score, language and countries etc.

In this study, we propose a classification-based support vector machine (SVM) model for predicting the movie popularity based on movie’s genre data and social network data. Social network analysis (SNA) is used for improving the classification accuracy. This study builds the movies’ network (one mode network) based on initial data which is a two mode network as user-to-movie network. For the proposed method we computed degree centrality, between- ness centrality, closeness centrality, and eigenvector centrality as centrality measures in movie’s network. Those four centrality values and movies’ genre data were used to classify the movie popularity in this study. The logistic regression, neural network, naïve Bayes classifier, and decision tree as benchmarking models for movie popularity classification were also used for comparison with the performance of our proposed model. To assess the classifier’s performance accuracy this study used MovieLens data as an open database. Our empirical results indicate that our proposed model with movie’s genre and centrality data has by approximately 0% higher accuracy than other classification models with only movie’s genre data. The implications of our results show that our proposed model can be used for improving movie popularity classification accuracy.

Keyword:Movie Popularity Classification, Social Network Analysis, Movie Popularity, SVM

(2)

1. Introduction

In recent years, data mining methods have been popularly used in many fields such as stock market, business intelligent, marketing, custo- mer relationship, forecasting, and healthcare (Park et al., 2012; Kabinsingha et al., 2012). Specifi- cally, the technique of data mining has been used for the knowledge discovery, pattern mining, and event extractions in many aspects such as a classification, clustering, and predictions (Lash et al., 2015; Asad et al., 2012; Bhave et al., 2015).

The movie popularity classification model is an important topic of interest both to academic and movie industry (Asad et al., 2012). Despite the large investment for the movie production, the success or profitability of the movie is largely uncertain (Lash et al., 2015). Therefore, classifi- cation method helps to make obvious planning for new movies in the future and find out a new trend of movies. The classification models en- compass a variety of statistical and data mining techniques that analyze current and historical facts to make a prediction about future or un- known event in the movie industry (Bhave et al., 2015). A large interest in the prediction of movie success started when Netflix announced the Net- flix Prize for movie rating prediction (http://

www.netflixprize.com) in 2006 (Amatriain and Basilico, 2012). After that, some researches tried to predict movie success from statistic attri- butes of the movies such as actor, director, coun- try, and genre etc. The prediction of movie’s popularity at an early stage is a very important issue for the investors. Hence, this study is interested in the movie popularity classification approaches, in particular, based on social net- work data. To accomplish this, we propose a

classification-based support vector machine (SVM) integrated with social network analysis (SNA).

SVM is a useful technique for data classi- fication. It had the highest prediction or classifi- cation accuracy in previous studies (Huang et al., 2005 and Hsu et al., 2010). SVM seeks to minimize an upper bound of generalization error rather than minimizing training error (Huang et al., 2005).

In recent years, Leem and Chun (2014) studied the impact of the product’s network position on demand or revenue of items. They found that a book’s position within this recommendation network affects its overall demand. Based on their research, we use network measures of the item as important variables for its classification.

The purpose of this study is to discover more effective and accurate classification models with network measures of recommendation network of items.

Our research questions are as follows. First, how item position in the network can affect its popularity classification? Second, what centrality measure is the most useful variable in classi- fication? Lastly, how much the performance on classification is improved when we use cen- trality variables as input variables in classifi- cation model? To answer these questions, we take the following steps. First, we build a movie network (item network) based on their co-pur- chasing data. Second, we compute the centrali- ties such as degree, betweenness, closeness, and eigenvector centralities. Finally, this study imple- ments classification based on centralities and social network data.

This paper used MovieLens open dataset (http://movielens.org). It has been used in many previous research projects (Ghazanfar and Prugel-

(3)

Bennet, 2014; Bao and Xia, 2012; Kim and Im, 2014). The logistic regression, neural network, naïve Bayes classifier, and the decision tree as benchmarking models for movie popularity classi- fication were also used for comparison with the performance of our proposed models.

2. Related Work

The movie classification methods have been studied in recommendation system of e-com- merce fields during the several decades. Most of them focused on user review data and those methods used for the recommendation of the future purchase of users. In this section, we review some previous studies of the prediction or classification methods in movie industry.

Oh and Chae (2015) proposed the movie rating prediction method with the sentiment analysis on movie review data. Their analysis consists of two stages as a collection of sentiment word and creation of sentiment sentences for the estimation. Kabinsingha et al. (2012) proposed movie rating classification model with various attributes of movies such as an actor, director, budget, and genre, etc. They found that the most effective attributes are genre and words used in the movies in their research result.

Bhave et al. (2015) studied different success factors of movies based on predictive models.

They suggested that the success of the un- released movie can be accurately predicted by considering the classical factors such as cast, producer, director, etc. as well as social factors such as the user anticipation or feedback through social media.

Bao and Xia (2012) proposed movie rating pre- diction method for the efficient recommendation

system. They applied the MovieLens database in their research to build a user based matrix.

They predicted rating value for the sparse recommendation matrix. They applied various rating prediction algorithms and found that the methods based matrix factorization outperform the other ones, which only use the stochastic information of the training data.

Asad et al. (2012) proposed classification scheme targeted for the future popularity of the movie.

This classification scheme was based on essen- tial attributes of the movies i.e. actor, director, rating, language, and budget, etc. They defined the correlation coefficient among essential attri- butes of post-movie data, and then used the correlation coefficient to discover necessary fac- tors in movie popularity classification. Finally, they selected the suitable classifiers based on IMDB movie dataset and target classification. It was a novel approach to predict pre-release movie popularity based on inherent attributes using C4.5 and PART classifier algorithm.

In addition, some researches studied social net- work analysis in recommendation system and other fields. For example, Leem and Chun (2014) examined the impact of item’s network position on its overall demand. They built recommen- dation network of the item based on amazon’s book co-purchasing dataset from the Stanford University. Their results showed that degree centrality has a significantly positive effect to- ward an item’s demand. It means best-selling items are positioning at the center of recom- mendation network.

Ghazanfar and Prugel-Bennet (2014) suggested firstly the gray-sheep user’s problem of recom- mendation system. The presence of gray-sheep users in the network can affect the performance,

(4)

Researcher Purpose Data type Model Asad et al.

(2012)

Classification of the movie

popularity Movie attribute data in IMDB C4.5, CART, Decision tree Bao and Xia

(2012) Predicting movie rating value MovieLens, IMDB data, Social media data

CART, Decision tree, Neural networks Bhave et al.

(2015)

Literature survey of movie classification

Classific factors of movies, Social media

K-means clustering,

Neural network, Naive Bayes, Logistic regression, SVM Chaovalit and Zhou

(2005) Movie classification Movie review (text) data N-gram classifier Kabinsingha et al.

(2012)

Movie classification using the

feature selection method Movie attribute data in IMDB Decision tree (J48) Lash et al.

(2015)

To predict the profitability of movies

Movie attribute data, Social

network data Logistic regression Manek et al.

(2013)

Movie classification using the review data

Movie review, Sentimental

data SVM, KNN, Naïve Bayes

<Table 1> Previous Researches on Movie Popularity Classification accuracy, and coverage of the recommendation

system. They used clustering algorithm to solve that gray-sheep user’s problem. The research proposed various improved centroid selection approaches and distance measures for the K- means clustering algorithm, and used content- based profile of gray-sheep users. The proposed approach reduced the recommendation error rate for the gray-sheep users while maintaining rea- sonable computational performance.

Kim and Im (2014) proposed a new algorithm that utilizes SNA techniques to resolve the gray- sheep user’s problem. They utilized degree cen- trality in SNA to identify users with unique preference (gray-sheep). For finding gray-sheep users, they built a user network from the user- item bipartite network and then computed de- gree centrality values for each user node.

Symeonidis et al. (2009) proposed a social- union method, which combines multiple simila- rity matrices derived from an unimodal and bi- modal network for rating prediction. They built rating matrix using the user-item (bimodal) ra-

ting network. And they computed the cosine similarity value between each user.

Christakou et al. (2005) proposed a hybrid movie recommendation system based on neural net- works. The content-based part of the system was based on neural network and, for the colla- borative filtering part, they used the Pearson correlation formula to find the correlation bet- ween a user and other users. To evaluate their proposed hybrid recommendation, they used the MovieLens dataset.

Lin et al. (2014) proposed a recommendation system for movie lovers using neural networks in collaborative filtering systems for consumer’s experiential decisions. Their experimental results revealed that it not only improves the accuracy of predicting movie ratings but also increases data transfer rates and provides richer user experiences.

Recently, many studies using data mining and social network analysis have been proposed for the prediction, classification, and recommenda- tion. However, previous studies have two limi-

(5)

tations as follows. First, most of the movie pre- diction models have focused on user’s preference in selecting future movie based on historical rating data or review data (Asad et al., 2012).

Second, predictive models require the various data such as cast, producer, director, and response of users for assessment of the movie.

This study proposes a new classification me- thod integrating SVM and SNA. We suggest a new additional important variable of the movie extracted from an SNA. The MovieLens dataset used in this paper is basically not enough to achieve high classification performance because the dataset includes only a few movie attributes and also does not include any other review data.

In recent years, SNA has been used for cal- culating centrality measures of the movie net- work (one mode network). <Table 1> sum- marizes previous studies of movie classification and prediction of movie rating.

3. Research Model

This study proposes a classification-based support vector machine model for classification in the movie popularity based on movie’s genre data and centrality measures of each movie.

SNA is used for extracting new variables from the MovieLens dataset and is useful for im- proving the classification accuracy. This paper focuses on SVM classification model and uses SNA for just extracting new useful variables such as centrality measures. In other words, we suggest new variables measuring movie net- work position using SNA. By applying this cla- ssification model to our empirical data, we try to investigate whether our model is able to achieve good performance by using a wide variety of

features such as movie attributes or social net- work data.

First of all, we built a movie’s network (one mode item network) based on initial data which is a two-mode network as a user-to-movie net- work. From this network, we computed degree centrality, betweenness centrality, closeness cen- trality, and eigenvector centrality as movie’s centrality values in a movie network. The cen- trality measures indicate the importance or po- sition of each movie in the whole network.

Our proposed model is an integrated model using SNA data and movies’ genre data for classification of movie popularity. We used SVM model for the classification. The SVM is useful technique for data classification and has the highest performance on prediction or classifi- cation in other previous studies (Huang et al., 2005 and Hsu et al., 2010). SVM seeks to mini- mize an upper bound of generalization error ra- ther than minimizing training error (Huang et al., 2005). The logistic regression, neural net- work, Naïve Bayes classifier, and a decision tree as benchmarking models for movie popularity classification were also used for comparison with the performance of our proposed models.

Our detailed research process is as follows.

First, we build the two-mode (user-to-movie) network based on MovieLens rating dataset (http://

movielens.org). Second, this study converts the initial data which is a 2 mode (user-to-movie) network into a one mode (movie-to-movie) net- work. Third, we calculate centrality measures of each node of movie network. Finally, this study classifies the movie popularity using the data mining algorithm. Weka and UCINET soft- ware tools were used in our research. Our input variables consist of the movie genre data and

(6)

centrality measures of each movie. In our re- search, input variables are 19 genres and 4 cen- trality measures. That is, the input variables include 4 numerical variables and one nominal variable including 19 dummy variables. The target variable is categorial variable, i.e. movie popularity measure. This variable is defined as positive or negative rating classified from an original average user rating of each movie in this research. The main research process is summarized in <Figure 1>.

<Figure 1> Main Research Process 3.1 Social Network Analysis

Social network analysis (SNA) was initially developed and practiced by social scientists to investigate the interaction between network actors to understand complex phenomena and structure of the social world (Xu et al., 2009).

The nature of the interaction is referred to rela- tionships among network actors such as friend-

ships, co-purchasing, co-working or informa- tion exchange. In a network perspective, actors or nodes can be individuals or organizations (Freeman, 1979; Leem and Chun, 2014). Recen- tly, SNA has attracted scientists from various research fields such as e-commerce, psycho- logy, biomedical engineering, geography, and social science. Especially, many researchers have attempted to improve the quality of recommen- der system using social network analysis to build recommendation network of item or user.

SNA has been used to extract the relationship between users or items and build a model cap- able of describing these adjacencies.

In user perspective, people have a natural ten- dency to interact and socialize more frequently with individuals who are similar to themselves (Park et al., 2012). It is called as homophily phenomenon (Kossinets and Watts, 2010). So consumers who are connected to one another are likely to have similar characteristics and product preferences (Park et al., 2012; Ma et al., 2015). Therefore, SNA can produce an undirected network or graph of their similar purchase, where each node corresponds to a customer, and there is an edge between two nodes if the two corresponding customers have a close rela- tionship (Ma et al., 2015). Several studies pro- posed SNA technique as a new similarity mea- sure (Kim and Im, 2014; Park et al., 2012; Lee and Kwon, 2014).

On the other hand, Leem and Chun (2014) proposed SNA as an item network analysis.

They discovered that a book's position within online recommendation network affects its overall demand. It means best-selling items are posi- tioning at the center of recommendation network.

Based on the above prior researches, we use the

(7)

network measures of the item as important variables for its classification.

Centrality is one of the most important and widely used measurements in social network analysis and researchers usually use centrality to identify the most powerful, influential or critical actors in a network. There are several important centrality measures, i.e. degree cen- trality, closeness centrality, betweenness centra- lity, and eigenvector centrality (Xu et al., 2009).

These measures are summarized as follows.

3.1.1 Degree Centrality

Degree centrality is defined as the degree of the given node and reflects the popularity and relational activity of actors (Frank, 2002; Santos et al., 2006). Nodes which have more links to others tend to be at advantaged positions (Xu et al., 2009). A node’s degree can be used to measure its centrality and power in the whole network. It can be calculated by all directed links with the node in the network. Degree centrality can be formulated as follows (Leem and Chun, 2014; Freeman, 1979).

   

 (1)

where element   when a direct link exists between nodes i and j and   otherwise.

3.1.2 Closeness Centrality

Closeness centrality is based on the geodesic distance, that is, the minimum length of the path from node i to j. (Freeman, 1979; Santos et al., 2006). It indicates the node’s availability, safety, and security (Frank, 2002; Santos et al., 2006).

As a mathematical formula of node i, closeness centrality can be represented as :

  

∈  ≠ 

  

(2)

Where  is distance from a node i to all other nodes j in a given network. Since closeness centrality is based on a distance between nodes, it is an inverse measure of centrality in that a larger value indicates a less central node while a smaller value indicates a more central node (Leem and Chun, 2014).

3.1.3 Betweenness Centrality

Betweenness centrality of a given node indi- cates the number of shortest paths from all nodes that pass through that vertex. In other words, betweenness centrality of a node i is defined as the number of the shortest paths between pairs of other nodes that run through j (Girvan and Newman, 2002). The betweenness centrality of node i can be obtained through the following formula as :

   ≠  ≠ ∈



(3)

Where  is the sum of all shortest paths between nodes k and j, and  is the number of the shortest paths that pass through i. This measurement represents the actor’s capability to influence or control the interaction between actors it links (Santos et al., 2006; Leem and Chun, 2014).

3.1.4 Eigenvector centrality

Eigenvector centrality means that links to actors who are themselves influential will give an actor additional influence than connections to less influential actor (Leem and Chun, 2014).

Eigenvector centrality tries to generalize degree centrality by incorporating the importance of the

(8)

neighbors (or incoming neighbors in directed graphs). It is defined for both directed and un- directed graphs (Zafarani et al., 2014).

 

  

 (4)

where  = 1 if node i is linked to node j, and

 = 0 otherwise, is the average of the centralities of i’s network neighbors, and is a constant (Leem and Chun, 2014).

3.2 Classification Method in Data Mining

3.2.1 Support Vector Machine

The support vector machines (SVM) is a popular classification technique in data mining.

It is supervised learning models with associated learning algorithms that analyze and recognize patterns of data used for classification and re- gression analysis. SVM requires that each data instance is represented as a vector of real num- bers. Therefore, before applying SVM, we first have to convert into numeric data if there are categorical attributes. Second, the scaling pro- cess is required before applying SVM. Scaling is very important in order to avoid attributes in greater numeric ranges dominating those in smaller numeric range (Hsu et al., 2008, 2010).

Third, we have to select the kernel function type for applying SVM. RBF (radial basis func- tion) kernel is most commonly used among com- mon kernel functions of SVM because RBF ker- nel non-linearly maps samples into a higher di- mensional space, so it, unlike the linear kernel, can handle the data when the relation between class labels and attributes is nonlinear (Hsu et al., 2008, 2010). In this paper, we used the RBF kernel in SVM.

3.2.2 Naïve Bayes Classifier

Naive Bayes classifier (Friedman et al., 1997) is widely used for discriminating prediction tasks such as classification in data mining and ma- chine learning fields because of its simplicity and classification accuracy as compared to other supervised learning methods. The simplicity of naive Bayes classifier is inherited from its assump- tion that all the features are independent of each other (Koc et al., 2012; Hsu et al., 2008; Fan et al., 2010). The naive Bayes classifiers have been proven successful in many domains, despite the simplicity of the model and the restrictiveness of the independent assumptions (Hsu et al., 2008;

Fan et al., 2010).

However, the traditional naive Bayes classi- fier is limited to categorical or discrete data and two major limitations i.e. underflow and over- fitting. Hsu et al. (2008) proposed extended naive Bayes classification method capable of handling mixed data. They utilized the original naive Bayes algorithm to calculate probabilities of categorical data and adopted statistical theory for numerical data. Chandra and Gupta (2011) proposed robust naive Bayes classifier (RNBC) to handle the underflow and overfitting problems in the tradi- tional naive Bayes classifier. RNBC algorithm uses a robust function for estimating probabi- lities in naive-Bayes Classifier to handle the underflow and overfitting problems in classific- ation. Koc et al. (2012) also showed that hidden naive Bayes model (HNB) can exhibits a supe- rior overall performance in terms of accuracy, error rate and misclassification costs compared with the traditional naive Bayes model because HNB relaxes the naive Bayes method’s condi- tional independence assumption (Koc et al., 2012;

Farid et al., 2014).

(9)

3.2.3 Decision Tree

Decision trees are generated from training data in a top-down, general-to-specific direction. Deci- sion tree classifiers are the very useful solution for classifying instances in a large dataset with a large number of variables. Decision tree is an unstable classifier that can represent different outputs in successive training on the same condition (Parvin et al., 2015). A decision trees sort instances down the tree from root node to every leaf node. The nodes in the bottom are leaf nodes and target class attribute is given to each leaf node (Tiwari and Prakash, 2014). In each node of the tree, an if-then rule of an attribute of the instance is applied and each branch descending from that node represents one of the possible values for this attribute.

To classify an instance, one starts at the root of the tree and tests the attribute specified by this node and then moves down the branch of the tree that represents the value of the attribute applicable for this instance. In general, decision trees can be seen as a disjunction of a conjunc- tion of attribute values of instances, where each path from the root is a conjunction of attribute tests and the tree itself is a disjunction of these conjunctions.

3.2.4 Neural Network

The neural network is very useful and power- ful classifier in data mining, machine learning, and pattern recognition field. It has been the most strong classifier before developing SVM. The three basic structures of a neural network con- sist of three basic input, hidden and output layer.

The neural network is composed of processing units, which are linked into nerve tree frame- works. The processing units have different pro-

cessing weight (such as the linkage strength) to justify the output results (Lin et al., 2014).

Neural networks are designed to solve a variety of problems in data mining fields such as classi- fication, clustering, function approximation, pre- diction, optimization, and control. In this paper, we use a multi-layer perception model in neural network systems for classification of our dataset.

4. Experiments and Results

4.1 Dataset and Pre-Processing

In order to evaluate performance of our pro- posed models, an empirical study was conducted using a publicly available MovieLens dataset by Grouplens research team (http://movielens.org).

This dataset consists of 10,000 ratings from 943 users on 1,682 movies. It was used in many prior researches. A summary of our dataset is shown in <Table 2>.

Data Content Total

Movie id, title, genre, release date 1,682 User id, gender, age, occupation, zip 943 Rating movieid, userid, rating, timestamp 10,000

<Table 2> A Summary of MovieLens Dataset

For the computation of centrality measure va- lues, first of all, we build the user-movie matrix of two-mode (user-movie) network using the rating data in MovieLens dataset (Kim and Im, 2014). Examples of MovieLens rating data are shown in <Table 3>. Second, we convert the initial user-movie matrix of a two-mode (user- movie) network into movie-movie matrix of one-mode (movie-movie) network. Third, this study calculates centrality of each node (each movie) of the one-mode (movie-movie) network.

(10)

Degree Betweenness Closeness Eigenvector

M1 0.948 0.001 0.950 0.043

M2 0.924 0.000 0.929 0.042

M3 0.836 0.000 0.859 0.039

M4 0.945 0.000 0.947 0.043

M5 0.907 0.000 0.915 0.042

M6 0.750 0.000 0.800 0.035

M7 0.944 0.001 0.947 0.042

<Table 4> Example of Centralities of Each Movie

Dataset Contents Data Type

Centrality Degree, Betweenness, Closeness

and Eigenvector Numeric

Movie Genre

Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western, Unknown

Nominal

Class 1 or 0 (positive or negative) Nominal

<Table 5> Research Dataset Description

UserID MovieID Rating Timestamp

196 242 3 881250949

186 302 3 891717742

22 377 1 878887116

244 51 2 880606923

166 346 1 886397596

298 474 4 884182806

115 265 2 881171488

253 465 5 891628467

305 451 3 886324817

6 86 3 883603013

62 257 2 879372434

286 1014 5 879781125

<Table 3> Rating Data (Example) of MovieLens Dataset

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10

U1 1 1 1 1 1 1 1 1 1 1

U2 1 0 0 0 0 0 0 0 0 1

U3 0 0 0 0 0 0 0 0 0 0

U4 0 0 0 0 0 0 0 0 0 0

U5 1 1 0 0 0 0 0 0 0 0

U6 1 0 0 0 0 0 1 1 1 0

U7 0 0 0 1 0 0 1 1 1 1

U8 0 0 0 0 0 0 1 0 0 0

U9 0 0 0 0 0 1 1 0 0 0

U10 1 0 0 1 0 0 1 0 1 0

U11 0 0 0 0 0 0 0 1 1 0

U12 0 0 0 1 0 0 0 0 0 0

<Figure 2> User-Movie Matrix

<Figure 3> Example of Two-Mode (User-Movie) Network

<Figure 4> Example of One-Mode (Movie-Movie) Network

<Figure 2>, <Figure 3>, and <Figure 4> show projection process of initial data onto one-mode network. Movies’ genre and centrality data are used in movie popularity classification in our research. Examples of four centrality measures are shown in <Table 4>.

Our research used Weka and UCINET soft- ware tools for analyzing two kinds of input variables (Genre and Centrality data) and one dependent variable (Class data).

<Table 5> summaries our research dataset. This dataset consists of 1,682 instances, input variables with 19 genres, four centrality measures of each movie, and output variable with class category (positive or negative ratings) for each movie.

Logistic regression, neural network, naïve Bayes classifier, and decision tree as benchmark models for movie popularity classification were also used for comparison with the performance of our pro- posed classification-based SVM model for pre- dicting the movie popularity (Bhave et al., 2015).

(11)

4.2 Feature Selection Method

Our study uses feature selection methods to improve classification performance. The feature selection methods remove variables redundant or irrelevant without incurring a loss of infor- mation. Guyon et al. (2002) compared the effec- tiveness of various feature selection methods.

According to their results, SVM achieved better performance than the other models. SVM is also very effective for discovering important features or attributes (Guyon et al., 2002; Wu et al., 2008).

Therefore, we used wrapper subset method as variable selection method based on the SVM with radial basis function (RBF). First, when we build classification models with only genre data, only nine genre categories among 19 genre cate- gories, defined as dummy input variables, were selected by selecting the best input variables based on the feature selection method. These genres are animation, drama, fantasy, film-noir, horror, musical mystery, romance, and war.

Second, when we develop classification models with both genre data and centrality data, the input variables, selected by selecting the best input variables based on the feature selection method, are nine genre variables (action, comedy, docu- mentary, drama, film noir, horror, musical, my- stery, and romance) and three centrality (close- ness, betweenness, and degree centrality) variables.

In order to compare performances of movie popularity classification models, we used four kinds of the classification models according to dataset types. Model (1) is a classification model with all movie genre data without feature selec- tion method. Model (2) is a classification model with all movie genre data and centrality mea- sures without feature selection method. Model (3) is a classification model with best genre data

selected by feature selection method. Model (4) is a classification model with best genre and cen- trality data selected by feature selection method.

4.3 Empirical Results

This study employed SVMs to perform our movie popularity classification. The kernel func- tion of the SVMs is the radial basis function (RBF) and 10-fold cross validation (K = 10) is conducted in our experiment. First of all, we evaluated the performances of our proposed classi- fication models in terms of classification accuracy on SVMs with only genre dataset and SVMs with both centrality and genre dataset.

The logistic regression, neural network, naïve Bayes classifier, and decision tree as benchmar- king models for movie popularity classification were also used for comparison with the per- formance of our proposed models.

We used classification accuracy, precision, sen- sitivity, specificity, and 10 fold cross-valida- tion to measure the performance of our proposed classification models.

First of all, the performance results of movie popularity classification models without feature selection methods are shown in <Table 6> and

<Table 7>. SVM in <Table 6> and <Table 7>

has the highest accuracy among our classifi- cation models. Second, we found that classi- fication model with genre and centrality data (Model (2)) has better performance than classi- fication model with only genre data (Model (1)).

For example, the classification accuracy of SVM with both all movie genre and centrality data is 0.727. The results indicates that SVM with both original genre and centrality data has by approxi- mately 10% higher performance than SVM method with only genre data for movie classification.

(12)

Classifiers Accuracy Recall Precision F1 Naïve Bayes 0.620 0.721 0.499 0.589 Decision tree 0.613 0.724 0.480 0.578

Logistic

regression 0.618 0.718 0.499 0.589

Neural

network 0.605 0.621 0.586 0.603

SVM 0.622 0.713 0.513 0.597

<Table 8> Classification Performance of Model with Best Movie Genre Variables selected by Variable Selection Method (Model (3))

Classifiers Accuracy Recall Precision F1 Naïve Bayes 0.722 0.648 0.811 0.721 Decision tree 0.633 0.574 0.704 0.632

Logistic

regression 0.718 0.723 0.713 0.732

Neural

network 0.731 0.706 0.760 0.732

SVM 0.734 0.735 0.734 0.734

<Table 9> Classification Performance of Model with Best Movie Genre and Centrality Variables Selected by Variable Selection Method (Model (4))

Model (1)*

Model (2)**

Model (3)***

Model (4)****

Naïve Bayes 0.612 0.715 0.620 0.722 Decision tree 0.613 0.722 0.613 0.633

Logistic

regression 0.620 0.719 0.618 0.718 Neural

network 0.595 0.709 0.605 0.731

SVM 0.628 0.727 0.622 0.734

<Table 10> Classification Accuracy of 10-fold Cross- Validation

* Model (1) : a classification model with all movie genre data without feature selection method.

** Model (2) : a classification model with all movie genre data and centrality measures without feature selection method.

*** Model (3) : a classification model with best genre data selected by feature selection me- thod.

****Model (4) : a classification model with best genre and centrality data selected by feature selection method.

Classifiers Accuracy Recall Precision F1 Naïve Bayes 0.612 0.738 0.462 0.568 Decision tree 0.613 0.724 0.480 0.578

Logistic

regression 0.620 0.748 0.467 0.575

Neural

network 0.595 0.679 0.493 0.572

SVM 0.628 0.739 0.496 0.594

<Table 6> Classification Performance of Model with all the Movie Genre Data (Model (1))

Classifiers Accuracy Recall Precision F1 Naïve Bayes 0.715 0.640 0.805 0.713 Decision tree 0.722 0.752 0.687 0.718

Logistic

regression 0.719 0.734 0.701 0.717

Neural

network 0.709 0.684 0.738 0.710

SVM 0.727 0.726 0.727 0.727

<Table 7> Classification Performance of Model with All the Movie Genre and Centrality Variables (Model (2))

Third, we compared classification perfor- mances of classification models with best Movie Lens genre data selected by feature selection method (Model (3)), and classification models with best MovieLens genre and movie cent- rality data selected by feature selection method (Model (4)). The results are shown in <Table 8> and <Table 9>. We found that our pro- posed SVM with the best genre and centrality data with feature selection projection has the highest accuracy performance (0.734). This mo- del has better performance than any other mo- dels.

Finally, the comparison results of model per- formance are summarized in <Table 10> and

<Figure 5>. The results indicated that our pro- posed SVM-based classification model outper- forms any other models, i.e. naïve Bayes, neural network, logistic regression and decision tree.

(13)

<Figure 5> Comparison Results of Model Performance (Classification Accuracy)

5. Conclusion

Movie popularity classification is one of the important topics in the movie industry or aca- demic research. In previous researches, several prediction models have been proposed, but most of these models required the various data of the movie. In this study, we proposed SVMs com- bined with SNA to extract significant input vari- ables for movie popularity classification. Movie- Lens dataset is not enough to achieve high classification performance because the original movie data includes only limited movie attributes and also doesn’t contain any other review attri- butes. Therefore, we suggest new network- based movie variables extracted by measuring movie network position using an SNA.

Our empirical results with MovieLens data showed that our proposed SVM model is able to achieve higher performance by using a wide variety of features such as movie attribute and social network data. In particular, SNA was used for measuring the important movie nodes of movie network in order to extract informative features for classification accuracy. This study built the movies’ network (one mode network) using initial user-movie matrix of a two-mode

(user-to-movie) network. Then we computed each movie’s degree centrality, betweenness cen- trality, closeness centrality, and eigenvector cen- trality of the movies’ network. Finally, those four centrality values and movies’ genre data were used to classify the movie popularity in this study.

For comparison with the performances of our proposed models, we used the logistic regres- sion, neural network, Naïve Bayes classifier, and decision tree as benchmarking models for movie popularity classification. Our empirical results in- dicate that our proposed model with movie’s genre and centrality data has by approximately 10%

higher accuracy than other classification models with only movie’s genre data. The implications of these results show that our proposed model, i.e. SVM combined with SNA can be used for im- proving movie popularity classification accuracy.

References

Amatriain, X. and J. Basilico, “Netflix Recom- mendations : Beyond the 5 stars (Part 1)”, The Netflix Tech Blog, 2012, Available at https://medium.com/netflix-techblog/netflix-rec ommendations-beyond-the-5-stars-part-1-5583

(14)

8468f429 (Accessed April 20. 2017).

Asad, K.I., T. Ahmed, and M.S. Rahman, “Movie Popularity Classification Based on Inherent Movie Attributes Using C4.5, PART and Correlation Coefficient”, International Con- ference on Informatics, Electronics and Vi- sion (ICIEV2012), 2012, 747-752.

Bao, Z. and H. Xia, “Movie Rating Estimation and Recommendation”, CS229 2012 Project, Stanford University.

Bhave, A., H. Kulkarni., V. Biramane, and P.

Kosamkar, “Role of Different Factors in Pre- dicting Movie Success”, International Con- ference On Pervasive Computing, India, 2015, 911-915.

Chandra, B. and M. Gupta, “Robust Approach For Estimating Probabilisties In Niave Baye- sian Classifier for Gene Expression Data”, Expert Systems with Applications, Vol.38, No.3, 2011, 1293-1298.

Chaovalit, P. and L. Zhou, “Movie Review Mi- ning : A Comparison between Supervised and Unsupervised Classification Approaches”, Proceeding of the 38th Hawaii International Conference on System Sciences(HICSS 2005), 2005, 1-9.

Christakou, C. and A. Stafylopatis, “A Hybrid Movie Recommender System Based on Neural Networks”, In Proceedings of the 2005, 5th International Conference on Intelli- gent Systems Design and Applications, 2005, 500-505.

Fan, L., K.L. Poh, and P. Zhou, “Partition-Con- ditional ICA for Bayesian Classification of Micro Array Data”, Expert Systems with Applications, Vol.37, 2010, 8188-8192.

Farid, D.M., L. Zhang, M.C. Rahman, M.A. Hossain, and R. Strachan, “Hybrid Decision Tree and

Naïve Bayes Classifier for Multi-Class Classi- fication Tasks”, Expert Systems with Appli- cations, Vol.41, No.4, 2014, 1937-1946.

Frank, O., “Using Centrality Modeling in Net- work Surveys”, Social Networks, Vol.24, No.4, 2002, 385-394.

Freeman, L.C., “Centrality in Social Networks : Conceptual Clarification”, Social Networks, Vol.1, No.3, 1979, 215-239.

Friedman, N., D. Geiger, and M. Goldszmidt,

“Bayesian Network Classifiers”, Machine Learning, Vol.29, No.2-3, 1997, 131-163.

Ghazanfar, M.A. and A. Prügel-Bennett, “Leve- raging Clustering Approaches to Solve the Grey-Sheep Users Problem in Recommen- der Systems”, Expert Systems with Appli- cations, Vol.41, No.7, 2014, 3261-3275.

Girvan, M. and M.E. Newman, “Community Struc- ture in Social and Biological Networks”, Proceedings of the National Academy of Sciences of the United States of America, Vol.99, No.12, 2002, 7821-7826.

Guyon, I., J. Weston, S. Barnhill, and V. Vapnik,

“Gene Selection for Cancer Classification Using Support Vector Machines”, Machine Learning, Vol.46, No.1-3, 2002, 389-422.

Huang, W., Y. Nakamori, and S.Y. Wang, “Fore- casting Stock Market Movement Direction with Support Vector Machine”, Computer and Operations Research, Vol.32, No.10, 2005, 2513-2522.

Hsu, C.C., Y.P. Huang, and K.W. Chang, “Ex- tended Naïve Bayes Classifier for Mixed Data”, Expert Systems with Applications, Vol.35, No.3, 2008, 1080-1083.

Hsu, C.W., C.C. Chang, and C.J. Lin, “Practical Guide to Support Vector Classification”, Department of Computer Science National

(15)

University, Taipei 106, Taiwan, 2010.

Kabinsingha, S., S. Chindasorn, and C. Chantra- pornchai, “A Movie Rating Approach and Application Based on Data Mining”, Inter- national Journal of Engineering and Inno- vative Technology (IJEIT), Vol.2, No.1, 2012, 77-83.

Kim, M. and I. Im, “Resolving the Gray Sheep Problem Using Social Network Analysis (SNA) in Collaborative Filtering (CF) Re- commender Systems”, Journal of Intelli- gent Information Systems, Vol.20, No.2, 2014, 137-148.

Koc, L., T.A. Mazzuchi, and S. Sarkani, “A Net- work Instruction Detection System Based on Hidden Naïve Bayes Multiclass Classi- fier”, Expert Systems with Applications, Vol.39, No.18, 2012, 13492-13500.

Kossinets, G. and D.J. Watts, “Origins of Homo- phily in an Evolving Social Network”, Ame- rican Journal of Sociology, Vol.115, No.2, 2009, 405-500.

Lash, M., S. Fu., S. Wang, and K. Zhao, “Early Prediction of Movie Success-What, Who and When”, LNCS 9021, Springer Inter- national Publishing Switzerland, 2015, 345- 349.

Lee, H.S. and J.H. Kwon, “Similar User Cluste- ring Based on Movielens Data Set”, Ad- vanced Science and Technology Letters, Vol.51, No.8, 2014, 32-35.

Leem, B. and H. Chun, “An Impact of Online Recommendation Network on Demand”, Ex- pert Systems with Applications, Vol.41, No.4, 2014, 1723-1729.

Lin, A.J., C.L. Hsu, and B.Y. Li, “Improving the Effectiveness of Experiential Decisions by Recommendation Systems”, Expert Systems

with Applications, Vol.41, No.10, 2014, 4904- 4914.

Ma, L., R. Krishnan, and A.L. Montgomery,

“Latent Homophily or Social Influence? An Empirical Analysis of Purchase within a So- cial Network”, Management Science, Vol.

61, No.2, 2015, 454-473.

Manek, A.S., R.P. Pallavi, H.V. Bhat, P.D. Shenoy, M.C. Mohan, K.R. Venugopal, and L.M.

Patnaik, “SentReP : Sentiment Classifica- tion of Movie Reviews Using Efficient Re- petitive Pre-Processing”, TENCON 2013, IEEE Region 10 Conference, 2013, 348- 353.

Oh, Y.J. and S.H. Chae, “Movie Rating Inference by Construction of Movie Sentiment Sen- tence using Movie Comments and Rating”, Journal of Internet Computing and Ser- vices, Vol.16, No.2, 2015, 41-48.

Park, S.H., S.Y. Huh, W. Oh, and S.P. Han, “A Social Network-Based Inference Model For Validating Customer Profile Data”, MIS Quarterly, Vol.36, No.4, 2012, 1217-1237.

Parvin, H., M. MirnabiBaboli, and H.A. Rokny,

“Proposing a Classifier Ensemble Frame- work based on Classifier Selection”, Engi- neering Application of Artificial Intelli- gence, Vol.37, 2015, 34-42.

Santos, E.E., L. Pan, D. Arendt, and M. Pittkin,

“An Effective Anytime Anywhere Parallel Approach for Centrality Measurement in Social Network Analysis”, IEEE Interna- tional Conference on Systems, Man, and Cybernetics, Taipei, Taiwan, Vol.6, 2006, 4693-4698.

Symeonidis, P., A. Nanopoulos, and Y. Manopoulos,

“Moviexplain : A Recommender System with Explanations”, In Proceedings of the 2009

(16)

ACM Conference on Recommender Sys- tems, 2009, 317-320.

Tiwari, A. and A. Prakash, “Improving Classifi- cation of J48 Algorithm Using Bagging, Boosting and Blending Ensemble Methods on SONAR Dataset Using Weka”, Inter- national Journal of Engineering and Tech- nical Research, Vol.2, 2014, 207-209.

Wu, T., P. Karsmakers, H. Vanhamme, and D.V. Compernolle, “Comparison of Variable Selection Methods and Classifiers for Na-

tive Accent Identification”, 9th Annual Con- ference of the International Speech Com- munication Association, 2008, 305-308.

Xu, Y., J. Ma, Y. Sun, J. Hao, Y. Sun, and D.

Zhao, “Using Social Network Analysis as a Strategy for E-Commerce Recommendation”, Pacific Asia Conference on Information Sys- tems (PASIC), India, 2009.

Zafarani, R., M.A. Abbasi, and H. Liu, Social Media Mining : An Introduction, Cam- bridge University Press, 2014.

(17)

About the Authors

Dorjmaa Tserendulam ([email protected])

Dorjmaa Tserendulam is an integrated M.A./Ph.D. degree candidate of Management Information Systems at Department of Management Infor- mation Systems, Graduate School, Yonsei University. She received a B.B.A.

degree in Business Administration from Huree University of Information and Communication Technology of Mongolia, in 2010. Her research interests includes social network analysis, data mining and recommendation systems.

Taeksoo Shin ([email protected])

Taeksoo Shin is currently a professor of Management Information Systems at Division of Business Administration, Yonsei University, Wonju. He holds a B.B.A. and a M.S. degree in business administration from Yonsei University and his Ph.D. degree in management information systems from Korea Advanced Institute of Science and Technology (KAIST). His current research interests include social network analysis, data mining, knowledge manage- ment, customer recommendation systems. He has published numerous papers, which have appeared in Expert Systems with Applications, Expert Systems, Vaccine, The Journal of MIS Research, Korean Journal of Intelligent Information Systems, Korea Management Review, Knowledge Management Research, Journal of the Korean Operations Research and Management Science Society, The Journal of Productivity, etc.

참조

관련 문서

Degree Centrality The counts of how many connections a node has Popularity or influence of a node (e.g., word, person) in the network Betweenness Centrality The extent to which

The related factors with clustering healthy behaviors by multinominal logistic regression analysis were higher for “females, the elderly, people with higher level of

For the association study, I analyzed the correlation of each SNP for Alzheimer's disease by logistic regression models using Additive genetic model after adjustment of

Transformed –log 10 p of SNPs from the logistic regression model of association with dementia based on the additive model in male subjects.. P-value on the left Y-axis is

Motivation – Learning noisy labeled data with deep neural network.

High school Japanese I textbooks were analyzed based on the classification of culture types based on Finocchiaro &amp; Bonomo, Chastain's theory and

We also propose the new method for recovering root forms using machine learning techniques like Naïve Bayes models in order to solve the latter problem..

As a result of multiple regression analysis, the statistically associated factors with burnout were social support, GDS, marital state, self­rated health