Design and Implementation of Dynamic Recommendation Service in Big Data Environment
Kim Ryong*․Kyung-Hye Park**
Abstract
Recommendation Systems are information technologies that E-commerce merchants have adopted so that online shoppers can receive suggestions on items that might be interesting or complementing to their purchased items. These systems stipulate valuable assistance to the user’s purchasing decisions, and provide quality of push service. Traditionally, Recommendation Systems have been designed using a centralized system, but information service is growing vast with a rapid and strong scalability. The next generation of information technology such as Cloud Computing and Big Data Environment has handled massive data and is able to support enormous processing power. Nevertheless, analytic technologies are lacking the different capabilities when processing big data. Accordingly, we are trying to design a conceptual service model with a proposed new algorithm and user adaptation on dynamic recommendation service for big data environment.
Keywords:Big Data Environment, Dynamic Recommendation Service, E-commerce Merchants, Information Technology
1)
Received:2019. 10. 01. Revised : 2019. 10. 28. Final Acceptance:2019. 10. 31.
※ This study was supported by research fund of ChungNam National University.
** First Author, Doctoral Candidate, MIS Major, Dept. of Management, ChungNam National University, e-mail:[email protected]
** Corresponding Author, Professor, School of Business, College of Economics and Management, ChungNam National University, 99 Daehak-ro, Yuseong-gu, Daejeon, Tel:042-821-5578, e-mail:[email protected]
for useful information [Ahmad, 2017; Greg, 2003]. Traditional database management sys- tem, such as relational database, was proven to be good for the structured data but break- able in cases of semi-structured and unstruc- tured data [Nandigam, 2005]. However, in reality data are coming from different data sources in various formats and vast majority of these data are unstructured or semi-struc- tured in nature [Balci, 2008]. Moreover, data- base systems are also pushed to its limit of storage capacity. As a result, organizations are struggling to extract useful information from the unpredictable explosion of data cap- tured from inside and outside their organi- zation. This explosion of data is referred to as “Big Data” [Ibrar, 2016]. Big data is a col- lection of large volume of complex data that exceeds the processing capacity of conven- tional database architecture. Traditional data- bases and data warehousing technologies do not scale to handle billions of lines of data and cannot effectively store unstructured and semi-structured data [Bourret, 2000; Broisin, 2005]. Enterprises have adapted platforms for massive data processing, however, many of them have reported that the analytic techni- ques are lacking in capabilities when proce- ssing big data [Sanjay, 2013].
In this research paper, the focus is on de- signing a Decentralized Recommendation Sys- tem (RS) to fully implement and benefit from
Rating Prediction and finding the Top-N items in the context of RS [Tewari, 2018; Evangelia, 2016]. Applications of the concept of artificial life and evolutionary algorithm inspired RS have been used in a practical scenario illu- strated in recommending relevant advertise- ment to online shopping users [Kim, 2016].
2. Related Works 2.1 Hadoop
Hadoop is the open-source implementa-
tion of the Map-Reduce framework. Hadoop
allows splitting an amount of work into many
pieces and enables these pieces of work to be
sent to the worker units [Mohd, 2015]. Those
worker units could be very primitive computa-
tion engines based on cheap commodity hard-
ware. In an ideal situation, they utilize some
sort of direct-attached storage to minimize
the network bottleneck. Usually, one job trac-
ker and one name node exist. The job tracker
receives the job as a big amount of work and
split it to many tasks. On the same node, there
should be also a name node installed. The
name node holds a list of all available data
nodes, and where files are located, in which,
it also includes replication of files [Neyland,
2016]. Computation resources are measured
in map and reduce capacity. With that infor-
mation, the task could be sent to the node
<Figure 1> Hadoop Ecosystem for Recommender Service
with the local data, which avoids unnecessary
network traffic and valuable task setup time.
The following <Figure 1>. illustrates the archi- tecture of Hadoop and its ecosystem.
2.2 Hadoop Distributed File System
Hadoop Distributed File System (HDFS) plays an important role in the entire Hadoop cluster, for this reason it holds all the data of the whole cluster. It scales well, provides fault tolerance through inter-node replica- tion and handles file system errors transpa- rent [Robert, 2016]. The HDFS saves the files into blocks of a predefined size. It is common to set this size around 64 or 128 megabytes.
For a cluster environment with many tera- bytes, size could resolve into a poor configu- ration, where it generates a remarkable amount of blocks, where it has to be managed by the name node. In addition, for an optimal data locality on the worker nodes, the job tracker generates one map task per block. A map task is a valuable unit, because it requires an expensive setup of the task on the worker node. With that in mind, setting an optimal block size is an important configuration. It is also an important size for a non-splittable compression algorithm [Brent, 2016]. The coor-
dinator of the HDFS is the name node, which lacks in fault-tolerance. If the node with the name node-image (the meta-information of the HDFS) fails, the complete HDFS becomes unavailable. Because of this, there is other file system. HDFS works best with read access, as it is optimal for data warehouses.
2.3 Hive and HiveQL
Hive is mostly designed for offline OLAP queries, as used in common data warehouses.
The data are saved as files and folders in HDFS. Hive gives the user the possibilities to write their own de/serialization methods, for example, it can be used for enabling com- pression. HiveQL is a high level query lan- guage for Map-Reduce provided by Hive to define and manipulate data stored in Hive.
The ANSI SQL-92 standard is not fully imple- mented, but the language is similar to the well-known SQL language. The language sup- ports the most primitive data types, collec- tions like arrays, maps and user-defined types.
Creating tables’ statements is provided by
Data Definition Language (DDL). Further
indexes could be defined and data could be
loaded into tables by using LOAD and INSERT
statements.
<Figure 2> Dynamic Recommendation Service Architecture
the services that are relevant to their se-
lected positions in a designed process. <Figure 2>. bellow shows the recommended design.
From the data flow presented, each element is linked based on recommender engine. To achieve the first objective, we use usage data and adapt well-known Collaborative Filte- ring (CF) techniques. We aim at discovering the user’s interest that is hidden in the usage data. We also intend to use CF techniques, which have been developed for item recom- mendation and prediction. While, we do not ask users to provide additional information such as profile, rating or comments, when a user selects a service. Our Dynamic Recom- mendation Service recommends the interest relevant dynamically.
To do so, we first identify user interests based on past usage data, then, we integrate these interests in CF algorithms to calculate similarities between users and services. Based
and other services using vector space model (VSM). Sort the services in a descending order of similarities and then display the Top-l recommended services to the user. The pseudo code of service recommendation based on item- based Top-N CF is described in <Algorithm 1>.
The key step of the algorithm is finding the
similarity between a service Sx. To compute
this similarity, we apply VSM, which is deve-
loped to compute the similarity between two
individual documents. It presents documents
in a k dimensional space, where k is the num-
ber of different terms. Each document is pre-
sented as a vector with k elements and each
element of a document vector corresponds to
a term appearing in the document. The value
of a vector element is the weight of the
corresponding term. This weight is computed
by term frequency (TF) and inverse document
frequency (IDF). Similarity between two docu-
ments is computed by the cosine value of the
input : Sx: currently used service output: a recommended list of l services S = set of services;
for each service Si in S do
Compute the similarity between Si and Sx;
end
Sort Si ∈ S in descending order of similarity;
Select Top-l services for recommendation;
<Algorithm 1> Service-Driven Recommendation
input : Ux: active user
output: a recommended list of l services U = set of users;
for each user Uj in U do
Compute the similarity between Uj and Ux;
end
Sort Uj∈U in descending order of similarity values;
Select Top-k users from the sorted list of Uj ∈ U;
for each of k selected users
Select the t-most-frequently-used;
Services to make a recommended list of l = k×t services;
<Algorithm 2> User-Oriented Recommendation
<Figure 3> UML Classification of Recommendation Service
angle created by the two corresponding vectors.
In our approach, we consider analogically each row (service) in the usage matrix as a docu- ment and each column (user) as a term. The value of each element in the usage matrix is considered as the number of times that the corresponding term appears in the correspon- ding documents. Similarity between two services is inferred from the similarity between two row vectors. We also apply the term-frequency (TF) and inverse document frequency (IDF) to the usage matrix to compute the weight of each user (term).
3.2 User-Oriented Algorithm (UOA)
Inspired by the fact that users who have similar interest tend to select similar items, we aim to use the algorithm to find users with similar interest, i.e. users that used similar services.
The pseudo code of User-oriented Recom- mendation is described in <Algorithm 2>. By selecting the most frequently used services that were used by the most relevant users and were not used by the active users to make recommendations, we consider in this algorithm each user as a document and each service as a term. A process and an approach that is contrary to the Sx.
The system generates recommendations in
three steps algorithm. First, compute the si-
milarity between the active user and others
based on their usage data. Second, sort other
users in descending order of similarity and
select the top-k users in the list. At last, for
each selected user, select the t-most-frequently-
used services that were not used by the active
user to make recommendations, illustrated
in <Figure 3>.
<Figure 4> UML-based Activity Model of Recommender Service input : current user Ux, current used service Sy
output: a recommended list of l services U = set of users;
for each user Uj in U do
Compute the similarity between Uj and Ux;
end
Sort Uj∈ U in descending order of similarity values;
Select Top-k users from the sorted list of Uj ∈ U;
A′[m×k] = usage data of the selected users;
S = set of services;
for each service Si in S do
Compute the similarity between Si and Sy based on the new usage;
Matrix A′[m×k];
endSort Si∈S in descending order of similarity values;
Select Top-l services for recommendation;
to search for relevant services based on con- texts but also allows process analysts to add constraints to the requested context to filter the searching results. The query language in our approach consists of three parameters, which are: associated service, connection constrain, and radius. The associated service is the service whose neighborhood context is taken into account to match with other con- texts.
Connection constrains are services or con-
nection flows to be included/excluded to filter
the query’s results. The radius is the number
of connection layers taken into account for the
neighborhood context matching. It specifies
the largeness of the considered contexts.
In our Query’s execution, we developed our query to filter the services returned by the context matching. In general, the procedure of the query execution is as follows: 1) Cap- ture the context of the associated service.
This neighborhood context is identified by the associated service and connection flows to its neighbors. The largeness of the context is specified by the radius parameter. 2) Match the context of the associated service to others in other business processes. 3) Refine the mat- ching result by selecting only services whose contexts satisfy the query’s constraint. 4) Sort the selected services based on the matching values and pick up Top-N services where the process analyst for the query’s response can tune the flexible N.
5. Conclusions
Recommendation Systems are information technologies that E-commerce merchants have adopted so that online shoppers can receive suggestions on items that might be intere- sting or complementing to their purchased items. These systems stipulate valuable assis- tance to the user’s purchasing decisions, and provide quality of push service.
We are trying to design a conceptual ser- vice model with a proposed new algorithm and user adaptation on ‘Dynamic Recommenda- tion Service’ for big data environment in this paper, we proposed an algorithm for a recom- mendation service with big data platform. To achieve a long-term real time-based model that evolves together with the users interest, it’s necessary to apply feedback techniques that provide recommendation information on this evolution. However, due to our research area, we are omitting the relevance of the users behavior, which was deemed successful by the shown feedback. Nevertheless, our re-
search suggested a new model named Dyna- mic Recommend Service which recommends the interest relevant dynamically. In addi- tion, we’re open to serve the recommendation service data from the big data system in order to resolve any specific <Query Model>.
Query ::= ServiceID, ‘:’ ,[Constraint], ‘:’ ,Radius;
ServiceID ::= Character, {Character|Digit};
Constraint ::= (‘+’|‘-’) Term | Constraint, ‘|’ ,Term;
Term ::= Item | Term,‘+’,Item | Term, ‘-’ ,Item;
Item ::= ServiceID| ConFlow | ‘(’, Constraint, ‘)’;
ConFlow ::= ‘<’, [ServiceID], ‘[’, FlowString, ‘]’, [ServiceID], ‘>’;
FlowString ::= ConElement, {ConElement};
ConElement ::= ‘sequence’ | ‘AND-split’ | ‘AND-join’
| ‘OR-split’ | ‘ORjoin’ | ‘XOR-split’ | ‘XOR-join’;
Radius ::= DigitNotZero, {Digit};
<Query Model>