Methodology for Issue-related R&D Keywords Packaging Using Text Mining

텍스트 마이닝 기반의 이슈 관련 R&D 키워드 패키징 방법론

Methodology for Issue-related R&D Keywords Packaging Using Text Mining

빅데이터 기술에 대한 관심이 급증함에 따라

소셜 미디어를 통해 유통되는 방대한 양의 비정형 데이터를 분석하고자 하는 시도 가 활발히 이루어지고 있다

이에 따라서 텍스트 형태의 비정형 데이터 분석을 통해 의미 있는 정보를 찾고자 하는 시도가 비즈니스 영역뿐 아니라

문화 등 다양한 영역에서 이루어지고 있다

특히 최근에는 여러 현안 및 이슈들을 발굴하여 이를 의사결 정에 활용하고자 하는 시도가 활발히 이루어지고 있다

이처럼 빅데이터 분석을 통해 국가현안이나 이슈를 발굴하고자 하는 시도가 꾸준히 이루어져왔음에도 불구하고

국가현안 및 이슈로부터 이와 관련된

문서를 효율적으로 제공하는 방안은 마련되지 않고 있다

이는 사용자들이 인식하는 현안 키워드와 실제 사용되는

키워드 사이의 이질성이 존재하기 때문이다

키워드간의 이질성을 극복하기 위한 중간 장치가 필요하며

이 중간 장치를 통해 각 현안 키워드와

키워드간에 적절한 대응이 이루어져야 한다

이를 위해 본 연구에서는

현안 키워드 추출을 위한 하이브리드 방법론

관점에서의 연관 현안 네트워크 구축 방법론의 총 세 가지 방법론을 제안한다

제안하는 방법론은 텍스트 마이닝

그리고 연관 규칙 마이닝 등의 데이터 분석 기법들을 활용하여 수행하였으며

에 의한 키워드 보강률 은

키워드간 다수의 연관 규칙이 나타났다

의 경우는 현재 진행 중에 있으며

향후 가시적 성과를 낼 수 있을 것으로 예상된다

1Graduate School of Business IT, Kookmin University, Seoul, 136-702, Korea.

2 Associate Professor, School of MIS, Kookmin University, Seoul, 136-702, Korea.

[Received 7 April 2014, Reviewed 11 April 2014, Accepted 3 June, 2014]

☆ A preliminary version of this paper was presented at ICONI 2013 and was selected as an outstanding paper.

ISSN 2287-1136 (Online) http://www.jksii.or.kr

(Figure 1) Research Overview

(Figure 2) Heterogeneity between Issue Keywords and R&D Keywords

(Figure 3) Two Mode Network Comprising Issue Keywords and R&D Keywords

(Figure 4) Discovering Related Issues Using Structural Equivalence

(Figure 5) Example of Network Merging

(Figure 6) Example of Matrix (Network Merging)

(Figure 7) Keyword Enhancement Rate of the Proposed Integration Methodology

(Figure 8) A Part of Relevant Keyword Network using 1,000 Association Rules (Support Value of

2013 B.A. in Business IT, Kookmin University, Korea.

2013 ~ Present M.S. in Business IT, Kookmin University, Seoul, Korea.

Research Interests: Text Mining, Data Mining, Opinion Mining E-mail : [email protected]

2011 B.S. (Hons) in Computer Science, Universiti Sains Malaysia, Pulau Pinang, Malaysia.

2012 ~ Present M.S. in Business IT, Kookmin University, Seoul, Korea.

Research Interests: Text Mining, Data Mining, Opinion Mining E-mail : [email protected]

1998 B.S. in Computer Engineering, Seoul National University, Korea.

2000 M.S. in Management Engineering, KAIST, Korea.

2007 Ph.D. in Management Engineering, KAIST, Korea.

2007 ~ Present Professor, School of MIS, Kookmin University, Korea.

Research Interests: Text Mining, Data Mining, Data Modeling E-mail : [email protected]

Methodology for Issue-related R&D Keywords Packaging Using Text Mining

Journal of Internet Computing and Services(JICS) 2015. Apr.: 16(2): 57-66 57

☆

Methodology for Issue-related R&D Keywords Packaging Using Text Mining

현 윤 진

윌 리 엄

김 남 규

Yoonjin Hyun William Wong Xiu Shun Namgyu Kim

요 약

,

.

,

,

,

.

.

,

R&D

.

R&D

.

R&D

,

R&D

.

(1)

, (2)

R&D

,

(3) R&D

.

,

,

,

, (1)

42.8%

, (2)

,

R&D

. (3)

,

.

:

,

,

,

,

ABSTRACT

keyword : Association Rules Mining; Keyword Matching; Social Network Analysis; Text Mining; Topic Analysis

1. INTRODUCTION

In traditional approaches, the selection of national pending issues is made by a few policy makers in a top-down manner.

neglect some associations even though they have many R&D keywords in common. Thus, the third goal of this study is to establish a methodology for constructing an issue network based on common R&D keywords.

2. Related Works

2.1 Text Mining and Topic Analysis

Text is the most widely used means for exchanging information and expressing intentions in the real world [4].

Thus, several researchers have attempted to discover meaningful knowledge through text analysis. Topic analysis, in particular, is one of the most representative methods for discovering core keywords from a large amount of documents.

The key concepts of topic analysis are vector space model [1, 5, 6] and TF-IDF. Once the keywords are refined based on specific conditions, the co-occurrence patterns for each keyword are calculated based on TF-IDF weights. There is also a study on improving the accuracy of text mining [7, 8].

2.2 Association Analysis and Social Network Analysis

Social network analysis is a quantitative analysis technique that identifies the characteristics of the connection structure and connection state of an object in a group through visual representation [17, 18, 19, 20, 21].

2.3 Keywords Similarity Analysis

The F-score is typically used for measuring the similarity

between documents or terms. It is obtained by calculating the

harmonic mean of the recall and precision. In document

searching applications, recall can defined as the proportion of

documents that have been detected to all the relevant

documents. On the contrary, precision means the proportion of appropriate documents to all retrieved documents. In this manner, F-Score is widely used for performance evaluation of information retrieval and also for similarity analysis among multiple documents. [22].

2.4 National R&D Information Service

3. Methodology for Packaging R&D Information on National Pending Issues

3.1 Research Overview

Our study has three specific components (see Figure 1).

To make it clear, we used issue as national pending issue for the following explanation. The first component generates a list of issue keywords by integrating standard policy documents and additional issue keywords obtained through data analysis.

The second component maps the R&D keywords to relevant issue keywords. The final component constructs an issue network based on the common R&D keywords. In Figure 1, the cylindrical shapes indicate the source data, whereas the rectangular shapes represent the analytical process. The boxes

outlined with dotted lines represent the outputs.

Next, we provide detailed explanations of each part of the process.

3.2 Hybrid Approach for Extracting and Integrating National Pending Issue Keywords

However, these methods cannot distinguish between keywords pertaining to issue keywords and general gossip keywords.

Therefore, in this study, we propose a hybrid approach for selecting and integrating issue keywords in order to overcome the limitations of the top-down and data-driven approaches.

The proposed approach is designed such that it can be applied to one of the core modules of the methodology for packaging R&D information services regarding issue.

To integrate IH and ID, we developed the WF-Score as a novel similarity measure by modifying the traditional F-Score.

In the expression below, #key (A) indicates the number of keywords in a set A, while C is the number of common keywords between IH and ID. For a set of common keywords between IH and ID, mi is the topic weight of the i-th keyword in ID.

^☆