Linked Data를 위한 인프라 구축 및 개체식별 활용 방법

(1)

Linked Data를 위한 인프라 구축 및

개체식별 활용 방법

김평 강인수 이승우 정한민 이미경 서동민 성원경

한국과학기술정보연구원

(2)

목 차

1. 서론 ... 1

2. 시맨틱 웹 ... 5

2.1.온톨로지 ... 8

2.2.OWL Lite, DL, FULL ... 12

2.3.OWL2 ... 16

3. Linked Data ... 21

3.1.Linked Data Set ... 26

3.2.Linked Data Publishing Principles ... 31

3.3.SPARQL End Point 서비스 ... 36

3.4.Linked Data의 해결 과제 ... 37

4. 개체 식별 ... 40

4.1.관련 연구 ... 43

(3)

(4)

(5)

(6)

서비스 연계 측면에서 시맨틱 웹의 활성화는 더디게 진행되고 있 다. 시맨틱 웹은 데이터의 웹을 위한 것이며, 데이터가 연계, 협 업 될 때 진정한 의미로서의 시맨틱 웹이 실현될 수 있음을 강조 하던 Tim. Berners-Lee는 2009년 TED 컨퍼런스에서 Linked Data 의 중요성에 대해 언급하며, 데이터 웹을 위해서는 데이터의 개 방을 통한 연계, 협업이 이루어져야 함을 강조했다. 가공된 데이 터가 아닌 Raw Data가 더 많이 개방되고 연계되어야 시맨틱 웹 이 실현될 수 있으며, Raw Data의 공유와 연계를 위해서 Linked Data Project가 시작되었다. Linked Data Project는 데이터의 출판 과 데이터 연계를 위한 규칙 및 데이터 접근을 위한 서비스 제공 방안을 표준화함으로써 데이터 연계 및 활용에 중심을 두고 공공 기관 및 업체들이 활발하게 참여하고 있다.

본 연구에서는 온톨로지 표현 언어와 OWL의 표현력 및 표

(7)

의미별로 식별해서 URI를 할당함으로써 보다 정확한 정보를 기반

으로 시맨틱 웹 서비스를 구축하기 위한 방법으로써, 기 구축된

다양한 시맨틱 자원(Linked Data Set)과 보유 자원을 연계 활용할

수 있는 데이터 인프라를 구축하고, 이를 개체 식별에 활용하기

(8)

(9)

술은 온톨로지를 통해 표현된 다양한 지식을 서로 연계 활용하기 위해서 URI, 추론, 검증 등을 위한 다양한 기술이 사용된다.

[그림 1] 시맨틱 웹 기술 구조(Semantic Web Layer Cake)

그림 1 의 왼쪽 그림은 팀버너스리가 제안한 시맨틱 웹 기술

구조의 원형이고, 오른쪽 그림은 2007 년 이후 대표적으로 사용

되고 있는 기술 구조로, 관련 기술은 기술 구조 원형이 제안된

이후 큰 변화없이 발전되어 왔다. 초창기에 비해 RDF(Resource

Description Framework)가 강조되고, 질의 언어로 SPARQL 이 추

(10)

는 RIF가 세분화 및 구체화 되었다.

URI (Uniform Resource Identifier): 웹 상의 자원을 식별하기

위한 객체의 명칭, 위치 등의 표현

XML(eXtensible Markup Language): 메타 정보 표현 언어인

XML, XML 상에서의 동일한 요소나 속성을 구분하기 위해 쓰

이는 이름인 Namespace, XML 문서의 마크업 방식에 대한

정의인 XML Schema 등과 같은 다양한 표준을 의미

RDF(Resource Description Framework): RDF는 정보 자원이나

자원의 구조를 표현하는 언어

RDFS: RDF의 Schema 정보로 경량의 온톨로지를 표현 SPARQL: RDF 질의를 위한 언어

RIF(Rule Interchange Format): 규칙의 정의와 교환을 위한 계 층

(11)

(12)

(13)

온톨로지는 그림 2 와 같이 다양한 언어로 표현될 수 있으며 KIF, OWL 을 포함한 대부분이 1 차 논리에 기반하고 있고 추론

시 결정 가능성(decidable)을 보장하는 서술논리(Description

Logic; DL) 기반의 언어들이 대세를 이루고 있다. 특별히 FLogic(Frame Logic)의 경우 혼 로직(Horn Long) 기반의 온톨로지

언어로, DL 과는 DLP(Description Logic Programming) 수준에서

(14)

표현의 수준이 올라갈수록 추론 능력도 커짐을 알 수 있으며, 색

인에서는 지식의 존재여부를 판단할 수 있고, 논리 차원으로 올

라가서 Logic 을 가지게 되면 스마트한 행위가 가능해진다.

[그림 3] 지식 표현 수준과 추론 능력

(15)

의 복잡도를 Rule, LP 와 DL 을 기준으로 보여주고 있다.

[그림 4] 온톨로지 언어의 복잡도

2.2. OWL Lite, DL, FULL

(16)

(17)

연한 문법을 모두 활용하고자 하는 사용자에게 적합

클래스는 개체의 집합인 동시에 그 자체가 하나의 개체

가 될 수 있음

이를 지원할 수 있는 SW의 현실적인 개발이 어려움

[그림 5] OWL Lite, DL, Full 상관관계

OWL Lite 가 가장 낮은 표현력을 가지고 있는 것으로, 클래스

표현을 위한 규칙인 DL(Description Logic)(OWL 의 형식적 기반이

된 논리학의 한 분야)을 사용하지 못한다. OWL Lite 는 계층적인

개념을 구조화하고 특정한 관계를 정의하는 수준이다. 따라서

(18)

한 분류 체계 그리고 시소러스를 빠르고 손쉽게 OWL 화하기 위 한 용도로만 적합하다. OWL DL 은 OWL 내에 포함되어 있는 DL 을 모두 사용할 수 있다. 그렇기 때문에 OWL DL 은 훨씬 정교하 게 개념의 형식적 정의를 수행할 수 있다. OWL DL 은 OWL 에서 정의한 모든 어휘를 포함하고 있으나 어휘를 사용할 때에는 사전 에 정해진 제약 사항을 준수해야 한다. OWL DL 은 계산적 완전성

(Computational Completeness)과 결정 가능성(Decidability)을 유

(19)

(20)

[그림 6]

OWL DL의 문제 해결을 대한 위해 2009년 10월에 OWL 2 운 특징들은 다음과 같다.

syntactic sugar to make some common statements easier to

say

new constructs that increase expressivity, ] OWL 2의 구조

대한 산업계의 요구사항을 수용하기 OWL 2 표준이 제정되었으며, OWL2의 새로

syntactic sugar to make some common statements easier to

(21)

extended support for datatypes, simple metamodeling capabilities, extended annotation capabilities, other innovations, and minor features.

(22)

드는데 적합하며, 기존의 RDB와 결합해 사용 가능한 장점을 가진다. OWL 2 QL 질의는 모두 SQL로 변환 가능하며, 표현 력에 제약이 있는 단점이 있다. OWL 2 RL : 상대적으로 적은 표현력 손실과 대용량 처리가 동시 필요하다면 OWL 2 RL 사용이 권고 된다. OWL 2 RL은 가능한 표현력 손실을 줄이면서 다항 시간 내에 답을 얻을 수 있도록 설계되었다. 또한, OWL 2 RL은 온톨로지의 일관성

점검(consistency check)과 포함관계(subsumption) 추론이 가

능하면서 동시에 규칙 기반 추론을 적용할 수 있는 장점이

(23)

[표 1] OWL2의 복잡도

기존 OWL DL과 마찬가지로, OWL 2의 각 프로파일의 적합한

사용을 위해서는 클래스 생성자 및 제약조건과 공리에 대해 충분

히 이해할 필요가 있다. 보다 상세한 내용은 OWL 2 Web

(24)

3. Linked Data

시맨틱 웹의 핵심인 데이터의 공유 및 연계를 위해 연계

가능한 데이터를 생산하고 배포해 나가기 위한 활동이 W3C 를

중심으로 활발하게 진행중인 Linked Data Project 이며, 데이터

연계 기반의 더 똑똑한 웹을 만들기 위한 방법이다.

(25)

(26)

(27)

(28)

따라 연관 데이터를 찾아 탐험할 수 있어야만 한다.

다음은 LOD Project의 특징이다. Community project withW3C support Take existing open data sets

Make them available on the Web in RDF Interlink them with other data sets Began early 2007

다음은 LOD를 통해 데이터를 출판하는 이유이다.

Ease of discovery Ease of consumption

standards-based data sharing Reduced redundancy

Added value

(29)

3.1. Linked Data Set LOD 프로젝트에 참여하고 있는 데이터들은 TBL 의 LOD 데이터 출판 규칙에 따라 시맨틱 웹 표준을 준수하고 있으며, 데이터의 연계 및 공동 활용을 위해서 다양한 활동을 수행하고 있다. 다음은 LOD 프로젝트에 참가하고 주요 참가자들이다.

US: Massachusetts Institute of Technology, Thomson Reuters, Zitgist, Cyc Foundation, University of Pennsylvania

Uk: University of Southampton, KMi, Open University,

University of London, BBC, Talis, Garlik, OpenLink

DE: Freie Universitat Berlin, Universitat Hannover , Universitat

(30)

IE: DERI AT: Joanneum

(31)

(32)

구성하고 있는 데이터들의 연결 관계를 보여주고 있는데, 크게

3 가지 그룹으로 구분할 수 있다.

bio/life-sciences data

the central cloud is “all the rest”, connecting the other two,

with DBPedia as its hub

(33)

290 만개의 정보 개체와 479 만개의 단위 정보를 RDF 로 변환

제공하고 있다.

[그림 10] DBpedia의 Linked Data 배포 및 관리구조

DBpedia 의 데이터를 확인하는 방법은 크게 세가지가 있다.

(34)

방법과, 데이터 파일을 다운로드 받는 법, 그리고 브라우저를

통해 HTML 형태의 데이터를 보는 방법 등이 가능하다. 마지막

방법은 웹 브라우저를 띄우고, 주소창에

“http://dbpedia.org/page/Korea”를 입력하는 식으로 한국에 대한

데이터를 한눈에 확인할 수 있다.

3.2. Linked Data Publishing Principles

팀 버너스-리는 Linked Data 에 대한 5 가지 출판 절차를

제시한 바 있다.

1. Understand the Principles

2. Understand your Data

3. Choose URIs for Things in your Data 4. Setup Your Infrastructure

(35)

팀 버너스-리는 Linked Data 에 출판을 위한 절차를 각

절차별로 구체적으로 살펴보면 다음과 같다.

1) Understand the Principles Use URIs as names for things

anything, not just documents you are not your homepage

information resources and non-information resources Use HTTP URIs

globally unique names, distributed ownership allows people to look up those names Provide useful information in RDF

(36)

to enable discovery of related information

2) Understand your Data

What vocabularies can be used to describe these? Principles: Reuse, don't reinvent, Mixliberally Potential Ontologies/Vocabularies

Geo GoodRelations FOAF Review SIOC Whisky

(37)

Keep out of other peoples' namespaces http://www.imdb.com/title/tt0441773/ http://www.imdb.com/title/tt0441773/thing http://myfilms.com/tt0441773

http://myfilms.com/tt0441773/html Abstract away from implementation details

http://dbpedia.org/resource/Berlin

http://www4.wiwiss.fu-berlin.de:2020/demos/dbpedia/

cgibin/resources.php?id=Berlin Hash or Slash

http://mydomain.com/foaf.rdf#me http://mydomain.com/id/me

(38)

[그림 11] LOD 서비스 응답/처리를 위한 인프라

5) Link to other Data Sets Linking Algorithms

String Matching: e.g. Lexical Distance between labels Common Key Matching: e.g. ISBN, Musicbrainz IDs

Property-based Matching: Do these two things have the

same label, type and coordinates Aim for reciprocal links

(39)

site owners to exchange links.

3.3. SPARQL End Point 서비스

LOD 프로젝트에 참여하고 있는 기관들은 TBL 의 4대 원칙에

따라 모든 데이터 셋이 시맨틱 웹 표준을 따르고 있으며, 그 중

대다수는 SPARQL endpoint 를 제공하고 있다는 것이다. SPARQL

endpoints 는 SPARQL protocol 을 지원하는 질의 처리 RESTful

웹 서비스이다. 기본적으로 HTTP GET 요청을 통해 원격에 있는

SPARQL endpoint 에 질의가 전달된다.

GET /sparql?query=PREFIX+rd ….. HTTP/1.1 Host : dbpedia.org

User-agent : my-sparql-client/0.1

질의 처리 결과는 XML, JSON, RDF, NTriples, Turtle, HTML 등

(40)

(41)

(42)

(43)

(44)

(45)

(46)

4.1. 관련 연구

이 장에서는 신규 저자식별 평가셋 구축의 관련 연구로 기존

저자식별 연구에서 사용된 평가셋들의 특성을 기술한다.

[표 2] 구체화 기법의 효과

Test set _Records# of _Persons# of Ambiguity Performance

psuciteseer9 3,028 447 49.7 93.6% psuciteseer10 3,355 490 49.0 90.6% psuciteseer14 8,442 480 34.3 84% psupike24 724 49 2.0 83.6% umassdblp17 841 97 5.7 92.2% umassrexa8 1,302 219 27.4 90.6% umasspenn7 1,588 93 13.3 35.5%

southampton8 4,799 n/a n/a 89.9%

표는 기존 저자식별 실험에서 활용되었던 평가셋들을

(47)

저자수(# of Persons), 저자중의성(Ambiguity), 최대성능(Performance)의 항목들로 정리한 것이다. 평가셋명에서 마지막 부분 숫자는 평가셋에 포함된 서로 다른 저자명의 개수이며 동명저자명개체집합의 개수와 같다. 예를 들어 평가셋 psu-citeseer-14 는 서로 다른 14 개의 저자명에 대해 8,442 개의 저자명개체레코드를 포함하고 있으며 수작업으로 각 저자명의 실세계 저자를 확인한 결과 총 480 명의 실세계 저자가 존재함을 의미한다. 이 경우 저자중의성 34.3 은 실세계 저자수 480 을 서로 다른 저자명 수 14 로 나눈 것으로, 정확하게는 14 개 각 저자명에 대해 저자중의성을 계산하여 평균을 구해야 할 것이다. 표에서 psu-citeseer-9/10/14 는 미국 펜실베니아주립대학교의 CiteSeer 연구그룹에서 DBLP 데이터에 기초하여 만든 평가셋으로 서로 다른 논문에서 서로 다른 크기(9, 10, 14)의 저자명 수를 실험에 사용하였다. 표에 제시된 p

(48)

논문에서 발췌한 것들이다. psu-pike-24 는 미국 펜실베니아주립대학교의 PIKE 연구그룹에서 DBLP 데이터를 주로

하여 만든 평가셋이다. 미국 메사츄세츠대학교에서 구축한 세 평가셋 umass-dblp-17, umass-rexa-8, umass-penn-7 은 DBLP

(49)

(50)

(51)

(52)

그러나 저자명에 대한 실세계 저자로의 대응이 확인되기 전에는 저자중의성을 알 수 없으므로 [단계-2]에서 전술한 두 가지 인자를 고려하기에는 어려움이 따른다. 이 문제를 다루기 위해, “저자명 출현 회수와 저자중의성은 비례할 가능성이 크다”는 가정에 기초하여 DBLP-Bib 내 저자명 출현 고빈도 순으로 상위 1000 개 저자명을 추출하여 식별 대상 저자명집합(DBLP-NameSet)으로 선정하였다. 예를 들어 논문서지집합이 아래 3 편의 논문으로 구성된 경우 저자명 출현 빈도순으로 상위 2 개의 저자명을 추출하면 J. Mitchell(3 회)과 P. Lincoln(2회)이 저자명집합으로 선정될 것이다. [논문서지집합 예]

J. Mitchell. 1983. File Servers. AC, 221-259.

P. Lincoln, J. Mitchell. 1991. Algorithmic Aspects of Type

(53)

P. Lincoln, J. Mitchell, A. Scedrov. 1996. Linear logic proof games and optimization. BSL, 322-338.

[단계-3]은 이전 단계에서 결정된 DBLP-NameSet 를 구성하는 1000 개 각 저자명의 출현 개체들을 DBLP-Bib 로부터 수집하여 저자명개체집합(DBLP-NameEntitySet)을 생성하는 것이다. 예를 들어 전술한 예인 [논문서지집합 예]를 논문서지집합으로 보고 여기서 얻어진 저자명집합이 {J. Mitchell, P. Lincoln, A. Scedrov}라고 하면, 이에 대응하는 저자명개체집합은 다음과 같다. [저자명개체집합 예]

(54)

<J. Mitchell> P. Lincoln, J. Mitchell. 1991. Algorithmic Aspects

of Type Inference with Subtypes. POPL, 293-304.

<J. Mitchell> P. Lincoln, J. Mitchell, A. Scedrov. 1996. Linear logic proof games and optimization. BSL, 322-338.

<P. Lincoln> P. Lincoln, J. Mitchell. 1991. Algorithmic Aspects

of Type Inference with Subtypes. POPL, 293-304.

<P. Lincoln> P. Lincoln, J. Mitchell, A. Scedrov. 1996. Linear logic proof games and optimization. BSL, 322-338.

<A. Scedrov> P. Lincoln, J. Mitchell, A. Scedrov. 1996. Linear logic proof games and optimization. BSL, 322-338.

실제로 저자명개체집합은 동명저자명개체집합(들)의 모음으로

이루어진다. 위 예의 저자명개체집합은 저자명집합을 구성하는

3 개 저자명 각각에 대한 동명저자명개체집합들의 모음인 것이다.

(55)

동명저자명개체집합은 P. Lincoln 이 출현한 논문 두 편의 모음이다. 즉 [단계-3]은 DBLP-NameSet 를 구성하는 1000 개 저자명에 대응하는 1000 개 동명저자명개체집합들로 이루어진 DBLP-NameEntitySet 를 생성하는 것이다. [단계-4]는 DBLP-NameEntitySet 내의 각 저자명개체에 대해 실세계 저자를 대응시키기 위한 정보를 수집하는 단계이다. 기존 평가셋 구축의 경우 각 저자명개체의 홈페이지 내

출판논문리스트페이지(Personal Publication List page, PPLpage)를

참조하거나 저자명개체가 출현한 논문의 원문에 기재된

전자메일주소로 확인 메일을 발송하는 방식 등을 통해 실세계

저자의 신원을 확인했다. 그러나, 이 연구에서 사용하는 DBLP

데이터의 경우 전자메일 획득을 위한 원문 확보가 쉽지 않고,

(56)

검색하는 것 또한 시간/인력 집약적 작업이 되는 것을 피할 수 없다.

이 문제를 다루기 위해 저자의 출판논문정보가 기재된 웹페이지를 구글 웹 검색을 통해 자동 획득하고자 시도하였다.

먼저 기존 홈페이지 탐색 기법[11]에서 활용된 단서

용어들(curriculum vitae, cv, resume, homepage, publication)과

특정 저자명개체가 출현한 논문의 제목을 저자명과 함께

구글검색엔진의 다양한 검색 옵션(intitle:, allintitle:, site: 등)과

(57)

다음은 J. Mitchell 에 대한 특정 저자명개체와 그 개체에 대한

PPLpage 를 웹검색하기 위한 최적 구글검색식의 예를 보인

것이다.

저자명개체: <J. Mitchell> P. Lincoln, J. Mitchell. 1991. Algorithmic Aspects of Type Inference with Subtypes. POPL,

293-304.

구글검색식: intitle:Mitchell Algorithmic Aspects of Type

(58)

(59)

예로는 서로 다른 두 PPLpage URL 들이 웹 서버의 부모-자식 디렉토리 위치에 존재하거나, 연구자의 소속 변경으로 인해 이전 소속기관과 현재 소속기관의 웹사이트에서 유사한 논문출판리스트들이 유지되고 있는 경우 등이 해당된다. 4.3. 평가셋 특징 이 연구에서 구축한 저자식별 평가셋은 영어(English) 저자명을 대상으로 하는 저자식별(Author Disambiguation)을 위해 한국과학기술정보연구원(KISTI)에서 구축한 첫 번째(01)

평가셋(TestSet)이라는 의미에서 KISTI-AD-E-01-TestSet 으로

(60)

(61)

(62)

(63)

(64)

(65)

(66)

영역을 확대한 것이다. 예를 들어 동명저자명개체그룹 G={a, b, c}에서 저자명개체집합 {a, b}, {c}가 실세계 저자 P1, P2 에 각각 대응될 경우 G 는 그림 4 에서 좌표 (3,2)에 하나의 점으로 표시된다. 그림에서 실선은 저자명개체수와 저자수가 같은 점들을 연결한 참고용 기준선(Y=X)이다. 기준선에 가까운 점일수록 저자군집의 수가 저자명개체수에 가까운 개체그룹임을 의미한다. 극단적으로 기준선에 위치한 군집들은

(67)

(68)

4.4. 저자식별 연구 이 장에서는 평가셋 KISTI-AD-E-01-TestSet 에 대한 저자식별성능을 제시한다. 먼저 평가셋에 포함된 총 881 개 개체그룹 중 크기가 1 인 14 개의 개체그룹을 제외하여, 총 867 개 그룹 41,659 저자명개체들을 실험대상으로 하였다. 이는 적어도 2 개 이상의 저자명개체를 포함한 개체그룹을 저자식별의 실험대상으로 하기 위함이다. 저자식별을 위한 군집법으로는

단일링크법(Single-Linkage Agglomerative Hierarchical Clustering)을 사용하고 군집법이 요구하는 개체거리함수로는

코사인함수와 이진거리함수를 적용하였다. 저자식별의

평가지표[13]로 재현율, 정확률, F1, OE(Over-clustering Error,

과다군집오류), UE(Under-clustering Error, 과소군집오류)를

(69)

[표 4] 저자식별 성능 자질 Rec. Pre. F1 OE UE F .9797 .9352 .9569 .0273 .0082 C .4205 .9645 .5856 .0062 .2330 T .4989 .5010 .4999 .1998 .2015 P .1285 .9456 .2263 .0025 .3748 Y .0815 .3520 .1323 .0603 .3693 F+T .9832 .9340 .9579 .0279 .0068 F+C .9870 .9104 .9472 .0391 .0052 C+T .7063 .8942 .7892 .0336 .1181 CTPY .6135 .7818 .6875 .0915 .1868 F+CTPY .9797 .9545 .9669 .0273 .0082 표는 새로운 평가셋의 저자식별성능을 보인 것이다. 표에서 F, C, T, P, Y 는 저자명개체레코드가 갖는 자질들이며 각각

저자명완전명(Fullname), 공동저자명(Coauthor names),

논문제목(Title), 게재지명(Publication title), 게재연도(Year)이다.

자질 CTPY 는 공동저자명, 논문제목, 게재지명, 게재연도

(70)

(71)

[그림 16] 국가 전거데이터 구축 사례

(72)

[그림 18] 저널 전거데이터 구축 사례

(73)

(74)

(75)

[참고 문헌]

[1] T. Berners-Lee, J. Hendler, and O. Lassila, “The Semantic Web”, Scientific American Magazine May, 2001.

[2] Y. Song, J. Huang, I. Councill, J. Li and C. L. Giles, "Efficient top ic-based unsupervised name disambiguation", In Proceedings of the ACM IEEE Joint Conference on Digital Libraries(JCDL), 2007(6).

[3] H. Han, H. Zha and C. L. Giles, "Name disambiguation in author

citations using a k-way spectral clustering method", In Proceedings

of the ACM/IEEE Joint Conference on Digital Libraries(JCDL), pp.334

-343, 2005(6).

[4] D. W. Lee, B. W. On, J. W. Kang and S. H. Park, " Effective and

(76)

Information Quality in Information Systems(IQIS), pp.69-76, 2005(6).

[5] P. Kanani and A. McCallum, "Efficient strategies for improving partitioning-based author coreference by incorporatingWeb pages

as graph nodes", In Proceedings of the 6th InternationalWorkshop

on Information Integration on the Web(IIWeb-07), 2007(7).

[6] D. M. McRae-Spencer and N. R. Shadbolt, "Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation",

In Proceedings of ACM/IEEE Joint Conference on Digital Libraries

(JCDL), pp.53-54, 2006(6).

[7] D. A. Pereira, B. Ribeiro-Neto, N. Ziviani, A. H. F. Laender, M. A. Goncalves and A. A. Ferreira, "Using web information for author

(77)

[8] J. Huang, S. Ertekin and C. L. Giles, "Efficient name disambiguation

for large scale databases", In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in

Databases(PKDD), pp.536-544. 2006(9).

[9] Y. F. Tan, M. Y. Kan and D. W. Lee, "Search engine driven author

disambiguation", In Proceedings of ACM/IEEE Joint Conference on

Digital Libraries(JCDL), pp.314-315, 2006(6).

[10]M. Ley, "DBLP - some lessons learned", In Proceedings of

InternationalConference on Very Large Data Bases(VLDB), 2009(8).

[11] V. Petricek, I. J. Cox, H. Han, I. G. Councill and C. L. Giles, "A

comparison of on-line computer science citation databases", In

(78)

[12]O. Fatemieh, K. Manzoor, A. Jain and A. Ramani, "Home Page Finder. University of Illinois at Urbana-Champaign", 2005.

[13]강인수, "한글 저자명 군집화를 위한 계층적 기법 비교", 정보관리

연구, 제40권, 제2호, pp.95-115, 2009.

[14]I. S. Kang, S. H. Na, S. W. Lee, H. M. Jung, P. Kim, W. K. Sung, J. H.

Lee, "On co-authorship for author disambiguation", Information

(79)

[부록]

LOD 데이터 셋

Addgene

Addgene catalog (tab delimited file) Sciences Research. Plasmids Bank. Allen Brain Atlas Science Commons extract from ABA Web site, on or shortly before 26 Feb 2007

Medical Research. Brain.

Airport Data SPARQL Airport Data

BAMS BAMS

Sciences Research. Brain Architecture Management System. BBC John Peel

sessions from DBtune.org

holding data released

during Hackday, 2007 Music Related

BBOP All OBO ontologies Open Biomedical

Ontologies BBOP

selected OBO ontologies, downloaded ~21 April 2007, augmented with inferred relations

selected OBO ontologies Billion Triples

Challenge various dumps

(80)

Dataset online end-user applications Bio2RDF various bio- and gene-

related datasets

bioinformatics integrated data by applying the

Semantic Web Bitzi collaborative file

describing service digital media encyclopedia Chef Moz 290344 restaurants - 104856 reviews - 59243 links to reviews - 2402 editors Professionally-built restaurant guides Data-gov Wiki Datasets containing RDF data converted from datasets published at http://data.gov (and other sources). The datasets are clustered by dc:subject, e.g. government budget, environmental statistics, housing and population statistics, medical cost, energy consumption, public library statistics, and labor statistics.

government datasets using semantic web

technologies

DBpedia

Data set containing extracted data from Wikipedia. About 2.6 million concepts described by 247 million triples, including

(81)

abstracts in 14 different languages

DMOZ RDF

Dump DMOZ Open Directory Project

DOAP Store

provides daily

generated dumps with all its DOAP project descriptions

directories are corrupted

DOAPspace

All 55,000+ DOAP profiles available as RDF/XML DOAP. This includes all DOAP created by doapspace and all DOAP spidered.

Entrez Gene Select fields from Entrez

Gene records database of genes

Entrez Gene Extract from <ftp://ftp.ncbi. nlm.nih.gov/g ene/DATA/gen e_info.gz>

Entrez Gene Extract from <ftp://ftp.ncbi.nlm.nih.g ov/gene/DATA/gene_inf o.gz> database of genes Freebase RDF Store Freebase Views of Freebase Topics following the principles of Linked Data. The dataset extractions contain aggregated data from: Wikipedia, MusicBrainz, IMDB, TVDB, Flickr and

Freebase's data with information about approximately 12 million

(82)

more... (tab delimited file)

FlyAtlas

FlyAtlas and Affy D2 probe-to-gene

the Drosophila adult gene expression atlas

Fly-TED

derived from data published by www.fly-ted.org and provides metadata on images depicting in situ hybridisation in D. melanogaster testes.

the Drosophila Testis

Gene Expression Database

Galen from

co-ode.org Galen from co-ode.org

Gathers interest from more and more communities, using the technology develop along a user-oriented path. The project officially finished in

August 2009 GeoSpecies Knowledge Base Information on Biological Orders, Families, Species as well as species occurrence records and related data

data about species

GO annotations from National Center for Biotechnology Information (NCBI) and GO annotations from National Center for Biotechnology

Information (NCBI) and European Bioinformatics Institute (EBI)

(83)

European Bioinformatics Institute (EBI)

GovTrack.us about the U.S. congress US Congress' activities HCLSIG LODD

group various dumps Linking Open Drug Data

Homologene

HomoloGene is a system for automated detection of homologs among the annotated

genes of several completely sequenced eukaryotic genomes. Homologene Jamendo from DBtune.org

data from the Jamendo

website music

Lexvo Linguistic Data

languages, words, characters, and other

human language-related entities

LinkedCT Linked Clinical Trials clinical

LinkedMDB Linked Data about

Movies movies

Magnatune from DBtune.org

data from the Magnatune

label music label

MeSH headings

List of all associations of MeSH headings to

(84)

Medline extracted from 2007 Medline baseline distribution MeSH titles Extracted from 2007 Medline baseline distribution Medicine MeSH pairs NLM 2007 MeSH

descriptor/qualifier pairs Medicine

MusicBrainz —

music metadatabase (seems to be just RAW

data, no rdf file)

Neurocommo ns text mining pilot

NeuroCommons text mining pilot - extracted

from Temis software applied to 7% of Medline

records (SC)

neuroscience-related PubMed

NLM 2007

MeSH NLM 2007 MeSH Medicine

OpenCyc

OpenCyc is the open source version of the Cyc

technology, the world's largest and most complete

general knowledge base and commonsense

reasoning engine.

Open Cycle Ontology

Open Directory

The Open Directory Project is the largest,

most comprehensive human-edited directory

(85)

of the Web. It is constructed and maintained by a vast, global community of volunteer editors Ordnance Survey administrative

geography data Geography

RAMEAU subject headings

SKOS representation of the RAMEAU book indexing vocabulary, maintained by the French National Library (BnF) contains 157,280 concepts, of which 96,825 correspond to common nouns, 51,646 to geographic names, 2976 to persons, 3419 to collective bodies, 2296 to titles and 123 to chronological subdivisions. common Quotations Book at least 42,000 famous quotations with author and subject

previous blog entries due to a hard drive crash

RKB Explorer Data

25 different domains, each with a separate data set. The data sets are focused on

scientific research; these include DBLP, Citeseer, CORDIS, NSF, EPSRC,

(86)

RAE2001, KISTI, UNLOCODE, Wordnet, voiD, OS.

Rpm Find data exposed?

generates Web pages describing a set of RPM

packages

Science Commons

A bridging ontology, from Science Commons, importing other ontologies used in the prototype, defining classes and relations used to represent gene records and their contents, as well as few items referred to by imported data sources, but not available in a

published ontology.

supporting research

Semantic Bible

(for New Testament Names) is a semantic knowledge base describing each named thing in the New Testament

computational linguistic technology, scripture

Semantic Web Dog Food

(87)

events.

SIMILE Data Collection

various data sets including CIA's World Factbook, Library of Congress' Thesaurus of Graphic Materials, National Cancer Institute's cancer thesaurus, Web Consortium's Technical Reports

building open source tool

STW

Thesaurus for Economics

Thesaurus for economics and business economics, including a classification of subject categories.

Maintained by the German National Library of Economics (ZBW) economics SwetoDblp ontology focused on bibliography data of publications from DBLP with additions that include affiliations, universities, and publishers

database management (including integration, mining, and visualization),

AI and knowledge representation, and bioinformatics. Telegraphis Linked Open Data Countries, Continents, Capitals, and Currencies collected from

GeoNames and

Wikipedia data

(88)

Texai Lexicon

machine readable dictionary derived from WordNet 2.1,

Wiktionary, the CMU Pronouncing Dictionary and the OpenCyc lexicon. Each lexicon word sense entry contains links back to the source dictionary entry, and also to

OpenCyc if the entry is

has been mapped to the Cyc ontology.

Texai is an knowledge-based, open source project to create artificial

intelligence

TCMGeneDIT Dataset

Traditional Chinese medicine, gene and disease association dataset and a linkset mapping TCM gene symbols to Extrez Gene IDs created by

Neurocommons

Chinese medicine, gene and disease

t4gm.info Thesaurus for Graphic Materials

Library of Congress' Thesaurus for Graphic

Materials.

UniProt a large life sciences

data set protein sequence

U.S. Census data

(89)

a whole, down through states, counties, sub-counties (roughly, cities and incorporated towns)

U.S. SEC data corporate ownership

stock ownership of corporations by officers,

board members, executives, and large shareholders (10%) from the SEC EDGAR database,

January 2004-June 2008 van Assem et al's ontology (used by output of MeSH to SKOS conversion) convert thesauri to SKOS

Wikipedia³ metadata extracted

from Wikipedia conversion of the English Wikipedia into RDF

Yale Senselab Yale Senselab data from NeuronDB,

ModelDB BrainPharm

YAGO The complete YAGO

ontology

Semantic Knowledge Base. Entities (persons,

organizations, cities…) YAGO

The subClassOf hierarchy of the YAGO ontology

Semantic Knowledge Base. Subclass

Airport Data SPARQL Airport Data

Billion Triples Challenge Dataset

(90)

applications Bio2RDF various bio- and gene-

related datasets

bioinformatics integrated data by applying the

Semantic Web Bitzi collaborative file

describing service digital media encyclopedia DOAP Store

provides daily

generated dumps with all its DOAP project descriptions

DOAPspace

All 55,000+ DOAP profiles available as RDF/XML DOAP. This includes all DOAP created by doapspace and all DOAP spidered.

Fly-TED

derived from data published by www.fly-ted.org and provides metadata on images depicting in situ hybridisation in D. melanogaster testes.

the Drosophila Testis

Gene Expression Database GO annotations from National Center for Biotechnology Information (NCBI) and GO annotations from National Center for Biotechnology

Information (NCBI) and European Bioinformatics Institute (EBI)

(91)

European Bioinformatics Institute (EBI) HCLSIG LODD

group various dumps Linking Open Drug Data

Homologene

HomoloGene is a system for automated detection of homologs among the annotated

genes of several completely sequenced eukaryotic genomes. Homologene MeSH headings

List of all associations of MeSH headings to papers indexed by Medline extracted from 2007 Medline baseline distribution Medicine MeSH titles Extracted from 2007 Medline baseline distribution Medicine MeSH pairs NLM 2007 MeSH

descriptor/qualifier pairs Medicine

MusicBrainz —

music metadatabase (seems to be just RAW

(92)

Open Directory

The Open Directory Project is the largest,

most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors human-edited directory of the Web Quotations Book at least 42,000 famous quotations with author and subject

previous blog entries due to a hard drive crash

RKB Explorer Data

25 different domains, each with a separate data set. The data sets are focused on

scientific research; these include DBLP, Citeseer, CORDIS, NSF, EPSRC, RAE2001, KISTI, UNLOCODE, Wordnet, voiD, OS. how to download??

Rpm Find data exposed?

(93)

(94)