• 검색 결과가 없습니다.

Algorithm: ExAlg – under revision

N/A
N/A
Protected

Academic year: 2022

Share "Algorithm: ExAlg – under revision"

Copied!
26
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Extracting Structured Data from Web Pages

Summarized by Jaehyok Chong

(2)

Algorithm: ExAlg – under revision

Input: Set of HTML pages I, Output: Template, Schema, Values Function ECGM (Equivalence Class Generation Module

DiffFormat(I, T); // differentiate roles using HTML format while (T is not empty) {

FindEquiv(T, L, Q); // find equivalence classes from T & put LFEQs in L HandInv(L, Q); // handle invalid equivalence classes in L

DiffEq(Q, T); // differentiate roles using equivalence classes }

End ECGM

Function AM (Analysis Module) ConstTemp(L, S);

ExVal(I, S, V);

End AM

Main procedure ExAlg(I) ECGM(I);

AM(L);

End ExAlg

(3)

Example

<html>1<body>2

<b>3Book4Name5</b>6Databases

<b>7Reviews8</b>9

<ol>10

<li>11

<b>12Reviewer13Name14</b>15John

<b>16Rating17</b>18 7

<b>19Text20</b>21

</li>22

</ol>23

</body>24</html>25

<html>1<body>2

<b>3Book4Name5</b>6Data Mining

<b>7Reviews8</b>9

<ol>10

<li>11

<b>12Reviewer13Name14</b>15Jeff

<b>16Rating17</b>18 2

<b>19Text20</b>21

</li>22

<li>11

<b>12Reviewer13Name14</b>15Jane

<b>16Rating17</b>18 6

<b>19Text20</b>21

</li>22

</ol>23

</body>24</html>25

<html>1<body>2

<b>3Book4Name5</b>6Query Opt.

<b>7Reviews8</b>9

<ol>10

<li>11

<b>12Reviewer13Name14</b>15John

<b>16Rating17</b>18 8

<b>19Text20</b>21

</li>22

</ol>23

</body>24</html>25

<html>1<body>2

<b>3Book4Name5</b>6Transactions

<b>7Reviews8</b>9

<ol>10

</ol>23

</body>24</html>25

(4)

DiffFormat

 Extract tokens from HTML pages

 Differentiate roles of tokens using parse tree (structural context)

 <b>(</b>)

 <b>1(</b>1): <html><body><b>(</b>)

 <b>2(</b>2): <html><body><ol><li><b>(</b>)

 Name

 Name1: <html><body><b>Name

 Name2: <html><body><ol><li><b>Name

(5)

DiffFormat

</html>

<b>2(<html><body><ol><li><b>)

</body>

<li>

</ol>

<ol>

</li>

Reviews

Text Transactions

8 Opt.

6 Query

2 Mining

7 Data

Rating Databases

Jane

</b>1(<html><body></b>)

Jeff Name1(<html><body><b>Name)

John Book

</b>2(<html><body><ol><li></b>)

<b>1(<html><body><b>)

Name2(<html><body><ol><li><b>Name)

<body>

Reviewer

<html>

Tokens Tokens

(6)

FindEquiv – 1 st iteration

 Make occurrence vectors for each tokens generated by DiffFormat()

 Find Equivalence classes using occurrence vectors

 Find LFEQs using size & support thresholds

 SizeThre = 3, SupThres = 3

(7)

FindEquiv – 1 st iteration

1, 1, 1, 1 (E1)

</html>

3, 6, 3, 0 (E8)

<b>2(<html><body><ol><li><b>)

1, 1, 1, 1 (E1)

</body>

1, 2, 1, 0 (E7)

<li>

1, 1, 1, 1 (E1)

</ol>

1, 1, 1, 1 (E1)

<ol>

1, 2, 1, 0 (E7)

</li>

1, 1, 1, 1 (E1) Reviews

1, 2, 1, 0 (E7) Text

0, 0, 0, 1 (E6) Transactions

0, 0, 1, 0 (E5) 8

0, 0, 1, 0 (E5) Opt.

0, 1, 0, 0 (E4) 6

0, 0, 1, 0 (E5) Query

0, 1, 0, 0 (E4) 2

0, 1, 0, 0 (E4) Mining

1, 0, 0, 0 (E3) 7

0, 1, 0, 0 (E4) Data

1, 2, 1, 0 (E7) Rating

1, 0, 0, 0 (E3) Databases

0, 1, 0, 0 (E4) Jane

2, 2, 2, 2 (E2)

</b>1(<html><body></b>

0, 1, 0, 0 (E4) Jeff

1, 1, 1, 1 (E1) Name1(<html><body><b>Name)

1, 0, 1, 0 (E9) John

1, 1, 1, 1 (E1) Book

3, 6, 3, 0 (E8)

</b>2(<html><body><ol><li></b>) 2, 2, 2, 2 (E2)

<b>1(<html><body><b>)

1, 2, 1, 0 (E7) Name2(<html><body><ol><li><b>Name)

1, 1, 1, 1 (E1)

<body>

1, 2, 1, 0 (E7) Reviewer

1, 1, 1, 1 (E1)

<html>

Occurrence Tokens

Occurrence Tokens

(8)

FindEquiv – 1 st iteration

 Find LFEQs using SizeThre = 3, SupThres = 3

 E1 : <html> <body> Book Name1 Reviews <ol> </ol> </body>

</html>

 E7 : <li> Reviewer Name2 Rating Text </li>

2 3 3 1 SupThres

1 1 1 4 4 SupThres

3 E5

1 E9

6 E4

2 E8

2 E3

6 2 E7

E2

1 E6

9 E1

SizeThre Equivalence Classes

SizeThre Equivalence Classes

(9)

HandInv – 1 st iteration

 Handle invalid equivalence classes

 [Observation 4.3] A valid equivalence class is ordered and a pair of two valid equivalence classes is nested

 Find invalid LFEQs determined by FindEquiv()

 In this iteration, there is no invalid LFEQs.

(10)

Ordered Equivalence classes

An equivalence class is ordered, if its tokens can be

ordered <t

1

,…,t

m

>, such that, for every page p

i

(1 ≤ i

≤ n), and every pair of tokens t

j

, t

k

(1 ≤ j < k ≤ m), I. If t

j

occurs at least l times in p

i

, the l

th

occurrence of

t

j

in p

i

occurs before the l

th

occurrence of t

k

in p

i

, and II. If t

j

occurs at least (l+1) times in p

i

, the (l+1)

st

occurrence of t

j

in p

i

is after the l

th

occurrence of t

k

in

p

i

.

(11)

Nesting of Equivalence classes

A pair of equivalence classes, E

i

and E

j

is nested if,

I. The span of any occurrence of E

i

does not overlap with the span of any occurrence of E

j

, or

II. The span of all occurrences of E

j

is with in Pos(p) of

some occurrence of E

i

for some fixed p; or vice versa.

(12)

Remove either one arbitrarily

 ADEFBDEFC

 ABC

 DEF

(13)

Causes of invalid equivalence classes

I. Correlation of instantiation of two or more type constructors

II. Frequently Occurring tuples

III. Overlap in the tokens associated with different type constructors

 Type A (I + II) is considered to be processed by

HandInv and Type B (III) is just removed by

HandInv

(14)

Almost-Ordered

An equivalence class E is almost-ordered if the tokens in the equivalence class can be partitioned into

ordered equivalence classes that do not overlap

 E = {A, B, C, D}

ABABCDCD, ABCD, ABABABCDCDCD

 divided into two equivalence classes of {A, B} and {C, D} which are ordered and do not overlap with each other

 Try All possible partitions

 Binary Partition only?

 ABABCDCDEFEF, ABCDEF, ABABABCDCDCDEFEFEF

 {A,B} and {C,D,E,F}?

 {A,B} and {C,D} and {E,F}?

(15)

Nearly-Nested

Two equivalence classes, Ei and Ej are said to be nearly-nested if for every token t ∈ Ei (and vice versa) there exists a position p, such that each occurrence of t is either outside the span of any

occurrence of Ej, or within Pos(p) of some occurrence of Ej.

 E1 = {A, B, C, D}, E2 = {E, F, G, H}

ABCDA(EF)BC(GH)D, A(EF)BC(GH)D, EFGH

 divide E1 into three equivalence classes of {A}, {B, C}, {D} and E2 into two equivalence classes of {E, F} and {G, H}

 E1 = {A, B, C, D}, E2 = {E, F, G, H}

A(EF)BC(GH)D, A(EF)BC(GH)D, EFGH

 divide E1 into three equivalence classes of {A}, {B, C}, {D} and E2 into two equivalence classes of {E, F, G, H}

(16)

DiffEq – 1 st iteration

 Differentiate roles of tokens which are not in the set of LFEQs using LFEQs

 For the tokens satisfies SupThres, differentiate the roles

 <b>1, </b>1, <b>2, </b>2

 Pos(i, E) represents i-th position(Pos(i)) in E = <t1, Pos(1), t2, Pos(2), t3, Pos(3), …, Pos(n – 1), tn>

</b>21, </b>22, </b>23

<b>21, <b>22, <b>23

</b>11, </b>12

<b>11, <b>12 Dtokens

Pos(3, E7), Pos(4, E7), Pos(5, E7)

</b>2

Pos(1, E7), Pos(3, E7), Pos(4, E7)

<b>2

Pos(4, E1), Pos(5, E1)

</b>1

Pos(2, E1), Pos(4, E1)

<b>1

Differentiated roles Tokens

(17)

An Example for DiffEq

 ACDCBDF

 Assume ABF is in LFEQ, but C and D are not in LFEQ

 ACDCBDF

(18)

FindEquiv – 2 nd iteration

 For the newly added dtokens by DiffEq(), make

occurrence vectors and equivalence classes

(19)

FindEquiv – 2 nd iteration

1, 1, 1, 1 (E1)

</html>

1, 2, 1, 0 (E7)

<b>21

1, 1, 1, 1 (E1)

</body>

1, 2, 1, 0 (E7)

<li>

1, 1, 1, 1 (E1)

</ol>

1, 1, 1, 1 (E1)

<ol>

1, 2, 1, 0 (E7)

</li>

1, 1, 1, 1 (E1)

</b>12

1, 2, 1, 0 (E7)

</b>23 1, 1, 1, 1 (E1)

Reviews

1, 2, 1, 0 (E7) Text

1, 2, 1, 0 (E7)

<b>23

1, 2, 1, 0 (E7)

</b>22

1, 2, 1, 0 (E7) Rating

1, 1, 1, 1 (E1)

<b>12

1, 1, 1, 1 (E1)

</b>11

1, 1, 1, 1 (E1) Name1(<html><body><b>Name)

1, 2, 1, 0 (E7)

<b>22 1, 1, 1, 1 (E1)

Book

1, 2, 1, 0 (E7)

</b>21 1, 1, 1, 1 (E1)

<b>11

1, 2, 1, 0 (E7) Name2(<html><body><ol><li><b>Name)

1, 1, 1, 1 (E1)

<body>

1, 2, 1, 0 (E7) Reviewer

1, 1, 1, 1 (E1)

<html>

Occurrence Tokens

Occurrence Tokens

(20)

FindEquiv – 2 nd iteration

 Find LFEQs using SizeThre = 3, SupThres = 3

 E1 : <html> <body> <b>11 Book Name1 </b>11 <b>12 Reviews

</b>12 <ol> </ol> </body> </html>

 E7 : <li> <b>21 Reviewer Name2 </b>21 <b>22 Rating </b>22 <b>23 Text </b>23 </li>

3 SupThres 4

SupThres

12 E7

13 E1

SizeThre Equivalence Classes

SizeThre Equivalence Classes

(21)

HandInv – 2 nd iteration

 In this iteration, there is no invalid LFEQs.

(22)

DiffEq – 2 nd iteration

 This time, there is no tokens to be differentiated by the contexts

 Returns 2 LFEQs

 E1 : <html> <body> <b>11 Book Name1 </b>11 <b>12 Reviews

</b>12 <ol> </ol> </body> </html>

 E7 : <li> <b>21 Reviewer Name2 </b>21 <b>22 Rating </b>22

<b>23 Text </b>23 </li>

(23)

ConstTemp

 Using the set of LFEQs found by ECGM (Equivalence Class Generation Module), construct the template for the input Web Pages

 From the root equivalence class (with occurrence vector<1, 1,

…>), for each non-empty positions of each equivalence classes, check the PosStrings for the patterns like below table

TB Unknown

6

TB String of dtokens and empty equivalence classes

5

(TEj)?

e or Ej 4

TEj| TEk Ejor Ek

3

{TEj}S Ej S Ej S Ej S Ej S … Ej

2

{TEj}e Ej Ej Ej Ej Ej ...

1

Template Pattern

(24)

ConstTemp

 Find root equivalence class: E1

 Partition E1 with non-empty positions

 E1 : (<html><body><b>Book Name</b>)C1 PosString(E1, 6) (<b>Reviews</b><ol>)C2 PosString(E1, 10)

(</ol></body></html>)C3

 E7 : (<li><b>Reviewer

Name</b>)C1 PosString(E7, 1) (<b>Rating</b>)C2

PosString(E7, 2)

(<b>Text</b>)C3 PosString(E7, 3) (</li>)C4

 Check the PosStings to find the

patterns T

String of dtokens B

PosString(E7, 11)

TB String of dtokens

PosString(E7, 8)

TB String of dtokens

PosString(E7, 5)

{TE7}e E7 E7 …

PosString(E1, 10)

TB String of dtokens

PosString(E1, 6)

Template Pattern

PosString

(25)

ConstTemp

 Returns the template

 <(<html><body><b>Book Name</b>)C1 *(A)

(<b>Reviews</b><ol>)C2 {(<li><b>Reviewer Name</b>)C4

*(B) (<b>Rating</b>)C5 *(C) (<b>Text</b>)C6 *(D) (</li>)C7}e (</ol></body></html>)C3>

(26)

ExVal

 Using the template constructed by ConstTemp, extract values of each web pages

e e

e Transactions 4

8

John Query Opt.

3

6

Jane

2

Jeff Data Mining

2

7

John Databases

1

D C

B A

Page

참조

관련 문서

다행히도 이제 동지가 지나가고 낮의 길이가 길어지기 때문에 햇빛을 받을 수 있는 시간이 점차 더 길어진다.. 봄이 오면 날씨가

2 0 0 3 년도에 추진이 시작된 교육행정정보시스템인 나이스( NEI S,Na t i ona lEducat i on I nf or mat i onSyst e m) 에 포함된 I T환경 하에서의 학교재정운용과

치주질환에 있어서 주요한 역할을 한다고 잘 알려진 P.gi ngi val i s ,P.i nt er medi a, A.ac t i nomyc et emc omi t ans ,T.f or s yt hi a및 Tr eponemadent i c ol

I EC를 사용하여 정제된 항세균 물질은 t r i c i neSDS-PAGE에서 하나의 band임을 확인 한 후 N-말단 아미노산 서열분석을 하였으나 실패하였다.FPLC s ys t e m을

대퇴사두근과 햄스트링의 공동작용( c oac t i vat i on) 은 슬관절의 과도한 전방 밀림과 외 전( abduc t i on)동작을 막아줌으로써 슬관절을

: Development of Microstructure and Alteration of Mechanical Properties.. 4.6 The homogeneous nucleation rate as a function of undercooling ∆T. ∆T N is the critical

19.. 4.6 The homogeneous nucleation rate as a function of undercooling ∆T. ∆T N is the critical undercooling for homogeneous nucleation.. → critical value

‰ a digital circuit is combinational if its output values only depend on its (current)