Corpus Analysis Tools for Data Processing

Chapter 3 Data and Methodology

3.3 Corpus Analysis Tools for Data Processing

In the process of doing statistics and analysis, several instruments are needed. In this study, we use AntConc to extract the keywords, Stanford Parsers to POS tag the corpus, the R language to do the analysis and statistics. This part will simply give some introduction to them.

3.3.1 AntConc

AntConc is a freeware concordance program for Windows, Macintosh OS X, and Linux.

The software includes seven tools:

1. Concordance Tool: shows search results in a 'KWIC' (KeyWord In Context) format.

2. Concordance Plot Tool: shows search results plotted as a 'barcode' format. This allows you to see the position where search results appear in target texts.

3. File View Tool: This tool shows the text of individual files. This allows you to investigate in more detail the results generated in other tools of AntConc.

4. Clusters/N-Grams: hows clusters based on the search condition. In effect it summarizes the results generated in the Concordance Tool or Concordance Plot Tool.

5. The N-Grams Tool, on the other hand, scans the entire corpus for 'N' (e.g. 1 word, 2 words, …) length clusters. This allows you to find common expressions in a corpus.

Collocates: shows the collocates of a search term. This allows you to investigate non-sequential patterns in language.

6. Word List: counts all the words in the corpus and presents them in an ordered list.

This allows you to quickly find which words are the most frequent in a corpus.

7. Keyword List: shows the which words are unusually frequent (or infrequent) in the corpus in comparison with the words in a reference corpus. This allows you to identify characteristic words in the corpus, for example, as part of a genre or ESP study.

3.3.2 Stanford Parser

Stanford Parser is the important of a large number of Penn TreeBank parsers. Its development was one of the biggest breakthroughs in natural language processing in the 1990s. The Stanford dependencies scheme has gained popularity throughout various natural language processing tasks. As a statistical parser, it still makes some mistakes, but commonly works rather well. It attained the highest confidence-weighted score of all entrants in the 2005 competition by a significant margin.

The parser deals with various languages; apart from English, Stanford parser also parses Chinese according to Chinese Treebank, German based on Negra corpus and Arabic by Penn Arabic Treebank. It provides phrase structure trees as well as dependencies output. In this study, Stanford Parser was used to tag two corpora.

3.3.3 R Language

R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. R is a GNU package. The source code for the R software environment is written primarily in C, Fortran, and R. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems. While R has a

command line interface, there are several graphical front-ends available.

R and its libraries implement a wide variety of statistical and graphical techniques, including linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made. R is highly extensible through the use of user-submitted packages for specific functions or specific areas of study. R has stronger object-oriented programming facilities than most statistical computing languages. Extending R is also eased by its lexical scoping rules. Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols. Dynamic and interactive graphics are available through additional packages. In this study, 2 main domains and 5 functions are compiled to one R program, which can be used to lemmatize, rearrange POS tagger and calculate collostruction strengths.

3.3.4 SPSS

The software SPSS, standing for Statistical Package for the Social Sciences, reflects the original market, although the software is now popular in other fields as well. The base software includes the statistics as follows:

1. Descriptive statistics such as descriptive ratio statistics;

2. Bivariate statistics including means, t-test, ANOVA, correlation, nonparametric tests;

3. Prediction for numerical outcomes, i.e. linear regression;

4. Prediction for identifying groups, including Factor analysis, cluster analysis.

5. Geo spatial analysis, simulation 6. R extension (GUI), Python

It is also used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations, data miners, and others. The SPSS 20.0 was used in the present study to calculate the relationships, to test the mathematical models and make linear and scattered pictures.

3.3.5 Semantic Tagging System

To define sea-related words, the UCREL semantic analysis system (USAS) is a helpful tool, which is a framework for undertaking the automatic semantic analysis of text. The semantic tagset used by USAS was originally loosely based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981). It has a multi-tier structure with 21 major discourse fields (shown here on the right), subdivided, and with the possibility of further fine-grained subdivision in certain cases. The semantic tags are composed of:

1. an upper case letter indicating general discourse field.

2. a digit indicating a first subdivision of the field.

3. (optionally) a decimal point followed by a further digit to indicate a finer subdivision.

4. (optionally) one or more ‘pluses’ or ‘minuses’ to indicate a positive or negative position on a semantic scale.

5. (optionally) a slash followed by a second tag to indicate clear double membership of categories.

6. (optionally) a left square bracket followed by ‘i’ to indicate a semantic template (multi-word unit).

Other symbols utilised in USAS are as follows:

% = rarity marker (1)

@ = rarity marker (2) f = female

m = male

c = potential antecedents of conceptual anaphors (neutral for number) n = neuter

i = indicates a semantic idiom

Antonymity of conceptual classifications is indicated by +/- markers on tags.

Comparatives and superlatives receive double and triple +/- markers respectively. Certain words and collocational units show a clear double (and in some instances, triple) membership of categories. Such cases are dealt with using slash tags, that is, all tags are indicated and separated by a slash (e.g. anti-royal = E2-/S7.1+, accountant = I2.1/S2mf, bunker = G3/H1 K5.1/W3, Admiral = G3/M4/S2mf S7.1+/S2mf, dowry = S4/I1/A9-).

The initial tagset was loosely based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981) as this appeared to offer the most appropriate thesaurus type classification of word senses for this kind of analysis. We have since considerably revised the tagset in the light of practical tagging problems met in the course of the research. The revised tagset is arranged in a hierarchy with 21 major discourse fields expanding into 232 category labels. The following Table 4 shows the 21 labels at the top level of the hierarchy.

Table 4. The 21 Labels at the Top Level of the Hierarchy in USAS A

general and abstract terms

the body and the individual

C arts and crafts

E emotion F

food and farming

G government and

public

architecture, housing and the home

I money and commerce in

industry K

entertainment, sports and games

life and living things

movement, location, travel and transport

N numbers and measurement O

substances, materials, objects

and equipment

P education

Q language and communication

social actions, states and processes T

Time

W world and environment

X psychological actions, states and

processes

Y science and technology Z

names and grammar

문서에서 저작자표시 (페이지 38-42)