• 검색 결과가 없습니다.

저작자표시

N/A
N/A
Protected

Academic year: 2022

Share "저작자표시"

Copied!
290
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게

l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. 다음과 같은 조건을 따라야 합니다:

l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다.

l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다.

저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약(Legal Code)을 이해하기 쉽게 요약한 것입니다.

Disclaimer

저작자표시. 귀하는 원저작자를 표시하여야 합니다.

비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다.

변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다.

(2)

Thesis for the Degree of Doctor of Philosophy

A Collostructional Analysis on the Sea-related Words in Maritime English

by

Yang Yu

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

in the Department of

English Language and Literature

Korea Maritime and Ocean University August 2018

[UCI]I804:21028-200000105244

(3)
(4)
(5)
(6)

A Collostructional Analysis on the Sea-related Words in Maritime English

Yang Yu

Department of English Language and Literature Graduate School of Korea Maritime and Ocean University

Abstract

Maritime English belongs to the domain of English for specific purposes (ESP); it is the lingua franca for people engaged in international maritime transportation, whose throughput accounts for more than 80% of the goods for world trade. Hundreds of thousands of seafarers of different countries speaking different tongues work in this industry communicating in one language—English, among themselves, between labor and management, from ship to ship and between sea and shore.

Quite often the captain, other senior officers and crews of one ocean-going ship are from several countries and English is the only language spoken both as a working language and for everyday conversation.

Researchers have studied the English words either from the word learning strategy, or word teaching method, or the misuse of words. However, most of these studies are based on the word-level instead of the more up-to-date and varied collostruction-level.

Based on corpus linguistics, the thesis is aimed at investigating the collexeme features of maritime English. According to the data obtained by using computer program and analysis tools, this study is expected to answer the following five questions: First, What are the high-frequency sea- related words of a Maritime English Corpus (MEC)? Second, what are the statistical results when realizing three different approaches, FYE, DELTA P and LOG, to collostructional analysis? Third, what are the relationships among the three different approaches to collostructional analysis? Fourth, is there any directionality of the DELTA P for collostructional analysis? Fifth, how do we explain the similarities and differences of near-synonyms in the MEC by using collostructional analysis?

To answer these research questions. First, by setting customary selection criteria of high- frequency sea-related words and referring to BNC Baby, I extracted 12 representative words from the compiled corpus, MEC, which is a 1,446,650 -word corpus including safety at sea, shipping news,

(7)

navigational and marine engineering technology, laws, rules and regulations and documents on all the related areas of maritime transportation.

Keyword analysis is a good way to analyze representative words in a specialized corpus. After cutting off by frequency and sorting by keyness, 861 keywords are extracted for the MEC. The selected keywords are subdivided into three categories according to the UCREL semantic analysis system (USAS), namely: Means of Water Transport (M4) words, Geographical (W3) words and Overlapped Sea-related Semantic Domain (W3/M4) words. The chosen 12 words are Ship (M4), Vessel (M4), Port (M4), Harbour (M4), Sea (W3/M4), Global (W3), Worldwide (W3), Ocean (W3/M4), Coast (W3/M4), Shore (W3/M4), Marine (W3/M4) and Maritime (W3/M4). Three different association measures are adopted and directionality is considered to further analyze these words in collostructional analysis.

Since FYE is the first association measure for collostructional analysis, I use it as independent variable and DPW2C, LOG as dependent variables respectively. After conveying the collostructional analysis of 13 representative words, the different measures return different values. The relationships of FYE and DPW2C values and FYE and LOG values can be described with the Menzerath-Altmann model, which can be presented into the following model:

cx b

e ax

y



Where y is the (mean) size of the immediate constituents, x is the size of the construct, and a, b and c are parameters which seem to depend mainly on the level of the units under investigation (Köhler 2012). With y being the DPW2C or LOG and x FYE, a, b and c are parameters. The result is excellent for all collostructional results for the 12 representative words.

In addition, nearly all measures that have been used are bidirectional, or symmetric. However, Ellis (2007) and Ellis & Ferreira-Junior (2009) pointed out that associations are not necessarily reciprocal in strength. More technically, bidirectional/symmetric association measures conflate two probabilities that are in fact very different: P (W|C) is not the same as P (C|W). To measure how difference they are in MEC, we have defined P (W|C) as Delta P word to construction, namely DPW2C; P (C|W) as Delta P construction to word, namely DPC2W to investigate their relationships and distribution of collexemes.

Last but not least, to further understand the collostructional feature of sea-related near- synonyms in MEC, There are collexemes shared by two near-synonyms. To identify the detailed difference is of vital importance to understand the subtle usage difference of two near-synonyms. For example: There are 174 shared collexemes in collostruction “A+A/N” of Ship and Vessel. Their DPW2C ranks vary significantly, for example: As a shared collexeme, Industry ranks 1 with Ship, while ranks 320 with Vessel, which indicates we should consider Ship more important than Vessel

(8)

when collocating before Industry. The detailed ranking differences are listed in appendix.

Collostructional analysis is a novel method for a nominal structure, especially for “A/N + N”

structure, which is a significant grammatical structure in maritime English, compared to general English. The results includes directionality and statistical methods, which proves to have advantages to traditional frequency-based collocation analysis. For example: in the top 10 collostruction and collocation results of Ship, the results are quite different between collostructional analysis and collocation analysis, Web ranks 4, Safety ranks 5, Owner ranks 7 and Construction ranks 8 in collostruction results, while Web ranks 5, Safety ranks 4, Owner ranks 8 and Construction ranks 9 in collostruction results. Operator and Design are in top 10 collostruction results, while Port and Cargo are in top 10 collocation results. The detailed ranking differences are listed in appendix.

The idea of using collostructional analysis to investigate high-frequency sea-related words can be presented. From the perspective of traditional methods, we can only see the relationship between words to words, or structures to structures. With collostructional analysis, we can see more than words and structures. We also see their inner relationships which contribute to the study of maritime English.

(9)

Acknowledgment

After two years’ study for maritime English and Corpus Linguistics, another half year’s writing thesis was finally accomplished. Without of the help of many people, this could never be completed, thus here I would like to express my sincere gratitude towards them.

First and foremost, I owe my heartfelt thanks to my distinguished and cordial supervisor, Professor Se-Eun Jhang, who influenced me with his insightful ideas and meaningful inspirations, guided me with practical academic advice and feasible instructions, and enlightened me while I was confused during the writing procedure. His thought-provoking comments and patiently encouragements are indispensable for my accomplishment of this thesis. My selection of Stefan Gries’ collostructional analysis as the theoretical framework is deeply motivated by his profound knowledge. Without his dedicated assistance and insightful supervision, this thesis would have gone nowhere.

Then, my faithful appreciation also goes to the prestigious and distinguished Professor Sung- Jun Kim, Professor Doo-Shik Kim, Professor Jeong-Ryeol Kim, and Dr. Sung-Kuk Kim, whose splendid and diversified suggestions have empowered me to broaden my research aspects solid research capability and upright academic attitude.

Ultimately, my first thanks go to leaders and colleagues in Dalian Maritime University, without their trust and help, I could not get this opportunity to study in Korea Maritime and Ocean University.

My second thanks go to my parents, who are my mentor and guardian from the very beginning in primary school. Without their refined education and care, I could never grow up in such a joyous and cozy environment nor have the courage to confront any obstacles on my way to success. Last but not least, my special thanks to my wife, who always encourage and support me without any complaint.

She is my motivation to go further. I will dedicate all my success to her and to all beloved people.

(10)

List of Tables

Table 1. Contingency Table Cross-tabulating Frequency Scores of L and C ... 13

Table 2. Composition of the MEC ... 20

Table 3. Composition of the BNC Baby ... 21

Table 4. The 21 Labels at the Top Level of the Hierarchy in USAS ... 25

Table 5. Semantic Tagged Maritime English Keywordlist ... 28

Table 6. All High-frequency Sea-related Words Sorted in Semantic Tag ... 29

Table 7. The Means of Water Transport Words ... 30

Table 8. The Geographical Words ... 30

Table 9. The Overlapped Sea-related Semantic Domain ... 31

Table 10. Top 1,000 Sampling Bigrams in MEC and BNC Baby ... 32

Table 11. The Collostructional Analysis of Ship in MEC ... 34

Table 12. The Collostructional Analysis of Vessel in MEC ... 38

Table 13. The Collostructional Analysis of Port in MEC ... 43

Table 14. The Collostructional Analysis of Harbour in MEC ... 49

Table 15. The Collostructional Analysis of Global in MEC ... 55

Table 16. The Collostructional Analysis of Worldwide in MEC ... 60

Table 17. The Collostructional Analysis of Sea in MEC ... 65

Table 18. The Collostructional Analysis of Ocean in MEC ... 70

Table 19. The Collostructional Analysis of Coast in MEC ... 75

Table 20. The Collostructional Analysis of Shore in MEC ... 80

Table 21. The Collostructional Analysis of Marine in MEC ... 85

Table 22. The Collostructional Analysis of Maritime in MEC ... 90

Table 23. The Registers of Most Significant Collexemes in MEC ... 95

Table 24. The Directionality Ranks of Ship ... 98

Table 25. The Directionality Ranks of Vessel ... 101

Table 26. The Directionality Ranks of Port ... 104

Table 27. The Directionality Ranks of Harbour ... 107

Table 28. The Directionality Ranks of Global ... 110

Table 29. The Directionality Ranks of Worldwide ... 112

Table 30. The Directionality Ranks of Sea ... 115

(11)

Table 31. The Directionality Ranks of Ocean ... 118

Table 32. The Directionality Ranks of Coast ... 121

Table 33. The Directionality Ranks of Shore ... 123

Table 34. The Directionality Ranks of Marine ... 126

Table 35. The Directionality Ranks of Maritime ... 128

Table 36. The Most Significant Collexemes of DPW2C and DPC2W ... 129

Table 37. The Shared Collexemes of Ship and Vessel ... 130

Table 38. The Shared Collexemes of Maritime and Marine ... 131

Table 39. The Shared Collexemes of Port and Harbour ... 132

Table 40. The Shared Collexemes of Sea and Ocean ... 132

Table 41. The Shared Collexemes of Coast and Shore ... 133

Table 42. The Shared Collexemes of Global and Worldwide ... 134

Table 43. The Top 10 Collostruction and Collocation Results of Ship ... 134

Table 44. The Top 10 Collostruction and Collocation Results of Vessel ... 135

Table 45. The Top 10 Collostruction and Collocation Results of Port ... 136

Table 46. The Top 10 Collostruction and Collocation Results of Harbour ... 136

Table 47. The Top 10 Collostruction and Collocation Results of Global ... 137

Table 48. The Top 10 Collostruction and Collocation Results of Worldwide ... 138

Table 49. The Top 10 Collostruction and Collocation Results of Sea ... 138

Table 50. The Top 10 Collostruction and Collocation Results of Ocean ... 139

Table 51. The Top 10 Collostruction and Collocation Results of Coast ... 139

Table 52. The Top 10 Collostruction and Collocation Results of Shore ... 140

Table 53. The Top 10 Collostruction and Collocation Results of Maine ... 141

Table 54. The Top 10 Collostruction and Collocation Results of Maritime ... 141

(12)

List of Figures

Figure 1ˊThe Explanation of Collostruction, Collostruct and Collexeme ... 12

Figure 2ˊThe Sample of POS Tagging MEC with Stanford Parser ... 27

Figure 3ˊThe relationships among FYE, DPW2C and LOG of Ship ... 35

Figure 4ˊModel fits among FYE, DPW2C and LOG of Ship ... 37

Figure 5ˊCollexemes distribution among FYE, DPW2C and LOG of Ship ... 38

Figure 6ˊThe relationships among FYE, DPW2C and LOG of Vessel ... 40

Figure 7ˊModel fits among FYE, DPW2C and LOG of Vessel ... 42

Figure 8ˊCollexemes distributions among FYE, DPW2C and LOG of Vessel ... 43

Figure 9ˊThe relationships among FYE, DPW2C and LOG of Port ... 45

Figure 10ˊModel fits among FYE, DPW2C and LOG of Port... 47

Figure 11ˊCollexemes distributions among FYE, DPW2C and LOG of Port ... 48

Figure 12ˊThe relationships among FYE, DPW2C and LOG of Harbour ... 50

Figure 13ˊ Model fits among FYE, DPW2C and LOG of Harbour ... 52

Figure 14ˊCollexemes distributions among FYE, DPW2C and LOG of Harbour .... 54

Figure 15ˊThe relationships among FYE, DPW2C and LOG of Global ... 56

Figure 16ˊModel fits among FYE, DPW2C and LOG of Global ... 58

Figure 17ˊCollexemes distributions among FYE, DPW2C and LOG of Global ... 60

Figure 19ˊModel fits among FYE, DPW2C and LOG of Worldwide ... 63

Figure 20ˊThe relationships among FYE, DPW2C and LOG of Sea ... 66

Figure 21ˊModel fits among FYE, DPW2C and LOG of Sea ... 68

Figure 22ˊCollexemes distributions among FYE, DPW2C and LOG of Sea ... 69

Figure 23ˊThe relationships among FYE, DPW2C and LOG of Ocean ... 71

Figure 24ˊModel fits among FYE, DPW2C and LOG of Ocean ... 73

Figure 25ˊCollexemes distributions among FYE, DPW2C and LOG of Ocean ... 74

Figure 26ˊThe relationships among FYE, DPW2C and LOG of Coast ... 76

Figure 27ˊModel fits among FYE, DPW2C and LOG of Coast ... 78

(13)

Figure 28ˊCollexemes distributions among FYE, DPW2C and LOG of Coast...79

Figure 29ˊThe relationships among FYE, DPW2C and LOG of Shore...81

Figure 30ˊModel fits among FYE, DPW2C and LOG of Shore...83

Figure 31ˊCollexemes distributions among FYE, DPW2C and LOG of Shore...84

Figure 32ˊThe relationships among FYE, DPW2C and LOG of Marine...86

Figure 33ˊModel fits among FYE, DPW2C and LOG of Marine...88

Figure 34ˊCollexemes distributions among FYE, DPW2C and LOG of Marine ...89

Figure 35ˊThe relationships among FYE, DPW2C and LOG of Maritime ...91

Figure 36ˊModel fits among FYE, DPW2C and LOG of Maritime ...93

Figure 37ˊCollexemes distributions among FYE, DPW2C and LOG of Maritime ..94

Figure 38ˊThe relationship of DPW2C and DPC2W of Ship ...97

Figure 39ˊThe Dispersion of Collexemes of Ship...98

Figure 40ˊThe relationship of DPW2C and DPC2W of Vessel ...100

Figure 41ˊThe Dispersion of Collexemes of Vessel...100

Figure 42ˊThe Dispersion of Collexemes of Port ...103

Figure 43ˊThe relationship of DPW2C and DPC2W of Harbour ...105

Figure 44ˊThe Dispersion of Collexemes of Harbour...106

Figure 45ˊThe relationship of DPW2C and DPC2W of Global...108

Figure 46ˊThe Dispersion of Collexemes of Global ...109

Figure 47ˊ The relationship of DPW2C and DPC2W of Worldwide... 111

Figure 48ˊ The Dispersion of Collexemes of Worldwide ... 112

Figure 49ˊThe relationship of DPW2C and DPC2W of Sea... 114

Figure 50ˊThe Dispersion of Collexemes of Sea ... 115

Figure 51ˊThe relationship of DPW2C and DPC2W of Ocean ... 117

Figure 52ˊThe Dispersion of Collexemes of Ocean... 118

Figure 53ˊThe relationship of DPW2C and DPC2W of Coast ...120

Figure 54ˊThe Dispersion of Collexemes of Coast...120

Figure 55ˊThe relationship of DPW2C and DPC2W of Shore ...122

Figure 56ˊThe Dispersion of Collexemes of Shore...123

Figure 57ˊThe relationship of DPW2C and DPC2W of Marine ...125

(14)

Figure 58ˊThe Dispersion of Collexemes of Marine ... 125 Figure 59ˊThe relationship of DPW2C and DPC2W of Maritime ... 127 Figure 60ˊThe Dispersion of Collexemes of Maritime ... 128

(15)

Contents



Abstract ... i

Acknowledgment ... iv

List of Tables ... v

List of Figures ... vii

Chapter 1 Introduction ... 1

1.1 Research Background ... 1

1.2 Significance of the Study ... 3

1.3 Research Questions ... 4

1.4 Layout of the Dissertation ... 5

Chapter 2 Literature Review ... 6

2.1 Previous Studies on Maritime English ... 6

2.1.1 Frequency-based Analysis ... 7

2.1.2 Keyword-based Analysis ... 8

2.2 Previous Studies on Collostructional Analysis ... 9

2.2.1 Introduction to Collostruction ... 9

2.2.2 Approaches to Collostructional Analysis ... 12

2.2.2.1 Fisher-Yates Exact Test ... 13

2.2.2.2 Loglikelihood ... 15

2.2.2.3 Delta P (P) ... 16

2.2.3 Critical Issues of Directionality ... 17

Chapter 3 Data and Methodology ... 19

3.1 Maritime English Corpus as a Study Corpus ... 19

3.2 British National Corpus (BNC) Baby as a Reference Corpus ... 20

3.3 Corpus Analysis Tools for Data Processing ... 21

3.3.1 AntConc ... 21

3.3.2 Stanford Parser ... 22

3.3.3 R Language ... 22

3.3.4 SPSS ... 23

3.3.5 Semantic Tagging System ... 24

3.4 Procedures for Data Processing ... 25

(16)

xi

3.4.1 Extracting the Important High-frequency Sea-related Words ... 25

3.4.2 POS Tagging the MEC ... 26

3.4.3 Semantic Annotation ... 27

3.4.3.1 The Means of Water Transport Words (M4) ... 29

3.4.3.2 The Geographical Words (W3) ... 30

3.4.3.3 Overlapped Semantic Domain Words (W3/M4) ... 30

3.4.4 A Collostruction of “A/N + N” ... 31

Chapter 4 Discussion and Results ... 33

4.1 Collostructional Analysis ... 33

4.1.1 M4 Words ... 33

4.1.2 W3 Words ... 54

4.1.3 W3/M4 Words ... 65

4.2 Directionality Analysis ... 95

4.2.1 M4 Words ... 96

4.2.2 W3 Words ... 107

4.2.3 W3/M4 Words ... 113

4.3 Shared Collexemes of Near-synonyms ... 130

4.3.1 Ship and Vessel ... 130

4.3.2 Maritime and Marine ... 131

4.3.3 Port and Harbour ... 131

4.3.4 Sea and Ocean ... 132

4.3.5 Coast and Shore ... 133

4.3.6 Global and Worldwide ... 133

4.4 Difference between Collostrucion and Collocation Results. ... 134

Chapter 5 Conclusion ... 143

5.1 Major Findings ... 143

5.2 Implication of the Study ... 145

5.3 Limitations and Suggestions ... 146

References ... 148

Appendix 1: The Collostructional Analysis Results ... 155

Appendix 2: The Registers of the Most Significant Collexemes ... 209

Appendix 3: The Directionality Ranks of 12 Important Sea-related Words ... 214

Appendix 4: The Shared Collexemes of Near-synonyms ... 264

(17)
(18)

Chapter 1 Introduction

This chapter briefly introduces the basic issues of the study, including the background of previous researches, research questions, significance as well as the layout of the dissertation.

1.1 Research Background

Maritime English belongs to the domain of English for specific purposes (hereafter referred to as ESP); it is the lingua franca for people engaged in international maritime transportation, whose throughput accounts for more than 80%

of the goods for world trade. Hundreds of thousands of seafarers of different countries speaking different tongues work in this industry communicating in one language—

English, among themselves, between labor and management, from ship to ship and between sea and shore. Quite often the captain, other senior officers and crews of one ocean-going ship are from several countries and English is the only language spoken both as a working language and for everyday conversation. Shipping accidents often occur, some being acts of God, and some because of human errors. Both types of sea distress have been thoroughly analyzed, and the nature of several accidents of the latter is determined as linguistic—miscommunication in English. Because of the importance of English, the International Maritime Organization (IMO) under the UN has set an English threshold for the international shipping circles and commissioned scholars to research into, and compile course books for, maritime English.

As teachers and researchers of maritime English, we are interested in the syntactic characteristics of maritime English, focusing on its collocations and phrase structures. As the most basic ideographic unit of English language, lexicon is one of the most attractive elements for language researchers. The central role that the lexicon plays in second language acquisition and teaching has received increasing interest in recent years. Lewis (1993: 27), who coins the term lexical approach, suggests that lexicon is the basis of language. In the process of language acquisition and application, not a single language skill can break off its reliance on vocabulary (Chomsky, 1999).

Vocabulary learning is the dominant task in language acquisition and it also becomes

(19)

a major part in the second language teaching class. For maritime English learners, mastering maritime words is the key point in order to understand maritime news, documents and laws and probe into the maritime domain.

Researchers have studied the English words either from the word learning strategy, or word teaching method, or the misuse of words. Nattinger and DeCarrico (1992: 1) claimed that "Lexical phrases are mufti-word lexical phenomena that exist somewhere between the traditional poles of lexicon and syntax, conventionalized form-function composites that occur more frequently and have more idiomatically determined meaning than language that is put together each time." And the term of

"lexical phrases" was adopted by Nattinger in many of his studies (1980, 1989, 1992).

Weinert (1995: 182) took formulaic expressions as "multi-word (how do you do?) or mufti-form strings( rain-ed; can-'t), which are produced or recalled as a whole chunk, much like an individual lexical item, rather than being generated from individual lexical items/forms with linguistic rules".

Moon (1998: 79) defined phrasal lexemes as "the sorts of item that for reasons of semantics, lexico-grammar, or pragmatics are regarded as holistic units rather than compositional strings. Such items include pure idioms, proverbs, similes, institutionalized metaphors, formulae, sayings, and various other kinds of institutionalized collocation".

Wray (1999: 214) also gave a definition of formulaic sequence, which is intended to be as inclusive as possible. The definition is as: "A sequence, continuous or discontinuous, of words or other meaning elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar".

Biber, Conrad, and Cortes (2003: 900) termed the mufti-word sequence as lexical bundles. "Lexical bundles are recurrent expressions, regardless of their idiomaticity and regardless of their structural status. That is, lexical bundles are simply sequences of word forms that commonly go together in natural discourse". That study adopted a frequency-driven way, and the lexical bundles were identified empirically, as the combinations of words that in fact recurred most commonly in a given register. For four-word lexical bundles, he set a minimal cut-off of at least ten times per million words for a sequence to be considered recurrent lexical bundles. Moreover, these occurrences must be spread across at least five different texts in a register. And a

(20)

six-word bundles. However, most of these studies are based on the word-level instead of the more up-to-date and varied collostruction-level.

Stefanowitsch and Gries (2003: 5) have proposed a type of collocational analysis which is sensitive not only to various levels of linguistic structure, but to the specific constructions found at these levels. They referred to this method as collostructional analysis. Collostructional analysis starts with a particular construction and investigates which lexemes are strongly attracted or repelled by a particular slot in the construction, which can be adopted into various types of collostruction including adverbial phrase of course, verbal phrase Jack gave Mary a book, adjectival phrase event waiting to happen and nominal phrase waiting people / apple trees. This is a new way to investigate the traditional word sequences. In addition, with the combination and development of language research and computer ability, importance is increasingly attached to the corpus-based study which is favored by language researchers since it is more objective and persuasive to combine language studies and related corpus and make it quantification. With the rapid development of computer science, corpus linguistics has become a dramatically growing discipline.

Therefore, this dissertation adopts the corpus-based research method and builds up a Maritime English Corpus (hereinafter referred to as MEC) to analyze one of an important nominal collostruction “A/N + N”. Because there are “Adjunct + N”, e.g.

Maritime English and “Complement + N”, e.g. Port Facility, “A/N + N” is better to represent this nominal collostruction.

MEC contains safety at sea, shipping news, navigational and marine engineering technology, laws, rules and regulations and documents on all the related areas of maritime transportation. Besides, as an important representative of general English, British National Corpus Baby (hereinafter referred to as BNC Baby) is a computer- readable general corpus of texts in the field of corpus linguistics. Therefore, BNC Baby is chosen as the reference corpus to represent the general English.

1.2 Significance of the Study

First of all, different from the traditional lexical analysis, the corpus-based analysis is applied in the study as an interdisciplinary study, corpus analysis includes language analysis, research techniques, statistics, information as well as technology.

(21)

Hence, the findings of the paper can highlight the characteristics of interdisciplinary corpus-based research and provide a study sample for other branches of linguistics.

Maritime English covers all aspects of the work and life of seafarers and related maritime organizations, marine events and accidents including disasters on sea, piracy, maritime pollution etc., checklists, logs and reports writing. There is no doubt that studies on maritime English collostruction can help the concerning readers know and grasp the language features of maritime English for a better understanding. The results are also beneficial for the design of relevant maritime English dictionary and teaching materials as well as the promotion of maritime English teaching.

In addition, collostructional analysis plays an important role in the actual language learning with collexeme being a key concept in corpus linguistics. The study thus takes a view on the maritime English collostruction, aiming at investigating its collexemes, collostruction strength and directionality features, while distinguishing several pairs of near-synonyms in maritime English.

1.3 Research Questions

Based on corpus linguistics, the dissertation is aimed at investigating the collexeme features of maritime English. According to the data obtained by using computer program and analysis tools, this study is expected to answer the following four questions:

1. What are the high-frequency sea-related words of the MEC?

2. What are the statistical results when realizing three different approaches, FYE, DELTA P and LOG, to collostructional analysis?

3. What are the relationships among the three different approaches to collostructional analysis?

4. Is there any directionality of the DELTA P for collostructional analysis?

5. What are the similarities and differences of near-synonyms when applying collostructional analysis?

(22)

1.4 Layout of the Dissertation

The dissertation is divided into five chapters. Chapter one is the Introduction.

First it gives a general description of the research background of this dissertation.

Then the second part introduces the significance and research questions of the dissertation.

Chapter two is the Literature Review. This chapter firstly reviews the definition of keywords and how to define a high-frequency and sea-related word. Then terminologies including collostructional analysis, collexeme and collostruction strength are introduced. Furthermore, definitions and related studies of statistical tests including Fisher-Yates Exact test, Loglikelihood and Delta P are displayed. Critical issues of directionality are explored in the final part of this chapter.

The first part of Chapter three describes the corpus-based methodology adopted in this paper. Then the corpora utilized in the research are introduced, i.e. MEC and BNC Baby. The second part introduces the statistical measures and corpus analysis tools used for data collection. The last part includes the procedure of data processing such as POS-tagging and programming.

Chapter four is the core of this paper. The first part analyses the different results from three different statistical tests in the MEC. The second main part deals with the directionality of these significant collostructions in the MEC to find the relationship between word to construction and construction to word. The third part concludes the shared collexemes of near-synonyms to reveal the similarities and differences.

Chapter five as the conclusion gives a summary of significant findings on collostructions of maritime English, including results and findings of this study.

Implications and limitations of the study as well as suggestions for further studies are also illuminated.

(23)

Chapter 2 Literature Review

This chapter introduces the definition of keywords, the classifications of high-frequency sea-related words in maritime English, the studies on collostructional analysis and critical issues of directionality.

2.1 Previous Studies on Maritime English

The shipping industry plays an important role in global economy and transportation.

English is the only language used by the seafarers in the international shipping industry.

Therefore, maritime English needs to be taught in a high level by English teachers to match the rapid requirement. Some previous studies related to maritime English are listed in this part.

Trenkner (2000) gives the definition of maritime English in the following: Maritime English includes all the means of the English language which is being used in the international maritime field and makes contributions to the safety of the facilitation and navigation of sea area business.

Pritchard (1998) says that maritime English is one of the subsets of English as a working language for the international seafarers. Maritime English is one of the categories of ESP and it contains a wide range of filed. All English articles and works related to maritime can be included in maritime English. Considering that, many linguists and scholars are paying great attention to maritime English. From the teaching aspect, Luo (2008) explore ESP teaching methods by taking maritime English as an example. They use corpus-based research to make both quantitative and qualitative analyses and they propose that the MEC is also an important method for maritime English teaching. For example, the MEC can be used to study the features of maritime English grammar and collocation in maritime English discourse. Lv and Gu (2009) conducts a research on the conjuncts in maritime English discourse which mainly contains four aspects: lexical density, lexical frequency, positional distribution and semantic distribution. They find that conjuncts and semantic distribution in maritime English are be less than general English and conjuncts in maritime English are be less important than general English. Their findings tell the learners that different discourses have different features of the usage of conjuncts, as a result teachers and students have to pay attention to these differences.

The difference of maritime English and general English from conjuncts aspect is just one

(24)

perspective, thus so many scholars and linguists explore the differences between maritime English and general English from different aspects.

Jhang (2011) conducts a corpus-based lexical analysis of maritime English by compiling both a word-list of high frequencies in his maritime English corpus and a keyword-list referenced against the LOB with a WordSmith Tools 5.0 program. He also uses a Perl program to extract noun compounds, to set up vocabulary lists, and to suggest how many words should be learned by beginning students (80% cumulative frequency rate) and by advanced students (90% cumulative frequency rate). He explores the distribution of parts of speech as content versus function words and points out significant differences between maritime English and general English. Great contributions are made in similar types of research (Jhang 2010, 2011, 2013, 2014, 2015; Lee 2016; Lv 2017).

2.1.1 Frequency-based Analysis

High frequency words are quite simple words which occur most frequently in written material. According to Hillerich (1927: 10), just three words I, and and the account for ten percent of all words in printed English. They are words that have little meaning on their own, but they do contribute a great deal to the meaning of a sentence. Some of the high frequency words can be sounded out using basic phonic rules, e.g. it is an easy word to read but is not phonically regular and is therefore hard to read in the early stages. These words are sometimes called tricky words, sight words or camera words. In addition to being difficult to sound out, most of the high frequency words have rather abstract meanings which are hard to explain.

The definition of high-frequency word can never be absolute for all languages in all contexts and for all users. However, common sense and observation would suggest that the frequency use of a given word is in inverse proportion to its specificity of meaning and use.

For example, the one hundred most commonly used words in English, words such as the various equivalents of be, is, and, so, but, now, do, have, go, then, the, a/an, with, for, it and me, occur much more frequently.

Yu (2012) discusses the translation method of high-frequency adjectives in maritime English. Kim (2013) examines the cruise brochures of World Wide Cruise by analyzing their contents through frequency studies. Su (2014) researches the inter-textual vocabulary growth and tests the goodness of fit of existing mathematic models and finds out the most suitable

(25)

model for vocabulary growth pattern of the maritime convention corpus based on the distribution of frequency. Qi (2016) discusses the general distribution and classifications of adverbs and investigate the collocations under different colligations of four high-frequency adverbs in his maritime English corpus.

2.1.2 Keyword-based Analysis

A ‘keyword’ refers to a word ‘whose frequency is statistically significant when compared to the standards set by a reference corpus (Scott, 1997; Baker, 2004; Anthony, L. 2005; Scott and Tribble, 2006). It has been argued that keywords have great potential to be indicative of changes in writing styles, which can ultimately be linked back to social change (O’Keeffe et al., 2007; Scott, 2008; Baker, 2010). A review of the literature shows that while a large number of studies have been conducted with respect to keywords as a way to characterize a genre, relatively few have been devoted to applying keywords in diachronic analyses of language changes within a particular text type. This leaves vast room for research.

Keyword analysis, on the other hand, uncovers trends using a quantitative method.

Keyword analysis is a corpus-based study which explores the theme through keywords whose frequency is unusually high in comparison with a reference corpus of some kind (Scott, 1996, 1997). Therefore, keywords are defined in relation to frequency differences and their quality of being ‘key’, best known as keyness, is determined by statistical values. The existing literature shows that the keyword-based approach has been adopted in trend analysis although it has not received much attention and importance, as compared to similar types of research using genre analysis (Tribble, 2000; Kemppanen, 2004; Scott and Tribble, 2006; Seale et al., 2006; Goh and Lee, 2008; Rayson, 2008; Baker, 2009; Mahlberg, 2009; McEnergy, 2009;

Jhang, 2011, Lin, 2015; Wilkinson, 2015).

Zhang (2000) studies keywords denoting legal documents in maritime treaty English focused on the etymology, general definition, legal definition, collocation and translation of these legal terms. Guo (2014) investigates the maritime news features in aspects of vocabulary size, word length, word frequency, lexical density as well as keyword studies.

Jhang (2017) analyses the trend of SOLAS conventions through keyword analysis and analyses the degree of diachronic changes in each safety standard, comparing the lexical distribution and density of each keyword list. Great contributions are made in similar types of research (Jhang 2011, 2014, 2016).

This brief review of previous studies indicates that these important issues do not seem to

(26)

have been sufficiently addressed in any previous studies, and this approach within the collostructional analysis of ESP genre is still rare and merits further exploration.

2.2 Previous Studies on Collostructional Analysis

This part discusses the concept of collostructional analysis, a family of quantitative corpus-linguistic methods that allow researchers to express the strength of the relationship between word constructions and the grammatical structures they occur in. It also explains that although adoption of collostructional analysis is a comparatively recent development in Construction Grammar, it has already been applied to a fairly wide range of constructions in the context of research questions ranging from systemic description over language variation and change to language acquisition and processing. In addition, it also addresses important methodological issues of collostructional analysis such as the use of inferential statistics, the cognitive mechanisms assumed, as well as the choice of statistical tests.

2.2.1 Introduction to Collostruction

In the past, the lexicon and the grammar of a language are considered as completely different parts of language, with the lexicon standing for specific lexical items, and the grammar standing for abstract syntactic rules. Lots of collocations like lexical bundles and fixed expressions have been ignored by most syntactic theories (Gries, 2003).

Recently, however, corpus linguistics has shifted their focus to a more comprehensive understanding of language. For example, Hunston and Francis’ Pattern Grammar (2000) and Lewis’ theory of Lexical Chunks (1993). Sinclair (1991); Barlow and Kemmer (1994) among others have already discussed that grammar and lexicon are not totally different, and that the long-ignored multi-word expressions serve as a vital link between them. In this respect, Pattern Grammar and Lexical-Chunk Theory share a view of both lexicon and grammar as consisting of linguistic signs, and the most influential theories are known as Construction Grammar, (Fillmore 1985, 1988; Lakoff 1987; Goldberg 1995, 1999; Kay and Fillmore 1999), Emergent Grammar (Hopper 1987; Bybee 1998), Cognitive Grammar (Langacker 1987, 1991), and some versions of LFG (Pinker 1989) and HPSG (Pollard and Sag 1994). The meaningful grammatical structures are variously referred to by terms such as construction, sign, pattern, lexical/idiom chunk, and a variety of other terms.

(27)

Stefanowitsch and Gries (2003) defined that lexemes that are attracted to a particular construction are referred to as Collexemes of the construction; conversely, a construction associated with a particular lexeme may be referred to as a collostruct; the combination of a collexeme and a collostruct will be referred to as a collostruction. They have proposed a type of collocational analysis which is sensitive not only to various levels of linguistic structure, but to the specific constructions found at these levels. They referred to this method as collostructional analysis. Collostructional analysis always starts with a particular construction and investigates which lexemes are strongly attracted or repelled by a particular slot in the construction.

Collostructional analysis investigates the lexicogrammatical associations between constructions and lexical elements. It is based on the theoretical framework of Construction Grammar, which claims that grammatical constructions are pairings of forms and meanings.

Goldberg (1995) mentioned the sizes of constructions range from individual morphemes to large-scale grammatical structures including clause-level argument-structure constructions.

Fillmore et al. (1988) investigated grammatically exotic and lexically specific constructions such as the let-alone construction, Kay and Fillmore (1999) investigated what's X doing Y construction. The constructions investigated in collostructional analysis are typically of the syntactic type and open grammatically defined slots for lexemes to occur. Thus examples studied by Stefanowitsch and Gries include the ditransitive construction with a focus on the slot for the verb, and, illustrate the more specific type, the N waiting to happen construction with a focus on the nominal slot.

Collostructional studies have been applied in numerous contexts: structural/syntactic priming, the study of morphosyntactic alternations, first language acquisition, diachronic constructional change, and so on. However, there are now also studies in second language acquisition (SLA) using these methods. For example, Gries and Wulff (2005) showed that advanced German learners exhibit verb-specific priming effects that are highly correlated with distinctive collexeme strengths of verbs participating in the English ‘dative alternation.’ In a similar vein, Gries and Wulff (2009) illustrated that the to- vs. ing-complementation alternation exhibits similar effects: advanced German learners, sentence-completion priming was more influenced by the preference of the verb in the sentence fragment than any other variable included in the experimental design. Ellis and Ferreira-Junior (2009) showed that the verbs learners learn first in several argument structure constructions are highly associated

(28)

compared learners' tense-aspect marking patterns in the British National Corpus and the Michigan Corpus of Academic Spoken English and shown how verb-specific constructional preferences of German and Dutch learners of English correspond to native speakers' preferences and how this approach allows learners to identify their behavioural outliers for subsequent analysis. These applications show that this method has a lot to offer to SLA research for language learning, representation, and processing.

There is always a misunderstanding between collocation and collostruction. According to Firth (1957: 181), "collocation of a given word are statements of the habitual customary places of that word". This notion of collocation is essentially quantitative (cf. Krishnamurthy 2000). This notion is widely accepted by many linguists. Greenbaum (1974: 79) defines collocation as "a frequent co-occurrence of two lexical items in the language." The word

"frequent" poses lots of disputes. This is not a definite word to tell how frequent the occurrence appears can be assumed as collocation. Hoey (1991: 6) believes collocation appears if a lexical item appears with other items "with greater than random probability in its (textual) context." This is also a vague description of collocation in statistical way.

Three different kinds of collocations are shown below:

(1) Colligation: the relation between a word and grammatical categories which co-occur frequently with it.

e.g. depend on (not depend at)

(2) Semantic Preference: the relation between a word and semantically related words in a lexical field.

e.g. strong coffee (not powerful coffee)

(3) Semantic Prosody: the discourse function of the word to describe the speaker’s communicative purpose.

e.g. He was killed in the street in broad daylight.

(not He was met in the street in broad daylight.)

To understand the difference, the figure below will show the details of collostruction clearly.

(29)

Figure 1ˊThe Explanation of Collostruction, Collostruct and Collexeme

2.2.2 Approaches to Collostructional Analysis

In 2003, Anatol Stefanowitsch and Stefan Th. Gries introduced a set of pioneering methods subsumed under the term collostructional analysis (Stefanowitsch and Gries 2003;

Gries and Stefanowitsch 2004). The major goal of these corpus-based methods is to develop improved tools for investigating interactions between lexemes and grammatical patterns.

More precisely, collostructional analysis gauges the associational strength between constructions and the lexical elements filling certain slots in these constructions (Gries and Stefanowitsch 2003), and unravels the semantic differences between apparently synonymous constructions by comparing the collostruction strength of manifestations and lexical variants in actual use as documented in corpora (Gries and Stefanowitsch 2004; Gries et al. 2005, 2010;

Gries and Stefanowitsch 2010).

Collostructional analysis ultimately relies on frequency counts of tokens of different types of phenomena in large corpora. For successful applications of the method four different scores for frequencies of occurrence of a target lexeme (L) and a target construction (C) must be retrieved from the corpus investigated (Stefanowitsch and Gries 2003):

(30)

- the frequency of L in C,

- the frequency of L in all other constructions, - the frequency of C with lexemes other than L, and

- the frequency of all other constructions with lexemes other than L.

These scores are arranged in a so-called contingency or two-by-two table familiar to

2 -test. The setup of such tables is conventionally as rendered in Table 1:

Table 1. Contingency Table Cross-tabulating Frequency Scores of L and C

+ Target Lexeme -Target Lexeme +Target

Construction

1. Frequency of L in C 3. Frequency of C with lexemes other than L

Row total (= frequency of C in the corpus)

-Target Construction

2. Frequency of L in all other constructions in the corpus

4. Frequency of all other constructions with lexemes other than L

Row total

Column total (= frequency of L in the corpus)

Column total Grand total

It has been a long-standing aim of corpus linguistics to measure the degree of mutual attraction between lexical elements in text. The mutual associations between lexemes and grammatical constructions came into the focus of attention at the end of that period (Hunston and Francis 2000; Schmid 2000), mainly because the insight was gaining a strong ground that grammar and lexicon are not such strictly separated parts.

Stefanowitsch & Gries (2003) used Fisher-Yates Exact Test to measure the collocation strength between a word and a construction, while classic statistical procedures proposed during that period include t-score, mutual information index, delta P and log-likelihood ratio (Church and Hanks 1990; Stubbs 1995; Manning and Schütze 2001; Evert and Krenn 2001;

Evert 2004; Ellis and Ferreira-Junior 2009) and so on. In this study, Fisher-Yates Exact Test, Loglikelihood and Delta P are chosen for analysis.

2.2.2.1 Fisher-Yates Exact Test

Fisher-Yates Exact Test is a statistical significance test used in the analysis of

(31)

contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypodissertation (e.g., P-value) can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.

The test is useful for categorical data that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification. Most uses of the Fisher test involve a 2 × 2 contingency table. The p- value from the test is computed as if the margins of the table are fixed. The principle of the test can be extended to the general case of an m × n table, and some statistical packages provide a calculation for the more general case.

Fisher showed that the probability of obtaining any such set of values was given by the hypergeometric distribution:

The contingency table serves as input to statistical tests that aim to measure the association between constructions and lexemes. Most studies that have applied collostructional analysis known as Fisher-Yates Exact test. The actual measure chosen to gauge the degree of attraction, referred to as collostruction strength, is the p-value of this test.

The null hypodissertation of the Fisher Exact test is the independence of the occurrence of a target lexeme (L) and a target construction (C). Basically, the distribution of observed frequencies is compared with expected frequencies under the null hypodissertation calculated on the basis of the row and column totals, which are known as marginals. Given a certain distribution of observed frequencies in the corpus, the p-value indicates the probability of obtaining this distribution or a more extreme one, assuming the null hypodissertation that the distribution was the result of chance.

This is interpreted by Stefanowitsch & Gries as meaning that the smaller the p-value, the higher the strength of the association between lexeme and construction. More often than not, p-values are so small that their significance resides only in the number of decimal places.

These scores are conventionally expressed in numbers of the type "1.12E-10".

Note that the larger the number of decimal places, and thus the higher the score for the

(32)

logarithmic transformation, the lower the p-value, and thus the stronger the hypothetical attraction between lexeme and construction. P-values are computed individually for each of the lexemes investigated in a given construction on the basis of their observed frequencies.

Once p-values have been computed for all targeted lexemes, a rank list of Collexemes is produced, which is taken to be an indicator of the relative differences in construction strength.

2.2.2.2 Loglikelihood

In frequentist inference, a likelihood function (often simply the likelihood) is a function of the parameters of a statistical model, given specific observed data. Likelihood functions play a key role in frequentist inference, especially methods of estimating a parameter from a set of statistics. In informal contexts, "likelihood" is often used as a synonym for "probability".

In mathematical statistics, the two terms have different meanings. Probability in this technical context describes the plausibility of a future outcome, given a model parameter value, without reference to any observed data. Likelihood describes the plausibility of a model parameter value, given specific observed data.

In Bayesian inference (1763), one can speak about the likelihood of any proposition or random variable given another random variable: for example the likelihood of a parameter value or of a statistical model (see marginal likelihood), given specified data or other evidence.

The likelihood gives an indication of how much the data contribute to the probability of the parameter value or of the model. More rigorously, the likelihood of a parameter value, given specified data, is the probability of the data given the parameter value. An important consequence of this difference in Bayesian inference is that a parameter value or statistical model can sometimes have a large likelihood given specified data, and yet have a low probability, or vice versa. This is often the case in medical contexts. Per Bayes Rule, the likelihood can be multiplied by a prior probability and then normalized, to give a posterior probability

Manning and Schutze (2001) showed how to compute the likelihood ratio, the following data are needed:

c1: the frequency of the key word

c2: the frequency of the first right collocate of the key word

c12: the frequency of the key word occurring with its first right collocate n: size of text or corpus

(33)

p: c2/n p1: c12/c1

p2: (c2-c12)/(n-c1)

The likelihood ratio logOis obtained with the following:

logO= log

) , , (

) , , (

) , , ( ) , , (

2 1 12 2 1 1 , 12

1 12 2 1 12

p c n c c b p c c b

p c n c c b p c c b









= log b(c12, c1, p) + log b(c2-c12, n – c1, p) – log b(c12, c1, p1) – log b(c2– c12, n– c1, p2) where b stands for binomial distribution:

Therefore:

) )

1 ( log(

) ) 1 ( log(

) )

1 ( log(

) ) 1 ( log(

log

) ( ) ( 2 ) ( 2 )

( 1 1

) ( ) ( ) ( )

(

12 2 12 1

12 2 12 1

12 2 1 12

2 12

1 12

c c c c n

c c c c

c c c n c

c c

c c

p p

p p

p p

p p





 























 O 

which can be rewritten as:

)) (

) )((

1 log(

) )(

2 log(

) )(

1 log(

) log(

)) (

) )((

1 log(

) )(

log(

) )(

1 log(

) log(

log

12 2 1 2

12 2

12 1 1 12

1 12

2 1

12 2 12

1 12

c c c n p c

c p

c c p c

p c

c c n p

c c p c

c p c

p







































 O 

logOis then multiplied with -2 since -2logOis 2distributed. In the 2distribution table, the significance level of D = 0.05 is 3.84 for one degree of freedom, so for a collocational

association to be significant, -2logO O interpretable

and can be more appropriate for sparse data.

2.2.2.3 Delta P (ƸP)

Fisher-Yates Exact Test is a bidirectional association measure. Put differently, it summarily states how much a word and a construction ‘like’ each other, but it cannot differentiate between the degree that the word ‘likes’ the construction and the degree to which the construction ‘likes’ the word. This is absolutely correct and precisely the reason why researchers need to output directional P-values in addition to whichever association measure a user chose.

(34)

Ellis and Ferreira-Junior (2009) introduced the measure of Delta P (P) and emphasized that two reciprocal rather than one unifying measure may be required to assess the association between constructions and lexemes, on the one hand, and lexemes and constructions, on the other, and therefore recommend the use of two scores. Technically, P measures the contingent probability of a given construction attracting a given lexeme (P construction to word; henceforth "DPC2W") and of a given lexeme relying on a given construction (P word to construction; henceforth "DPW2C").

The calculations of P results are as follows, which can be used to measure the directionality of a collostruction.

2.2.3 Critical Issues of Directionality

To investigate from “directionality point of view” is different from various statistical measures. The traditional association measures are almost bidirectional. Sinclair (2004) presents the notion of statistical collocation through exhaustive empirical testing and makes a suggestion of Window-spans or positions of co-occurrences of between -4 and +4. Baker (2006) studies collocates of bachelor/bachelors using different statistical tools such as MI, MI3, Z-score, Log-likelihood, Log-Log, Observed/expected. Scott (2014) mentions WordSmith Tools 6.0 including various bidirectional association measures such as MI, MI3, Z-score, Dice-coefficient, Log-likelihood.

But for directional approaches to the association measures, there are few but important studies which should be mentioned. Wiechmann (2008) compares 47 different association measures with regard to how well they match up with psycholinguistic reading time data using MS measure. Ellis and Ferreira-Junior (2009) mention associations are not necessarily reciprocal in strength and conflate two probabilities by bidirectional/symmetric association measures. Michelbacher et al. (2011) rank measures for a comparison of the results between raw co-occurrence frequencies, G2, and t, and chi-square and human associations and indicate the correct direction of association. Gries (2013) argues that corpus linguistics would benefit

(35)

from the study dealing the fact that collocations are not symmetric. He introduces an association measure from the associative learning literature that can identify asymmetric collocations, explores some aspects which the common collocates of near-synonyms such as maritime-marine may have, and tries to explain them through a language network framework.

Directional approaches increase our chances of improving both our results and the match of our methods to current cognitive and psycholinguistic theories, which will be helpful to explore some aspects which the common collostructions of near-synonyms such as maritime/marine, port/harbour, sea/ocean etc.

(36)

Chapter 3 Data and Methodology

3.1 Maritime English Corpus as a Study Corpus

As other learning instruments and resources, corpus has different classifications. Only by understanding the corpus types and their functions and effects can we behave better in choosing the right corpus, according to our needs, to provide help in our work and learning.

Corpus can be classified into different types following different standards. For example, by the standard of whether it is processed, there are two types. Raw texts vs. untagged corpus, or tagged vs. annotated corpus; by the sources of corpus, there are original corpus and translational corpus. Translational corpus collects texts translated from foreign languages; by the content of corpus, there are general corpus and specialized corpus.

For keyword analysis, there are study corpus and reference corpus. Keywords are computed using two word lists, one from the text or study corpus that one wants to investigate and the other from a normally larger reference corpus that acts as a benchmark corpus or provides background data for keyword calculation. The British National Corpus (BNC) Sampler is the important of general corpus, which can be used as a reference corpus in this study, while Maritime English Corpus (MEC) is a specialized corpus, which can be used as a study corpus.

The Maritime English Corpus (MEC) was used as the data source in the research. It is a 1,446,650-word corpus including safety at sea, shipping news, navigational and marine engineering technology, laws, rules and regulations and documents on all the related areas of maritime transportation. The following is the composition of MEC:

(37)

Table 2. Composition of the MEC

MEC Registers Number of

texts

Number of Running Words

Percentage

Spoken

SMCP 1 3,476 0.26%

Interview 2 15,749 1.09%

Presentation 4 80,274 5.55%

Others 7 15,240 1.03%

Subtotal 14 114,991 7.93%

Written

Academy 55 505,985 34.98%

Laws 7 213,492 14.76%

News 197 574,439 39.71%

Textbooks 9 37,995 2.62%

Subtotal 268 1,331,911 92.07%

Total 282 1,446,650 100%

The corpus was then syntactically parsed with Standford Parser v.3.7.0 (https://nlp.stanford.edu /software/ lex-parser.shtml), adding parts of speech tags for the following collostructional analysis.

3.2 British National Corpus (BNC) Baby as a Reference Corpus

BNC Baby is a four million word sampling of the 100 million word British National Corpus. It contains a brief description of the design of this sample and information about the way in which it is encoded. The words in each sample set correspond to a specific genre label.

One sample set contains spoken conversation and the other three sample sets contain written text: academic writing, fiction and newspapers respectively. The latest (third) edition has been released and comes in XML format.

(38)

Table 3. Composition of the BNC Baby

BNC Baby Registers Number of texts

Number of

Running Words Percentage

Spoken Conversations 30 696,258 17.41%

Written

Academic Texts 30 1,300,467 32.51%

Fiction 25 1,001,454 25.04%

Newspapers 97 1,001,821 25.04%

Subtotal 152 3,303,742 82.59%

Total 182 4,000,000 100%

3.3 Corpus Analysis Tools for Data Processing

In the process of doing statistics and analysis, several instruments are needed. In this study, we use AntConc to extract the keywords, Stanford Parsers to POS tag the corpus, the R language to do the analysis and statistics. This part will simply give some introduction to them.

3.3.1 AntConc

AntConc is a freeware concordance program for Windows, Macintosh OS X, and Linux.

The software includes seven tools:

1. Concordance Tool: shows search results in a 'KWIC' (KeyWord In Context) format.

2. Concordance Plot Tool: shows search results plotted as a 'barcode' format. This allows you to see the position where search results appear in target texts.

3. File View Tool: This tool shows the text of individual files. This allows you to investigate in more detail the results generated in other tools of AntConc.

4. Clusters/N-Grams: hows clusters based on the search condition. In effect it summarizes the results generated in the Concordance Tool or Concordance Plot Tool.

(39)

5. The N-Grams Tool, on the other hand, scans the entire corpus for 'N' (e.g. 1 word, 2 words, …) length clusters. This allows you to find common expressions in a corpus.

Collocates: shows the collocates of a search term. This allows you to investigate non-sequential patterns in language.

6. Word List: counts all the words in the corpus and presents them in an ordered list.

This allows you to quickly find which words are the most frequent in a corpus.

7. Keyword List: shows the which words are unusually frequent (or infrequent) in the corpus in comparison with the words in a reference corpus. This allows you to identify characteristic words in the corpus, for example, as part of a genre or ESP study.

3.3.2 Stanford Parser

Stanford Parser is the important of a large number of Penn TreeBank parsers. Its development was one of the biggest breakthroughs in natural language processing in the 1990s. The Stanford dependencies scheme has gained popularity throughout various natural language processing tasks. As a statistical parser, it still makes some mistakes, but commonly works rather well. It attained the highest confidence-weighted score of all entrants in the 2005 competition by a significant margin.

The parser deals with various languages; apart from English, Stanford parser also parses Chinese according to Chinese Treebank, German based on Negra corpus and Arabic by Penn Arabic Treebank. It provides phrase structure trees as well as dependencies output. In this study, Stanford Parser was used to tag two corpora.

3.3.3 R Language

R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. R is a GNU package. The source code for the R software environment is written primarily in C, Fortran, and R. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems. While R has a

참조

관련 문서

For this study—our third on global workforce trends, follow- ing studies in 2014 and 2018—Boston Consulting Group and The Network surveyed some 209,000 people in 190 countries

1 John Owen, Justification by Faith Alone, in The Works of John Owen, ed. John Bolt, trans. Scott Clark, "Do This and Live: Christ's Active Obedience as the

According to process analysis ‘the process of establishing peer-relationships between disabled children and non-disabled children in inclusive classeses of

Granger causality tests shaw that agriculture and transport storage industry are affected loan but the loan of restaurants and hotel industry and

As the results, the preparations of financial and customer/marketing perspectives have significant relationships between operational performances in total

Among them, the cases in which traditionality and universality are indifferent are most, and in such cases font determinations are relatively easy. However,

(The Office of International Affairs arranges a dormitory application for the new students only in the first semester.). 2)

Using AMOS as a statistical method, the findings revealed there are significant and positive relationships between technology factors (effort expectancy,