• 검색 결과가 없습니다.

An Evaluation of Translation Quality by Homograph Disambiguation in Korean-X Neural Machine Translation Systems

N/A
N/A
Protected

Academic year: 2021

Share "An Evaluation of Translation Quality by Homograph Disambiguation in Korean-X Neural Machine Translation Systems"

Copied!
6
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

1. Introduction

Neural machine translation is recently proposed as an end-to-end method to build a single neural network [1-2]. With the aid of powerful deep learning methods, NMT is now becoming the dominant paradigm in machine translation (MT) with remarkable improvements compared with rule-based and statistics-based MT [3-5]. NMT systems are often based on a sequence-to-sequence model that consists of an encoder and a decoder recurrent neural network (RNN).

The initial step of NMT is to calculate word embeddings for both source and target languages individually by converting each word into a continuous vector. Then, the encoder RNN encodes a source sentence (i.e., a sequence of word embeddings) into a single context vector [6] or a sequence of them [7]. The decoder RNN decodes the context vector to a target sentence through the target language’s word embeddings.

The potential issue with the word embeddings is that multiple senses of a word are encoded into one continuous vector. The encoder and decoder RNNs must learn how to choose the correct target word from several translation candidates that represent different senses of the source word. Even spending a substantial amount of their capacity, the encoder and decoder still failed to disambiguate word sense, and consequently, NMT cannot translate ambiguous words [8-9].

In most languages, many words have the same lexical form, but different senses. For example, in English, the sense of “light” is “not heavy” in the sentence “The sack of potatoes is 5 kilos light” or “illumination” in the other “He turned on the light.” The senses of a word in a specific usage can only be determined according to its neighboring context. This is a trivial task occurring subconsciously in the human brain. However, the computer requires a tremendous amount of knowledge to disambiguate the word-senses.

In order to address the issue of ambiguous words in NMT, we introduce a tool – UTagger, which identifies the correct senses of homographic words and tags the corresponding sense-codes to these words. These processes are done for only Korean text in the training parallel corpus before using it to train NMT systems. Each sense-code, which represents a special sense of a word, is defined as numerals in the Standard Korean Language Dictionary (표준국 어대사전 - SKLD). For instance, the sense-codes of the Korean word “사과” are defined from 01 to 08 to represent its eight different senses as shown in Table 1. Because computer delimits words by blank spaces between them, the tagging of a distinct sense-code to a word creates new words (e.g., “사과_05” is the form of “사과” tagged with “05”). Thus, NMT systems can handle the word ambiguity problem for Korean texts.

UTagger [10] was developed based on our Korean lexical

한-X 신경기계번역시스템에서 동형이의어 분별에 따른

변역질 평가

원광복, 신준철, 옥철영 울산대학교

nqphuoc@gmail.com, ducksjc@nate.com, okcy@ulsan.ac.kr

An Evaluation of Translation Quality by Homograph Disambiguation in

Korean-X Neural Machine Translation Systems

Quang-Phuoc Nguyen, Joon-Choul Shin, Cheol-Young Ock University of Ulsan

Abstract

Neural machine translation (NMT) has recently achieved the state-of-the-art performance. However, it is reported failing in the word sense disambiguation (WSD) for several popular language pairs. In this paper, we explore the extent to which NMT systems are able to disambiguate the Korean homographs. Homographs, words with different meanings but the same written form, cause the word choice problems for NMT systems. Consistent with the popular language pairs, we discover that NMT systems fail to translate Korean homographs correctly. We provide a Korean word sense disambiguation tool-UTagger to use for improvement of NMT’s translation quality. We conducted translation experiments using Korean-English and Korean-Vietnamese language pairs. The experimental results show that UTagger can significantly improve the translation quality of NMT in terms of the BLEU, TER, and DLRATIOevaluation metrics.

(2)

semantic network (LSN) – UWordMap [11], which comprises hierarchical networks for nouns, adjectives, verbs, and adverbs based on hyponymy relations. The connections between the networks in UWordMap were established through subcategorization information that we manually compiled based on sentence structures of example and definition statements from SKLD. Each node corresponds to a certain sense of a word, so it contains a word’s original form and a sense-code. Currently, UWordMap has been constructed with approximately 500 thousand words including all part-of-speech (POS) and becomes the most comprehensive and biggest LSN for the Korean language. Using UWordMap as a knowledge base, UTagger achieves the accuracy of 96.52% and the speed of approximate 30,000 words per second on the system of CPU core i7 860, 2.8 GHz when testing on the Sejong corpus [12].

We extensively evaluated the effectiveness of UTagger on NMT with the language pairs English and Korean-Vietnamese. The experimental results reveal that UTagger can significantly improve the quality of NMT systems in term of the BLEU, TER and DLRATIO evaluation metrics.

2. Neural Machine Translation

Most of NMT models belong to the attention-based encoder-decoder architecture [7]. The encoder is a bi-directional recurrent neural network (RNN) (i.e., forward and backward RNNs) in which the forward RNN reads the source sentence from left to right and computes forward hidden states (ℎ , ℎ , … , ℎ ) . The backward RNN reads the source sentence in the reverse order and produces backward hidden states (ℎ , ℎ , … , ℎ ), where is the length of the source sentence.

The forward hidden state at time t is calculated by ℎ = (1 − ) ∘ ℎ + ∘ ℎ , > 0 0 , = 0 (1) where, ℎ = ℎ ( + [ ∘ ℎ ]) (2) = ( + ℎ ) (3) = ( + ℎ ) (4) is a word-embedding matrix of the source language that is shared forward and backward, and ∗ and ∗ are weight matrices. denotes a logistic sigmoid function. The calculation

of hidden backward states is similar to that for forward states. The forward and backward hidden states are concatenated to have the source annotations ℎ , ℎ , … , ℎ .

The forward and backward hidden states are concatenated to have the source annotations ℎ , ℎ , … , ℎ with ℎ = ℎ ; ℎ .

A decoder is a forward RNN to generate the target sentence y= , , … , , ∈ ℝ , where is the length of the target sentence; is the vocabulary of the target language. Word is calculated by the conditional probability:

( | , … , , ) = ( , , ) (5) The hidden state is first initialized with = tanh ( ℎ ) and then calculated for each time by

s = (1 − ) ∘ s + ∘ s (6) where

̃ = ℎ( + [ ∘ ] + ) (7) = ( + + ) (8) r = σ(W Ey + U s + C c ) (9) is the word-embedding matrix of the target language, and ∗, ∗, and ∗ are weight matrices.

The context vector is calculated based on the source annotations by

= ∑ ( )ℎ (10) = tanh ( + ℎ ) (11) where is an attention mechanism to measure how well ℎ and

match, and , , and are weight matrices. 3. Korean Word Sense Disambiguation UTagger

Unlike English, Korean is a morphologically complex language, in which a token unit (eojeol-어절) that is delimited by whitespaces consists of a content word and one or more function words, such as postpositions, endings, and auxiliaries. Before homographic disambiguation, the Korean input texts need to be morphologically analyzed.

The problem of Korean morphological analysis is that several different morphemes and POS may encode into the same eojeol. For instance, four different sets of morphemes and POS that are shown in Table 2 can make up the same eojeol “가시는”. The phonemes in morphemes may be changed with many kinds of regularities and irregularities. Yet the same morphemes that are tagged with different POS have different meanings. The morphological analysis has to discover the correct set in a given context.

Most of the conventional methods [13-15] in Korean morphological analysis have had to do the following task:

 Segmenting the input eojeol into morphemes  Recovering the changed phonemes to the original  Assigning or tagging POS to each morpheme

Because these methods must perform various interim-processes and transform character codes to recover the original form, they increase the frequency of dictionary accesses and cause Table 1: The sense-codes of the Korean word “사과”.

Sense-code POS Sense

01 Noun a kind of cantaloupe

02 Noun a secretary of Joseon’s military 03 Noun 4 enlightenments of Buddhism 04 Noun 4 departments of Confucianism

05 Noun apple

06 Noun forgiveness

07 Noun loofah

(3)

the over analyzing problem. Then, the longest match strategy [16] was proposed to reduce the frequency of dictionary accesses, and the syllable-based prediction model [17] was introduced to handle the over analyzing problem. Recently, statistical-based [18-20] and deep learning-based approaches [21-22] have been investigated to address these problems. However, the high computational complexity of these approaches causes low performance’s systems; and those raise the problem of maintenance when any neologism occurs in the language.

The method of using the pre-analysis eojeol dictionary (PED) [23-24] can overcome the problems. According to the method, a dictionary of analyzed eojeol has to build in advance and the problem turns into the looking up morphologically analyzed eojeol in the PED. This method can perform fast because it does not need to identify the changed phonemes nor recover the original form. It is further easily maintained by editing or inserting data in the PED. However, building the PED containing all eojeols of Korean is an impossible task.

Instead of using the PED, we constructed a pre-analysis partial eojeol dictionary (PPED). Then we propose a method using the combination of PPED [25] and sub-word conditional probability [26] to analyze the morphology. The method could not only take advantage of the fast performance and easy maintenance of the PED method but also gained high-accuracy. Table 3 gives the morphological analysis’s accuracies testing on Sejong corpus. The comparisons show that UTagger achieves the state-of-the-art accuracy of Korean morphological analysis.

After analyzing into morphemes, UTagger disambiguates the homographs using UWordMap as a knowledge-based approach. There are several approaches to Korean WSD such as statistical-based [26], embedded word space [27], and bidirectional recurrent neural network [28]. However, these approaches suffer from the neological problem. For instance, given a sentence “래드불을 따 르다…” (I pour a Redbull...), we assume that “래드불” is a new beverage product does not exist in the training corpus. Therefore, these approaches cannot determine the sense of “따르다”, which has multi-meanings.

The knowledge-based approach can solve the neologism problem by adding the neologism into the LSN belonged to its hypernym (e.g., hypernym of “래드불” is beverage). Instead of “래드불”, the WSD system examines its hypernym (beverage) to determine the correct sense of words in the sentence. This approach is easy to be maintained but requires an LSN with abundant data.

Therefore, we constructed our Korean LSN – UWordMap as a

large-scale lexical knowledge base. Table 4 shows the statistic of UWordMap and the comparison with existing Korean LSNs. KorLex [29] were constructed based on English WordNet. CoreNet [30] was developed by mapping the Goidaikei Japanese hierarchical lexical [31] to Korean word senses. LCN is the ETRI lexical concept network [32] that was designed for question-answering systems. Recently, UWordMap is the biggest and most comprehensive Korean LSN.

Then, we developed UTagger to work according to the processes shown in Figure 1. The detailed of algorithms in UTagger were described in the papers [10] and [25-26]. We evaluated UTagger on the test set of 1,108,204 eojeols extracted from the Sejong corpus by selecting sentences with orders divisible by 10. The accuracy of UTagger reached 96.52% and it processed approximately 30K Korean words per second on a CPU core i7 860 (2.8 GHz).

We also compared UTagger with recent machine learning methods. These methods used the Sejong corpus to train and Table 2. The eojeol “가시는” and its analyzable morphemes

Eojeol Morphemes Meaning

가시는

가/VV +시/EP +는/ETM to go (honorific form) 갈/VV +시/EP +는/ETM to sharpen (honorific form) 가시/VV +는/ETM to disappear, vanish 가시/NNG +는/JX a prickle, thorn, or needle

Table 3. Morphological analysis accuracies (on Sejong corpus)

Method Accuracy

CRF (Na, 2015) [19] 95.22%

Phrase-Based Statistical (Na. et al 2018) [20] 96.35% Neural architecture (Jung. et al 2018) [21] 97.08% Bi-LSTM (Matteson. et al 2018) [22] 96.20%

UTagger 98.2%

Table 4. Comparison of UWordMap and existing Korean word nets KorLex CorNet LCN UWordMap Noun 104,417 51,607 49,000 293,547 Verb 20,151 5,290 30,000 78,563 Adjective 20,897 2,801 18,539 Adverb 3,123 105,450 Total 150,199 58,985 79,000 496,099 Input Text Tokenization Morphological Analysis Sub-word Conditional Probability Processing Word Sense Disambiguation Output Text Sejong Corpus Pre-analyzed Dictionary Lexical Semantic Network - UWordMap Standard Korean Language Dictionary

(4)

evaluate their systems. As shown in Table 5, UTagger outperformed these methods. UWordMap and UTagger are available for online using and download at http://nlplab.ulsan.ac.kr. 4. Experiments and Results

To evaluate the effectiveness of our Korean WSD method in improving NMT results, we conducted experiments using bi-directional translation between Korean, English, and Vietnamese. 4.1. Datasets

We built the Korean-English and Korean-Vietnamese parallel corpora by collecting the bilingual aligned texts from diversified resources. We extracted definition statements and examples from the National Institute of Korean Language’s Learner Dictionary1.

We downloaded and aligned texts from articles in the multilingual magazines and books such as “Watchtowers and Awake!2”,

“Books & Brochures3”, and “Rainbow4” that include many

categories (economy, entertainment, health, science, social, political, and technology). These resources are well aligned and good translation. We also crawled the texts from online journals and websites that contain Korean-English and Korean-Vietnamese language pairs. Because these texts contain many mismatches, we had to select and filter carefully texts that have the good alignments.

Then we removed noises from the collected Korean-English and Korean-Vietnamese bilingual aligned texts. The noises are messy codes, HTML tags, or special symbols and characters used to display on websites. We removed long sentences, which will crash the MT systems. In these corpora, we define that the long sentence is the sentence with over 50 words. We also removed duplicated sentences, which sometimes occur due to the collecting from many resources. The corpora were re-corrected the splitting of sentences and each sentence was stored in one line on a disk file. Finally, we obtained 1,251,075 and 410,131 sentence pairs for Korean-English and Korean-Vietnamese, respectively. We extracted 2.000 sentence pairs from each corpus for making the test set, the rest was used as the training set. The detailed corpora are shown in table 6. These parallel corpora are available for download at

https://github.com/nqphuoc/UKren.

4.2. Integrating UTagger into the Corpora

Korean words in both the training and testing sets were tagged

1 https://krdict.korean.go.kr

2 https://www.jw.org/en/publications/magazines/

with the sense-codes before they were input into the NMT systems. UTagger thus works as a preprocessor for MT systems. Table 7 gives an example of a Korean sentence tagged with the sense-codes. Because MT systems delimit words by the white spaces between them, the sense-code tagging transforms homographic words into distinct words, eliminating the ambiguous words from the Korean dataset.

UTagger changed the sizes of the tokens and vocabulary (i.e., the types of tokens) in the Korean dataset, as shown in Table 6. As explained in detail above, the Korean WSD includes two steps. The first step analyzes the morphology into which a Korean word (eojeol) is segmented and then recovers it to the original form. The second step tags homographic words with the appropriate sense-codes. The morpheme segmentation increased the token size. The original form recovery reduced the vocabulary size. Tagging different sense-codes to the same homographic words increased the vocabulary size. 4.3. Implementation

We implemented our NMT systems on the open framework OpenNMT [33], which is a sequence-to-sequence model described in section 2. The systems were set with the following parameters: word-embedding dimension = 500, hidden layer = 2x500 RNNs, input feed = 13 epochs.

We used those NMT systems for bi-directional translation of the language pairs Korean-English and Korean-Vietnamese. To separately evaluate the effectiveness of our morphological analysis and sense-code tagging, we used three systems (Baseline, Morphology, and WSD) for each direction. The Baseline systems were trained with the originally collected corpora given in Table 6. The Morphology systems were trained with the Korean corpus that

3 https://www.jw.org/en/publications/books/ 4 https://www.liveinkorea.kr

Table 5. Korean WSD Results Comparison

Approach Accuracy Statistical-based [26] 96.42%

Embedded Word Space [27] 85.50% Recurrent Neural Network [28] 96.20%

UTagger 96.52%

Table 6. Statistic of parallel corpora

#Sentences #Avg. len #Tokens #Types English 1,251,075 11.5 14,387,731 353,153 Korean Original 8.5 10,693,99 827,315 Morph Ann. 28.8 22,864,606 127,109 WSD 135,657 Vietnamese 410,131 20.5 8,408,437 39,263 Korean Original 12.7 5,194,098 372,473 Morph Ann. 22.7 9,307,781 59,025 WSD 63,667

Table 7. An example of a sense-code tagged sentence Original form

눈에 미끄러져서 눈을 다쳤다. Sense-code tagged form

눈__04/NNG+에/JKB 미끄러지/VV+어서/EC 눈__01/NNG+을/JKO 다치__01/VV+었/EP+다/EF+./SF

(5)

had been morphologically analyzed. In the WSD systems, the Korean training corpus was both morphologically analyzed and tagged with sense-codes.

4.4. Evaluation

We used the BLEU, TER, and DLRATIO evaluation metrics to measure the translation quality. BLEU (Bi-Lingual Evaluation Understudy) [34] measures the precision of an MT system by comparing the n-grams of a candidate translation with those in the corresponding reference and counting the number of matches. In this research, we use the BLEU metric with 4-grams. TER (Translation Error Rate) [35] is an error metric for MT that measures the number of edits required to change a system output into one of the references. DLRATIO [36] (Damerau-Levenshtein edit distance) measures the edit distance between two sequences.

Table 8 shows the results of the 12 systems in terms of their BLEU, TER, and DLRATIO scores. All three metrics demonstrate that both the Morphology analysis and WSD systems improved the translation quality for both language pairs and both translation directions.

The Morphology analysis systems improved the results of the Baseline systems for all the language pairs by an average of 2.83 and 6.56 BLEU points for translation from and to Korean, respectively. Morphological complexity causes a critical data sparsity problem when translating into or from Korean. The data sparsity increases the number of out-of-vocabulary words and reduces the probability of the occurrence of each word in the training corpus. For instance, NMT systems treat the morphologies of the Korean verb “to go” as completely different words: “가다,” “간다,” “가요,” and “갑니다.” Hence, the Korean morphological analysis can improve the translation results. The disproportionate improvement of results in different translation directions occurred because we applied the morphological analysis only to the Korean side. Therefore, the improvement of translations from Korean is more significant than that in the reverse direction.

The Korean sense-code tagging helped the NMT systems correctly align words in the parallel corpus as well as choose correct words for an input sentence. Therefore, the performance of the WSD systems further improved by an average of 4.14 and 2.80 BLEU points for all the language pairs when translating from and to Korean, respectively. In comparison with the Baseline systems, the WSD systems improved the translated results for all language pairs by an average of 6.97 and 9.37 BLEU points for translations from and to Korean, respectively. In summary, the proposed Korean WSD can remarkably improve the translation quality of NMT systems.

The TER and DLRATIO metrics provide more evidence that the proposed Korean WSD system can improve the translation quality of NMT. The results in Table 8 show that the proposed Korean WSD system improved the NMT performance by an average of 7.40 TER and 5.66 DLRATIO error points when translating from Korean to the different languages. In the reverse direction, the proposed Korean WSD improved the performance by an average of 10.13 TER and 5.34 DLRATIO error points for all NMT systems. Particularly, the Korean sense-code tagging improved translation error prevention by 8.76 TER points and 5.50 DLRATIO points for

all the language pairs. In short, the proposed Korean WSD can considerably reduce NMT errors.

Furthermore, we examined some well-known MT systems to see how they handle the Korean WSD problem. We input the sentence in Table 7 “눈에 미끄러져서 눈을 다쳤다.” into Google Translate, Microsoft Bing Translator, and Naver Papago. In this sentence, the word “눈” occurs two times and has two different meanings: snow and eye. The translated results are shown in Table 9. Naver Papago and the proposed system translated this sentence correctly. Whereas, Google Translate and Microsoft Bing Translator could not distinguish the different meanings of “눈” in this sentence. They translated this sentence incorrectly.

5. Conclusion

In this research, we discover that NMT systems fail to translate Korean homographs correctly. Hence, we proposed the fast and accurate Korean WSD system - UTagger. The experimental results from bi-directional translation between language pairs Korean-English and Korean-Vietnamese demonstrate that UTagger significantly improved NMT results.

In the future, we plan to collect more data related to Korean. Additionally, we intend to study the application of syntactic and parsing attentional models to NMT systems.

Table 9. Korean-to-English translation examples

Source 눈에 미끄러져서 눈을 다쳤다.

Reference I slipped over the snow and my eyes are injured. Naver Papago I slipped in the snow and hurt my eyes. Google Translate I slipped on my eyes and hurt my eyes. Microsoft Bing I slipped my eyes and injured my eyes. Proposed system I slipped on the snow and injured my eyes.

Google Translate, Microsoft Bing, and Naver Papago were accessed on Sep. 11, 2018. Texts highlighted in red are incorrectly translated.

Table 8. Translation results

Systems BLEU TER DLRATIO

Korean-to-English Baseline 20.39 64.27 53.31 Korean-to-English Morphology 25.49 62.18 52.07 Korean-to-English WSD 30.35 57.63 48.10 English-to-Korean Baseline 23.49 71.03 58.39 English-to-Korean Morphology 24.05 68.58 56.18 English-to-Korean WSD 27.48 62.86 52.28 Korean-to-Vietnamese Baseline 24.66 71.75 70.17 Korean-to-Vietnamese Morphology 25.53 67.43 65.68 Korean-to-Vietnamese WSD 27.79 65.82 64.53 Vietnamese-to-Korean Baseline 9.83 84.71 70.1 Vietnamese-to-Korean Morphology 22.09 73.42 65.42 Vietnamese-to-Korean WSD 25.44 70.38 65.05

(6)

Acknowledgment

This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2017M3C4A7068187).

References

[1] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” In Proc. of the 2013 EMNLP, San Diego, CA, USA, 2013, pp. 1700-1709.

[2] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” In Proc. the 27th NIPS, Montreal, Canada, 2014, pp. 3104-3112.

[3] L. Bentivogli, A. Bisazza, M. Cettolo, and M. Federico, “Neural versus phrase-based machine translation quality: a case study”, In Proc. the 2016 EMNLP, Austin, TX, USA, 2016, pp. 257–267. [4] J. Crego et al., “SYSTRAN’s Pure Neural Machine Translation

Systems”, arXiv preprint arXiv: 1610.05540, Oct. 2016.

[5] M. Junczys-Dowmunt, T. Dwojak, and H. Hoang, “Is neural machine translation ready for deployment? A case study on 30 translation directions”, In Proc. of the 13th IWSLT, Seattle, DC, USA, 2016.

[6] K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” In Proc. the 2014 EMNLP, Doha, Qatar, pp. 1724–1734.

[7] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate”, in Proc. of ICLR, 2015.

[8] R. Marvin and P. Koehn, “Exploring Word Sense Disambiguation Abilities of Neural Machine Translation Systems,” In Proc. the 13th AMTA, Boston, MA, USA, 2018, pp. 125-131.

[9] H. Choi, K. Cho, and Y. Bengio, “Context-dependent word representation for neural machine translation,” Computer Speech & Language, Vol. 45, pp. 149-160, 2017.

[10] J. C. Shin and C. Y. Ock, “Improvement of Korean Homograph Disambiguation using Korean Lexical Semantic Network (UWordMap)”, KIISE: Software and Applications, vol. 43, pp. 71-79, 2016.

[11] Y. J. Bae, C. Y. Ock, "Introduction to the Korean Word Map (UWordMap) and API," in Proc. of 26th Annual Conf. on Human and Language Technology, 2014, pp. 27-31.

[12] H. Kim, “Korean national corpus in the 21st century Sejong project,” In Proc. the 13th NIJL International Symposium, Tokyo, Japan, 2006, pp 49-54.

[13] S. S. Kang and Y. T. Kim, “Syllable-based model for the Korean morphology,” in Proc. 15th COLING, Kyoto, Japan, 1994, pp. 221-226.

[14] D. B. Kim, S. J. Lee, K. S. Choi, and G. C. Kim, “A two-level morphological analysis of Korean,” in Proc. 15th COLING, Kyoto, Japan, 1994, pp. 535-539.

[15] O. W. Kwon et al., “Korean morphological analyzer and part-of-speech tagger based on CYK algorithm using syllable information,” in Proc. MATEC99, 1999, pp. 76-88.

[16] J. H. Choi and S. J. Lee, “A method for reducing dictionary access with bidirectional longest match strategy in Korean morphological analyzer,” J. Kor. Inf. Sci. Soc. Soft. Appl., vol. 20, no. 10, pp. 1497-1507, Oct 1993.

[17] G.G. Lee, J. H. Lee, and J. W. Cha, “Syllable-pattern-based unknown morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean,” Comput. Ling., vol. 28, no. 1, pp. 53-70, Mar. 2002.

[18] J. S. Lee, “Three-step probabilistic model for Korean morphological analysis,” J. Kor. Inf. Sci. Soc. Soft. Appl., vol. 38, no. 5, pp. 257-268, May. 2011.

[19] S. H. Na, “Conditional random fields for Korean morpheme segmentation and POS tagging,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 14, no. 3, pp. 10:1-16, Jun. 2015.

[20] S. H. Na and Y. K. Kim, “Phrase-based statistical model for Korean morpheme segmentation and POS tagging,” IEICE Trans. Inf. & Syst., vol. E101–D, no. 2, pp 512–522, Feb. 2018.

[21] S. K. Jung, C. K. Lee, and H. S. Hwang, “End-to-End Korean Part-of-Speech Tagging Using Copying Mechanism,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 17, no. 3 pp. 19:1–8, Jan. 2018. [22] A. Matteson, C. H. Lee, Y. B. Kim, and H. S. Lim, “Rich

character-level information for Korean morphological analysis and part-of-speech tagging,” in Proc.27th COLING, Santa Fe, NM, USA, 2018. [23] K. S. Shim, “Cloning of Korean morphological analyzers using

pre-analyzed eojeol dictionary and syllable-based probabilistic model,” KIISE: Computing Practices, vol. 22, no. 3, pp. 119-126, Mar. 2016. [24] C. H. Lee, J. H. Lim, S. Lim, and H. K. Kim, “Syllable-based Korean POS tagging based on combining a pre-analyzed dictionary with machine learning,” J. of KIISE, vol. 43, no. 3, pp. 362-369, Mar. 2016.

[25] J. C. Shin and C. Y. Ock, "A Korean morphological analyzer using a pre-analyzed partial word-phrase dictionary", KIISE: Software and Applications, vol. 39, no. 5, pp. 415-424, May. 2012. [26] J. C. Shin and C. Y. Ock, "Korean Homograph Tagging model based

on Sub-Word Conditional Probability," KIPS Transactions on Software and Data Engineering, vol. 3, no.10, pp.407-420, 2014. [27] M. Y. Kang, B. Kim, and J. S. Lee "Word Sense Disambiguation

Using Embedded Word Space", J. Comp. Sci. and Eng., vol. 11, no. 1, March 2017, pp. 32-38.

[28] J. H. Min et al., "A Study on Word Sense Disambiguation Using Bidirectional Recurrent Neural Network for Korean Language", J. The Korea Society of Computer and Information, vol. 22, vo. 4, pp. 41-49, 2017.

[29] A. S. Yoon et al., "Construction of Korean Wordnet KorLex 1.5", J. of KIISE: Softw. Appicat., vol. 36, no. 1, pp. 95–126, 2009. [30] K. S. Choi, "CoreNet: Chinese-Japanese-Korean wordnet with

shared semantic hierarchy", in Proc. NLP-KE, Beijing, China, 2003, pp. 767–770.

[31] S. Ikehara et al., "The Semantic System, volume 1 of Goidaikei – A Japanese Lexicon," Iwanami Shoten, Tokyo, 1997.

[32] M. Choi, J. Hur, and M. G. Jang, "Constructing Korean lexical concept network for encyclopedia question-answering system", in Proc. ECON, Busan, Rep. Korea, 2004, pp. 3115–3119.

[33] G. Klein et al., "OpenNMT: Open-Source Toolkit for Neural Machine Translation," arXiv preprint arXiv: 1701.02810, 2017. [34] K. Papineni, S. Roukos, T. Ward, and W. Zhu, "Bleu: A method for

automatic evaluation of machine translation," in Proc. ACL, 2002, pp. 311–318.

[35] M. Snover et al., "A study of translation edit rate with targeted human annotation", in Proc. Assoc. MT Americas, MA, USA, 2006, pp. 223-231.

[36] G. V. Bard, "Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric", in Proc. ACSW, Ballarat, Australia, 2007, pp. 117-124.

수치

Table 3. Morphological analysis accuracies (on Sejong corpus)
Table 5. Korean WSD Results Comparison
Table 9. Korean-to-English translation examples

참조

관련 문서

“On using very large target vocabulary for neural machine translation”, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics

• DFTL incurs extra overheads due to its translation mechanism – The address translation accounts for 90% of the extra overhead. • DFTL yields a 3-fold reduction

A thesis submitted in partial fulfillment of the requirement for the degree of Master of Interpretation and Translation..

A thesis submitted in partial fulfillment of the requirement for the degree of Master of Interpretation and Translation..

A thesis submitted in partial fulfillment of the requirement for the degree of Master of Interpretation and

A thesis submitted in partial fulfillment of the requirement for the degree of Master of Interpretation and Translation?.

For this, this study analyzed the ethnic and national identity of Korean Chinese by approaching the context of political, cultural and ideological systems

<Hwang Tae-euljeon> is a Korean translation, and the Korean version is mostly translated, and the sentences appear deformed during the