AUTOMATIC ANAPHORA RESOLUTION FOR AFAAN OROMOO LANGUAGE

Elias Debelo; Dr. Wondwossen Mulugeta

AUTOMATIC ANAPHORA RESOLUTION FOR AFAAN OROMOO LANGUAGE

Elias Debelo; Dr. Wondwossen Mulugeta

URI: http://ir.haramaya.edu.et//hru/handle/123456789/7904

Date: 2019-08

Abstract:

Anaphora Resolution is a process of finding an entity introduced earlier in the discourse referred to by current entity back in discourse. The referenced entity is called antecedent, of referring entity which is called anaphor. There are number of anaphora types in a text, pronominal anaphora is the most prevalent one. Anaphora resolution is an important subtask that can be used in many Natural Language Processing applications. This study aims at developing Anaphora Resolution model based on Machine Learning approach for Afaan Oromoo language. The language is morphologically complex in that pronoun itself exist hidden inside the verbs. Input data cleaning, tokenization and part of speech tagging, noun phrase extraction and hidden pronoun extraction are useful steps toward Anaphora resolution. From every valid antecedent-anaphor pairs in training and testing sets, feature vectors will be generated. Machine Learning classifier trained using positive and negative instances generated from training set. Sklearn python package was used for as a trainer using fit function and as a predictor using predict function. Sklearn is set of packages consisting of implementations of Machine Learning algorithms. Three types of dataset, gathered from Afaan Oromoo News, Bible verses and Oromo Fictions, were used for training and testing. Five top best features were used for training and testing out of 14 features extracted from the text. Using 10-fold cross validation technique, the three datasets were divided into 10% testing and 90% training at each run. Each test sets of datasets were tested by range of 1 to 10 sentence distance between antecedent and anaphor on three Machine Learning algorithm Decision Tree (DT), Support Vector Machine (SVM) and Naïve Bayes (NB). Performance of the models on the three datasets were represented as mean average of the 10-folds on ten sentence range. Generally, average precision achieved for Bible-Fiction dataset 52.3% on DT, 51.25% using SVM and 53.57% using NB, for News dataset 57.67% using DT, 47.77% using SVM and 57.5% using NB and for compiled dataset 47.62% using DT, 46.82% using SVM and 50.15% using NB was achieved for combined independent and hidden anaphors. This result could be enhanced primarily by finding better ways of getting feature values for antecedent-anaphor pairs