Abstract:
Anaphora Resolution is a process of finding an entity introduced earlier in the discourse
referred to by current entity back in discourse. The referenced entity is called antecedent, of
referring entity which is called anaphor. There are number of anaphora types in a text,
pronominal anaphora is the most prevalent one. Anaphora resolution is an important subtask
that can be used in many Natural Language Processing applications. This study aims at
developing Anaphora Resolution model based on Machine Learning approach for Afaan
Oromoo language. The language is morphologically complex in that pronoun itself exist hidden
inside the verbs. Input data cleaning, tokenization and part of speech tagging, noun phrase
extraction and hidden pronoun extraction are useful steps toward Anaphora resolution. From
every valid antecedent-anaphor pairs in training and testing sets, feature vectors will be
generated. Machine Learning classifier trained using positive and negative instances generated
from training set. Sklearn python package was used for as a trainer using fit function and as a
predictor using predict function. Sklearn is set of packages consisting of implementations of
Machine Learning algorithms. Three types of dataset, gathered from Afaan Oromoo News, Bible
verses and Oromo Fictions, were used for training and testing. Five top best features were used
for training and testing out of 14 features extracted from the text. Using 10-fold cross validation
technique, the three datasets were divided into 10% testing and 90% training at each run. Each
test sets of datasets were tested by range of 1 to 10 sentence distance between antecedent and
anaphor on three Machine Learning algorithm Decision Tree (DT), Support Vector Machine
(SVM) and Naïve Bayes (NB). Performance of the models on the three datasets were represented
as mean average of the 10-folds on ten sentence range. Generally, average precision achieved
for Bible-Fiction dataset 52.3% on DT, 51.25% using SVM and 53.57% using NB, for News
dataset 57.67% using DT, 47.77% using SVM and 57.5% using NB and for compiled dataset
47.62% using DT, 46.82% using SVM and 50.15% using NB was achieved for combined
independent and hidden anaphors. This result could be enhanced primarily by finding better ways
of getting feature values for antecedent-anaphor pairs