Abstract:
Anaphora resolution is a fundamental task in natural language processing that involves
identifying the antecedent of an anaphoric expression in a text. The task is critical for several
NLP applications, such as machine translation, text summarization, question answering,
information extraction, natural language generation, discourse analysis and sentiment analysis.
This research focuses on developing an effective anaphora resolution model using machine
learning approach.
Our model for resolving anaphora in Amharic has two main phases: training and testing. The
model is made up of five components, including pre-processing, which involves tokenizartion,
normalization, part-of-speech tagging and chunking, identifying independent anaphora,
identifying hidden anaphora (extracting pronouns from Amharic verbs using morphological
analyzer), identifying candidate antecedents (identifying noun phrases), feature selection
(applying resolution factors which are the constraint rules and the preference rules), and the
actual anaphora resolver model. We have used the HornMorpho for morphological analysis.
While preparing the corpus or datasets by customizing for our annotation scheme, we have used
INCEpTION text annotation tool. We have developed a supervised machine learning model of
Amharic anaphora resolution. This study focuses only pronominal and reflexive pronouns. We
have used three machine learning algorithms for classification which are: Support Vector
Machine (SVM), Naïve Bayes (NB), and Random Forest (RF).
To evaluate the performance of the model, we have collected 575 sentences which has 211
independent pronouns and 537 hidden pronouns for our experiment from Amharic news portals,
Quran, Bible and the evaluation of the model was conducted in different scenarios. First based
on the dataset type, based on the location of anaphor and antecedent finally the performance also
evaluated for hidden and independent anaphora detection. In the first case the performance of the
model on the compiled dataset scores an accuracy of 59.38 for SVM, 57.47 for NB and 57.91 for
RF. In the second case, for inter-sentential anaphora the model scored SVM 43.87%, NB 43.20%
and RF 44.52%. for intra-sentential anaphora the model scored an accuracy for SVM 62.50%,
59.44%, 57.39% for NB and RF respectively. For independent and hidden anaphora detection
SVM 66.82% and for hidden anaphora 66.11% of accuracy was scored.