Abstract:
Text classification is the assignment of text documents to one or more predefined
classes based on their content and topic. The classification is usually done on the basis
of significant phrases or features extracted from the text document using N-gram
phrase extraction method. In this study design science approach is adopted to explore
the possibility of using varying N-gram model for Afaan Oromo news text
classification. Supervised machine learning methods; Naïve Bayes(NB), Support
Vector Machine(SVM) and Decision Tree(DTree) were used to explore the efficiency
of the classification task. For experimentation, 1007 news documents are collected from Afaan Oromo news
portals and annotated. From the literature, nine predefined classes; adaa fi
turiziimii/culture and tourizm, barnoota/education, dinagde/business, fayyaa/health, ispoortii/sport, qonna/agriculture, raayyaa ittisa fi nagenya/defence force and peace
and kanbiraa/others were identified and used. We have deployed preprocessing
algorithms; tokenization(number, symbol and punctuation removal), normalization, stopword removal and stemming. After cleaning of noisy data is done, we have
extracted N-gram and Hybrid N-gram features using N-gram feature extraction
method. The classification algorithms are then, trained using 75% and tested using
25% of the dataset. The selected classification algorithms are used to predict the category of the new news
document into one of predefined news class. Classifier performance is evaluated using
precision, recall, F-measure and accuracy metrics. The performance evaluation result
shows that SVM with hybrid-Uni-Bi-gram(1,2) achieved the best accuracy of 92%. In
this study we have selected SVM learning model with Hybrid-uni-bi-gram(1,2) model
for future works based on the result we have got. The result we have got is
encouraging, but if we increase and balance the dataset, we will get much better results. For future work classifying news texts using semantic relationship of phrases by
observing synonym phrases