Abstract:
With the rapid growth of web technologies, individuals and organizations are
increasingly using public opinions in blogs, forums, review sites, social networks, etc.
for expressing their views and opinions. These reviews are very useful for service
providers, manufactures and organizations in making informed decisions and
improving their service. However, the huge volume of reviews on the social media
grows so rapidly and becoming increasingly difficult for users to analyze and extract
relevant information. Therefore, an automated sentiment analysis is needed.
In this research, we presented a multiscale sentence-level sentiment analysis for
Tigrigna online posts using a supervised machine learning approach. The multiscale
Tigrigna sentiment analysis model classifies a given sentence into five predefined
classes: very positive (2), positive (1), neutral (0), negative (-1) and very negative (-2).
We have used three supervised machine-learning algorithms: Naïve Bayes (NB),
Maximum Entropy (MaxEnt) and Support Vector Machine (SVM) with unigram,
bigram, trigram and hybrid of unigram and bigram variants of N-gram as a feature. The
proposed model contains different components like preprocessing (tokenization,
normalization, stop word removal), morphological analysis (lemmatizing), feature
extraction, training a machine learning algorithms, classification and evaluation of the
result using evaluation metrics.
For conducting the experiments, 1500 Tigrigna sentences are collected from different
sources. Due to the morphological complexity of the language, preprocessing
techniques have been applied in order to clean noisy data and reduce sparseness and
dimensionality of the dataset. After preprocessing, the dataset is lemmatized, before it
is given to training phase of the experiment. The experimental results show the SVM
algorithm with unigram language model outperforms all algorithms with 71% accuracy.
In conclusion, despite the language morphological complexity and lack of effective
morphological analysis tools, the achieved experimental results are promising.
However, we are convinced that the results could improve further with a larger, pre annotated and cleaned corpus.