Abstract:
Automatic text summarization is paramount in this information age, where societies engendered a massive quantity of information for utilization. This fact remains true for Ethiopia; the country has suffered expeditious expansion in electronic data that prepared for consumption. However, in Ethiopian languages, text summarization works are in their initial phases of expansion. Consequently, to meet current and upcoming needs, supplemental effort should carry out. This work proposes a probabilistic topic-modeling approach called Latent Dirichlet Allocation (LDA), for extractive Amharic text summarization. By utilizing LDA and Gibbs sampling as an inference algorithm, it is promising to estimate the probability of a word belonging to a set of topics. After generating topics through LDA, the algorithm applies a topic coherence measure to estimate the quality of generated topics. Then, the research explores the keyword-based and topic-based approach for sentence selection. The keyword-based approach stresses identifying paramount words of the document to agnize the most vigorous sentence that represents the document. Whereas the topic-based approach fixates on finding a consequential topic that best represents the main sentence of the document. Determinately, the data that avails in system development are like abbreviations, stop words, and affix amassed from anterior works. Besides, the Python programming language is culled as a tool to implement the system. To test how the proposed summarizers perform, the archetype of Amharic text summarizer developed depends on the proposed methods. Accordingly, 60 different newspapers were amassed, 30 of them utilized for training while the rest 30 employed for testing. Furthermore, three respondents took part in manual summary preparation. The system was evaluated for an objective evaluation metric called F-measure in three experimental scenarios. First, proposed approaches tested through different term weighting functions, and with several numbers of keywords. Then, the algorithms compared to each other and with previous works. In the final analysis, the results obtained are 0.797, 0.694, and 0.688 at 20%, 25%, and 30% extraction rates. To conclude, because of the integrated features and diverse techniques, the proposed approaches outperform the antecedent works.