EXTRACTIVE AMHARIC TEXT SUMMARIZATION USING LATENT DIRICHLET ALLOCATION

Show simple item record

dc.contributor.author Redi, Abdurehman
dc.contributor.author Assabie, (PhD) Yaregal
dc.contributor.author Hussen, (MSc) Muluken
dc.date.accessioned 2021-08-02T04:32:53Z
dc.date.available 2021-08-02T04:32:53Z
dc.date.issued 2020-09
dc.identifier.uri http://localhost:8080/xmlui/handle/123456789/4108
dc.description 124p. en_US
dc.description.abstract Automatic text summarization is paramount in this information age, where societies engendered a massive quantity of information for utilization. This fact remains true for Ethiopia; the country has suffered expeditious expansion in electronic data that prepared for consumption. However, in Ethiopian languages, text summarization works are in their initial phases of expansion. Consequently, to meet current and upcoming needs, supplemental effort should carry out. This work proposes a probabilistic topic-modeling approach called Latent Dirichlet Allocation (LDA), for extractive Amharic text summarization. By utilizing LDA and Gibbs sampling as an inference algorithm, it is promising to estimate the probability of a word belonging to a set of topics. After generating topics through LDA, the algorithm applies a topic coherence measure to estimate the quality of generated topics. Then, the research explores the keyword-based and topic-based approach for sentence selection. The keyword-based approach stresses identifying paramount words of the document to agnize the most vigorous sentence that represents the document. Whereas the topic-based approach fixates on finding a consequential topic that best represents the main sentence of the document. Determinately, the data that avails in system development are like abbreviations, stop words, and affix amassed from anterior works. Besides, the Python programming language is culled as a tool to implement the system. To test how the proposed summarizers perform, the archetype of Amharic text summarizer developed depends on the proposed methods. Accordingly, 60 different newspapers were amassed, 30 of them utilized for training while the rest 30 employed for testing. Furthermore, three respondents took part in manual summary preparation. The system was evaluated for an objective evaluation metric called F-measure in three experimental scenarios. First, proposed approaches tested through different term weighting functions, and with several numbers of keywords. Then, the algorithms compared to each other and with previous works. In the final analysis, the results obtained are 0.797, 0.694, and 0.688 at 20%, 25%, and 30% extraction rates. To conclude, because of the integrated features and diverse techniques, the proposed approaches outperform the antecedent works. en_US
dc.description.sponsorship Haramaya University en_US
dc.language.iso en en_US
dc.publisher Haramaya university en_US
dc.subject Text Summarization, Latent Dirichlet Allocation, Keyword-based, Topic-based en_US
dc.title EXTRACTIVE AMHARIC TEXT SUMMARIZATION USING LATENT DIRICHLET ALLOCATION en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search HU-IR System


Advanced Search

Browse

My Account