Abstract:
This research work comes up with Latent Semantic Indexing and Document Clustering based searching for Afaan Oromo documents. It intends to apply LSI and K-means clustering to handle the semantic structure of words in documents. This mainly consists of three components; indexing, clustering, and searching. Latent Semantic Indexing (LSI) model is a concept based retrieval method that exploits the idea of a vector space model and singular value decomposition. On the other hand, document clustering was investigated for improving the performance of information retrieval system. Document clustering is an issue of measuring similarity between documents and grouping similar documents together. K-means clustering was used to cluster the document using the Singular Value Decomposition (SVD) matrix. Then, the retrieval process is further refined by making a similarity measure between the query vector and cluster centroid vectors. IR pre-processing for tokenization, normalization, stop word removal, and stems were used for selecting index and query terms. Finally, a comparison is made between the SVD model with K-means clustering, VSM and SVD model. The performance evaluation of the system was performed by using a selected set of documents and queries. The experimental result showed that the proposed prototype registered on average 70% recall, 80% precision, and 72% F-measure. Therefore, it indicated that the proposed method (SVD with Kmeans) achieved significant improvement compared to the VSM and SVD model. Nevertheless, the performance of the system is greatly affected by the statistical extraction of synonyms and polysemy, mis-clustering, standard corpus, and stemming which need further research.