Annual Report Summarizer: RAG based Summarizer
Main Article Content
Abstract
Introduction: --- Annual reports are complicated, voluminous documents that summarize the financial performance, governance structure and strategic outlook of a company. Manual analysis is time-consuming and overwhelming due to their size and technical density. Advances in Large Language Models (LLMs) and Generative AI enable automated interpretation of such documents. Retrieval-Augmented Generation (RAG) strengthens contextual accuracy by grounding generated text in retrieved evidence. The Annual Report Summarizer (ARS) addresses these challenges using semantic retrieval, section-level summarization and accessibility features such as multilingual translation and text-to-speech.
Objectives: Demonstrate the ARS architecture, including PDF preprocessing, vector storage and RAG summarization. Evaluate contextual accuracy and comprehensibility of summaries using LLM-based tools such as G-EVAL and ROUGE. Generate section-wise human-like summaries for real annual reports. Highlight the impact of ARS on financial literacy, research and analytical efficiency.
Methods: The ARS system integrates PDF text extraction, cleaning, segmentation, embedding generation, semantic retrieval and LLM-based summarization. Text is converted into vector embeddings for retrieval using cosine similarity. The RAG module retrieves relevant chunks and generates summaries through context-aware prompting. The system includes modules for user input, preprocessing, embeddings, summarization and output delivery with multilingual translation and text-to-speech.
Results: ARS accelerates interpretation of lengthy financial documents by combining semantic retrieval and LLM-based summarization. It improves clarity, factual consistency and coherence of generated summaries. Evaluation using G-EVAL and ROUGE supports the system’s contextual accuracy. ARS reduces time and cognitive effort while enhancing accessibility for analysts, students and researchers.
Conclusions: ARS automates interpretation of complex annual reports using Retrieval-Augmented Generation. It integrates semantic retrieval, contextual summarization and accessibility features. While limitations exist in scalability, numerical accuracy and domain-specific adaptation, ARS demonstrates strong potential as a reliable framework for financial analysis. Future improvements may include domain-specific fine-tuning and multimodal data processing.