Evaluation

Evaluating the pipeline for querying a Graph Database, enriching context through a vector database, or using a CAG pipeline for summarizing a codebase proved to be a complex task. As a result, our evaluation primarily focused on the performance of RAG and the overall system efficiency.

RAG Evaluation

To assess the performance of our Retrieval-Augmented Generation (RAG) pipeline, we employed a combination of RAGAS and LLM as a judge to incorporate various aspects that one model may be missing out on. The definition of the two methods of evaluation methods are as follows :

Synthetic Query Evaluation with RAGAS

Using the RAGAS framework, we generate synthetic queries tailored to the codebase and evaluated the system’s ability to retrieve and synthesize relevant information. This approach allowed us to systematically measure the precision and recall of the RAG pipeline, ensuring that the retrieved context aligns with user intent.

LLM as a judge

Leveraging a large language model (LLM) as an impartial evaluator to judge the quality of responses. The LLM assessed the outputs based on criteria such as factual correctness, contextual relevance, and linguistic clarity. This method provided an additional layer of validation, ensuring that the system’s responses met high standards of quality and usability.

Results

The context recall by RAGAS achieved an average score of 0.8, indicating that the system consistently retrieved relevant information. However, factual correctness ranged between 0.3 and 0.4, suggesting that while the correct context was retrieved, the generated responses often contained inaccuracies. Using the default F1 score in RAGAS to measure factual correctness, we observed that retrieving 10 related documents introduced additional information, which diluted precision and lowered the overall F1 score compared to the synthetic responses generated by RAGAS.

To address this, we switched to using recall as the primary metric, which better accounted for the differences, resulting in an improved factual correctness score of approximately 0.6. While this was an improvement, it still fell short of expectations. To investigate further, we analyzed a specific example where the factual correctness score was 0.3. Upon human evaluation, the response appeared accurate. We then used an LLM as a judge to assess the similarity between the synthetic and pipeline-generated responses, which yielded a high similarity score of 0.9.

This analysis demonstrated that our RAG pipeline is robust and capable of producing accurate answers, even if the default evaluation metrics do not fully reflect its performance.

Efficiency

Efficiency is critical for making the Code Analysis Agent usable on real-world codebases. In this section, we evaluate two major aspects: the time required for data ingestion (preprocessing and database population) and the latency experienced during interactive query responses.

Data Preprocessing/Ingestion

The data preprocessing pipeline consists of several sequential stages: transforming raw Python files into natural language descriptions, generating graph structures, and uploading the results into the Neo4j and vector databases. The following table summarizes the time taken for each of these stages during a typical run on a moderately sized codebase:

Stage	Time (Graph DB)	Time (Vector DB)
Code → NL	71s	57s
Graph Generation	381s	X
Database Input	62s	0s
Total Time	8m 34s	57s

These timings highlight that the initial overhead for ingestion is non-trivial, especially due to the time-intensive nature of code parsing and language model processing. However, once ingestion is complete, the system is able to serve user queries efficiently without needing to reprocess the underlying codebase.

Note

Note: Future improvements such as incremental graph updates (see Future Extensions) could reduce ingestion times significantly.

Response Latency

Response time is crucial for user experience when interacting with the system. We separately measure the latency for both micro-agent and macro-agent workflows, distinguishing between responses that generate natural language answers and those that involve visualization generation.

Micro Agent

The micro agent handles fine-grained codebase queries, such as retrieving detailed relationships between modules or classes. Latency is broken down into the following stages:

Stage (Text Response)	Time (Text Response)	Stage (Visualization Response)	Time (Visualization Response)
Generate/Execute Query	140s	Generate/Execute Query	140s
RAG Retrieval	10s	Create Visualization	42s
Total Time	150s	Total Time	182s

Micro-agent responses typically involve a multi-step process: building a Cypher query, executing it against the Neo4j database, optionally performing retrieval-augmented generation (RAG) for additional explanation, and formatting the final output. Visualization workflows replace RAG retrieval with the generation of relationship graphs.

Macro Agent

The macro agent generates high-level overviews of the codebase structure and creates mermaid diagrams for architectural summaries. Response timing for macro workflows is shown below:

Stage	Time
LLM Summarization	16s
Mermaid Diagram Generation	33s
Total Time	49s

In contrast to the micro agent, the macro agent operates primarily through direct text generation followed by conversion into visual diagrams. As a result, its overall latency tends to be more predictable but may still vary depending on the size and complexity of the input codebase.

Overall, while response times are acceptable for typical use cases, further optimizations could enhance responsiveness, particularly for large-scale repositories or complex relational queries.