Unpacking Codebases with Generative AI

Introduction

Project Overview and Purpose

Every developer has faced the challenge of diving into an unfamiliar codebase - trying to untangle dependencies, understand architectural decisions, and figure out “how does this all work together?” For our class project, we set out to address this universal pain point by creating a Code Analysis Agent tool that makes sense of complex Python codebases through the power of AI. By combining language models with graph databases, we’ve developed a system that can “read” code the way humans do, identifying relationships between components and allowing developers to ask natural questions about structure and functionality.

The Code Analysis Agent functions as an intelligent code exploration assistant, transforming source code into queryable knowledge graphs. By applying natural language processing to code elements, it creates semantic representations that capture the architecture, dependencies, and functional relationships within a codebase. This approach enables developers to gain insights through conversational queries rather than manual code traversal, significantly reducing the cognitive load of understanding unfamiliar or complex projects.

Key Features

The Code Analysis Agent offers several core capabilities:

It automatically analyzes Python codebases by parsing files and generating graphs that capture semantic relationships between modules, classes, and functions.
It employs a multi-agent reasoning architecture built with LangGraph, enabling dynamic selection between micro-level (fine-grained) and macro-level (high-level) analysis workflows.
It provides interactive visualizations, including relationship graphs and mermaid diagrams, to present architectural information in a digestible format.
It allows users to query the codebase using natural language, returning LLM-generated explanations, semantic paths, or visual summaries depending on the request.

These features combine to offer both depth and flexibility, making code exploration faster, more structured, and more accessible.

Technologies Used

The system integrates several modern AI and graph-based technologies:

LangGraph is used to build the agentic workflow, allowing the system to dynamically route user queries to the appropriate analysis tools.
Neo4j serves as the backend graph database, storing semantic relationships extracted from the codebase.
Streamlit provides a lightweight, interactive web interface that guides users through repository ingestion, analysis, and querying.
OpenAI GPT-4.1-mini acts as the primary language model for natural language understanding, Cypher query generation, and RAG-based retrieval.

Together, these technologies enable the Code Analysis Agent to operate at both a structural and semantic level, offering users a unified and intelligent code exploration experience.