Automating Codebase Management with n8n and AI

1. System Overview

This n8n automation consists of three main workflows:

Initial indexing: Downloads the entire codebase and loads it into a vector store
Automatic sync: Every push to GitHub automatically updates modified files
AI Assistant: A Telegram bot that uses the codebase to create GitHub issues and answer questions

Tech Stack:

n8n for orchestration
MongoDB Atlas as vector store
Google Vertex AI (Gemini 2.5 Pro) as LLM
Telegram as user interface

2. Codebase Indexing

The Basic Flow

The first step is to download all .dart files from the GitHub repository and prepare them for embedding.

Main nodes:

HTTP Request → Downloads the file tree from GitHub API
get file list → Transforms the tree into a file list
Filter .dart Files → Filters only Dart files
Download Raw Content → Downloads the raw content of each file

You can get the complete file list using the GitHub API tree endpoint: replace "owner/your-repo" with your repository and "COMMIT_SHA" with the commit SHA you want to explore.

https://api.github.com/repos/OWNER/YOUR-REPO/git/trees/COMMIT_SHA?recursive=1

By replacing OWNER/YOUR-REPO and the SHA, you get all the files present in that commit.

Intelligent Chunking

Here's where it gets interesting. We can't just put entire files in the vector store - they're too large. We need to do "chunking", but intelligently.

// The code looks for semantic separators in Dart code
const separators = [
  /\nclass /,
  /\nimport /,
  /\nvoid /,
  /\nFuture



                Instead of randomly cutting the code every X characters, the chunker node tries to split on semantic boundaries - when a new class starts, an import, a function, etc. This keeps the context intact.
                
                Maximum chunk length: 1200 characters

                Data Format
                Each chunk is structured like this:
                
                {
  file: "path/to/file.dart",
  chunkIndex: 0,
  content: "class MyWidget extends StatelessWidget..."
}

                And then transformed :
                
                {
  pageContent: "class MyWidget...",
  metadata: {
    filename: "path/to/file.dart",
    chunkIndex: 0
  }
}

                Embeddings and Storage
                
                    Embeddings are generated and saved in MongoDB Atlas with vector index
                    Each chunk is semantically searchable



            

            
            
                3. Automatic Updates on Push
                
                The Problem
                Indexing the entire codebase is fine the first time, but what happens when you commit? I don't want to re-index everything every time.

                The Solution: Incremental Updates
                The Github Trigger node listens to push events and the get new files node processes commits to understand what changed:
                
                const filesToIngest = new Set();  // New or modified files
const filesToDelete = new Set();  // Removed or modified files

for (const commit of commits) {
  commit.added.forEach(path => filesToIngest.add(path));
  commit.modified.forEach(path => {
    filesToIngest.add(path);   // Re-index new version
    filesToDelete.add(path);   // Remove old version
  });
  commit.removed.forEach(path => filesToDelete.add(path));
}

                
                

                Two Separate Branches
                The Switch node splits the flow:
                
                    Branch "delete": Removes old documents from MongoDB
                    Branch "ingest": Downloads and indexes new files
                

                Example delete operation:
                {
  "operation": "delete",
  "collection": "codebase",
  "query": "{ \"filename\": \"lib/screens/home.dart\" }"
}

                Logging to Google Sheets
                Every change is logged to a Google Sheet with:
                
                    File path
                    Operation (delete/ingest)
                    Timestamp
                    Commit ID
                    Author
                
                This allows for a complete history of changes to the vector store.
            

            
            
                4. AI Agent with Telegram
                
                The User Interface
                I chose Telegram as the interface because:
                
                    It's fast
                    Supports voice messages
                    Always available on my phone
                    Has a good API
                

                The Telegram Flow
                
                    Telegram Trigger → Receives messages
                    is admin → Verifies it's me or an authorized user
                    Switch1 → Distinguishes between text, photo, and audio
                    Code in JavaScript1 → Extracts text from message
                    AI Agent → Processes the request
                

                
                

                Multimodal Support
                Voice Messages:
                
                    Downloaded via Telegram API
                    Transcribed with Gemini using Vertex AI APIs
                    Processed as normal text
                

                Images:
                
                    Analyzed with Gemini
                    The idea is to analyze them for UI/bug screenshots
                

                The AI Agent
                The heart of the system is the AI Agent node configured with:
                
                LLM: Google Vertex AI - Gemini 2.5 Pro
                I use Google because I have free credits from Vertex

                Available tools:
                
                    MongoDB Atlas Vector Store → Searches the codebase
                    Create an issue in GitHub → Creates issues automatically
                

                Memory: Buffer window memory
                
                    Maintains conversational context
                    One session per Telegram chat
                

                The System Prompt
                The prompt is long and detailed, but the key points are:
                
                
                    Core Capabilities:
                    
                        GitHub Issue Creation - Convert user messages into structured issues
                        Coding Assistant - Provide development guidance using codebase knowledge
                        File Recommendations - Suggest specific files to edit
                    
                

                Reasoning Budget
                An interesting thing in the prompt is the "reasoning budget":
                
                Maximum 2-3 reasoning steps per response
NO loops or iterations - Decide → Draft → Done

                This forces the agent to be efficient and not waste tokens on useless reasoning.
            

            
            
                5. How RAG Works
                
                Retrieval-Augmented Generation
                RAG = Retrieve relevant information before generating a response.

                The process:
                
                    User query → "the button to select the exam in the timer isn't working?"
                    Query embedding → Converts the question into a numeric vector
                    Similarity search → Searches for the most similar chunks in the vector store
                

                Query vector: [0.23, -0.45, 0.12, ...]

Top results:
- pages/timer/button.dart (similarity: 0.89)
- pages/timer/exam_selector.dart (similarity: 0.82)
- src/interceptors/authInterceptor.ts (similarity: 0.78)



                
                    Context injection → Relevant chunks are passed to the LLM
                    Response generation → The LLM generates the response using the context
                

                Why It Works
                
                    Semantic search: Doesn't search for keywords, but meaning
                    Precise context: Only the relevant parts of the code
                    Always up to date: The vector store syncs with every push
                
            

            
            
                6. Practical Results
                
                What I Can Do Now
                
                
                    1. Create issues quickly
                    Telegram: "I want to add dark mode to the settings page, assign it to copilot"
Bot: [Complete issue draft with files to modify]
Me: "yes"
Bot: ✅ Issue #123 created
                

                
                    2. Ask questions about the code
                    Telegram: "where is the user state managed?"
Bot: "User state is managed in src/state/userState.dart using Bloc.
     It's initialized in src/app.dart at app launch..."
                

                
                    3. Assisted debugging
                    Telegram: "I have an error in API parsing, where should I look?"
Bot: "Check src/services/apiService.dart, lines 45-60.
     JSON parsing might fail if a field is missing..."
                

                Metrics
                
                    Time to create an issue: From 5-10 minutes → 30 seconds
                    Chunks in vector store: ~3788
                    Cost per query: zero (Google provides free credits)
                
            

            
            
                7. Next Steps
                
                
                    New files:
                    If the automation is down and new commits arrive, they get lost. It would be better to implement a system that doesn't just look at the last push's changes but compares the differences between the last processed commit and the latest one that arrived.
                
                   
                    Open source:
                     Use open source models when possible.