AI-Powered Developer Tools: Building with LLM APIs

It's been a while since I posted here. Life got busy, projects piled up – you know how it goes. But I've been spending the last several months deep in something that genuinely changed how I work as a developer, and I figured it was time to share.

I'm talking about building developer tools powered by Large Language Models. Not the "let me ask ChatGPT to write my code" kind of thing – I mean actually integrating LLM APIs into your own tools, scripts, and workflows to solve real problems.

In this post, I'll walk you through the current LLM API landscape, build a few practical tools from scratch, and share some hard-won lessons about what works (and what burns through your API budget faster than you'd expect).

The API Landscape in 2026 – A Quick Lay of the Land

Before we write any code, let's talk about what's actually available. The three major players right now are OpenAI, Anthropic, and Google, and the landscape has shifted a lot even in the last year.

OpenAI recently rolled out the GPT-5 family. GPT-5 sits at $1.25/$10 per million tokens (input/output), with GPT-5 Mini at $0.25/$2 and GPT-5 Nano at just $0.05/$0.40. The older GPT-4o is now considered legacy. They've also introduced the Responses API, which is basically Chat Completions with built-in tool use – and the Assistants API is sunsetting this year.

Anthropic launched Claude 4.6 (Opus and Sonnet) in February 2026. Sonnet 4.6 is the sweet spot for most developer tool use cases – $3/$15 per million tokens, 200K context window, and it's scarily good at understanding code. Opus 4.6 has a 1M token context window in beta, which opens up some interesting possibilities for codebase-wide analysis.

Google's Gemini 3.1 Pro landed with impressive reasoning benchmarks. At $1.25/$10 per million tokens, it's price-competitive with GPT-5. The free tier through Google AI Studio is generous enough for prototyping. And the 1M token context window is available across their lineup.

Here's the thing though – pricing alone doesn't tell you which one to pick. I've found that each model has a personality when it comes to code-related tasks. More on that later.

Setting Up – The Boring But Important Part

Let's get our environment ready. I'll use Python since that's where most of the LLM tooling ecosystem lives, but these concepts translate to any language.

# Create a project directory
mkdir llm-dev-tools && cd llm-dev-tools

# Set up a virtual environment
python -m venv venv
source venv/bin/activate

# Install the SDKs we'll need
pip install openai anthropic google-genai langchain-core langchain-openai langchain-anthropic

Now, set up your API keys. Please don't hardcode these – I've seen too many keys leaked in Git repos.

# .env file (add to .gitignore!)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AI...

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_KEY = os.getenv("ANTHROPIC_API_KEY")
GOOGLE_KEY = os.getenv("GOOGLE_API_KEY")

Tool #1: An Intelligent Code Reviewer

This is the first thing I built, and honestly it's still the one I use most. The idea is simple – pipe a git diff into an LLM and get back a code review that actually catches things.

# code_reviewer.py
import subprocess
import sys
from anthropic import Anthropic

client = Anthropic()

def get_git_diff(base_branch="main"):
    """Grab the diff of current changes against a base branch."""
    result = subprocess.run(
        ["git", "diff", base_branch, "--", ".", ":(exclude)*.lock"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        print(f"Git error: {result.stderr}")
        sys.exit(1)
    return result.stdout

def review_code(diff: str) -> str:
    """Send the diff to Claude for review."""
    message = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"""You are a senior developer reviewing a pull request.
Review this diff and provide feedback on:
1. Bugs or logic errors
2. Security concerns
3. Performance issues
4. Code style and readability

Be specific. Reference line numbers where possible. Skip obvious stuff
— focus on things a human reviewer might actually miss.

If the code looks solid, say so briefly. Don't manufacture issues.

```diff
{diff}
```"""
            }
        ]
    )
    return message.content[0].text

if __name__ == "__main__":
    branch = sys.argv[1] if len(sys.argv) > 1 else "main"
    diff = get_git_diff(branch)

    if not diff.strip():
        print("No changes found.")
        sys.exit(0)

    print(f"Reviewing {len(diff.splitlines())} lines of changes...\n")
    review = review_code(diff)
    print(review)

Run it like this:

python code_reviewer.py main

A couple of things I learned building this:

Token management matters. A large diff can easily be 50K+ tokens. I'm excluding lock files in the git diff command for a reason – those diffs are massive and useless for review. You'll want to add similar exclusions for generated files, minified assets, etc.

The system prompt makes or breaks it. My first version had a generic "review this code" prompt and the output was mostly fluff – "consider adding error handling" type stuff. The prompt above specifically asks it to skip obvious issues and focus on things a human might miss. That one change made the tool actually useful.

Tool #2: A CLI Documentation Generator

This one came out of pure frustration. I had a codebase with about 40 utility functions and zero documentation. Writing docstrings by hand for all of them felt like punishment. So I automated it.

# doc_generator.py
import ast
import sys
from pathlib import Path
from openai import OpenAI

client = OpenAI()

def extract_functions(filepath: str) -> list[dict]:
    """Parse a Python file and extract function signatures and bodies."""
    source = Path(filepath).read_text()
    tree = ast.parse(source)

    functions = []
    lines = source.splitlines()

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            # Get the full function source
            start = node.lineno - 1
            end = node.end_lineno
            func_source = "\n".join(lines[start:end])

            # Check if it already has a docstring
            has_docstring = (
                node.body and
                isinstance(node.body[0], ast.Expr) and
                isinstance(node.body[0].value, ast.Constant) and
                isinstance(node.body[0].value.value, str)
            )

            functions.append({
                "name": node.name,
                "source": func_source,
                "lineno": node.lineno,
                "has_docstring": has_docstring
            })

    return functions

def generate_docstring(func: dict) -> str:
    """Generate a docstring for a single function using GPT."""
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {
                "role": "system",
                "content": """Generate a Google-style Python docstring for the given function.
Include: a one-line summary, Args section (with types), Returns section,
and Raises section if applicable. Keep it concise and accurate.
Return ONLY the docstring content, no triple quotes, no code fences."""
            },
            {
                "role": "user",
                "content": func["source"]
            }
        ],
        temperature=0.2  # Low temperature for factual output
    )
    return response.choices[0].message.content.strip()

def process_file(filepath: str, write: bool = False):
    """Process a file and either print or write docstrings."""
    functions = extract_functions(filepath)
    undocumented = [f for f in functions if not f["has_docstring"]]

    if not undocumented:
        print(f"All functions in {filepath} already have docstrings.")
        return

    print(f"Found {len(undocumented)} undocumented functions in {filepath}\n")

    for func in undocumented:
        print(f"--- {func['name']} (line {func['lineno']}) ---")
        docstring = generate_docstring(func)
        print(f'    """{docstring}"""\n')

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python doc_generator.py <file.py>")
        sys.exit(1)

    process_file(sys.argv[1])

Why did I use GPT-4.1-mini here instead of Claude? Two reasons: docstring generation is a relatively simple task that doesn't need heavy reasoning, and GPT-4.1-mini is dirt cheap for this kind of structured output. I use Claude for the code reviewer because it needs deeper understanding of logic flow, but for templated output like docstrings, the smaller models do fine.

This is a pattern worth internalizing: match the model to the task complexity. Using Opus for everything is like driving a truck to get groceries.

Tool #3: A Multi-Model Test Generator

This is where it gets fun. I built a tool that takes a function and generates unit tests – but it queries multiple models and combines the best parts. Each model tends to catch different edge cases.

# test_generator.py
from anthropic import Anthropic
from openai import OpenAI
import google.genai as genai

anthropic_client = Anthropic()
openai_client = OpenAI()

GENERATION_PROMPT = """Given this Python function, generate pytest unit tests.

Requirements:
- Cover happy path, edge cases, and error cases
- Use descriptive test names that explain what's being tested
- Include setup/teardown if needed
- Don't over-mock — only mock external dependencies

Function:
```python
{function_code}
```

Return only the test code. No explanations."""

def generate_with_claude(function_code: str) -> str:
    message = anthropic_client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=4096,
        messages=[{"role": "user", "content": GENERATION_PROMPT.format(function_code=function_code)}]
    )
    return message.content[0].text

def generate_with_gpt(function_code: str) -> str:
    response = openai_client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": GENERATION_PROMPT.format(function_code=function_code)}]
    )
    return response.choices[0].message.content

def generate_with_gemini(function_code: str) -> str:
    client = genai.Client()
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=GENERATION_PROMPT.format(function_code=function_code)
    )
    return response.text

def combine_tests(function_code: str) -> str:
    """Query all three models, then use Claude to merge the best tests."""
    print("Generating tests with Claude...")
    claude_tests = generate_with_claude(function_code)

    print("Generating tests with GPT...")
    gpt_tests = generate_with_gpt(function_code)

    print("Generating tests with Gemini...")
    gemini_tests = generate_with_gemini(function_code)

    # Now use Claude to merge and deduplicate
    merge_prompt = f"""I generated unit tests for a function using three different AI models.
Merge them into a single, comprehensive test file. Rules:
- Remove duplicate test cases (keep the better-written version)
- Keep ALL unique edge cases from any model
- Ensure consistent style (pytest, descriptive names)
- Add any obvious edge cases that all three missed

Original function:
```python
{function_code}
```

Claude's tests:
```python
{claude_tests}
```

GPT's tests:
```python
{gpt_tests}
```

Gemini's tests:
```python
{gemini_tests}
```

Return only the merged test file."""

    message = anthropic_client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=8192,
        messages=[{"role": "user", "content": merge_prompt}]
    )
    return message.content[0].text

I won't pretend this is cheap to run – you're hitting three APIs plus a merge step. But for critical utility functions, the combined output catches way more edge cases than any single model. I've found that Claude tends to catch logic edge cases, GPT is good at boundary value analysis, and Gemini often thinks about type-related issues that the others skip.

Keeping Costs Under Control

Let's be real – this stuff can get expensive if you're not careful. Here are the strategies that actually worked for me:

1. Model cascading. Start with the cheapest model. If the task fails or the output quality is poor, escalate.

def smart_complete(prompt: str, complexity: str = "low") -> str:
    """Route to the right model based on task complexity."""

    if complexity == "low":
        # Simple tasks: GPT-5 Nano ($0.05/$0.40 per MTok)
        response = openai_client.chat.completions.create(
            model="gpt-5-nano", messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content

    elif complexity == "medium":
        # Moderate tasks: Claude Sonnet ($3/$15 per MTok)
        message = anthropic_client.messages.create(
            model="claude-sonnet-4-5-20250929", max_tokens=4096,
            messages=[{"role": "user", "content": prompt}]
        )
        return message.content[0].text

    else:
        # Complex reasoning: Claude Opus ($5/$25 per MTok)
        message = anthropic_client.messages.create(
            model="claude-opus-4-5-20251101", max_tokens=8192,
            messages=[{"role": "user", "content": prompt}]
        )
        return message.content[0].text

2. Prompt caching. Both Anthropic and OpenAI now offer prompt caching. If you're sending the same system prompt or large context repeatedly (like a codebase summary), cached tokens can cut costs by 50-90%. This is huge for tools that run on every commit.

3. Batch API for non-urgent work. Both OpenAI and Anthropic offer batch processing at half the cost. Perfect for things like generating docs for an entire codebase overnight.

# Example: Anthropic batch API for bulk docstring generation
from anthropic import Anthropic

client = Anthropic()

# Create a batch of requests
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"func_{i}",
            "params": {
                "model": "claude-sonnet-4-5-20250929",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": f"Generate docstring for:\n{func}"}]
            }
        }
        for i, func in enumerate(undocumented_functions)
    ]
)

print(f"Batch created: {batch.id}")
# Results are available within 24 hours at 50% cost

4. Local models for sensitive code. If you're working on proprietary code and don't want it leaving your machine, Ollama with models like Llama 3 or Mistral is a solid option for simpler tasks. I use this for quick formatting and basic completions.

Building Something More Ambitious: A Codebase Q&A Tool

Let me show you a slightly more involved example. This tool indexes your codebase and lets you ask questions about it in natural language. I use it constantly when jumping into unfamiliar projects.

# codebase_qa.py
import os
from pathlib import Path
from anthropic import Anthropic

client = Anthropic()

# File extensions worth indexing
CODE_EXTENSIONS = {".py", ".js", ".ts", ".jsx", ".tsx", ".java", ".go", ".rs"}

def index_codebase(root_dir: str, max_files: int = 50) -> str:
    """Build a text representation of the codebase structure and key files."""
    root = Path(root_dir)
    indexed = []
    file_count = 0

    # Skip common non-essential directories
    skip_dirs = {"node_modules", ".git", "venv", "__pycache__", "dist", "build", ".next"}

    for filepath in sorted(root.rglob("*")):
        if any(skip in filepath.parts for skip in skip_dirs):
            continue
        if filepath.suffix not in CODE_EXTENSIONS:
            continue
        if file_count >= max_files:
            break

        try:
            content = filepath.read_text(errors="ignore")
            # Truncate very large files
            if len(content) > 5000:
                content = content[:5000] + "\n... (truncated)"

            relative = filepath.relative_to(root)
            indexed.append(f"=== {relative} ===\n{content}")
            file_count += 1
        except Exception:
            continue

    return "\n\n".join(indexed)

def ask_codebase(codebase_context: str, question: str) -> str:
    """Ask a question about the codebase."""
    message = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=4096,
        system=f"""You are a senior developer who has thoroughly read this codebase.
Answer questions accurately based on the actual code. If you're not sure, say so.
Reference specific files and line patterns when relevant.

Codebase:
{codebase_context}""",
        messages=[{"role": "user", "content": question}]
    )
    return message.content[0].text

def main():
    import sys

    project_dir = sys.argv[1] if len(sys.argv) > 1 else "."
    print(f"Indexing {project_dir}...")
    context = index_codebase(project_dir)
    print(f"Indexed codebase ({len(context)} chars). Ask me anything (type 'quit' to exit):\n")

    while True:
        question = input("You: ").strip()
        if question.lower() in ("quit", "exit", "q"):
            break
        if not question:
            continue

        answer = ask_codebase(context, question)
        print(f"\n{answer}\n")

if __name__ == "__main__":
    main()

This is where prompt caching really shines. The codebase context in the system prompt stays the same across questions, so after the first call, subsequent questions are dramatically cheaper. With Anthropic's caching, that system prompt only gets fully billed once.

Practical Tips From Months of Building These Things

I'll wrap up with a few things I wish someone had told me when I started:

Temperature settings matter more than you think. For code generation and factual tasks, use 0.0-0.2. For creative tasks like naming variables or writing commit messages, 0.7-0.8 works better. I wasted weeks debugging "inconsistent" outputs before realizing I had temperature at the default 1.0 for a code generation task.

Always set max_tokens explicitly. Don't rely on defaults. If you expect a 200-token response, set max_tokens to 500 or so. This prevents runaway responses and keeps costs predictable.

Structured output is your friend. Both OpenAI and Anthropic support asking for JSON output. When building tools that need to parse the response programmatically, always request structured output instead of trying to regex your way through prose.

# Anthropic structured output example
message = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": """Analyze this function for issues. Respond in JSON format:
{"issues": [{"severity": "high|medium|low", "line": int, "description": "..."}], "summary": "..."}

Function:
```python
def divide(a, b):
    return a / b
```"""
    }]
)

Rate limiting and retries are not optional. All three APIs have rate limits, and they will hit you in production. Use exponential backoff.

import time
from anthropic import Anthropic, RateLimitError

client = Anthropic()

def call_with_retry(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            wait = 2 ** attempt
            print(f"Rate limited. Retrying in {wait}s...")
            time.sleep(wait)
    raise Exception("Max retries exceeded")

Version pin your models. Models get updated and deprecated. Use specific model version strings (like claude-sonnet-4-5-20250929) instead of aliases that might change behavior under you.

What's Next

This post covered the basics – individual tools for specific developer workflows. But the real power comes when you start chaining these together. Imagine a pre-commit hook that runs the code reviewer, auto-generates missing docstrings, creates tests for new functions, and updates the changelog – all before your code hits the remote.

That's what I'm building right now, and I'll cover it in a follow-up post.

If you want to experiment, start with the code reviewer – it's the quickest win and you'll see the value immediately. And if you're worried about cost, GPT-5 Nano and Gemini's free tier are perfectly good for getting started.

The code from this post is available on my GitHub. Feel free to fork it and adapt it to your workflow.

Happy coding!

AI-Powered Developer Tools: Building with LLM APIs

The API Landscape in 2026 – A Quick Lay of the Land

Setting Up – The Boring But Important Part

Tool #1: An Intelligent Code Reviewer

Tool #2: A CLI Documentation Generator

Tool #3: A Multi-Model Test Generator

Keeping Costs Under Control

Building Something More Ambitious: A Codebase Q&A Tool

Practical Tips From Months of Building These Things

What's Next

Kubernetes Commands

Publishing a library to Maven Central

AWS Create a VPC

Docker Commands

Run PostgreSQL on Docker

Kubernetes WebUI

Leave a Reply Cancel reply

The API Landscape in 2026 – A Quick Lay of the Land

Setting Up – The Boring But Important Part

Tool #1: An Intelligent Code Reviewer

Tool #2: A CLI Documentation Generator

Tool #3: A Multi-Model Test Generator

Keeping Costs Under Control

Building Something More Ambitious: A Codebase Q&A Tool

Practical Tips From Months of Building These Things

What's Next

Similar Posts

Leave a Reply Cancel reply