I Gave an AI Agent Access to My Entire Dev Setup. Here's What Happened.

Six weeks ago I gave an AI agent access to my calendar, email, GitHub, and production servers. It runs 24/7, builds apps while I sleep, and sends me a morning brief. Here's what actually happened.

Why I did it

The standard way to use AI coding tools is reactive — you open the chat, describe a problem, get a solution. That's useful, but it leaves a lot of potential on the table. You're still the bottleneck. The AI waits for you.

I wanted to flip that. What if the agent was running continuously, had access to the same context a colleague would have, and could do useful work without being explicitly asked every time?

This isn't a new idea — agentic AI has been discussed for years. But the infrastructure to actually run it well is now accessible. Cheap VPS hosting, good AI APIs, and tools like Claude that can hold context and reason about multi-step tasks made this practical in a way it wouldn't have been two years ago.

The setup

The agent runs on a Hetzner VPS — a €7/month machine in Portland. It's connected to my infrastructure via Tailscale, which means it can reach my development environment securely without any public-facing ports.

The interface is Telegram. I message the agent the same way I'd message a colleague. It responds, asks clarifying questions when it needs to, and gets on with things. I can check in from my phone at any time.

The agent has access to:

The filesystem of the VPS and all my project repos (via SSH)
GitHub (read/write — it can commit and push)
Vercel (it can deploy)
A Notion task board (it reads and updates tasks)
A cron scheduler (it can set its own reminders and recurring jobs)
The browser (via Playwright-style automation for web tasks)

What it doesn't have: my email password, financial accounts, or anything that could cause irreversible damage. The principle I used was: give access to things where the downside of a mistake is recoverable. A bad commit can be reverted. A sent email is harder to take back.

What I thought would happen vs. what did

I expected the agent to be useful for discrete, well-defined tasks — "build a currency converter app", "fix this bug". What I didn't expect was how useful it would be for ambient tasks that are hard to carve out time for.

While I was sleeping, it built the transactions page for my finance app, added subscription detection, and committed everything with proper messages. I woke up to a Telegram summary of what it had shipped.

It runs a weekly ship log every Sunday — pulling git logs from all my repos, formatting them into a digest, and sending it to me. That's a task I would have done manually (badly, infrequently) or not at all.

The overnight sessions have become the most productive part of my development workflow. While my conscious attention is elsewhere, the agent is working through a prioritised task list, making decisions within defined boundaries, and flagging anything uncertain for me in the morning.

The memory problem

The hardest engineering challenge wasn't integrating the APIs. It was giving the agent continuity across sessions.

LLMs don't remember anything between conversations — each session starts fresh. For a coding agent, this means repeatedly re-explaining context that should be obvious: which projects exist, what their tech stacks are, what decisions were made and why.

The solution was a file-based memory system. The agent writes daily notes to a structured file tree — raw logs of what happened, decisions made, things to follow up on. A separate long-term memory file gets updated with distilled lessons and context.

At the start of each session, the agent reads its memory files before doing anything else. It wakes up with context. The daily notes are like a journal; the long-term memory is like a mental model of the project.

It's not perfect — file-based memory is brittle compared to a real knowledge graph — but it works well enough that the agent rarely asks me something it should already know.

What breaks

Long sessions hit context limits. The agent will be mid-task on something complex and run out of context window — the equivalent of a colleague forgetting what they were doing mid-sentence. The mitigation is structured checkpointing: write state to files before stopping so the next session can pick up cleanly.

The agent sometimes makes decisions I wouldn't have made. Not wrong decisions, but different ones. An unexpectedly opinionated architecture choice, a component structure I'd have approached differently. The builds are sound, but they reflect the model's preferences as much as mine.

Tool calls fail silently in ways humans wouldn't. If a git push fails due to a network hiccup, a human developer notices and retries. An agent will sometimes proceed without noticing the failure. Building good error handling and verification steps into the workflow took longer than I expected.

What surprised me

The agent is better at some tasks than I expected and worse at others in ways that aren't obvious upfront.

Better: generating boilerplate, maintaining code style consistency across files, writing documentation, and finding the right library for a given problem. These are tasks where humans get bored and start cutting corners. The agent doesn't cut corners on the boring parts.

Worse: tasks that require genuine judgment about user experience. Not technical judgment — the agent is fine at technical tradeoffs — but aesthetic and interaction judgment. Knowing that a feature is technically correct but feels wrong to use. That's still mine.

The combination is more productive than either alone. The agent handles implementation depth; I handle product direction. When I can articulate clearly what I want, it builds it well. When I'm uncertain what I want, the back-and-forth of iteration is as slow as it would be with a human.

Is it worth it?

The honest answer: yes, but the value isn't where I thought it would be.

I expected the main value to be coding speed — build features faster. That's real, but it's not the main thing. The main thing is staying in motion. Projects have momentum and they have standstills. The agent eliminates most of the standstills — the "I'll get to that tomorrow" tasks that never get done, the background work that compounds into technical debt, the documentation no one writes.

There's also something clarifying about having to write down what you want precisely enough for an agent to execute it. The discipline of articulation — being clear enough in your task description that the agent doesn't need to ask clarifying questions — turns out to improve the quality of my own thinking about what I'm building.

The best way to find out if you know what you want is to tell an AI agent to build it.

What I'd do differently

Start with tighter access boundaries and expand. I gave broad access from the start, which was fine because I'd thought through the risk model — but it's not the right advice for someone starting out. Start with the agent having read-only access to things, expand write access incrementally as you build trust in how it behaves.

Build better status interfaces earlier. I built a basic mission control dashboard eventually — shows active sessions, recent activity, cron jobs — but I should have built it first. Without visibility into what the agent is doing, you end up messaging it to ask for status updates, which defeats part of the purpose.

Invest in the memory system upfront. The file-based approach works, but I cobbled it together reactively. Designing the memory architecture intentionally from the start would have saved debugging sessions where the agent was re-learning context it should have retained.

Where this goes

The current setup is a hand-built version of what will eventually be commodity infrastructure. In two to three years, giving an AI agent access to your development environment will be as normal as giving a new team member a GitHub invite.

The people who will do that well are the ones who've already thought through the trust model, built the context systems, and developed intuitions about what agents are good at vs. what requires human judgment. That knowledge is being built right now, by people willing to run the experiments.

I'm treating this project as extended fieldwork in a new kind of software collaboration. The products I'm shipping are real and the learnings compound. That seems like a reasonable bet on where things are going.