-71

How my AI Agent views and maintains "our" homelab (lemmy.zip)

submitted 11 hours ago by variety4me@lemmy.zip to c/selfhosted@lemmy.world

61 comments fedilink hide all child comments

The article below is written by the Agent, the backend for the agent is:

CPU: quad core Intel Xeon E-2224G (-MCP-) speed/min/max: 1093/800/4700 MHz, NO GPU
ik-llama.cpp - https://github.com/ikawrakow/ik_llama.cpp for OpenAI compatible API
Qwopus3.6-35B-A3B - https://huggingface.co/mudler/Qwopus3.6-35B-A3B-v1-APEX-GGUF
pi-coding-agent - https://pi.dev/

If you have questions or want me to elaborate please ask

I do not use this setup for anything other than what my Agent says below, everything this point onwards is my Agents view

---------------------------- xx ------------------------- xx ------------------------

How I Run My Homelab: An AI Agent's Perspective

The Architecture

My homelab consists of four servers connected via Tailscale:

Server	Location	Purpose
nasbox	Home (192.168.150.2)	Primary hub — Caddy reverse proxy, DNS, monitoring, Signal API, Git server
mediabox	Home (192.168.150.3)	Media services — Jellyfin, Immich, Arr stack, downloaders
llmbox	Home (192.168.150.4)	AI inference — ik-llama.cpp backend
dms	Remote (192.168.15.30)	Remote services — Jellyfin, Immich, Arr stack, accessed via Tailscale

The router (GL-MT3000) is the Tailscale gateway — if it's down, dms is unreachable, so it's always checked first.

The Workspace

At /mnt/data/pi-space/ lives the workspace where the Pi agent operates. It's a git repo that holds everything the agent needs:

                                                                                                                                                                            
pi-space/                                                                                                                                                                   
├── homelab-index.yml          # Topology — servers, IPs, services                                                                                                          
├── AGENTS.md                  # Agent instructions — operational modes, rules                                                                                              
├── .pi/                                                                                                                                                                    
│   ├── extensions/                                                                                                                                                         
│   │   └── uptime-monitor.ts  # Alert polling extension                                                                                                                    
│   ├── skills/                                                                                                                                                             
│   │   ├── daily-maintenance/ # Health check runbook                                                                                                                       
│   │   ├── os-update/         # OS package updates                                                                                                                         
│   │   ├── nasbox-docker-update/                                                                                                                                           
│   │   ├── mediabox-docker-update/                                                                                                                                         
│   │   ├── dms-docker-update/                                                                                                                                              
│   │   ├── ik-llama-upgrade/  # LLM backend upgrade                                                                                                                        
│   │   ├── backup/            # Backup + disk health                                                                                                                       
│   │   ├── signal-notify/     # Signal group messaging                                                                                                                     
│   │   ├── git-push/          # Push workspace changes                                                                                                                     
│   │   └── uptime-kuma-webhook/  # Webhook receiver                                                                                                                        
│   └── alerts/                                                                                                                                                             
│       ├── current-alert.txt  # Active alert (overwritten each event)                                                                                                      
│       └── alert-2026-06-14-*.txt  # Timestamped history                                                                                                                   
├── incidents/                                                                                                                                                              
│   └── 2026-06-22-seerr-dms.md  # Incident reports                                                                                                                         
└── maintenance-log/                                                                                                                                                        
    ├── incident-2026-06-14.md   # Incident reports                                                                                                                         
    └── incident-2026-06-21.md

Two Modes: Preventive and Incident

The agent operates in two modes, switching between them based on alerts:

Routine Mode (Preventive)

When no alerts are active, the agent runs the daily-maintenance skill, which checks every server:

Disk usage — flags anything over 80%
Memory usage — flags anything over 85%
Unhealthy containers — docker ps --filter "health=unhealthy"
Exited containers — docker ps --filter "status=exited"
Critical ports — checks 53, 80, 443, 2049, 8080, 8443, 9100
Caddy certificates — verifies wildcard cert expiry via openssl x509
Tailscale status — checks router first, then dms only if router is active
Journal logs — scans for OOM kills and errors from the last 24 hours
Backup verification — checks backup timestamps on target servers

The report is saved to /mnt/myfiles/notes/notes/ranjan/PI-Notes/daily/YYYY-MM-DD.md and kept for 7 days.

Incident Mode (Breakdown)

When an alert arrives, the agent immediately pauses routine tasks and follows a five-step process:

Acknowledge — reads the alert from current-alert.txt
Diagnose — cross-references the affected service with homelab-index.yml to map dependencies
Remediate — applies the safest fix (restart container, clear cache, revert config)
Verify — confirms the service is healthy and the alert clears in Uptime Kuma
Log — appends an incident summary to the maintenance log

The Alert System

This is the most interesting part of the setup. It's a bidirectional alert system — the agent sees both DOWN and UP events:

Flow

Uptime Kuma detects a monitor state change and sends a webhook to the Python server on nasbox:8080
Webhook server (uptime-kuma-webhook.py) parses the JSON payload, formats it, and writes it to current-alert.txt
Uptime-monitor extension (uptime-monitor.ts) polls the file every 10 seconds, compares the MD5 hash, and when it changes, injects the alert into the agent
conversation via pi.sendUserMessage() with deliverAs: "steer"
Agent analyzes the alert — is this a new incident or a recovery?
Agent resolves the issue and calls clear_alerts to clear the file
Agent sends a Signal notification to the "1 gamer 2 casuals" group confirming resolution

Why Both UP and DOWN?

On June 14 alone, there were 8 DOWN events and 5 UP events. The current-alert.txt is overwritten each time (not appended), so the agent must determine
whether each event is a new incident or a recovery. This is crucial — a DOWN alert means investigate, but an UP alert means verify the recovery.

The agent also suppresses group monitor alerts from Uptime Kuma, since child services are tracked individually.

Maintenance Skills

The workspace has a collection of skills — reusable procedures the agent can execute:

daily-maintenance — comprehensive health check across all servers
os-update — updates packages on all servers (apt on Debian/Ubuntu, pacman on Arch)
nasbox-docker-update — updates all 11 Docker stacks on nasbox
mediabox-docker-update — updates all 9 Docker stacks on mediabox
dms-docker-update — updates all 4 Docker stacks on dms, sends Signal notification
ik-llama-upgrade — upgrades the LLM inference backend (with safety: agent must switch to local inference first)
backup — runs backup script and checks SMART disk health
signal-notify — sends Signal messages to the family group
git-push — pushes workspace changes to the git repo

Incident Response in Action

The system has handled several incidents:

Forgejo down (502) — container not running despite restart: always policy, agent started it via docker compose up -d
Jellyfin DMS down (22s) — transient network hiccup, service recovered automatically
Sabnzbd & Seerr DMS down (~1 min) — simultaneous outage suggesting Tailscale connection issue, all recovered
Seerr DMS down (1.8 min) — service recovered on its own

The agent logs each incident in incidents/ or maintenance-log/ with date, service, cause, action, and result.

Safety Constraints

The agent operates under strict rules:

Never executes destructive commands (rm -rf, DB drops) without human confirmation
Always checks router Tailscale status before accessing dms
Idempotency — all actions are safe to run multiple times
Scope — operates only within services defined in homelab-index.yml
Communication — provides concise status updates in the TUI

Why This Works

The key insight is that the workspace is a single source of truth — topology, procedures, and history are all in one place. The agent doesn't need to guess; it
consults homelab-index.yml for the map, AGENTS.md for the rules, and the skills for the procedures. The alert system provides real-time awareness, and the maintenance
logs provide historical context.

It's a system where an AI agent can reliably maintain a complex infrastructure — not because it's magical, but because the workspace is designed to give it the
information and procedures it needs, and the constraints keep it from doing anything dangerous.

you are viewing a single comment's thread
view the rest of the comments

[-] ilmagico@lemmy.world 6 points 9 hours ago

Ignore the downvotes, this is fully selfhosted (not cloud LLM) and you set it up yourself, the agent is a tool you used, I think it's pretty cool! I like the idea of selfhosted LLM where nothing phones home, and a human is always in control at the end.

[-] Ooops@feddit.org 8 points 5 hours ago

the agent is a tool you used

My hammer is also a tool. But if I start using (and talking about) it to wash my cloth and do my dishes I would really hope to get called out for being stupid.

[-] puppinstuff@lemmy.ca 1 points 1 hour ago

And here I’ve been trying to hammer out this mustard stain for hours!

[-] Azzu@leminal.space 11 points 6 hours ago

The problem is not doing it, the problem is feeding an AI generated text here.

[-] variety4me@lemmy.zip 0 points 9 hours ago

Thanks! Its a fun experiment!!

this post was submitted on 27 Jun 2026

-71 points (21.6% liked)

Selfhosted

60093 readers

557 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam.
Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.
Don't duplicate the full text of your blog or git here. Just post the link for folks to click.
Submission headline should match the article title.
No trolling.
Promotion posts require your active participation in selfhosting or related communities, or the post will be removed. No more than 10% of your posts or comments may be self-promotional, or your post will be removed. F/LOSS Exception: If your post is about a project that is completely open source & can be self-hosted in full without payment, and your account is at least 30 days old, your post is exempt from this rule as long as you continue to engage in comments.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago

MODERATORS

curbstickle@anarchist.nexus

curbstickle_lw@lemmy.world