Multi-Instance Watchtower — Research Meeting Minutes

Date: 2026-02-17
Meeting start: 00:25 EST | Meeting end: —
Duration: —
Attendees: Phil (product owner), CoS (facilitator), SDK Researcher, Discord Researcher, Infra/Resource Analyst, Brain DB Analyst

1. SDK Researcher — Multi-Instance Feasibility

Verdict: FULLY SUPPORTED

No singletons or global state in the SDK. Each ClaudeSDKClient is independent — own transport, own subprocess, own message stream.
Process model: Each client spawns its own claude Node.js subprocess via anyio.open_process(). Communication is JSON over stdin/stdout pipes.
Working directory isolation: Each subprocess gets its own cwd — fully isolated at the OS level.
Settings per instance: The --settings CLI flag is per-subprocess. Each worker can use a different settings file.
Concurrent asyncio: Multiple clients can run query() concurrently in the same event loop. Each has its own TaskGroup and message stream.
One minor wart: A cosmetic env var (CLAUDE_CODE_ENTRYPOINT) gets mutated globally, but it's harmless — same value written by all clients.
No nesting issue: Workers are spawned by the Python coordinator, not by CC itself, so the CLAUDECODE env var detection doesn't fire.

Resource per instance

Component	Consumption
Node.js subprocess	150-500MB RAM (idle to active)
File descriptors	3+ (stdin/stdout/stderr pipes)
Python bridge overhead	~50MB per worker
API connection	Independent HTTP per subprocess

2. Discord Researcher — Channel Management

Verdict: STRAIGHTFORWARD

Bot can create channels: guild.create_text_channel(name, category, overwrites). Needs manage_channels permission.
Categories supported: Create a "Watchtower Workers" category, nest worker channels inside it. Clean sidebar UX.
Bot can delete channels: channel.delete() — same permission. Cleanup on session end.
Per-channel permissions: Overwrites at creation — lock to Phil-only and the bot. Everyone else denied.
Multi-channel routing: Single on_message handler + a dict[channel_id → worker] registry. No separate listeners needed.
Channel naming: 1-100 chars, lowercase + hyphens. Discord does NOT enforce uniqueness (use IDs as keys).
Rate limits: ~2 channel creates per 10 seconds. Irrelevant for spawning a handful of workers.
Hard limit: 500 channels per guild. Ephemeral pattern (create on start, delete on end) keeps this clean.

Recommended pattern

Channels are ephemeral — created when worker starts, deleted when worker ends.
Named worker-<short-id> or worker-czechwriter for readability.
If Phil wants history, keep a persistent log channel for session summaries.

3. Infra/Resource Analyst — Deadpool Capacity & Clone Model

Verdict: FEASIBLE — 8-10 concurrent workers on Deadpool

Memory budget (32GB)

Component	Cost
Windows 10 + other processes	~6 GB
WSL2 kernel overhead	~0.5 GB
Main Watchtower process	~0.15 GB
Available for workers	~25 GB
Per worker (Python bridge + Node.js)	~0.5-1 GB
Safe concurrent workers	8-10

Real bottleneck: Anthropic API rate limits (typically 5-10 concurrent requests per tier), not local resources.

Clone-based workspaces

Clone location: Linux filesystem at /home/plangeberg/watchtower-workers/<worker-id>/ — substantially faster than /mnt/d/ (5-15x for git ops).
Clone scope: Just the needed subrepo (e.g., czechsuma-labs/czechwriter). Use SDK's --add-dir flag for read access to brain/ context if needed.
Shallow clones: git clone --depth 1 keeps disk usage minimal (~10-50MB per worker).
Best option: Bare clone mirror on Linux FS + git worktree add for workers. Faster than network clones, shared object store.

Cleanup

Normal: commit → push → rm -rf worker dir.
Crash: On startup, scan worker dirs for orphans (check PIDs). worker.json metadata per worker.

Conflict handling — recommended approach

Worker branches: Each worker commits to worker/<id>. No push collisions possible.
CoS (or Phil) merges worker branches into main after review.
Alternative: Lock one worker per repo (simpler, less parallel).

WSL2 prep needed

Raise fs.inotify.max_user_watches to 524288 (default 8192 will exhaust with multiple Node.js processes).
Set .wslconfig: memory=20GB, swap=4GB — leaves 12GB for Windows.

4. Brain DB Analyst — Shared State & Concurrent Writes

Verdict: BUILD WHEN NEEDED — not now, but architect for it

The real conflict risk

95% of the risk is one file: THREADS.md — specifically the todo list and thread status sections.
Everything else (PHIL.md, contexts, sessions, handoffs, runbooks) is either read-only, append-only, or session-scoped. No concurrency risk.

Recommended phased approach

Phase 1 (ship with multi-instance): "CoS owns THREADS.md" rule. Worker instances run in a restricted mode — no writes to brain/. This is basically extending !secret mode to all workers. Zero new infrastructure needed.
Phase 2 (when pain is real): SQLite DB in WAL mode at brain/brain.db.
- Tables: threads, todos, parking_lot, review_items, delegated
- WAL mode allows concurrent reads + serialized writes with auto-retry
- Python CLI wrapper (brain-db) — CC calls it via Bash instead of editing THREADS.md
- Auto-renders THREADS.md as a read-only artifact after every write (Phil's view doesn't change)
- Migration script parses existing THREADS.md → DB rows (~150 lines)
- CLI tool ~200 lines. Minimal command set: brain-db todos, brain-db todo-done "text", etc.

What stays as files forever

PHIL.md, CHIEF-OF-STAFF.md, contexts/* — read-only for CC, rare human edits
Sessions/* — append-only, one file per session
Handoffs — create-once, consume-once lifecycle
Runbooks, parking-lot detail files — reference docs, not concurrent data

Key insight: Don't build the DB until multi-instance is live and you feel the pain. "CoS owns THREADS.md" is sufficient for launch.

5. Architecture Summary

Watchtower starts → CoS session in main channel (primary working copy)
    │
    Phil: "spin up a worker for CzechWriter"
    │
    ▼
CoS checks: resources OK? repo locked? → YES
    │
    ├── Creates Discord channel: #worker-czechwriter
    ├── Clones repo to /home/plangeberg/watchtower-workers/wt-001/
    ├── Spawns new ClaudeSDKClient(cwd=clone_path)
    ├── Registers channel_id → worker in routing dict
    └── Tells Phil: "Channel ready, go talk to it"

Phil switches to #worker-czechwriter → works directly with that CC instance
    │
    Done → Phil says "end session" or tells CoS to kill it
    │
    ▼
Cleanup:
    ├── Worker commits to branch worker/wt-001
    ├── Worker pushes branch
    ├── Clone directory deleted
    ├── Discord channel deleted (or archived)
    └── CoS merges branch to main (or Phil reviews first)

6. Proposed Tickets

EPIC 1: Multi-Instance Core (MVP)

Estimated: 3-4 sessions

WT-001 Worker registry and lifecycle manager
Track active workers, their channels, repos, branches, PIDs. Spawn/kill operations. Orphan cleanup on startup.

WT-002 Multi-channel Discord routing
Refactor on_message from single-channel to registry-based routing. Each worker channel routes to its own ClaudeBridge. CoS channel keeps existing commands.

WT-003 Dynamic channel creation/deletion
Bot creates channels under "Watchtower Workers" category on spawn. Deletes on cleanup. Phil-only permissions.

WT-004 Clone-based workspace management
Git clone to Linux FS, shallow clone, worker branch naming, commit+push on completion, directory cleanup. Consider bare-mirror optimization.

WT-005 CoS commands: !spawn, !workers, !kill
!spawn czechwriter — creates worker. !workers — lists active. !kill wt-001 — terminates worker + cleanup.

WT-006 Resource guard
Check available memory + active worker count before spawning. Configurable max workers (default 4). Deny spawn with reason if limit hit.

WT-007 Default to CoS on boot
Watchtower auto-starts a CoS session in the main channel on startup. Currently requires !cos.

EPIC 2: Worker Isolation & Safety

Estimated: 1-2 sessions

WT-008 Worker restricted mode (no brain writes)
Workers get a preamble similar to !secret — no writes to brain/, memory/, THREADS.md. Only CoS touches shared state.

WT-009 Per-worker settings file
Generate a worker-specific watchtower-settings-wt-001.json scoped to the clone directory. Prevents accidental access to other repos.

WT-010 WSL2 environment prep script
Script to set inotify limits, .wslconfig memory cap, verify ulimit -n. Run once before first multi-instance use.

EPIC 3: Brain DB (Deferred — Build When Needed)

Estimated: 2 sessions

WT-011 THREADS.md → SQLite migration script
Parse THREADS.md, seed DB tables (threads, todos, parking_lot, review_items). Validate by round-trip diff.

WT-012 brain-db CLI tool
Python CLI: query/update threads, todos, parking lot. Auto-renders THREADS.md after writes. CC calls via Bash.

WT-013 Update Watchtower todo.py to use brain-db
Swap file I/O for subprocess calls to brain-db. Same Discord interface for Phil.

WT-014 Update CHIEF-OF-STAFF.md for DB workflow
Tell CC to use brain-db CLI instead of editing THREADS.md directly.

7. Open Questions for Phil

Max concurrent workers: Default to 4? Or let it float based on memory?
Worker branch merging: CoS auto-merges to main? Or Phil reviews the branch first?
Channel persistence: Delete channels when worker ends? Or keep for history (collapse into archive category)?
Repo mapping: Should CoS know which repos map to which project names? (e.g., "CzechWriter" → czechsuma-labs/czechwriter) Or does Phil specify the path?
Naming: The app is still called Watchtower (pending rename). Do these tickets go into the existing Watchtower backlog, or does multi-instance warrant its own project name?

8. Risks & Mitigations

Risk	Severity	Mitigation
Anthropic API rate limits throttle multiple workers	High	Resource guard caps workers; stagger API-heavy operations
Worker crashes leave orphaned clones/channels	Medium	Startup cleanup + worker.json metadata + periodic health check
Two workers edit same file across repos	Medium	Worker branches prevent push conflicts; CoS resolves at merge
WSL2 inotify exhaustion	Low	WT-010 prep script raises limits
Write-after-Allow bug (existing) blocks worker permissions	High	Must fix existing bug (WT backlog) before multi-instance ships