The Stack We Actually Ship On: Infrastructure
Field notes for founders on AmpUp's infrastructure stack — from PostgreSQL with Row-Level Security and Elasticsearch hybrid search, to FastAPI, Inngest durable tasks, and Tilt-based developer experience. Part 1 of 2.
We’re AmpUp , a sales intelligence, execution, and coaching layer built on conversational AI agents. We spent months choosing, debating, and second-guessing every layer of our infrastructure. Here’s the blueprint we landed on, shared as a gift to founders who are right where we were.
My background is in applied AI research: building data analysis agents at ThoughtSpot and world-building agents at Roblox . AmpUp is where those lessons collided with the realities of shipping a production AI product to enterprise sales teams. Over the past year, we made multiple infrastructure decisions under pressure, with incomplete information, and with a team that was always a little smaller than the problem demanded. This post is what we learned.
This is Part 1 of a two-part series covering the infrastructure foundation: storage, backend architecture, background jobs, authentication, testing, and developer experience. In Part 2, we cover the product core: conversational agents, voice AI, transcription, chat interface, analytics, frontend architecture, and observability.
Take what’s useful. Make it your own.
00 — First: Who We Are and Why This Matters
AmpUp is a sales intelligence, execution, and coaching layer. We build AI agents that act as a blend of sales strategist, sales coach, and chief of staff, and they mostly talk to you. No clicking through dashboards. No wrestling with CRM interfaces that create friction instead of removing it. You talk, the agent listens, coaches, prepares you, and handles the busywork.
Under the hood, the product is a closed-loop system. Our Meeting Intelligence engine analyzes past calls to find the causal drivers that actually move revenue (not correlations, causes). A Pre-Meeting Agent briefs reps with real CRM context before every conversation. A Post-Meeting Agent debriefs them immediately after through a voice conversation, not a form. And a Practice System lets reps drill on the exact scenarios where they’re weakest, generated from their own deals, not generic roleplay scripts.
When a top rep handles a pricing objection brilliantly on Monday morning, every other rep on the team can practice that exact technique by the afternoon. That’s what the product delivers.
Why the Stack Matters for What We Build
Our architecture isn’t academic; every choice traces back to the product. We’re a voice-first platform, which means streaming audio, real-time transcription, and conversational AI agents that need to feel natural, not laggy. We run causal analysis across hundreds of meeting patterns, so our search and data layers handle complex retrieval at scale. We process call recordings, CRM signals, and transcripts through durable ingestion pipelines, because if that pipeline hiccups, a rep walks into a critical meeting without their brief. And because we sell to enterprise sales orgs who are handing us their most sensitive competitive conversations, our auth and tenant isolation aren’t afterthoughts; they’re the reason we get the contract.
Every stack decision you’ll read below was shaped by these constraints. Your constraints will be different, but the thought process transfers.
We’re early-stage, venture-backed, and already deployed in production with enterprise customers. Our team is small, which means every architectural decision is also a staffing decision. We can’t afford to operate a bespoke message broker and a custom auth system and a hand-rolled deployment pipeline. We need tools that are powerful enough for enterprise demands but simple enough that a small team can actually run them.
That tension (enterprise-grade product, startup-sized team) is what shaped every decision below. If you’re in a similar position, this is written for you.
01 — Storage: Postgres, Elasticsearch, and the Search Evaluation That Took a Month
Storage is the section that deserves the most attention because it touches everything else. Your database choice determines your multi-tenancy model, your search experience, your vector retrieval strategy, and your operational overhead. We spent more time evaluating options here than on any other layer, and the story of how we landed on our current setup is worth telling in full.
Locked In
- Source of Truth: PostgreSQL (with Row-Level Security)
- Multi-Tenancy: Shared schema + RLS
- ORM: SQLAlchemy (async)
- Migrations: Alembic
- Search Engine: Elasticsearch
- Vector Search: Dense vectors with HNSW (cosine similarity)
- Retrieval Strategy: Hybrid: BM25 + kNN with Reciprocal Rank Fusion
- Simple Vector Needs: pgvector (Postgres extension)
PostgreSQL as Source of Truth
When we started building AmpUp, we made a deliberate bet: PostgreSQL would be our single source of truth, and we would resist the temptation to reach for specialized stores until the data proved we needed them. Two years in, handling millions of meeting records, transcript chunks, AI-generated signals, and coaching insights across hundreds of enterprise tenants, Postgres has not blinked.
Postgres is not a “starter database you migrate away from.” It is a production-grade, extensible, battle-tested system that can carry you from zero to significant scale. The extension ecosystem (pgvector for embeddings, pg_trgm for fuzzy search, timescaledb for time-series) means you can add capabilities without bolting on entirely new infrastructure.
Multi-Tenant Isolation with Row-Level Security
The hardest problem in B2B SaaS is making sure one tenant can never see another tenant’s data, not through a bug, not through a missing WHERE clause, not through an ORM quirk. The naive approach is to litter WHERE org_id = ? filters throughout every query. This works until someone forgets one.
We use PostgreSQL Row-Level Security to enforce tenant isolation at the storage layer. Every connection runs SET LOCAL app.current_tenant = '<org_id>' at the start of each transaction. RLS policies on every tenant-scoped table automatically filter every SELECT, INSERT, UPDATE, and DELETE:
CREATE POLICY tenant_isolation ON meetings
USING (org_id = current_setting('app.current_tenant')::uuid);
With FORCE ROW LEVEL SECURITY applied even to table owners, there is no escape hatch. An ORM query that forgets a filter does not return another tenant’s data; it returns zero rows, or raises a constraint violation on write. This is the difference between defense in depth and optimism.
Schema Design
All tenants live in the same tables. Migrations run once. Connection pools are shared. Monitoring dashboards show one database. The org_id foreign key on every tenant-scoped table is explicit and indexed even though RLS enforces isolation independently; the index ensures query plans stay fast. Our core data model flows as a chain: organizations → meetings → transcript_chunks → signals → coaching_insights, all linked with foreign keys and cascading deletes.
Connection Pool Hygiene
We run SQLAlchemy ’s async sessions backed by asyncpg. The SET LOCAL app.current_tenant call happens inside a session event listener on transaction start; application code never has to remember to set it. Pool sizing matters more than you’d expect on Kubernetes: with multiple pods each holding a pool, it’s easy to exhaust max_connections. We size conservatively and monitor active connections as a first-class metric.
Migration Workflow
Rather than requiring a running cluster connection, we built a local migration script that spins up a temporary Postgres instance, applies every existing migration to build the current schema, then runs Alembic’s autogenerate. Any engineer can generate a correct migration from their laptop with zero external dependencies. We always review generated files; autogenerate handles column additions and index creation but misses data migrations, custom SQL, and column renames.
Performance Patterns
Strategic indexing is where Postgres performance lives. We index every
org_idcolumn (RLS can use an index scan), every foreign key in aJOIN, and composite indexes on(org_id, created_at)for paginated lists. Every slow query getsEXPLAIN (ANALYZE, BUFFERS)before any optimization. N+1 queries are the most common ORM trap; we enforce eager loading for known relationship traversals and lint for N+1 patterns in code review.
The Search and Vector DB Evaluation: A Month-Long Odyssey
We needed hybrid search: fulltext keyword matching combined with vector similarity retrieval. Our AI agents need both. When a coaching agent is answering “what objections did this prospect raise about pricing in the last three calls?”, a pure vector query surfaces semantically related content but misses exact terminology. Pure BM25 inverts that problem: keywords match but paraphrases miss. We needed both, in one system, running fast enough for conversational latency targets.
Here is what we needed from a search and vector solution:
- Fulltext search with lemmatization, stop words, BM25 ranking, and typo tolerance
- Vector search (cosine similarity) on fewer than 5M vectors at 1536 dimensions
- Horizontally scalable for future growth
- Good memory paging (HNSW is in-memory by default; we needed good disk paging behavior)
- Self-hostable preferred (for local dev, airgapped installs, cloud-agnostic deployments)
- Not excessively heavy to run on a small team’s infrastructure budget
I personally benchmarked most of these options. We evaluated Meilisearch , Typesense , Algolia , Turbopuffer , OpenSearch , pgvector , ParadeDB , Pinecone , ChromaDB , Weaviate , Milvus , Qdrant , and Elasticsearch . Each had tradeoffs: some excelled at pure vector search but lacked fulltext capabilities, some had excellent developer experience but hit scaling walls at our target volume, and several were not self-hostable or required explicit vendor whitelisting in our infrastructure.
We centered on Elasticsearch for several reasons:
- Self-deployable: we can run it in our own cluster for local development, airgapped environments, or any scenario where depending on a managed service isn’t an option
- Cloud-native support: every major cloud provider supports it natively (Elastic Cloud , AWS OpenSearch , GCP Marketplace ), so we’re never locked into a single hosting path
- No vendor whitelisting: alternatives that performed well on benchmarks either required vendor-hosted infrastructure with no self-deploy option, or needed explicit whitelisting and approval processes that added friction to our deployment story
- Operational flexibility: for a small team, having a search engine we can run anywhere was the deciding factor
On the vector side, pgvector impressed us enormously. We benchmarked it at 1M vectors (1536 dimensions) and saw p50 = 41ms, p95 = 71ms, with RAM under 200MB during queries. For comparison, Elasticsearch on the same benchmark showed p50 = 426ms with ~4GB heap. pgvector handles our simpler vector needs directly in Postgres, while Elasticsearch handles the heavy hybrid search workload where we need both BM25 keyword scoring and cosine similarity in a single query.
Where We Landed: Elasticsearch + Postgres
After a month of evaluation, we settled on a two-system approach:
- Postgres (with RLS) as the canonical source of truth, with pgvector for simple vector needs
- Elasticsearch for the heavy hybrid search workload: BM25 fulltext + kNN vector with Reciprocal Rank Fusion
The reasoning: Elasticsearch is heavy, but we already had runbooks, snapshot policies, and alerting wired up for it. Adding a dedicated vector database would have meant a third persistence system to monitor, back up, and keep in sync. Operational consolidation matters when your team is small. One search system, one set of client libraries, one backup strategy. The tradeoff is that ES is not the fastest at pure vector search (pgvector beats it handily there), but for hybrid retrieval where we need both BM25 keyword scoring and cosine similarity, ES gives us a single query interface that combines both.
Postgres handles the structured data, tenant isolation, and transactional guarantees. Elasticsearch handles the read-optimized search projections. If they ever disagree, Postgres wins.
Open Questions We’re Still Exploring
pgvector + Postgres native FTS could potentially replace Elasticsearch entirely. The pgvector performance numbers are compelling, and Postgres fulltext search with
tsvector/tsqueryis more capable than most engineers realize. We’re watching ParadeDB for enhanced BM25 scoring in Postgres. For now, our two-system approach works, but the Postgres-only path is the most appealing simplification on the horizon.
The Embedding Pipeline
When a call recording is processed, we chunk the transcript into overlapping windows of roughly 400 tokens (with a 50-token overlap to avoid splitting context across chunk boundaries), then pass each chunk through an embedding model to produce a dense vector. Those vectors land in Elasticsearch alongside the raw text, speaker metadata, timestamps, and pre-computed fields like deal stage and account tier, with an HNSW graph built at index time for fast approximate nearest-neighbor retrieval.
Hybrid Search: BM25 + Vector
When a coaching agent needs to answer “what objections did this prospect raise about pricing in the last three calls?”, we combine both approaches using Elasticsearch’s knn clause alongside a standard query block, then use Reciprocal Rank Fusion to merge the ranked lists. Agent retrieval queries lean heavier on vector similarity (roughly 70/30 vector-to-BM25), while explicit keyword lookups from the UI flip that ratio.
Pick Postgres first. Add search infrastructure only when the queries prove you need it.
02 — Backend Architecture: FastAPI
Locked In
- Language: Python
- Framework: FastAPI
- Validation: Pydantic
- API Contract: OpenAPI (auto-generated)
- ORM: SQLAlchemy + SQLModel
- Migrations: Alembic
We chose FastAPI over Django and Flask for several reasons:
- Async-first: native async support designed in from day one, not bolted on
- Pydantic integration: validation, serialization, and deserialization across the entire codebase
- SQLModel + SQLAlchemy : clean ORM integration that plays well with Pydantic models
- Dependency injection : every endpoint declaratively injects database sessions, auth context, OpenFGA authorization checks, and tenant isolation, all composable and testable without mocks
- Middleware: straightforward system for cross-cutting concerns like logging, tenant context, and request tracing
- Streaming + WebSockets: native support for streaming responses and persistent WebSocket connections, essential for an AI product where responses are streams, not documents
The Async Migration: A Pain Point Worth Documenting
We started building AmpUp without asyncio. The initial codebase was synchronous Python: synchronous database sessions, synchronous HTTP calls, synchronous everything. It worked fine until it didn’t. The turning point came when we integrated third-party libraries that were fundamentally async: Langfuse for LLM observability and Inngest for durable workflows. Both libraries expected to run inside an async event loop. Mixing synchronous blocking calls with an async framework creates a specific class of problems: blocking the event loop starves other coroutines, leading to cascading timeouts across unrelated requests.
The migration was not a weekend project. We had to move every database session from Session to AsyncSession, replace requests calls with httpx.AsyncClient, update every test to use async fixtures, and audit every shared utility function to ensure nothing was blocking. The hardest part was finding the hidden synchronous calls: a .all() on a SQLAlchemy query that triggers a lazy load, a logging handler that writes synchronously to a file, a third-party SDK that calls time.sleep() internally. Each of these could silently block the event loop for seconds.
If you’re starting a new Python project today with FastAPI, start async from day one. The migration cost grows with every synchronous function you write. We estimate the async migration consumed roughly three weeks of engineering time spread across two months, and that was for a codebase that was still relatively young. For a mature codebase, the cost would be significantly higher.
The OpenAPI Codegen Pipeline
The part of our FastAPI setup that probably saves us the most developer time is the OpenAPI codegen pipeline. Every Pydantic model we define on the backend automatically becomes a TypeScript type on the frontend, with zero manual effort. FastAPI introspects our Pydantic models and route signatures to generate an openapi.json spec, then we run a codegen step that feeds that spec through a TypeScript code generator and drops fully-typed client SDK files into the frontend project. The frontend team never writes an API client by hand.
When a backend engineer adds a field to a response model, runs codegen, and commits, the frontend TypeScript immediately knows about it and will fail to compile if a component uses the old shape. This tight feedback loop has caught dozens of integration bugs that would have otherwise only surfaced at runtime. One non-obvious benefit: we use Python (str, Enum) classes for any fixed set of values (sidebar options, dashboard tabs, integration provider names) rather than plain list[str]. Enums flow through the OpenAPI spec as string literals, which means the generated TypeScript gets a proper union type rather than just string. The frontend gets autocomplete and exhaustiveness checking on values that live entirely in backend config.
Dependency Injection for Multi-Tenant Isolation
FastAPI’s dependency injection system is where the framework really earns its keep in a multi-tenant SaaS. Every endpoint that touches the database gets a session dependency injected, which handles connection checkout and guaranteed cleanup. Auth context is injected similarly: validate the JWT, resolve org membership, and attach the full user object, so route handlers just receive a typed user and never think about tokens. Tenant isolation goes one level deeper: a tenant-aware database dependency runs SET LOCAL app.current_org_id on the Postgres session before handing it to the handler, activating row-level security at the database level. The beauty is that tests can swap in a real test-database session with a pre-seeded org without any mocking; you just override the dependency in the test client, and the entire auth and tenant stack runs for real.
Middleware and Tenant Context
Middleware fills in what dependencies can’t. Before a request ever reaches a route handler, our tenant context middleware reads the X-Org-ID header (or extracts it from the JWT claims), initializes a context singleton with the resolved org_id and user_id, and attaches it to the request state. This matters for code deep in the call stack, where shared library utilities need to know “which tenant are we running for right now” without threading the org ID through every function parameter. The middleware also handles the logging envelope: every log line emitted during a request automatically gets org_id and request_id attached via structlog context variables, which makes production debugging across our microservices dramatically less painful than you’d expect.
BackgroundTasks: The First Step Before You Reach for a Queue
FastAPI’s built-in BackgroundTasks should be the first tool you reach for when you need async work, before introducing a message broker or a framework like Celery , Inngest , or Temporal . This is one of the reasons we chose FastAPI: it gives you a lightweight, zero-infrastructure way to run work after the response is sent. No broker, no worker process, no queue to monitor. For things like sending a webhook notification, writing a summary record, or firing an analytics event, a BackgroundTask runs in the same process with access to all in-flight request context, and it just works.
The limitation is scale. Each background task occupies a thread (or coroutine slot), and at volume, thread exhaustion means your API endpoints start competing for resources with background work. When that happens, response latencies spike and user-facing requests start failing, which is exactly the kind of degradation that’s hardest to debug and most damaging to user experience. That’s the signal to graduate to a dedicated task execution layer. For us, the rule became: if the job takes less than a few seconds and doesn’t need retries, it’s a BackgroundTask. Anything long-running, retry-sensitive, or that must survive a pod restart goes to Inngest.
On the monolith-vs-microservices question: we went microservices, but with a crucial caveat. All services share a centralized Postgres and Redis. This gives us independent deploy cycles without the distributed data challenges that sink most early-stage microservice architectures. If you’re pre-product-market-fit, start with a monolith. We earned the right to split by feeling the pain first.
03 — Frontend Architecture: React SPA
For a product whose core experience is conversational, the frontend carries more weight than a typical B2B dashboard. Users are chatting with AI agents, watching streaming responses render in real time, interacting with tool-call visualizations, and navigating between voice conversations, coaching artifacts, and analytics. The frontend isn’t a thin wrapper around API calls; it’s where the product feels fast or slow, smart or broken.
Locked In
- Framework: React 18 + TypeScript (SPA)
- Build Tool: Vite
- State (Client): Zustand
- State (Server): TanStack React Query
- Styling: Tailwind CSS + shadcn/ui
- Real-Time: Inngest Realtime (WebSocket)
- API Client: openapi-fetch (type-safe, zero-runtime)
Why a React SPA, Not Next.js
We went with a plain React SPA over Next.js deliberately. Our product is an authenticated B2B application, not a content site. There’s no SEO benefit to server-side rendering when every page lives behind a login. What we needed was fast client-side navigation between deeply interactive views (chat panels, voice sessions, analytics dashboards), and a pure SPA gives us full control over the rendering lifecycle without the complexity of hydration mismatches, server components, or edge runtime constraints. Vite gives us sub-second hot reload in development and aggressive chunk splitting in production.
State Management: Two Layers, Clear Boundaries
State management in a chat-heavy product is where most frontend rewrites originate. We split state into two layers with strict boundaries:
- Zustand for client state: chat messages, streaming status, UI flags, agent configuration. Zustand stores are lightweight, hook-based, and don’t require providers or context wrappers. Each domain gets its own store (chat, user, workflow, canvas), which keeps concerns isolated and avoids the “one giant store” problem.
- TanStack React Query for server state: conversations, workflows, connector configs, anything that lives in the database. React Query handles caching, background refetching, and cache invalidation, so components just declare what data they need and the library handles freshness.
This separation means we never have to answer “is this stale?” for server data (React Query handles it) or fight with cache synchronization for transient UI state (Zustand handles it). The two layers don’t talk to each other, which is the point.
Real-Time Streaming with Inngest Realtime
When a user sends a message to an AI agent, the response streams back token by token. We use Inngest Realtime over WebSocket for this, which gives us a persistent connection that handles agent responses, tool-call execution status, thinking indicators, and error states all through the same channel. The chat store tracks each message’s streaming state, tool calls in flight, and thinking content, so the UI can show “the agent is searching your CRM” or “analyzing 3 past calls” as it happens.
We started with Server-Sent Events (SSE) for streaming and migrated to Inngest Realtime for better reliability on reconnection and richer event semantics. SSE worked for simple text streaming but became limiting when we needed bidirectional status updates (e.g., the frontend telling the backend that the user cancelled a tool execution mid-flight).
Component Architecture: Compound Components for Chat
The chat interface is the most complex UI surface in the product, and we built it using a compound component pattern. The ChatPanel component exposes composable sub-components (ChatPanel.Messages, ChatPanel.Input, ChatPanel.EmptyState, ChatPanel.OutputArtifact) that share context internally but can be arranged independently. This lets us use the same chat infrastructure for the Post-Meeting Agent, Practice System, and general coaching conversations, with different layouts and artifact panels for each.
Messages render with markdown streaming via react-markdown , tool-call visualizations that show what the agent is doing in real time, and thinking indicators when the model is reasoning through a complex query. Rich text input uses Tiptap with entity mentions (reps, deals, meetings) that resolve to structured references the backend can use for retrieval.
The OpenAPI Type Pipeline (Frontend Perspective)
This is the frontend side of the codegen story from the backend section. Every Pydantic model on the backend becomes a TypeScript type via openapi-typescript , and our API client (openapi-fetch ) uses those types to make every API call fully type-checked at compile time. When a backend engineer adds a field to a response model, the frontend gets autocomplete and compile errors immediately, with no manual type definitions, no runtime validation, no “the API changed and nobody told us” bugs.
One pattern worth calling out: we define Python (str, Enum) classes for fixed value sets (sidebar options, dashboard tabs, integration providers) rather than plain string lists. These flow through OpenAPI as string literal types, which means the frontend TypeScript gets proper union types with exhaustiveness checking instead of just string. A single source of truth from backend config all the way to frontend autocomplete.
Styling and Component Library
Tailwind CSS with shadcn/ui (built on Radix UI primitives) gives us accessible, composable components without the bundle weight of a full component framework. We extended the default Tailwind palette with brand-specific tokens and use CSS variables for theming. For data-heavy views (pipeline analytics, call grids), AG Grid handles the heavy lifting. The key insight: pick a component library that gives you unstyled primitives, not opinionated designs. You’ll restyle everything anyway.
04 — Task Execution: From Celery to Inngest
This is the layer that quietly makes or breaks your product. For us, it’s the backbone: a sales call ends, a transcript lands in GCS, and a cascade of processing kicks off (transcription, signal extraction, coaching generation, CRM sync). If any step is fragile, the Post-Meeting Agent doesn’t have context when it calls the rep back for a debrief.
We started with Celery and migrated to Inngest for durable task execution. The key wins: step-level retries with exponential backoff (no re-running a fifteen-minute pipeline because the last ten seconds failed), a dashboard with full audit trails for every execution, fan-out patterns where a single event triggers independent functions in parallel, and fine-grained multi-tenant throttling out of the box. The Inngest Dev Server runs inside our pytest fixtures, so integration tests exercise the full event-driven pipeline locally.
Locked In
Fewer things to operate always wins. Your ops burden compounds faster than your headcount.
05 — Authentication and Authorization
Auth is the one area where “build it yourself” is almost always the wrong answer. We use Auth0 for authentication and OpenFGA for fine-grained, relationship-based authorization. This combination lets us handle everything from simple role checks to complex permission inheritance without maintaining our own identity system. That said, using Auth0 has not been trivial; we’ve accumulated enough learnings and workarounds to fill a separate post, which we plan to share soon. We’re also watching WorkOS , which is gaining traction as a modern alternative and may be worth evaluating as our needs evolve.
This matters especially for us because our agents are having deeply candid conversations with salespeople. Reps tell our Post-Meeting Agent things they wouldn’t type into a CRM: political dynamics, deal anxieties, competitive intel overheard in a meeting. Getting auth wrong in a product that handles this level of conversational intelligence would be a company-ending mistake.
Locked In
Our token strategy has three channels, each purpose-built:
| Token Type | Algorithm | Use Case |
|---|---|---|
| SPA tokens | RS256 JWT | Frontend via Auth0 React SDK with silent refresh |
| Service tokens | HS256 JWT | Internal service-to-service calls |
| API keys | Prefixed, hashed | Programmatic access, stored in DB with usage tracking |
Our four-layer authorization model, defense-in-depth as architecture:
| Layer | Where | What It Does |
|---|---|---|
| Middleware | API Gateway | JWT/API key validation, context setup, auto-provisioning users |
| Database | PostgreSQL RLS | SET LOCAL app.current_tenant on every connection, automatic tenant isolation |
| Per-route | Endpoint code | Privilege checks via OpenFGA |
| Service boundary | OpenFGA | Fine-grained checks: can this user edit this specific meeting? |
Why OpenFGA Over Casbin
We evaluated two primary authorization frameworks: Casbin and OpenFGA . Casbin uses a policy-based model (RBAC, ABAC, or custom expressions defined in configuration files), while OpenFGA uses a relationship-based model inspired by Google’s Zanzibar paper . The difference becomes concrete when you ask the question our product asks constantly: “can this user see this specific meeting?”
In Casbin, you’d express this with policy rules that check roles and attributes. This works for broad access patterns (“admins can see all meetings”, “managers can see their team’s meetings”) but becomes increasingly complex when permissions are relationship-driven. A meeting might be visible to the rep who attended, their manager, anyone explicitly shared with, and any admin in the org. In Casbin, each of these is a separate policy rule, and the policy file grows linearly with the number of access patterns.
In OpenFGA, you model the relationships directly: a meeting has an attendee, an owner, a shared_with set, and an org. The authorization model defines that viewer is the union of attendee, owner, shared_with, and org#admin. Adding a new access pattern is adding a relationship tuple, not writing a new policy rule. For a product where the permission model is inherently about relationships between users and specific resources, OpenFGA was a natural fit. Casbin is excellent for simpler RBAC scenarios; for our use case, OpenFGA’s relationship model matched how our users actually think about access.
Why Row-Level Security Matters
PostgreSQL Row-Level Security (RLS) is our silent guardian. Every database connection sets the current tenant, and Postgres enforces isolation at the query level. Even if a bug in application code forgets to filter by org, RLS catches it. This is the kind of defense that lets you sleep at night in a multi-tenant B2B application. If you’re building multi-tenant anything, look into RLS early; retrofitting it is painful.
06 — Testing Strategy
The unsexy section. The one that separates teams that ship with confidence from teams that deploy on Fridays and pray.
Locked In
- Backend Tests: pytest
- E2E Tests: Playwright
- Philosophy: API-level integration tests, no mocks at boundaries
Test Harness: Real Services, Not Mocks
Every external dependency in our test suite is backed by a real (or near-real) instance, spun up per test session:
- PostgreSQL — testing.postgresql spins up an ephemeral Postgres instance per session. Tests run against real SQL with RLS policies enforced.
- Elasticsearch — Testcontainers launches a real ES node in Docker. Hybrid search queries hit real indices with real analyzers.
- Redis — fakeredis provides an in-memory Redis implementation for unit isolation without requiring a running Redis server.
- Inngest — The Inngest Dev Server is launched via
npx inngest-cli devinside a module-scoped pytest fixture , giving us a fully functional workflow engine within the test process.
SDK Clients: API-Level Testing as the North Star
Each service exposes a typed SDK client that wraps every API endpoint. Tests interact exclusively through these clients, never touching the database directly. This means our test suite validates the same contract that production consumers use. If an endpoint’s request shape, response format, or error behavior changes, the tests catch it immediately — because they’re calling the real API, not asserting against internal ORM objects.
The result: most tests don’t import a single database model. They create data through API calls, query through API calls, and assert against API responses. The database is an implementation detail that tests rarely need to know about.
Why Integration Tests Matter More When LLMs Write Your Code
This is the insight we wish someone had told us earlier. When your product is built on AI agents, and when you’re increasingly using LLMs to help write and modify code, the value of integration testing goes up dramatically while the value of unit tests with mocks goes down.
Here is the problem: an LLM generating code will produce something that looks correct, passes type checking, and satisfies unit tests that mock the boundaries. But the subtle bugs that LLM-generated code introduces are almost always at the integration points. A query that uses the wrong join condition. A serialization format that differs slightly from what the downstream consumer expects. An async function that works in isolation but deadlocks when called within a transaction context. Mocks hide every one of these bugs.
When you test through real database queries, real API calls, and real validation pipelines, you catch the exact class of errors that LLM outputs introduce. We’ve had cases where a generated function passed every unit test (because the mocks matched the expected interface) but failed immediately in integration because it was passing a raw SQL string where the ORM expected a wrapped statement object. That’s the kind of bug a mock will never catch, because the mock doesn’t know about the ORM’s execution model.
Our rule is simple: if a test uses unittest.mock.patch or MagicMock, it needs a very strong justification for why an integration test cannot cover the same behavior. In practice, we find that roughly 90% of proposed mocks can be replaced with a real test against a test database or test API client. The tests are slower (a few seconds versus milliseconds), but they actually verify the behavior that matters.
Test Data and Environment Tiers
Test data is created via factory patterns using SDK clients that make real API calls, not fixtures stuffed into a database. Auth bypass is used for test isolation, but E2E tests hit real Auth0. Database isolation is handled by transaction rollback with RLS bypass via a root session.
Our environment tiers: CI (mocked external services, 12 parallel workers) → E2E Daily (Tilt , 6 workers, full service mesh) → Staging (real cloud infrastructure). Each tier catches a different class of bug, and the pipeline promotes confidence incrementally.
07 — Developer Experience
A startup’s dev experience is either a multiplier or a tax on every engineer’s output. We invest aggressively here because the compound returns are enormous.
Locked In
From Minikube to Tilt: The 12GB Problem
We started where most Kubernetes teams start: minikube on every developer’s laptop. It worked until it didn’t. As we added services (API, Celery worker, sales agents, Inngest workers, UI, OpenFGA, plus infrastructure like Postgres, Redis, MinIO, Inngest dev server), minikube’s memory footprint climbed past 12GB with 4 dedicated CPU cores. Engineers on 16GB MacBooks were swapping constantly. Hot reload meant rebuilding Docker images inside minikube’s VM, which added 30-60 seconds per code change. The feedback loop was brutal: change a line of Python, wait a minute, check if it worked.
Tilt changed the equation entirely. Instead of running a local Kubernetes cluster on your laptop, Tilt orchestrates against a remote GKE dev cluster, where each developer gets their own isolated namespace. The 1,200-line Tiltfile handles everything: building images (with smart caching that skips rebuilds when Dockerfiles haven’t changed), syncing code via live_update for sub-second hot reload on Python services, managing per-namespace infrastructure (Postgres, Redis, MinIO, Inngest dev server, nginx ingress), and dynamic configuration patching so each namespace gets its own database, storage bucket, and DNS entry.
The developer experience is a single command. It connects to the dev cluster via a bastion tunnel, provisions your namespace, deploys all services, and opens the Tilt UI where you can see logs, build status, and service health for every component. Change a Python file, and live_update syncs it into the running pod and restarts the process, with no image rebuild, no push, no Argo CD sync. The feedback loop dropped from 60 seconds to under 3.
Tilt Still Supports Minikube
We kept minikube as a fallback mode for offline development or situations where the dev cluster is unreachable. In minikube mode, Tilt uses local Docker images instead of a remote registry, and skips External-DNS configuration. But the overwhelming majority of development happens against the shared GKE cluster; the resource savings alone (freeing 12GB of RAM) made it worth the migration.
Per-Developer Isolation on a Shared Cluster
Every developer gets their own isolated namespace on a shared GKE dev cluster. Inside that namespace, everything runs in containers: Postgres, Redis, MinIO (S3-compatible storage), the Inngest dev server, and a namespace-specific nginx ingress controller. Wildcard DNS routes to your services automatically via External-DNS . Your development environment looks exactly like production, with the same Kustomize overlays, same service mesh, same ingress patterns, minus the scale.
Because the dev cluster is always running in the cloud, developers can share their latest changes with teammates, PMs, or designers via publicly shareable links — no setup required on the reviewer’s end. This was one of the hardest things to do with minikube, where the entire cluster lived on your laptop and went down the moment the lid closed. A shared remote cluster means your work is always accessible, even when your machine is asleep.
Ephemeral CI Environments
Every CI run spins up an isolated namespace using the same Tilt infrastructure. GitHub Actions runs Tilt in CI mode with a configurable readiness timeout. Tests run against a full service mesh: real Postgres, real Redis, real Inngest dev server. When the run finishes, the cleanup flag tears everything down. No orphaned resources, no namespace collisions, no “works on my cluster” problems. If you’re on Kubernetes, ephemeral namespaces per PR are one of the highest-ROI investments you can make.
Your dev environment should be the production environment’s mirror, not its distant cousin.
Continue Reading
This post covered the infrastructure foundation: storage architecture, the search and vector evaluation that took a month, backend architecture and the async migration, durable task execution with Inngest, authentication and authorization, testing strategy, and developer experience. But infrastructure is only half the story. The product core, how we actually build and ship conversational AI agents that talk to salespeople, is where it comes alive.
In Part 2: The Product Core, we cover:
- CRM Integration (Ampersand ): why we pull data instead of querying it at runtime
- Voice AI (ElevenLabs ): real-time voice conversations, client-actions, latency budgets, and the build-vs-buy math on Pipecat
Deep Dives
- From Celery to Inngest: the full migration story, fan-out patterns, retry strategy, and honest tradeoffs
- Ampersand, MCP, and the CRM Enrichment Trap: why runtime queries don’t work at scale and how we build our integration layer
The best architecture is the one your team can actually build on.
Part 1 of 2. Written by Rahul Balakavi. For founders, by founders who’ve been there.