The Stack We Actually Ship On: The Product Core

Part 2 of 2. How AmpUp pulls and enriches CRM data through Ampersand, builds voice agents with ElevenLabs client-actions, and closes the loop from data ingestion through intelligent retrieval. Field notes for founders.

Subscribe & Share:

Rahul Balakavi, Co-Founder, AmpUp

Part 2 of 2. In Part 1, we covered the infrastructure: cloud, deployment pipelines, backend architecture, background jobs, data layer, auth, testing, developer experience, and security. This half covers the intelligence layer that powers AmpUp’s product: how we pull and enrich CRM data, and how voice agents guide users through their own UI. Each major section has an accompanying deep-dive post for the full story.

This is Part 2 of our complete engineering stack breakdown. If you haven’t read the first half, covering cloud platform, infrastructure as code, backend architecture, background jobs, data layer, auth, testing, developer experience, and security, start there. ← Read Part 1: The Infrastructure

Part 1 was about the foundation: the things you’d find in any well-built SaaS backend. This half is about what makes AmpUp’s product actually work. Not just the AI models we call, but the closed loop that makes them useful.

The Closed Loop

Most AI products work like a vending machine. You put a question in, you get an answer out, and the quality of that answer depends entirely on what you fed it in the moment. AmpUp works differently. We built a closed loop: your CRM and meeting recorder data flows into AmpUp continuously, our pipeline enriches every meeting with structured intelligence (deal signals, coaching moments, competitive mentions, objection patterns), and when you ask a question, you’re not querying raw data. You’re querying a curated knowledge base that already understands your deals.

A rep asks “what objections has this account raised about pricing across the last three quarters?” In a traditional setup, an AI agent would need to fetch the deal from your CRM, pull every associated meeting, load every transcript into context, and then reason over all of it. That’s a context window problem, a latency problem, and a cost problem all at once. In AmpUp, the answer is already partially computed. Every meeting has been analyzed. The objections have been extracted and tagged. The agent queries structured intelligence, not raw transcripts. The question that would blow through a million-token context window and a $200/month API budget gets answered in seconds for pennies.

This closed loop, data ingestion, continuous enrichment, curated retrieval, is the architectural decision that every other choice in this post builds on. It starts with how we get the data in.

01 — Ampersand: Why We Pull Data Instead of Querying It

AmpUp is only as good as the data it can access. We integrate with Salesforce , HubSpot , Gong , Chorus , Fireflies , and a growing list of CRM systems and meeting notetakers. The question was never whether to integrate, but how.

We evaluated two tempting approaches and rejected both. MCP (runtime queries) works for point lookups but falls apart when you need to reason across hundreds of meetings and deals: the questions that actually matter in sales always span more data than any context window can hold. Enriching your CRM with custom fields and objects sounds good until you realize you’ve turned your CRM into a database you need to operate, with six categories of tooling, each with its own vendor contract and failure modes.

Instead, we pull data continuously through Ampersand and enrich it on arrival. Ampersand handles the OAuth dance, token refresh, rate limiting, and data syncing for 100+ SaaS providers. Webhooks land as Inngest events with durable execution guarantees. By the time a user asks a question, the expensive analysis has already been done during ingestion. Your CRM stays clean, your switching costs stay low, and your sales-ops team isn’t debugging a pipeline they didn’t build.

Locked In

Integration Platform: Ampersand

CRM Providers: Salesforce , HubSpot (via unified API)

Notetaker Providers: Gong , Chorus , Fireflies (via unified API)

Data Sync: Ampersand webhooks -> Inngest event handlers

Storage: Postgres (tenant-isolated via RLS)

What convinced us to use Ampersand over MCP and CRM enrichment, including the full analysis of why runtime queries blow through context windows and API budgets, the toolchain sprawl of CRM enrichment, our evaluation of Composio and Meltano, and the webhook-to-Inngest architecture, is detailed in our deep-dive: read the full Ampersand post here.

02 — Voice AI: ElevenLabs and Client-Actions

Salespeople don’t type. That’s not a knock; it’s just the reality of the job. A good rep spends their day on calls, in meetings, driving between appointments, running back-to-back demos. By the time they finally sit down to log something in Salesforce, the energy of the conversation has evaporated and they’re reconstructing a memory rather than capturing a moment. Voice isn’t a nice-to-have feature for AmpUp. It’s the only interface that doesn’t create more friction than it removes.

Locked In

Voice AI Platform: ElevenLabs Conversational AI

Primary LLM: Claude Haiku

Fallback LLM: OpenAI GPT-5.2 (via ElevenLabs cascading)

Disaster Recovery: Pipecat (self-hosted, standby)

Connection: WebSocket (full duplex audio streaming)

Session Management: Zustand stores (frontend)

Browser Audio: AudioContext + AudioWorklet

We evaluated the obvious alternatives. We could stitch together Whisper for STT, run our own LLM with function calling, and pipe output through a self-hosted TTS model. That path exists. It also burns six months of engineering time before you have anything usable, and you end up owning a voice infrastructure problem when your actual product is sales intelligence. ElevenLabs’ Conversational AI API changed the calculus entirely. Sub-400ms time-to-first-audio on a fresh turn. Voices that don’t have the uncanny valley problem. Reps stopped commenting on the voice after the first week of testing, which is exactly what you want.

Client-Actions: The Feature That Changed Everything

This is the feature that turned voice from “a different input method” into “an entirely different interaction model.” ElevenLabs’ Conversational AI supports client-actions: client-side tool execution where the voice agent can trigger actions directly in the user’s browser while speaking.

The voice agent says “let me show you how your win rate has changed this quarter” and the browser navigates to the analytics page, scrolls to the relevant chart, and highlights the data series being discussed. The agent says “here’s the call where Sarah handled the pricing objection well” and the transcript viewer opens to that exact moment. The agent says “let me pull up the Acme deal” and the deal view slides into focus with the relevant pipeline stage highlighted.

This bridges the gap between “voice conversation” and “interactive UI experience.” The user isn’t just listening to an agent talk. They’re watching their screen respond to the conversation in real time. For sales managers reviewing team performance or reps reviewing their own calls, this is the difference between a phone call and a guided tour.

Client-Actions: Voice-Driven Browser Interactions

How Client-Actions Work

Under the hood, client-actions are tool calls that execute on the client side rather than the server. When the voice agent decides it needs to show the user something, it emits a tool call through the WebSocket connection. Our frontend listens for these tool calls and routes them to the appropriate handler: scroll to a DOM element, navigate to a page, highlight a data point, open a panel. The key insight is that the agent controls the UI the same way it controls any other tool. “Show the user this chart” is the same pattern as “look up this deal in the CRM.” The difference is just where the execution happens.

We built a registry of client-action handlers that map tool names to UI operations. Adding a new client-action (say, opening a coaching scorecard for a specific call) is a matter of registering a new handler function. The voice agent’s system prompt describes the available actions, and the LLM decides when to use them based on conversational context. This is where ElevenLabs’ platform saved us significant engineering: building the bidirectional tool-call bridge between a streaming voice session and a browser DOM would have taken weeks. With client-actions, it was a configuration layer on top of our existing tool infrastructure.

One of our more interesting frontend investments is @ampup/chat-widget, an independently published npm package that lets any React website embed an AmpUp voice or chat agent. You install it with npm install @ampup/chat-widget, pass an agent_id, and get a fully functional conversational interface. The widget handles its own WebSocket connections, audio management, and streaming; the host application doesn’t need to know anything about voice infrastructure.

What makes it more than a simple embed is the DOM interaction layer: the widget can scroll to specific elements, highlight relevant content on the host page, and navigate between sections, all driven by agent tool calls via ElevenLabs’ client-actions. A voice agent says “let me show you the pricing section” and the page actually scrolls there. This is the same client-actions pattern from our core product, packaged for external use.

The Latency Budget

In a text interface, a 2-second response feels fast. In a voice conversation, it feels like the other person has checked out. We obsess over the latency budget:

Stage	Target	What Happens
VAD	~80ms	Detect end-of-turn
STT	~100ms	Transcribe speech to text
LLM First Token	~120ms	Start generating (streaming, not full completion)
TTS First Audio	~150ms	ElevenLabs produces first audio chunk from partial text
Total	<500ms	Conversational-speed response

When tool calls happen mid-response, including client-actions that move the UI, we insert a brief filler phrase (“Let me pull that up for you”) generated immediately so the audio stream doesn’t go silent while the action executes. Silence in a voice conversation reads as a dropped call, not as thinking.

Practice Mode: Roleplay With an AI Buyer

The use case we’re most proud of is the practice system. Before a rep walks into a high-stakes renewal, they can run a full roleplay session selling against an AI buyer persona built from their actual account data. Not a generic “skeptical enterprise buyer,” but a persona constructed from the specific objections that have come up in previous calls with that account, the competitor the prospect is currently evaluating, the economic pressures their industry is facing right now. This is where the closed loop pays off: because we’ve already enriched every historical meeting with that account, the AI buyer persona draws on real objection patterns and competitive dynamics, not generic templates.

ElevenLabs provides the voice, and we tune the persona’s speaking style (pacing, level of warmth, how quickly they interrupt) to match the archetype of the real stakeholder. When the rep successfully handles an objection, the AI buyer escalates appropriately rather than folding, because a practice session where you always win teaches you nothing. The results speak for themselves: in one engagement, we helped a $2B revenue company increase sales productivity by 30% in 8 weeks by layering intelligence, execution, and coaching into their sales motion.

Vendor Risk: Why We Stayed With ElevenLabs

Let’s address the elephant in the room: our entire real-time voice experience depends on a single API provider. We know. We actually built a parallel voice stack using Pipecat, the open-source framework for building voice agents, as a hedge. Pipecat gives you full control: you compose your own STT, LLM, and TTS pipeline, run it on your own infrastructure, and own the entire latency budget.

We chose to continue investing in ElevenLabs over maintaining the Pipecat stack for three specific reasons. First, client-actions: triggering browser interactions from voice tool calls without building the entire bridge ourselves. Second, fallback LLMs: ElevenLabs supports configuring fallback model providers, so if your primary LLM has a latency spike or outage, the conversation automatically routes to a backup without the user noticing. Third, the platform UI for configuring agents, testing conversations, and monitoring sessions has matured significantly.

The Pipecat stack remains our disaster recovery plan. If ElevenLabs has a sustained outage or makes a pricing change that breaks our unit economics, we can route voice sessions through our self-hosted pipeline within hours, not weeks.

Cost Reality: Voice AI

Voice AI is not cheap. ElevenLabs charges per-character for TTS and per-minute for conversational AI sessions. At meaningful volume, it becomes a significant line item. We manage costs through session-level controls (hard time limits on practice sessions, caching pre-meeting briefs that don’t need real-time generation) and by routing non-voice interactions through our text-based chat agents, which are orders of magnitude cheaper per interaction. The rule of thumb: voice for high-value, time-sensitive interactions (pre-meeting briefs, post-call debriefs, live practice). Text for everything else.

Parting Advice

If you’re a founder reading this and feeling overwhelmed, don’t be. You don’t need all of this on day one. You need Postgres, a web framework, an auth provider, and a way to deploy. Everything else can wait until the pain is real.

We built AmpUp because we believe salespeople deserve better tools than the ones they’ve been given: tools that talk to them like a strategist and coach would, not tools that make them click through ten screens to log a note. The closed loop, from data ingestion through enrichment to intelligent retrieval, is what makes that possible. Without it, you’re just another chatbot wrapper. With it, you’re delivering answers that would take a human analyst hours to produce.

Between Part 1 and Part 2, we’ve covered the complete system: from the cloud infrastructure and deployment pipelines that keep it running, to the intelligence layer that makes it useful. The foundation (GKE , Terraform , Argo CD , Inngest , Postgres ) gives us reliability and velocity. The product core (Ampersand for data ingestion, ElevenLabs for voice with client-actions, Claude Agent SDK for intelligent analysis, Daytona for safe code execution) gives us the capabilities that users actually experience. Neither layer works without the other.

For the full story on two of our deepest technical decisions, see our companion deep-dives:

From Celery to Inngest: the migration story, fan-out patterns, and retry strategy
Ampersand, MCP, and the CRM Enrichment Trap: why runtime queries don’t work at scale

The value of sharing this isn’t that you should copy it verbatim. It’s that you can see how one team resolved a hundred small decisions that would otherwise eat your week. Steal the decisions that fit. Ignore the ones that don’t. And when you find yourself debating the same question for the third sprint in a row, just pick something and ship.

The best architecture is the one your team can actually build on.

The Complete AmpUp Stack

← Back to Part 1: The Infrastructure

Part 2 of 2. Written by Rahul Balakavi, for founders who’ve been there. Share it forward.