Jonas Helming, Maximilian Koegel and Philip Langer co-lead EclipseSource, specializing in consulting and engineering innovative, customized tools and IDEs, with a strong …
AI Voice and Interaction Agents in Production: 6 Lessons from the Field
February 26, 2026 | 16 min ReadIf you know EclipseSource, you probably know us for developer tools, IDEs, and technical AI topics. What you might not know is that we’ve been building AI voice agents for production use since before it was cool.
We started early. While most of the industry was still experimenting with chatbots, we were already deep into building agents that handle real customer interactions—by phone, chat, e-mail and messaging. The most visible result of this work is MediVoice, an AI-powered phone system now actively used in medical practices across Germany. Patients call, the agent answers, appointments get booked—millions of times, reliably, in production.
In early 2025, we began transferring what we learned in the medical domain to other industries. Today, we partner with software vendors across very different fields - furniture retail, steel trading, large-goods commerce - helping them extend their existing products with AI-powered interaction capabilities. Our partners bring the domain expertise, software-defined processes, and customer relationships; we bring the hard-won lessons about what actually works when an AI interacts with a customer.
Live demo: A customer calls to file a claim for a furniture order. The AI agent accesses live data and completes the task by capturing the claim in the underlying ERP system (call in German).This article is different from our usual technical deep-dives on AI coding and tooling. Here, we want to share six lessons we’ve learned from building and deploying voice and interaction agents in production.
Most of our examples come from voice agents—because phone is the hardest channel: real-time, no visual fallback, fully unpredictable callers. But every lesson applies equally to chat, messaging, or any other channel where an AI agent interacts autonomously with end users. If you’re considering adding AI interaction capabilities to your product, or building something similar from scratch, we hope these insights save you some of the detours we took.
What We Mean by “AI Voice and Interaction Agents”
Before diving into the lessons, a clarification: when we talk about AI voice and interaction agents, we do not mean speech recognition systems that transcribe what callers say. We also do not mean smart answering machines or chat bots that record messages and create callback lists for your team.
What we build are fully autonomous agents. When a patient calls to book an appointment, the agent checks availability in the practice management system, applies the practice’s scheduling rules, collects insurance data, conducts an initial anamnesis, and books the slot—all without human involvement. When a customer calls a furniture store to ask about a specific sofa, the agent checks real-time inventory, confirms availability at their nearest location, and reserves it for pickup. Whether the interaction happens by phone, chat, or e-mail—the conversation ends with the task completed, not with a to-do item for staff.

This is a fundamentally different ambition. The user is a human who expects to accomplish something. They don’t want to leave a message and wait for a callback. They want the appointment booked, the information provided, the reservation confirmed. Our agents aim to meet that expectation—to fully and autonomously complete workflows triggered by end customers who behave like humans, because they are humans.
This distinction matters because the lessons that follow all stem from this goal: not just answering calls, but actually finishing what users started.
Lesson 1: Task Completion Matters – And It Matters the Most
When end users interact with AI agents, especially voice agents, you cannot expect them to adapt their behavior. They won’t speak more clearly, avoid interrupting themselves, or structure their requests in ways that are easy to understand. They will mumble, change their mind mid-sentence, have dogs barking in the background, don’t follow strict orders in which they provide information, or just hang up. What users expect is a human-like interaction—one that just works.
Being natural, friendly, or sounding human is only one part of the equation. The more important part is task completion.
In practice, task completion means that the agent is actually capable of completing the expected workflow—booking an appointment, requesting specific information, making a reservation - reliably, thousands of times an hour. That requires understanding what the caller wants, knowing which information is needed to address the request, and executing the right actions to finish the task. Any shortcoming in any of these steps compounds throughout the call, dragging overall task completion rates down significantly.
When a user calls, they expect to finish what they started. The agent only provides value if it can reliably complete these workflows. Reliability here means very high task completion rates, typically close to 90%.
Such high completion rates are usually only possible because the agent knows when not to try. When a caller’s request exceeds what the agent can handle—a complex complaint, an unusual edge case, an emotionally charged situation—the right response is a smooth handoff to a human colleague with full context preserved. A well-executed escalation is not a failure; it is a successfully completed task.

It is important to note that “task completion” is also not the same as “user success.” For example, a patient may call to book an appointment, but no suitable slot is available. The patient didn’t get what they wanted—but the agent still completed its task correctly: it checked availability, communicated the situation, and offered alternatives. From the agent’s perspective, this is a completed task. What we optimize for is whether the agent reliably finishes the workflow it was designed to handle—not whether the outcome is always what the caller hoped for.
Even humans do not reach 100% task completion in such interactions. But humans are more forgiving with humans, as they are used to interacting with them. If task completion rates of AI agents are significantly lower, the consequences are severe: users start complaining, frustration increases, additional manual work is created around the AI agent, and the AI agent no longer fulfills its original purpose.
Many vendors in this space advertise thousands of available voices, ultra-natural sound, and low latency. These features make for impressive demos. But we’ve seen it repeatedly: companies choose a solution based on how good it sounds, roll it out, and then revisit that decision a few months later when they realize the agent—despite sounding great—simply doesn’t complete enough calls successfully. The shiny features don’t matter if callers hang up frustrated.
Key insight: If you have to choose between a beautiful voice and very natural conversation flow, or a few percentage points higher task completion—task completion always wins in practice.
Lesson 2: Building Truly Reliable Agents Is Still Much Harder Than It Looks
If you’ve used ChatGPT, Claude, or any modern LLM in voice mode, you know they can hold remarkably fluent conversations. This creates a perception that building AI voice agents is essentially a solved problem—just connect an LLM to a phone line and you’re done.
The data tells a different story.
According to Menlo Ventures’ “2025 State of Generative AI in the Enterprise” report, only 16% of enterprise AI deployments qualify as true agents—most are fixed-sequence workflows. NLW’s AI ROI Benchmarking Study found similar numbers: just 14% of use cases fall into the “agentic” category of autonomous work execution, while 57% remain in “assisted” mode.
This is surprising, because agentic workflows obviously provide the biggest benefit. If AI can fully complete a task, the ROI is clear. So why do so few deployments reach this level?
Because autonomous agents are what Germans call the Königsdisziplin—the supreme discipline. The Pareto principle doesn’t apply here. The first perceived 80% comes at almost no cost: your agent can hold a conversation, understand intent, even sound natural. But those last 20%—that’s where most projects stall or fail.

Consider what happens in a real call: A patient asks about appointment availability for next Tuesday. Mid-sentence, they remember they switched insurance providers and ask whether the new one is accepted. Then they want to go back to booking, but now for Wednesday instead—and by the way, can they also get a prescription refill during the same visit? The agent has to track all those intents without losing the relevant context for each of them, and successfully complete each of those separate tasks, while consistently applying the practice’s scheduling rules for each concern. Now imagine these challenges for ten of thousands of calls, each with its own unexpected turns.
Handling these combined and shifting intents, maintaining coherence across longer conversations, enforcing business rules even when the caller’s requests conflict with them, connecting securely to legacy systems, recovering gracefully when something goes wrong—this is where the real engineering lives.
Key insight: Don’t let the fluency of modern LLMs fool you. The gap between “can have a conversation” and “can reliably complete a workflow” is vastly bigger than you would expect - and it’s where most projects stall.
Lesson 3: Own the Full Stack
Many off-the-shelf voice agent platforms suggest that end users can simply tweak a prompt and everything will work. In our experience, this doesn’t hold up in real, domain-specific use cases.
Reaching reliable autonomy and high task completion requires far more control. Any tiny variation in a prompt can destroy carefully tuned behavior. Guarding an agent and making it highly reliable is not something you achieve by editing a text field in a web interface.
Real-world agents don’t just have to understand simple words. They have to parse complex inputs: prescription names with unusual spellings, booking constraints with multiple dependencies, product codes from legacy catalogs. And they have to connect to arbitrary domain-specific infrastructure—sometimes systems that are decades old.
Here’s the core challenge: LLMs are non-deterministic. The caller on the other end of the line is also unpredictable. To build something reliable out of two sources of uncertainty, you need to control the harness around the LLM as tightly as possible—the speech recognition, the conversation flow, the validation logic, the backend integrations, the fallback behavior.
This is also why building on top of blackbox solutions carries hidden risk. When your agent sits on someone else’s stack, the provider can change parameters or switch LLM versions without choice—or worse, without notice. You wake up to a massively dropping task completion rate and no idea why. And in regulated environments—healthcare, finance, any context where GDPR (German “DSGVO”) applies—you may not even be able to answer basic compliance questions: Where is the data processed? Who has access? How long is it retained? Control isn’t just about customization or stability—it’s about accountability.

Ownership means you can chase a high task completion rate relentlessly while staying compliant. When callers mispronounce a product name, you tune speech recognition. When a critical piece of information keeps getting missed, you adjust the conversation flow. When success rates dip, you trace the problem end-to-end and fix it today. And when the auditor asks where patient data goes, you have an answer.
We use off-the-shelf components where they fit. But we build on an open, flexible architecture that combines the best available parts, bound together in ways we fully control.
Key insight: Owning the stack isn’t about reinventing wheels—it’s about having the power to do the job right when the defaults don’t cut it. Remember: reliable task completion wins. You can’t optimize what you don’t control.
Lesson 4: Non-Functional and Administrative Capabilities Are Critical
Non-functional requirements and administrative use cases around AI voice agents are not optional—they are essential.
This includes the ability to monitor the system, see what kinds of calls were made, measure task completion rates, and detect systematic errors. You need to understand how users interact with the system, how they feel about it, where workflows break, and where optimizations are needed.
In conversational agent systems, the key metrics to track go well beyond a single number. They include: task completion rate, business-rule compliance rate, containment rate (interactions resolved without a human), handoff rate (interactions that require a human), first-call resolution rate, recovery rate after misunderstandings, and direct hangup rate. Together, these metrics paint a complete picture of agent performance and guide targeted optimization.
This requires proper infrastructure. It is not sufficient to just look at chat or call logs. If the system works correctly and is used in production, the sheer volume of conversations makes manual inspection and analysis completely impractical.
A typical example: a new product launches, and callers start asking for it. But the agent struggles—either because the term is new to the LLM, or because callers mispronounce it in unexpected ways. Without proper monitoring, you won’t even know this is happening. Calls fail, callers get frustrated, and you only find out when someone complains. With the right analytics (e.g. containment rates, tool-call failure rates, hang-up rates, ASR confidence scores) you detect the pattern within days, add the term to your agent’s vocabulary, tune the recognition, and deploy—before it becomes a real problem.
A solid test harness is also essential—you can’t improve what you can’t reliably measure. But with non-deterministic LLMs, building such a harness is harder than it sounds. The same input can produce different outputs, so traditional test approaches don’t apply cleanly. You need ways to evaluate behavior at scale, across many variations, without expecting byte-for-byte repeatability.
Beyond monitoring and testing, you also need to be able to react quickly to changes, deploy new agents, and update existing agents.
Additionally, AI agents typically do not replace humans completely. They operate in parallel with human staff. This introduces orchestration challenges: deciding which calls go to agents and which to humans, defining when humans take over, and controlling how many interactions are handled by agents.
Key insight: The agent itself is only half the product. You need infrastructure that lets you detect problems before your users complain, a test harness that lets you deploy fixes without breaking what already works, and the operational tooling to react within hours, not months. In production, speed of iteration is a competitive advantage.
Lesson 5: Deep Integration with Existing Systems Unlocks Real Value
There are many companies on the market offering generic voice agents, sometimes even domain-specific agents. These vendors try to sell the same solution to as many companies as possible.
A common limitation of these solutions is that they do not integrate deeply with existing systems. The reason is that the landscape of existing systems is extremely diverse.
As a result, many of these solutions bring their own data storage, can capture intents, can extract structured data—but they often cannot fully close a case.
Example: A user wants to make a reservation. The agent captures the reservation details in a structured way. But it is not connected to the actual reservation system. In many companies, these systems are 20 or even 40 years old, written in technologies like RPG, running on AS/400 systems.

Without integration, humans still need to do manual work. Someone has to connect the extracted data to the legacy system. If something is unavailable, someone needs to call the customer back.
Integration is also more than just generating an API wrapper or spinning up an MCP server. It requires deeply understanding the existing system and processes—often complex, underdocumented, built over decades—and designing an interface that is well-suited for an autonomous agent. What data does the agent need? In what form? What operations should it be allowed to perform? Getting this right is design work, not just plumbing.
Deep integration also enables seamless collaboration between agents and humans. Whether a case requires escalation or is simply designed to involve a human at a certain point, the human colleague picks up in the same system, sees the same data, and continues where the agent left off—no friction, no context lost. Without integration, every handoff forces the caller to start over, turning a smooth transition into a frustrating experience.
Key insight: The moment agents are deeply integrated with real systems and workflows, the true benefit starts. Deep integration enables end-to-end coverage of use cases, reading authoritative data directly from existing systems, and writing data back where allowed. This allows agents to fully close cases instead of just creating to-do lists for humans.
Lesson 6: Chase Real Use Cases, Not the Obvious Ones
It is essential to focus on real, high-value use cases. Many companies instinctively start with support agents and question-answering systems. This feels natural—and these use cases can add value when they genuinely reduce load on human staff.
But there’s a trap. If the agent can capture intent but not complete the workflow, humans still have to process the request afterwards. Worse, they often have to call back anyway because the agent didn’t collect all the relevant information. The promised benefit evaporates, and you’ve added a new step instead of removing one.
People usually call when they want to do something they cannot easily do online: no online account exists, logging in would be too cumbersome, or live data is required—or simply because they prefer live interaction. Examples include booking complex appointments, checking real-time availability, or reserving goods or services. These use cases are more difficult to implement and more interaction-heavy. But they also generate many calls, keep human staff busy, and provide significantly more value when fully automated.
Key insight: Instead of chasing obvious use cases, chase the ones that the agent can fully complete. With clients, we start by evaluating use cases along two dimensions: achieved benefit and feasibility. Feasibility is often higher than people expect—especially once you’ve invested in the hard parts and built up real experience with domain-specific agents and interaction-heavy workflows. And once your stack is in place for the hard use cases, the simple ones—often with high volume—come almost for free.
Where to Go from Here
Building AI voice and interaction agents that actually work in production is hard. It requires reliable task completion over polish, control over convenience, deep integration over quick demos, and a focus on use cases that matter. These lessons took us years to learn—through countless calls, failed experiments, and gradual refinement across multiple domains. We share them because we wish someone had told us earlier—and because they shaped how we work with partners today.
If you’re considering adding AI interaction capabilities to your product, we’d be happy to talk.

Every domain—healthcare, retail, trading, logistics—has its own systems, rules, workflows, and customer expectations that only people in that industry truly understand. Building reliable AI interaction agents, on the other hand, requires deep specialized expertise that takes years to accumulate. Our partner model brings both sides together.
You bring the domain expertise, the customer relationships, and the existing software-defined processes your customers already rely on. We bring the production experience, the architectural knowledge, and the engineering required to make agents that reliably complete real tasks. Depending on your situation, that can mean building on our AI Interaction Platform, or helping you select and assemble the right components for your domain-specific context. Either way, agents run inside your product, under your brand, connected to your infrastructure—delivered to your customers as a natural extension of what they already use.
We’re already doing this across industries, from medical practices to furniture retail to steel trading. The domains differ. The pattern works. A typical first step is a short remote workshop to identify high-value use cases, followed by a working demo you can show to pilot customers.
If you’d like to explore what’s possible for your product, get in touch.
→ AI Customer Interaction Solutions — Full overview of our interaction services, partnership model, and getting started options
→ AI Interaction Platform (AIIP) — The battle-tested technology foundation behind many of our interaction agents
Stay Updated with Our Latest Articles
Want to ensure you get notifications for all our new blog posts? Follow us on LinkedIn and turn on notifications:
- Go to the EclipseSource LinkedIn page and click "Follow"
- Click the bell icon in the top right corner of our page
- Select "All posts" instead of the default setting