Six weeks, four platforms, 50 Australian accent test calls. Here's the honest benchmark behind why we run Retell and Twilio, and where it still struggles.
Six weeks, four platforms, 50 Australian accent test calls. Here's the honest benchmark behind why we run Retell and Twilio, and where it still struggles.
We evaluated Bland AI, Vapi, Retell, and Air AI over six weeks across latency, Australian accent recognition, and escalation reliability. Retell with Twilio won: sub-600ms response, best regional accent performance, and the webhook reliability our SLAs demand.
Picture this. A bloke rings up from a farm outside Mildura at 6:47am on a Tuesday. Broad accent, spotty reception, background noise of a diesel pump. He needs an emergency plumber. He's already tried three numbers. This is our internal demo call. This is the one we kept running every voice AI platform through, because if the stack can handle Kev-from-the-farm, it can handle a specialist clinic in South Yarra at lunchtime.
The first platform we tried politely told Kev "I didn't catch that" six times before hanging up. That was the day I stopped trusting glossy demo videos and started benchmarking seriously. If you're about to build anything serious on top of AI voice agents, here's the work we did so you don't have to. Honest numbers. Real outcomes. The bits where each platform fell over.
Why does voice AI stack choice matter more than most people think?
Most businesses treat voice AI like choosing another SaaS tool. Tick the boxes, pick the cheapest, move on. That's a mistake. The stack underneath determines whether your AI receptionist sounds like a helpful human or a GPS robot that got lost in 2014.
Three things go wrong when you pick the wrong stack. First, latency creeps up. Anything over 900ms between a caller finishing their sentence and your agent replying, and humans start talking over each other. Second, accent handling fails silently — the agent thinks it heard something sensible and barrels on, booking the wrong job. Third, webhooks drop. Your AI agent politely tells a customer "I've booked that in" while your CRM never gets the call. You find out three days later when they ring back angry.
Most AI voice tools are rubbish for Australian accents straight out of the box. The platform choice is what lets you actually fix that instead of living with it.
What were the key criteria we tested across Bland AI, Vapi, and Retell?
We ran each platform through seven tests over six weeks, with $149.95/month credit budgets so we weren't constrained by tier limits:
- Latency — Time from caller finishing a sentence to AI first token audible. Measured with a stopwatch and call recordings, not vendor claims.
- Australian accent recognition — 50 test calls with our internal Aussie panel (Queensland, Victorian, Western Australian, Territorian voices). We measured word error rate.
- Escalation reliability — When the AI needs to hand off to a human, does the warm transfer land every time?
- Webhook delivery — 500 calls triggering a CRM write. How many landed without manual intervention?
- Voice naturalness — Blind A/B tests with 20 listeners. Does it actually sound like a person or a robot reading a script?
- Interruption handling — When a caller talks over the agent, does it recover gracefully or panic?
- Cost per minute at scale — Real pricing under real call volume, not the landing page sticker.
We didn't care about marketing features like "emotional intelligence" or "enterprise-grade". We cared about whether the phone actually worked when a paying customer called at 7am on a Monday.
How did each platform handle broad Australian regional accents in our testing?
Honest results. Word error rate on our 50-call Australian accent panel:
- Retell + Deepgram Nova 2 streaming ASR: 3.1% WER
- Vapi with Deepgram: 4.8% WER
- Bland AI: 11.2% WER (dealbreaker)
- Air AI: 7.4% WER
Bland AI struggled badly with Queensland and regional WA voices. One tester from Townsville got misheard so often the call devolved into a Monty Python sketch. For a Melbourne-CBD-only clinic you might get away with it. For a national tradie network covering regional callers, you absolutely can't.
Retell won this round because they let us plug in our own ASR (Deepgram Nova 2) and tune the keyword boost for Australian-specific terms — suburb names, trade-specific vocabulary like "RCD", "splitter", "bulk-billed", "after-hours callout". That level of control is the difference between 95% accuracy and 97% accuracy, and that two-point gap is where most production calls live or die.
What latency thresholds separate a natural conversation from an awkward one?
Our rule of thumb, measured across hundreds of real calls:
- Under 600ms: Feels like talking to a person. Callers don't even notice it's AI for the first 30 seconds.
- 600-900ms: Noticeable but tolerable. Like a slightly sleepy receptionist.
- 900-1,500ms: Awkward. Callers start repeating themselves.
- Over 1,500ms: Fails. Callers assume the line dropped and either hang up or start talking over the agent.
Measured first-token latency on our test harness (median / p95):
- Retell + Twilio Media Streams: 540ms / 720ms
- Vapi: 710ms / 1,100ms
- Bland AI: 680ms / 980ms
- Air AI: 890ms / 1,400ms
Vapi and Retell are close on the median, but the p95 tail is where Vapi lost us. One in twenty calls hitting 1.1 seconds means five calls out of every hundred feel awkward. That's real money when you're handling hundreds of calls a day.
The data and proof layer
Since we picked Retell with Twilio in November 2024, we've handled roughly 14,200 calls across nine live clients. The numbers we track every week:
- Answer rate: 99.1% (2 recorded outages, both on the Twilio side, 11 minutes total)
- Webhook delivery: 99.97% (we retry twice on failure before paging a human)
- Escalation success: 98.4% of warm transfers reach the right human first time
- Booking accuracy: 96.8% of bookings land on the correct service and time slot
For context, the missed call problem in Australian small businesses sits around 30-45% for after-hours calls without automation. Our live clients are now running at 4-6% missed. That's the gap an AI voice stack closes when the plumbing underneath actually works. And it's the same 2026 market shift in AI receptionists pushing more Australian operators to stop treating voice as an afterthought.
Honest limitations
Here's where the stack still struggles, because I'd rather tell you upfront than have you find out on a live call with a paying customer.
Very noisy environments — cafes at peak, construction sites, motorbikes at idle, strong wind. Deepgram's noise rejection is good but not perfect. We wire in a "couldn't hear you clearly, is there a better number to try?" fallback for this.
Rapid code-switching — callers switching between English and Mandarin or Vietnamese mid-sentence throw the transcription model off. Handling this properly needs a multilingual ASR and a routing layer we haven't shipped yet.
Legacy PBX integrations — if the client runs an on-prem Avaya system with no SIP trunk, we can't plug in without a physical box on site. About one in five enquiries hit this wall in the first conversation.
The first 48 hours of a new deployment — booking flow edge cases we didn't predict always show up in real traffic. We don't pretend go-live is the end. We monitor every call for the first two days and hand-tune prompts in near real time.
If a vendor tells you none of these issues exist on their platform, they're either lying to you or they haven't run 10,000 real production calls yet. Neither is who you want holding your phone line.
FAQ
Can you share the raw benchmark data from your stack comparison? Short version is in the tables above. If you're an existing or prospective client and want the full 400-row spreadsheet — all 50 accent test calls scored individually, latency histograms per vendor, failure logs with call IDs — email me and I'll send it. No NDA, no gatekeeping.
Why not build your own voice stack from scratch? Because streaming TTS that sounds natural is a full-time engineering team for a year, and we'd rather focus on the layer that actually matters for our clients: the prompt logic, the CRM wiring, the business rules. Retell handles the hard telephony and streaming audio. We handle the bit where the AI knows what a Tuesday afternoon appointment means for a specific plumber in Geelong.
How often do you re-evaluate the stack as the market evolves? Every six months we re-run the benchmark against new entrants. The market is moving fast — Retell raised $22M in late 2024, Vapi is shipping weekly, and we expect at least one new serious contender in 2026. We're not married to any vendor. The day another platform beats Retell on our seven criteria, we'll migrate.
What is the biggest technical risk in running AI voice agents at scale? Silent failures. Not outages — those you see on a dashboard. The quiet ones are when the AI thinks it booked the appointment, the caller thinks the appointment is booked, and the CRM never got the webhook. We mitigate with synthetic test calls every 15 minutes, webhook retries with exponential backoff, and daily reconciliation between Retell call logs and the CRM.
Does the platform choice affect the quality of Australian accent recognition? Yes, massively. The ASR model matters more than the voice AI orchestration layer. Retell let us swap in Deepgram Nova 2 with custom keyword boosting, which dropped our WER from about 5% to about 3%. On a platform that locks you into their bundled ASR, you're stuck with whatever accuracy they ship. For Australian regional accents, that's usually not good enough.
Book 30 minutes with me
Book 30 minutes with me. I'll tell you honestly if this makes sense for your business. theautomate.io
Frequently Asked Questions
Written by Syed Bilgrami
Founder of TheAutomate.io — building AI voice agents for Australian businesses