Ermir Beqiraj
Backend architect. Systems, agents, infrastructure — from inside the work.
all writing

A few times a week, a big company calls me with a robot. It starts talking and just keeps reading its menu, “Press 1 for this, press 2 for that,” in a way that makes me want to commit violence. And these are companies with budgets, with teams, with people whose actual job title is to make this not feel exactly like it does.

So a couple of months back, Mistral shipped Voxtral, their STT and TTS models, all in one ecosystem, from a company I already trusted. I wanted to try it out. It took me a while to get to it (some interesting stuff at work has been keeping me busy), but I finally caught a break, and there I was: a man with a plan. I was going to build a voice agent that picks up the phone for me when I don’t want to do it myself.

The premise is simple: a bot that picks up the phone, lets the other side know it’s working on my behalf, takes a message, and texts it to me.

How hard could it be, right?

Fred Brooks had my number fifty years ago: all programmers are optimists.

I should also say up front: I’m not a voice expert. No DSP background, no speech-recognition pedigree. What follows is common sense and curiosity bouncing off a problem that turned out to be much deeper than it looks, and interesting enough to write about.

First, the boring walls

The idea was this: a call comes in, and I decide whether I want to answer. If I don’t, it goes to voicemail, except the voicemail is a handmade, AI-powered bot tailored exactly how I like it.

For the call transfer to happen, you need a number on the other side. I was a bit disappointed to find Twilio numbers priced at absurd levels, but I found a nice $2/mo number at Telnyx. And because the iPhone treats calls as sacred and won’t let you touch them, I ended up using plain GSM forwarding codes, which worked like a charm.

With the wiring out of the way, it was time for the cool stuff.

The initial idea

Here’s how I drew it on the first day.

There’s a caller, and there’s the agent (running on my homeserver). The agent is a little state machine with three states: it’s listening, thinking, or speaking. The caller talks (agent listens), it figures out a reply (thinks), it says the reply (speaks), and goes back to listening. A clean little loop.

It was also the root mistake. I just didn’t know it yet. It would take me most of the project to see why.

The model that spoke Spanish, and a little Portuguese

The first crack showed up the instant I tested in Italian.

I started with the realtime streaming transcription: audio goes in over a live socket, text comes out word by word, very fast. Except sometimes it came out in the wrong language. Under a bit of background noise it would throw in words in Spanish, or even Portuguese, depending on how noisy things got. I’d happily have sacrificed multilingual support for correctness, but this model had no way to lock the language. It worked well on clean voice, and a phone call is anything but, so it became clear this wasn’t the one.

So I switched to the second transcription model, the upload one. It works like this: you upload an audio file, it transcribes it, and returns the result. This one does take a language parameter, so I locked it to Italian.

Night and day. The drift was gone, the transcripts got sharp. The original problem was solved, but I’d created a new one, and soon I realized this was going to be a pattern.

Now I had to wait: collect the audio, send it, wait for the round trip, think, then speak.

Sound is not speech

With the upload model, I was the one who had to decide when the caller had finished, so I’d know when to send the clip. My first instinct was the obvious one: listen for sound. When the line goes quiet, they’re done.

This was easy. For about a day. The library would happily tell me when there was sound and when there wasn’t, so the implementation was trivial. Except it created a deeper problem.

Because “there is sound” and “a person is talking” are not the same thing, and the gap between them is enormous. A running engine is loud. A café is loud. So the agent kept waiting for a silence that, in real conditions, would never come, even when there was nothing to transcribe.

The fix was to stop measuring loudness and start recognizing voice. There’s a tiny neural network called Silero: just a few megabytes, runs locally, trained on millions of speech and not-speech samples until it learned the shape of a human voice. You feed it audio, it gives you back the probability that a person is talking in there.

I dropped it in, and with engine noise at twice the loudness of my own speech, I got zero false triggers.

This mattered more than I expected, because it quietly unlocked the realization that the whole architecture was wrong.

I was optimizing the wrong thing

The little loop (listening, thinking, speaking) was the actual problem.

Real conversations don’t work that way. When humans talk, all three happen at once, constantly. You’re talking while I’m reacting; I start to answer before you’ve finished; you cut back in to correct me. And my little state machine, by being in exactly one state at a time, was built to throw away anything that didn’t fit the current moment.

Concretely: while the bot was speaking, it wasn’t listening, so anything you said to interrupt it landed nowhere.

The fix was to never mute the mic. Not while thinking, not while speaking, not ever. Audio flows in continuously, the whole call. The bot is always listening, and the only thing allowed to stop is its mouth.

And that flips the priority, which, I realized, is the thing I actually care about. The caller dictates who’s talking. The instant the caller speaks, the bot stops, because its job is to take the message, not hold the floor. That’s the exact instinct the menu-tree robots have backwards: they’re built around what the system wants to say. This one is built around the caller saying their piece and being heard.

Some tidy-up

Sending a clip to be transcribed meant first deciding the clip was over, and my only definition of “over” was silence, a couple of seconds to be safe. So up to this point the flow was: caller stops, bot waits to be sure, then uploads, transcribes, thinks, and finally speaks. It felt off, and it was wrong, because two seconds of quiet is something people do in the middle of a thought, and the bot kept reading those as endings and talking over them to “clarify”.

So I stopped waiting for one big silence to mark the end of everything. I cut the audio at every little pause instead, and sent each fragment off to be transcribed the instant it landed. Now the words kept pace with the talking. By the time the caller actually finished, the text was already there, and the strange lag was gone.

And, as had become the law of this project, the fix didn’t kill the problem so much as move it. Before, I had one decision: has the caller gone quiet? Now I had that same decision after every fragment: was that the end of what they came to say, or just a breath before the next clause? Still, chunking made transcription fast, so I’ll take the win.

The thing that cracked it was deciding to look at the words for turn-taking, not the audio. Hand the running transcript to a model and let it judge. Which pause means go turned out to be the whole game.

The discovery of silence

This is the part that took the longest and that I’m proudest of, and as you’ll see, silence turns out to be much more than “the quiet part”.

Here’s the trap: a one-second pause after “with whom am I speaking?” and a one-second pause in the middle of “I need three things…” are acoustically identical. One means “your turn now.” The other means “pay attention, more is coming.”

So I stopped trying to decide it from one signal, and started stacking cheap ones that cover for each other.

The voice detector cuts the stream into pieces on every half-second of quiet: phrase-sized fragments. Each piece gets transcribed and appended to a running “story” of what the caller has said so far. Then the story goes to a small, fast language model whose entire job is to answer one question in one word: is this a complete thought that wants a reply, or should I wait for more?

Watch it work across the cases that used to break everything:

  • “So… I wanted to leave a message for Ermir…” [three seconds of silence] “…tomorrow’s meeting got moved to three.” The naive bot answers into the pause. This one says WAIT, holds through the gap, and only replies after the actual end.
  • “I need three things…” [pause] “the quote…” [pause] “the invoice…” [pause] “and the delivery date. Can you pass that along?” WAIT, WAIT, WAIT, COMPLETE. This is the one volume-detection fails hardest: three confident silences that all mean “not done.”
  • “Oh… um… so, who am I speaking with?” COMPLETE, despite opening with pure filler, because it reads the question at the end.

A couple of cheap shortcuts ride on top. If a piece ends in a clear question mark, it’s obviously the bot’s turn to answer. And if the model takes too long to decide, the bot either asks for clarification or steps in to acknowledge.

None of these pieces is clever on its own, but combined, they produced surprisingly good results.

People don’t stop mid-word

When you interrupt someone saying “goodbye,” they don’t stop at “go.” They finish the word. Maybe even run to the nearest comma. Humans don’t get cut off mid-syllable, and when I made the assistant hard-stop the instant the caller spoke, it felt jarring in a way you can’t quite name.

My plan was to do it properly: track the words as they go out, and when the caller barges in, stop cleanly at the next word or comma boundary. It turned out harder than I thought, because the model gives you no way to know which words actually went out and which didn’t. So I did a little trick instead: when the caller barges in, the bot’s audio doesn’t stop dead. It fades out. That’s it. It changed the feeling completely, and the conversation suddenly felt far more natural.

What did the caller actually hear?

Here’s another scenario I hadn’t accounted for.

  • The caller talks
  • The bot responds
  • The caller interrupts

All of this went to the model as one full “story” to build a natural conversation from. But when the caller interrupts, they don’t actually hear the rest of what the bot was saying. So the caller’s version of the story diverges from the bot’s.

Here’s how it goes wrong. The model is mid-sentence, recapping what it understood, and the caller cuts in:

Bot: “Perfect, so I’ll tell Ermir the delivery’s confirmed for Tuesday, at… [interrupted]

Caller: (cutting in) “No wait, Wednesday.”

The mouth stops on time, but look at what the two of them are holding. The caller heard exactly this much (the delivery’s confirmed for Tuesday, at some time he never got to hear) and is already correcting the day to Wednesday:

“…the delivery’s confirmed for Tuesday, at…”

The model, though, wrote the whole intended line into its side of the story, including the part that never left the speaker:

“…the delivery’s confirmed for Tuesday, at nine. Anything else you’d like to add?”

To stop the model from reasoning off a phantom, the fix is simple: only write to the story what was actually streamed before the caller barged in. When the audio is interrupted, I estimate how far into the sentence it got (roughly, from how long it had been streaming) and store just that much, explicitly marked as cut off:

“…the delivery’s confirmed for Tuesday, at… [cut off]

Now the model has the better picture, and can judge the story from the caller’s side too.

What I’d still want

A clock on the words. Twice now I’ve had to guess: fading out the audio at X milliseconds, and estimating how much of a sentence we streamed. Both guesses exist for the same reason: the TTS hands me sound with no idea which word is playing when. If it just told me word seven left the speaker at 1.31s, both hacks evaporate. Maybe someone will ship that.

Knowing who’s calling. Every caller gets the same bot right now: same greeting, same manner, same patience. The obvious fix is to sync my contacts and let the prompt change with the number. I’ve already added one override that does nothing but listen for a sales pitch and hang up.

A little more “mhm”. When a person listens, they don’t go dead silent. They drop in tiny signals, an “mh,” a breath, so you know the line’s alive and someone’s actually there. Maybe I’ll add them. We’ll see how fast I get bored.

The whole thing, on a box in my house. Right now the app runs on my homeserver in Italy, Telnyx lives somewhere in Germany, and the Mistral models run in France. If I ever find a bag of unmarked money on my afternoon walks, I’ll have GPUs hosting the Voxtral models at home too. It wouldn’t beat physics, but it would make the agent side faster and the conversation feel more natural.

Where it landed

It turned out to be a fun project, and I learned a lot along the way.

The biggest lesson: the “one afternoon” demo really was one afternoon, just three APIs, a phone number, and a thing that picks up and talks. That’s the easy 10%, and it’s exactly why everyone ships it and stops there. The other 90% is the whole job: the noise that looks like speech, the silences that mean opposite things, the caller who interrupts and corrects and trails off. None of it needed a speech-recognition PhD. It just needed paying attention to how a conversation actually goes, and being willing to throw out the tidy diagram every time the world disagreed with it.

That’s the part worth building. It was also, apparently, worth a few Saturdays.


The whole thing is open source: github.com/ermirbeqiraj/phone-assistant.

Ermir Beqiraj is a backend architect building AI-integrated infrastructure. This is his personal writing.