Voice AI agents generate natural language. E-commerce storefront search APIs are built for typed keywords. Put them together and you get a retrieval failure that isn’t obvious until you’re looking at empty result sets in production.
This post is about that gap: what causes it, why the obvious fix doesn’t fully work, and what actually does.
The failure mode
A customer calls a voice agent for an electronics retailer. They say: “I need something to charge my laptop faster, mine takes forever and I’m always running out of battery.”
The voice agent transcribes this, extracts the intent, and fires a search query to the platform’s storefront API. The query it sends: "something to charge my laptop faster mine takes forever running out of battery".
The catalog has exactly what they want: 65W USB-C GaN charger PD 3.0. BM25 returns nothing. Zero results.
BM25, the retrieval model behind most keyword search implementations, scores documents by how many query terms appear in them, weighted by term frequency and inverse document frequency. There is no term overlap between "something to charge my laptop faster" and "65W USB-C GaN PD 3.0". The engine can’t find what it has.
This isn’t a bug. It’s what BM25 is supposed to do. The product catalog was indexed for terms that shoppers type into a search bar: short, specific, often technical. usb-c charger 65w, gan charger laptop, pd 3.0 fast charge. Voice generates something else entirely.
Why pure vector search doesn’t fix it
The first instinct is to replace BM25 with semantic vector search. Embed the query, embed the catalog, find nearest neighbors by cosine distance. This solves the vocabulary mismatch: the embedding for “something to charge my laptop faster” sits near the embedding for “USB-C GaN charger” in the vector space.
But vector search alone has a different failure mode. It’s too fuzzy.
A customer says “Anker 737 140W charger” — an exact model specification. With pure vector search, you might surface three semantically similar chargers, none of which is exactly what was asked for. The semantic similarity score doesn’t know the difference between the 737 and the 736. The customer does.
Voice agents handle both query types in the same conversation. Vague natural-language requests and precise specifications arrive in the same input stream. A retrieval layer built for one fails the other.
Hybrid search with RRF
Run both retrieval methods in parallel and fuse their ranked outputs.
For each query:
- Vector search handles semantic intent — the vague, exploratory queries
- Full-text search handles exact terminology — product names, model numbers, part codes
Then combine the ranked lists using Reciprocal Rank Fusion.
RRF is elegant. For each document, you sum its reciprocal ranks across all retrieval methods:
RRF_score(d) = Σ 1 / (k + rank_i(d))
Where k is a constant (typically 60) that dampens the influence of top ranks. A document ranked 1st in vector search and 5th in FTS scores higher than one ranked 1st by only one method. The fusion is rank-based, so you’re not trying to normalize scores across incompatible similarity metrics. You’re just asking: did both methods think this was relevant?
Keeping this in Postgres
The appeal of doing this in Postgres is that your product data is probably already there. You don’t need a separate search cluster.
WITH vector_ranked AS (
SELECT
id,
ROW_NUMBER() OVER (ORDER BY embedding <=> $query_embedding) AS rank
FROM products
ORDER BY embedding <=> $query_embedding
LIMIT 60
),
fts_ranked AS (
SELECT
id,
ROW_NUMBER() OVER (ORDER BY ts_rank(search_vector, query) DESC) AS rank
FROM products,
websearch_to_tsquery('polish', $query_text) AS query
WHERE search_vector @@ query
LIMIT 60
),
rrf AS (
SELECT
COALESCE(v.id, f.id) AS id,
COALESCE(1.0 / (60 + v.rank), 0) + COALESCE(1.0 / (60 + f.rank), 0) AS score
FROM vector_ranked v
FULL OUTER JOIN fts_ranked f ON v.id = f.id
)
SELECT p.*, rrf.score
FROM rrf
JOIN products p ON p.id = rrf.id
ORDER BY rrf.score DESC
LIMIT 10;
pgvector handles the vector side with an HNSW index. Postgres full-text search (tsvector, websearch_to_tsquery) handles the FTS side. The FULL OUTER JOIN in the RRF step means a document that only showed up in one method still scores — it just scores lower than something both methods agreed on.
The query runs fast. An HNSW index on the embedding column and a GIN index on the search vector column keep both branches sub-20ms at typical catalog sizes.
The synonym layer
There’s a third problem that shows up in any market with brand aliases, colloquial names, or localized terminology.
A customer says “zasilacz do laptopa” (laptop power supply). The catalog has AC adapter, charger, power brick. FTS doesn’t connect them because they share no tokens. Vector search might bridge the gap, depending on how multilingual your embedding model is. Or it might not.
The fix is a synonym dictionary applied at index time. Postgres supports this through custom FTS configurations:
CREATE TEXT SEARCH DICTIONARY product_synonyms (
TEMPLATE = synonym,
SYNONYMS = product_aliases
);
The synonym file maps zasilacz → charger, adapter, power supply. Both the document and the query pass through the same dictionary, so zasilacz and AC adapter 65W hit the same tokens in the index.
This needs maintenance: someone has to build and update the synonym file. But the dictionary is domain-specific and relatively stable. For most catalogs it’s a one-time build, not an ongoing operational burden.
What changes in the AI layer
The voice agent doesn’t need to know any of this. From the agent’s perspective, it sends a natural-language query and gets ranked product results back. The retrieval complexity lives entirely in the search layer.
What does change is how you think about query preprocessing. Voice transcripts often include filler words, hedges, and relative terms that should be stripped before hitting the search index:
“I need something to charge my laptop faster, mine takes forever and I’m always running out of battery”
becomes something like:
“fast laptop charger”
This normalization step — extracting the search-relevant tokens from a conversational utterance — is where the LLM is actually useful. Not in retrieval itself, but in cleaning up what goes into retrieval.
The pattern that works: pass the transcribed utterance through a lightweight extraction prompt that returns a short search string plus constraint parameters (price range, size, category) to apply as filters before the search runs. Those constraints go into a WHERE clause, not into the search query itself.
The architecture this leads to
Voice Input
│
▼
ASR (transcription)
│
▼
LLM (intent extraction → search_query + filters)
│
├── filters → SQL WHERE clause
│
└── search_query → Hybrid Retrieval
│
├── pgvector (semantic)
├── Postgres FTS (lexical)
└── RRF fusion
│
▼
Ranked results → LLM response
The voice agent handles conversation and intent. The retrieval layer handles finding products. Neither compensates for the other’s limitations.
Why this matters more for voice than text
When a typed search returns zero results, the user sees an empty page and types something different. The feedback loop is immediate and visible.
When a voice agent returns zero results, it either says “I couldn’t find anything” (which sounds like the agent doesn’t know what it’s doing) or it hallucinates a product that doesn’t exist. Neither outcome is recoverable in a phone call the way an empty search page is recoverable in a browser.
Voice is a higher-stakes retrieval environment. Failures are audible, synchronous, and hard to walk back. The margin for zero-result queries is close to zero.
This is why building voice AI on top of a storefront’s default search API is a category error. The API was built for a typed, asynchronous, screen-based interaction. Voice is none of those things. The retrieval layer needs to be built for the input it’s actually going to receive.
Building a voice agent for an e-commerce platform? Get in touch — this is a problem we’ve worked through.