Create Next App

Role: Senior ML Engineer (Contract)

Context: Mid-sized European E-commerce marketplace (75K SKUs, 50K DAU, 2.5K orders/day)

The Problem

The marketplace’s existing search stack was built around keyword matching and rule-based ranking. While functional at small scale, it struggled as the catalog and user base grew:

Natural-language and multi-constraint queries failed frequently
Long-tail products were under-discovered
Synonyms, attributes, and multilingual queries were poorly handled
A significant fraction of searches returned “no results”, even when relevant products existed

From a business perspective, search was a high-intent surface, but relevance failures directly translated into lost conversions and user frustration.

The challenge was not just to “add an LLM,” but to design a production-grade, low-latency, cost-efficient system that could improve relevance without destabilizing the platform.

Business Definition of Success

We defined success in operationally measurable terms, aligned with product and finance stakeholders:

Primary metric:
- Search-to-purchase conversion (relative improvement vs control)
Secondary metrics:
- Reduction in “no results found” queries
- Latency SLOs (p95 / p99)
- Error rate under peak load
Explicit non-goals:
- We did not require AOV uplift for success
- We did not optimize for long conversational responses

The system had to be:

Fast enough for real-time use
Cheap enough to run continuously
Observable and governable in production

Feasibility and Risk Assessment

Owned feasibility and risk assessment for a high-impact RAG search project, evaluating data quality, modeling complexity, latency and cost constraints, and ethical considerations. Worked with Product and Engineering to define a phased rollout (text-first → multimodal), establish evaluation and guardrails for relevance and groundedness, and design performance- and cost-aware architecture from day one.

Comprehensive Evaluation strategy

Led the design of a multi-layered evaluation framework to ensure the RAG system met production standards for relevance, trust, performance, and business impact. Working closely with Product, Engineering, and Data stakeholders, we defined a “quality gauntlet” that gated every change—from data and retrieval logic to model upgrades—through rigorous offline validation (data quality checks, retrieval ranking metrics, groundedness testing, integration and load tests) and online evaluation (A/B testing, user engagement signals, and shadow deployments). This approach embedded evaluation across the entire lifecycle, enabling safe rollouts, preventing regressions, and providing a shared, data-driven basis for promotion decisions and continuous improvement.

Data & Indexing Foundations

Data sources

Product catalog (titles, descriptions, attributes, specs)
Category and brand metadata
Historical search queries and clickstream events
Product images
User interaction signals (views, clicks, add-to-cart)

Indexing strategy

Semantic chunking of product text and metadata
Separate indexes for:
- Textual content
- Attribute-heavy fields
Vector + keyword hybrid indexing to balance recall and precision
Strict versioning of embeddings and index schemas to avoid silent drift

This foundation ensured retrieval quality was traceable, reproducible, and debuggable.

Building a “Golden Dataset” for Scalable RAG Evaluation

To enable rigorous, statistically meaningful evaluation of the RAG system, I led the design of a synthetic “golden dataset” generation pipeline that replaced manual labeling with a scalable, LLM-driven approach. Working closely with Product, Search, and Data stakeholders, we defined what “ground truth” should mean for e-commerce search and translated that into an automated evaluation asset grounded in real user behavior. The pipeline combined curated production search logs, catalog-aligned document chunking, and LLM-based query generation with automated validation, producing tens of thousands of high-quality (query, relevant product) pairs. This dataset became a foundational artifact: it enabled reliable measurement of retrieval quality (MRR), powered automated regression testing in CI/CD, and significantly accelerated experimentation by allowing teams to quantify the impact of new retrieval and ranking strategies with high confidence before production rollout.

Retrieval & Generation Architecture

Rather than a “chatbot-first” design, we treated the LLM as one component in a retrieval pipeline:

Query understanding
- Light query normalization and expansion
- Intent classification (navigational vs exploratory)
Hybrid retrieval
- Vector search for semantic recall
- Keyword/BM25 search for precision and exact matches
- Adaptive top-K selection based on query complexity
Reranking
- Contextual reranker incorporating relevance and business constraints
- Guardrails to prevent over-fetching and latency blowups
Generation (minimal by design)
- Short, structured responses
- No verbose descriptions unless explicitly required

This architecture was deliberately chosen to minimize token usage, control cost, and keep latency predictable.

Fine-Tuning a Domain-Specific Re-Ranker for High-Precision Retrieval

Led the design and fine-tuning of a domain-specific re-ranking model to improve precision in a two-stage RAG retrieval pipeline. Working closely with Product, Search, and Data teams, we aligned on relevance definitions rooted in real user behavior and translated them into a scalable training strategy. Using production interaction logs, we constructed high-quality training data with purchased items as positives and retrieval-time hard negatives, ensuring the model learned to resolve the most ambiguous, business-critical ranking decisions. The re-ranker was implemented as a cross-encoder, trained via managed cloud training jobs, and integrated behind a high-recall retriever to balance latency and accuracy. This approach materially improved ranking quality (MRR) while keeping inference costs predictable, and became a core mechanism for safely iterating on retrieval strategies without compromising user trust or system performance.

Production Architecture & MLOps

Real-time inference

Async API running on containerized compute with autoscaling
Sub-500ms p99 latency under peak load
Circuit breakers and graceful degradation paths

Evaluation & monitoring

Offline metrics: recall@k, MRR
Online metrics: conversion, “no results” rate
LLM-as-judge for sampled quality checks
Drift monitoring on:
- Query distributions
- Retrieval outputs
- Embedding versions

CI/CD & governance

Model and index versioning
Canary deployments and rollback
Load-tested capacity envelopes
Explicit cost budgets per query

Business & Operational Impact

After staged rollout and controlled experiments:

~4% relative improvement in search-to-purchase conversion
Stable sub-500ms p99 latency, including during traffic spikes
Inference cost reduced to ~0.1 cents/query

Importantly, these gains were achieved without introducing heavy operational complexity or runaway costs.

Cost & Scalability

The system was designed to operate efficiently at mid-market production scale, serving ~3M search queries per month (~100K/day) across a catalog of 10s of thousands of products with 1–2 GB of vector index data. By deliberately optimizing for search-centric RAG (short prompts, minimal generation, aggressive caching, adaptive retrieval, and model tiering), steady-state inference cost was kept to ~0.1 cents per query, resulting in a low four-figure monthly operating cost dominated by LLM usage rather than idle infrastructure. The real-time stack scales horizontally via autoscaling compute (typically 2–4 inference tasks) and a small highly available search cluster, maintaining sub-500ms p99 latency even during traffic spikes. Importantly, cost scales linearly with query volume, not catalog size, and all heavy compute (ingestion, re-indexing, retraining) runs on-demand rather than continuously—making the system predictable, cost-bounded, and sustainable to operate long-term while still delivering measurable business impact.

What I Learned

This project reinforced several production-grade lessons:

Search relevance improvements compound slowly but reliably — even single-digit conversion gains matter at scale.
Most e-commerce search does not need “chatty” LLMs; retrieval quality matters more than generation verbosity.
Cost is a first-class ML metric in real-time systems.
Observability beats intuition — silent failures in retrieval pipelines are more dangerous than obvious model crashes.
Conservative metrics build trust with product, finance, and leadership teams.

Overall, this project demonstrated how LLMs can be integrated into core commerce systems responsibly, with measurable impact, strong ROI, and production-grade reliability.

RAG-Powered Search & Discovery

Key Achievements