Create Next App

Role: Senior ML Engineer (Contract)

Context: Mid-sized European E-commerce marketplace (75K SKUs, 50K DAU, 2.5K orders/day)

The Problem

Many products had hundreds or thousands of customer reviews — but most users don’t read them. Others had only sparse or mixed reviews. This created:

Cognitive overload for high-traffic SKUs
Lack of reliable insight for niche products
Inconsistent messaging between product categories
Heavy manual work for merchandising teams

The hypothesis was simple:

If we could condense review sentiment into a single, trustworthy summary, users could make decisions faster.

Solution Overview

We designed a batch Retrieval-Augmented Generation (RAG) system instead of real-time inference.

Pipeline steps:

Generate embeddings for ~4M historical reviews
Store embeddings in Pinecone serverless (pay-per-use, scalable)
Retrieve key review snippets (recent + helpful + sentiment-balanced)
Summarize with a fine-tuned Mistral-7B model using structured prompts
Run evaluation (hallucination, relevance, toxicity)
Store final outputs in the product catalog

Why RAG for Review Summarization

Architected a Retrieval-Augmented Generation (RAG) summarization pipeline to give product, UX, and business stakeholders explicit control over summary quality, balance, and bias at scale. Instead of naively summarizing all reviews, designed a metadata-aware retrieval layer (helpfulness, sentiment, recency, verified purchase) that selects a curated, high-signal context per product and feeds it into a single, deterministic LLM call, improving trustworthiness while reducing inference cost and complexity.
Partnered with product and UX teams to define retrieval rules (e.g., balanced pros/cons, mandatory inclusion of critical negative feedback), ensuring summaries reflected real customer sentiment, avoided “lost-in-the-middle” failures, and remained cost-efficient and future-proof (reusable for semantic search and discovery use cases).

Multi-adapter (LoRA)

Designed a modular continual-learning strategy to prevent catastrophic forgetting in production LLMs by introducing a domain-specific multi-adapter (LoRA) architecture, enabling isolated retraining per product category (e.g., Electronics, Fashion) while preserving quality across the rest of the catalog. Partnered with product and content stakeholders to define domain boundaries and rollout criteria, ensuring safe, incremental model improvements without cross-category regressions.

Evaluation & Experiment Results

We A/B-tested the feature on a subset of product pages.

What we observed:

Users interacted more with the review section
Scroll depth and dwell time increased
The conversion curve showed an uplift of ~2%, but it was not statistically significant

We treated this as directional validation and focused on:

Quality improvements
Operational time savings
Future UX iteration opportunities

Cost & Optimization

Initial prototypes using GPT-4 API cost ~$40–50/month .

By switching to:

fine-tuned Mistral-7B
quantization
inference batching via vLLM

we reduced inference cost to ~$10–20/month, with no quality degradation.

Model Governance and Responsible AI (RAI)

Led end-to-end model governance and Responsible AI practices for a production LLM system, integrating reproducibility, auditability, access control, and bias mitigation across the full MLOps lifecycle. Partnered with product, legal, and platform teams to define governance standards (model cards, versioning, access policies) and implemented bias-aware RAG retrieval, explainability hooks, and proactive PII redaction to ensure trustworthy, compliant AI outputs at scale.

Monitoring and Observability

Led the design of a comprehensive monitoring and observability framework for an LLM-based summarization platform, spanning system reliability (batch DAGs, GPU utilization, API errors), data integrity (semantic drift, PII leakage), and model quality (hallucinations, toxicity, RAG relevance). Collaborated with data engineering, product, and legal teams to define thresholds, alerting policies, and on-call workflows—ensuring continuous quality, fast incident response, and long-term trust in production GenAI outputs.

Lessons Learned

Summaries are only as good as retrieval signal quality
Model evaluation needs automation, not subjective review
Cost modeling is critical for production viability
Experimentation must include both UX and ML metrics

LLM Generated Review Summaries

Key Achievements