Powerful multi-agent systems are no longer hard to prototype — but they’re still hard to trust. Why?
Building a first draft or demo of sophisticated systems that pull proprietary information from your databases or take actions on your behalf has become faster and easier thanks to frameworks like LangChain and similar tools. But making these systems reliable and ready for real-world use is still a serious challenge. LLMs remain stochastic and difficult to fully control. So:
- How can we ensure agentic systems are doing what we intend?
- How can we monitor and test LLM apps effectively?
- How can we build trust — first with developers, and ultimately with end users?
- And how do we do it efficiently and at scale?
With LLMs as the new foundation for much software development, we are faced with a paradigm shift: What was once test-driven software engineering is now evaluation-driven AI engineering.
To tackle this new paradigm, we partnered with LastMile AI to tackle a particularly challenging real-world use case: an agentic content search platform for Bertelsmann’s creatives. The project provided the perfect testing ground for LastMile’s infrastructure for agentic systems and evaluation tooling.
Together, we focused on enabling real-time, cost-effective, and context-aware evaluation — not just to measure performance, but to accelerate it. With the help of compact, self-hosted models tailored to our use case, we were able to automate key evaluation metrics such as relevance, faithfulness, and overall quality. This, in turn, helped us surface hallucinations, improve agent selection logic, and unlock new capabilities like real-time evaluations and targeted active learning.
Ultimately, we made meaningful progress toward solving some of the most fundamental questions when building agentic systems grounded in real life data:
- How can we confidently measure the quality of answers across diverse domains?
- How can we reduce cost and latency of evaluation without sacrificing accuracy?
- How do we integrate real-world feedback and continuously improve?
What we collaborated on: The Bertelsmann Content Search
With a portfolio from books to music and TV, Bertelsmann has one of the richest data sources for creatives. However, with that scale comes fragmentation. Data lives in silos, with divisions running their own systems. So, what happens when a creative or researcher at Bertelsmann wants to answer a seemingly simple question like:
“What kind of content do we have on Barack Obama?”
The answer might live in dozens of places: biographies from Penguin, documentaries via news channels, podcasts on streaming platforms, and even third-party commentary from the open web. Finding this content used to mean knowing where to look — and having or getting access to each of the relevant systems.
The Bertelsmann Content Search changes that. Built as a multi-agent system by the Bertelsmann AI Hub team, the platform allows Bertelsmann’s creatives to ask natural language questions and receive unified, trustworthy answers — without needing to know which system holds what.
Here’s how it works: Behind the scenes, a router directs queries to specialized agents, each responsible for searching a specific domain. One agent might dig into the RTL archives, another into PRH’s book catalog, while a third checks external sources for real-time web trends. Each agent returns its own answer, which is then distilled into a single, coherent response.
The user sees just one clean answer — but behind it lies a distributed orchestration of knowledge retrieval across the entire Bertelsmann ecosystem.
This multi-agent design makes it possible to surface the right information — across diverse formats, brands, and platforms — without centralizing all data. It empowers creatives, marketers, and producers to work faster, stay informed, and make better content decisions.
LastMile x Bertelsmann: Solving Key Challenges in Agentic Systems
Before launching the Bertelsmann Content Search, it was clear that building a multi-agent system wasn’t just a technical challenge — it was a trust challenge. With so many components working together across different data domains, we faced hallucinations, agentic coordination issues and a lack of reliable feedback signals.
Together with LastMile AI, we built an evaluation process that could monitor and improve performance across the entire pipeline. We defined key evaluation metrics, developed methods to compute them at a trace level, and implemented fast, cost-effective models that made real-time evaluation and system improvement possible.
The highlights of our collaboration included:
- Improved Tool Call Accuracy Through Better Agent Routing
- Implemented an enhanced agent selection router model that intelligently routes queries to the most relevant agents.
- Cost-Effective Evaluation at Scale with Self-Hosted Models
- Trained and deployed compact 400M parameter alBERTa models optimized for CPU-based inference.
- Enabled cost-effective, real-time evaluation of system outputs in high-throughput production scenarios.
- Specialized Models for Key Eval Metrics, including Long-Document Faithfulness
- Developed custom evaluation metrics, including novel techniques for computing scores over long-form 128k-token contexts.
- Lifted faithfulness AUC score from 0.71 to 0.84+, enabling more reliable detection of hallucinations and inconsistencies.
- Accelerated Model Improvement via Targeted Sampling and Active Learning
- Established a data-driven sampling process to surface the most impactful areas for manual inspection.
- Generated 5,000+ high-quality labeled datapoints in days through LLM-based weak labeling combined with human-in-the-loop validation.
- Enabled continuous learning loops and refinement of key evaluation models.
- Unlocked New Capabilities Through Real-Time Evaluation
- Deployed evaluation models for online inference.
- While still in progress, this opens up powerful new opportunities:
- Automated guardrails to catch low-quality or sensitive outputs during run-time before they reach end users.
- Improved agent selection via dynamic routing based on predicted faithfulness.
- Enhanced response generation through output filtering or reweighting.
- Evaluation-driven fine-tuning of underlying retrieval and generation models.
Read the Full Case Study at LastMile
We’re proud of the progress made in this collaboration. You can find a more in-depth version of this case study, including further technical detail, can be found on LastMile’s blog.