PostHole
Compose Login
You are browsing us.zone2 in read-only mode. Log in to participate.
rss-bridge 2026-02-24T07:54:58+00:00

#1 on Spider 2.0–DBT Benchmark – How Databao Agent Did It

As of February 2026, Databao Agent ranks #1 in the Spider 2.0–DBT benchmark. This ranking measures how well agents can operate in a real dbt project, including reading the repository, understanding what’s broken, implementing the missing models, and validating everything by actually running code. Our team ended up achieving the highest score in the benchmark, […]


### Databao

Agentic platform with modular AI tools and a governed semantic layer for any data stack

#1 on Spider 2.0–DBT Benchmark – How Databao Agent Did It

[Dmitrii Mikhailovskii]

Dmitrii Mikhailovskii
Dmitrii Zolotarev

As of February 2026, Databao Agent ranks #1 in the Spider 2.0–DBT benchmark. This ranking measures how well agents can operate in a real dbt project, including reading the repository, understanding what’s broken, implementing the missing models, and validating everything by actually running code.

Our team ended up achieving the highest score in the benchmark, but we didn’t do it just because “we used a better model.” We got the biggest gains by treating the agent the same way you would mentor a junior colleague – providing better context, restricting chaos, and enforcing a reliable workflow.

This post is a practical account of what we changed and why it mattered. Read on to learn about the engineering decisions that made the difference, including how we reduced uncertainty, upgraded context, tightened up tool discipline, and rewrote a messy pile of prompts into a clear policy the agent could follow. The lessons we learned the hard way are that reliability beats cleverness, and prompts alone don’t buy you reliability – you have to design for it.

What is a dbt project?

dbt (data build tool) treats analytics like software. Instead of ad-hoc SQL embedded in dashboards and notebooks, data transformations live in a version-controlled repository, are reviewed like code, and can reliably rebuild the same analytics layer.

The main unit of work in dbt is a model: an .sql file that defines a dataset (usually a table or a view) built from other datasets. Models depend on other models, and dbt builds them in dependency order, turning the project into a directed graph rather than a pile of disconnected queries.

A typical dbt repository contains the following parts:

  • The models/ directory with SQL models (often organized into layers, such as staging → intermediate → marts).
  • YAML files that document the project and add tests and constraints (sources, descriptions, uniqueness tests, freshness, etc.).
  • A workflow built around commands like dbt run or dbt build. These commands materialize models, run tests, and tell you what failed, where, and why.

Working with dbt means navigating a codebase, respecting conventions and dependencies, iterating, and not declaring victory until the build is green. The Spider 2.0–DBT benchmark asks agents to do exactly that.

What Spider 2.0–DBT evaluates

The Spider 2.0–DBT benchmark turns a day-to-day dbt workflow into an evaluation. In the version we ran, the benchmark had 68 tasks. Each of them was a folder containing:

  • An incomplete dbt project (models were missing or incorrect).
  • A DuckDB database file with the available data.

The agent’s job was to behave like a careful data engineer:

  • Read the repository to understand what the repo is trying to produce.
  • Identify what’s missing or wrong.
  • Implement the missing SQL models or fixes.
  • Run dbt.
  • Keep iterating until the project builds.

The evaluation compares the produced database with a “golden database” and checks whether the agent produced the right tables and columns.

Even though it may sound like simple SQL generations that many LLMs can do well, the hard part is operating in a repository environment. Some tasks are large – like, “data warehouse” sized – tables with 2,500+ columns, dozens of models in a single task, and thousands of lines of SQL across the project.

This scale forces the agent to behave like a real contributor. You can’t paste the entire repository and schema into a single prompt and expect consistent reasoning. The agent has to navigate the project, read selectively, build a mental map of the project, and stay oriented after each run.

Where we started: Baselines and the real enemy

We didn’t start from scratch – our first agent was based on a popular LLM and could inspect a data project, run commands, and make edits using standard data tools. Surprisingly enough, its performance right out of the gate wasn’t too shabby – it could solve about a quarter of the tasks in our benchmark.

Encouraged, we built a more flexible version of the agent by giving it some more tools not available in default setups of other agents. This gave us more control and room to experiment. On paper, these were all improvements. But in practice, consistency was sorely lacking. The agent behaved a little differently each time we ran it. It would nail one task, then completely whiff on the next.

This inconsistency turned out to be the real enemy. When we looked closer, the issue wasn’t that the agent couldn’t write SQL or “do data stuff.” The problem was that it struggled to behave consistently and to understand what the actual task was – something a careful data or analytics engineer wouldn’t have any issues with.

Important kinds of uncertainty

As we dug deeper, we realized there were two main culprits behind the agent’s randomness.

The first was missing or unclear context. The agent often didn’t have enough visibility into how the project was structured, what tables existed, or what conventions were being followed. This uncertainty is fixable. If you provide better, targeted context, the agent stops guessing.

The second was natural ambiguity. Human language is fuzzy by nature. Even with good instructions, there can be multiple reasonable ways to solve a task, but only one of them matches the benchmark’s expected answer. You can’t fully eliminate this kind of uncertainty.

Understanding this distinction changed what we worked on. Once we did, we were able to re-allocate our efforts, focusing less on fixing the model and more on fixing the environment around it.

Our strategy shift: From model tuning to workflow engineering

Early on, we gave the agent lots of freedom and lots of tools. That felt powerful, but failed in predictable ways: the agent wandered around, tried random actions, undid its own work, and generally got lost.

So, we changed our mindset. Instead of asking, “What can this agent do?” we asked, “What would a human engineer actually do here?”

We focused on two things:

  • Better context: Make the right information easy to access and hard to miss.
  • A clear, disciplined workflow: Reduce chaos by forcing a specific order of operations.

Better context

We made sure the agent didn’t have to hunt for information.

We showed the important project files upfront, so the agent wouldn’t waste time opening the wrong things, and added a quick database overview at the beginning, so the agent knew which tables already existed. These fixed a surprising number of failures, especially on tasks where the correct action was to do nothing at all.

We also helped the agent connect the dots between requirements and data sources instead of guessing names. When it ran data builds, we summarized the results instead of dumping long, noisy logs. This kept the agent focused on what mattered next.

The result? Fewer blind mistakes and fewer “I didn’t find the right thing” failures.

A clear, disciplined workflow

Context helped, but it didn’t solve the failures entirely, so we tightened up the rules.

In the first version, we gave the agent access to many tools. It could read, write, edit, and add any file in the dbt project, and it had unrestricted access to the terminal. In theory, this was supposed to make the agent powerful, but unfortunately, the agent used its power to break things.

[...]


Original source

Reply