Text-to-SQL: Evolution and Limitations | Agentic Lakehouse Knowledge Base

Text-to-SQL is a subfield of Natural Language Processing (NLP) focused on translating human-readable text into structured Structured Query Language (SQL) statements. While the concept predates the current generative AI boom, the advent of Large Language Models (LLMs) has transformed Text-to-SQL from an academic novelty into a core component of modern enterprise data architectures.

However, generating mathematically accurate SQL for production data environments is historically difficult. The evolution of Text-to-SQL reveals a clear path from brittle, zero-shot generation to the sophisticated, context-aware pipelines found in today's Agentic Lakehouse.

Phase 1: Zero-Shot Prompting

The earliest (and most naive) implementations of LLM-based Text-to-SQL relied on zero-shot prompting. An engineer would simply pass a business question (e.g., "Show me the top 5 customers in Europe") directly to an LLM like GPT-3.

Because the LLM had no knowledge of the specific database schema, it would hallucinate table names and columns based on its training data. It might generate a query selecting from a fictional european_customers table. This approach proved useless for real-world enterprise applications.

Phase 2: DDL Injection

To fix the hallucination problem, engineers began injecting the Data Definition Language (DDL) of their tables directly into the LLM's system prompt. The prompt would include the exact `CREATE TABLE` statements for the entire database schema.

This approach improved syntax accuracy but introduced severe scaling limitations. Large data warehouses contain thousands of tables and tens of thousands of columns. Passing the entire DDL into every prompt quickly exceeded the LLM's context window limits and incurred massive token costs. Additionally, it failed to solve the "tribal knowledge" problem. The LLM knew the column names, but it didn't know the business logic (e.g., that "Europe" internally meant region_code IN (1, 4, 7)).

Phase 3: Schema RAG (Retrieval-Augmented Generation)

To bypass context window limits, teams implemented RAG. Instead of passing the entire schema to the LLM, they vectorized their table definitions and stored them in a vector database. When a user asked a question, a similarity search retrieved only the top 3 most relevant table schemas and injected them into the prompt.

While this solved the context window problem, it still suffered from a lack of business context. Schema RAG often retrieved tables that sounded similar to the prompt but were analytically incorrect (e.g., retrieving a staging table instead of a curated dimensional table).

Phase 4: Semantic Text-to-SQL and Agentic Loops

The current state-of-the-art for Text-to-SQL abandons raw DDL entirely. Instead, the architecture routes the natural language prompt through an AI Semantic Layer.

In this paradigm, the LLM is not prompted with physical table structures. It is prompted with business definitions. The semantic layer defines what a "Customer" is, what "Revenue" is, and how they relate. When the LLM generates a SQL query, it generates it against these virtual semantic models.

Additionally, modern Text-to-SQL is no longer a one-shot process. It is executed within an Agentic Loop. If the generated query contains a syntax error, the execution engine returns the error to the agent. The agent reads the error, modifies its SQL, and tries again. This self-correcting capability drastically improves the reliability of natural language analytics, allowing non-technical users to query massive Apache Iceberg tables safely and accurately.

Text-to-SQL

Phase 1: Zero-Shot Prompting

Phase 2: DDL Injection

Phase 3: Schema RAG (Retrieval-Augmented Generation)

Phase 4: Semantic Text-to-SQL and Agentic Loops

Master the Agentic Lakehouse

Start Your Free Dremio Trial

Architecting an Apache Iceberg Lakehouse

The AI Lakehouse

Apache Iceberg and Agentic AI

Lakehouse Built for Everyone