Saturday, December 13, 2025

A Rigorous Conversion of Natural Language to SQL

Some time back, I was thinking about the problem of converting natural language to a relational query language, e.g. SQL. Here's PostgreSQL's parser.

The typical language models nowadays are not robust estimators, and they're even worse at generating formal languages. They're not even fit for performing their own uncertainty quantification either. This in part has to do with the "largeness" of their parametric forms, the parametric form itself, and the optimization programs done to train them. But, I won't explain in full that just now.

How do you convert a natural language query to a relational query from noisy outputs of a language model?

First, remember that statistical estimation can be posed as a statistical inverse problem. You can imagine the "natural language to relational query language" problem to be some unknown partial function (since not all natural language sentences are queries). This is important because it can be cast as an inverse problem. What's the corresponding forward problem? It's simply the "relational query language to natural language" problem.

Considering the limitations of present language models, I think this forward problem is special because it's much more well-posed than the inverse if you do it in a bottom-up manner. SQL is inductively defined by a formal grammar. Imagine that you have two relations and a natural language analogue to both of them. Now, what's the natural language analogue of the equijoin of those relations? That's way easier for today's language models to solve.

But, what's the use of solving the forward problem if we want to actually solve the inverse problem? Well, given some noisy results of the inverse (language model NL-to-SQL conversions), you can retrieve the relatively less noisy forward results of those outputs, and then measure pairwise consistency/similarity.

I can imagine a number of parametric and non-parametric statistics that you could form out of those measurements. But, whatever. This basically reduces the difficulty of NL-to-SQL to semantic similarity over natural language, which is very easy nowadays for natural language models.

You could make this process more rigorous with intermediate DSLs that have to do with your relational schema. That's not a coincidence: this problem is essentially that of a compiler since this is a matter of mapping one language to another. Just as LLVM converts parsed C programs to machine code, the inverse problem converts natural language to SQL. An intermediate language does not really change things. In fact, LLVM does have an intermediate language/representation: LLVM IR.