Package org.elasticsearch.xpack.esql


package org.elasticsearch.xpack.esql
The ES|QL query language.

Overview

ES|QL is a typed query language which consists of many small languages separated by the | character. Like this:

   FROM foo
 | WHERE a > 1
 | STATS m=MAX(j)
 | SORT m ASC
 | LIMIT 10
 

Here the FROM, WHERE, STATS, SORT, and LIMIT keywords enable the mini-language for selecting indices, filtering documents, calculate aggregates, sorting results, and limiting the number of results respectively.

Language Design Goals

In designing ES|QL we have some principals and rules of thumb:
  • Don't waste people's time
  • Progress over perfection
  • Design for Elasticsearch
  • Be inspired by the best

Don't waste people's time

  • Queries should not fail at runtime. Instead we should return a warning and null.
  • It is ok to fail a query up front at analysis time. Just not after it's started.
  • It is better if things can be made to work.
  • But genuinely confusing requests require the query writing to make a choice.

As you can see this is a real tight rope, but we try to follow the rules above in order. Examples:

  • If TO_DATETIME receives an invalid date at runtime, it emits a WARNING.
  • If DATE_EXTRACT receives an invalid extract configuration at query parsing time it fails to start the query.
  • 1 + 3.2 promotes both sides to a double.
  • 1 + "32" fails at query compile time and the query writer must decide to either write CONCAT(TO_STRING(1), "32") or 1 + TO_INT("32").

Progress over perfection

  • Stability is super important for released features.
  • But we need to experiment and get feedback. So mark features experimental when there's any question about how they should work.
  • Experimental features shouldn't live forever because folks will get tired of waiting and use them in production anyway. We don't officially support them in production but we will feel bad if they break.

Design for Elasticsearch

We must design the language for Elasticsearch, celebrating its advantages smoothing out its and quirks.
  • doc_values sometimes sorts field values and sometimes sorts and removes duplicates. We couldn't hide this even if we want to and most folks are ok with it. ES|QL has to be useful in those cases.
  • Multivalued fields are very easy to index in Elasticsearch so they should be easy to read in ES|QL. They should be easy to work with in ES|QL too, but we haven't gotten that far yet.

Be inspired by the best

We'll frequently have lots of different choices on how to implement a feature. We should talk and figure out the best way for us, especially considering Elasticsearch's advantages and quirks. But we should also look to our data-access-forebears:
  • PostgreSQL is the GOAT SQL implementation. It's a joy to use for everything but dates. Use DB Fiddle to link to syntax examples.
  • Oracle is pretty good about dates. It's fine about a lot of things but PostgreSQL is better.
  • MS SQL Server has a silly name but its documentation is wonderful.
  • SPL is super familiar to our users, and is a piped query language.

Major Components

Compute Engine

org.elasticsearch.compute - The compute engine drives query execution
  • Block - fundamental unit of data. Operations vectorize over blocks.
  • Page - Data is broken up into pages (which are collections of blocks) to manage size in memory

Core Classes

org.elasticsearch.xpack.esql.core - Core Classes
  • EsqlSession - Connects all major components and contains the high-level code for query execution
  • DataType - ES|QL is a typed language, and all the supported data types are listed in this collection.
  • Expression - Expression is the basis for all functions in ES|QL, but see also EvaluatorMapper
  • EsqlFunctionRegistry - Resolves function names to function implementations.
  • Sync and async HTTP API entry points

Query Planner

The query planner encompasses the logic of how to serve a query. Essentially, this covers everything from the output of the Antlr parser through to the actual computations and lucene operations.

Two key concepts in the planner layer:

  • Logical vs Physical optimization - Logical optimizations refer to things that can be done strictly based on the structure of the query, while Physical optimizations take into account information about the index or indices the query will execute against
  • Local vs non-local operations - "local" refers to operations happening on the data nodes, while non-local operations generally happen on the coordinating node and can apply to all participating nodes in the query

Query Planner Steps

Guides

Code generation

ES|QL uses two kinds of code generation which is uses mostly to monomorphize tight loops. That process would require a lot of copy-and-paste with small tweaks and some of us have copy-and-paste blindness so instead we use code generation.
  1. When possible we use StringTemplate to build Java files. These files typically look like X-Blah.java.st and are typically used for things like the different Block types and their subclasses and aggregation state. The templates themselves are easy to read and edit. This process is appropriate for cases where you just have to copy and paste something and change a few lines here and there. See build.gradle for the code generators.
  2. When that doesn't work, we use Annotation processing and JavaPoet to build the Java files. These files are typically the inner loops for EvalOperator.ExpressionEvaluator or AggregatorFunction. The code generation is much more difficult to write and debug but much, much, much, much more flexible. The degree of control we have during this code generation is amazing but it is much harder to debug failures. See files in org.elasticsearch.compute.gen for the code generators.