Interface BlockLoader

All Known Implementing Classes:
AbstractBooleansBlockLoader, AbstractBytesRefsFromOrdsBlockLoader, AbstractDoublesFromDocValuesBlockLoader, AbstractIntsFromDocValuesBlockLoader, AbstractLongsFromDocValuesBlockLoader, AbstractShapeGeometryFieldMapper.AbstractShapeGeometryFieldType.BoundsBlockLoader, BlockDocValuesReader.DocValuesBlockLoader, BlockLoader.Delegating, BlockSourceReader.BooleansBlockLoader, BlockSourceReader.BytesRefsBlockLoader, BlockSourceReader.DenseVectorBlockLoader, BlockSourceReader.DoublesBlockLoader, BlockSourceReader.GeometriesBlockLoader, BlockSourceReader.IntsBlockLoader, BlockSourceReader.IpsBlockLoader, BlockSourceReader.LongsBlockLoader, BlockStoredFieldsReader.BytesFromBytesRefsBlockLoader, BlockStoredFieldsReader.BytesFromStringsBlockLoader, BlockStoredFieldsReader.IdBlockLoader, BlockStoredFieldsReader.StoredFieldsBlockLoader, BooleansBlockLoader, BytesRefsFromBinaryBlockLoader, BytesRefsFromCustomBinaryBlockLoader, BytesRefsFromOrdsBlockLoader, DenseVectorBlockLoader, DenseVectorFromBinaryBlockLoader, DoublesBlockLoader, FallbackSyntheticSourceBlockLoader, IntsBlockLoader, LongsBlockLoader, MvMaxBooleansBlockLoader, MvMaxBytesRefsFromOrdsBlockLoader, MvMaxDoublesFromDocValuesBlockLoader, MvMaxIntsFromDocValuesBlockLoader, MvMaxLongsFromDocValuesBlockLoader, MvMinBooleansBlockLoader, MvMinBytesRefsFromOrdsBlockLoader, MvMinDoublesFromDocValuesBlockLoader, MvMinIntsFromDocValuesBlockLoader, MvMinLongsFromDocValuesBlockLoader, SourceFieldBlockLoader, Utf8CodePointsFromOrdsBlockLoader

public interface BlockLoader
Loads values from a chunk of lucene documents into a "Block" for the compute engine.

Think of a Block as an array of values for a sequence of lucene documents. That's almost true! For the purposes of implementing BlockLoader, it's close enough. The compute engine operates on arrays because the good folks that build CPUs have spent the past 40 years making them really really good at running tight loops over arrays of data. So we play along with the CPU and make arrays.

How to implement

There are a lot of interesting choices hiding in here to make getting those arrays out of lucene work well:

  • doc_values are already on disk in array-like structures so we prefer to just copy them into an array in one loop inside BlockLoader.ColumnAtATimeReader. Well, not entirely array-like. doc_values are designed to be read in non-descending order (think 0, 1, 1, 4, 9) and will fail if they are read truly randomly. This lets the doc values implementations have some chunking/compression/magic on top of the array-like on disk structure. The caller manages this, always putting BlockLoader.Docs in non-descending order. Extend BlockDocValuesReader to implement all this.
  • All stored stored fields for each document are stored on disk together, compressed with a general purpose compression algorithm like Zstd. Blocks of documents are compressed together to get a better compression ratio. Just like doc values, we read them in non-descending order. Unlike doc values, we read all fields for a document at once. Because reading one requires decompressing them all. We do this by returning null from columnAtATimeReader(org.apache.lucene.index.LeafReaderContext) to signal that we can't load the whole column at once. Instead, we implement a BlockLoader.RowStrideReader which the caller will call once for each doc. Extend BlockStoredFieldsReader to implement all this.
  • Fields loaded from _source are an extra special case of stored fields. _source itself is just another stored field, compressed in chunks with all the other stored fields. It's the original bytes sent when indexing the document. Think json or yaml. When we need fields from _source we get it from the stored fields reader infrastructure and then explode it into a Map representing the original json and the BlockLoader.RowStrideReader implementation grabs the parts of the json it needs. Extend BlockSourceReader to implement all this.
  • Synthetic _source complicates this further by storing fields in somewhat unexpected places, but is otherwise like a stored field reader. Use FallbackSyntheticSourceBlockLoader to implement all this.

How many to implement

Generally reads are faster from doc_values, slower from stored fields, and even slower from _source. If we get to chose, we pick doc_values. But we work with what's on disk and that's a product of the field type and what the user's configured. Picking the optimal choice given what's on disk is the responsibility of each field's MappedFieldType.blockLoader(org.elasticsearch.index.mapper.MappedFieldType.BlockLoaderContext) method. The more configurable the field's storage strategies the more BlockLoaders you have to implement to integrate it with ESQL. It can get to be a lot. Sorry.

For a field to be supported by ESQL fully it has to be loadable if it was configured to be stored in any way. It's possible to turn off storage entirely by turning off doc_values and _source and stored fields. In that case, it's acceptable to return BlockLoader.ConstantNullsReader. User turned the field off, best we can do is null.

We also sometimes want to "push" executing some ESQL functions into the block loader itself. Usually we do this when it's a ton faster. See the docs for BlockLoaderExpression for why and how we do this.

For example, long fields implement these block loaders:

NOTE: We can't read from longs from stored fields which is a bug, but maybe not a terrible one because it's very uncommon to configure long to be stored but to disable _source and doc_values. Nothing's perfect. Especially code.

Why is BlockLoader.AllReader?

When we described how to read from doc_values we said we prefer to use BlockLoader.ColumnAtATimeReader. But some callers don't support reading column-at-a-time and need to read row-by-row. So we also need an implementation of BlockLoader.RowStrideReader that reads from doc_values. Usually it's most convenient to implement both of those in the same class. BlockLoader.AllReader is an interface for those sorts of classes, and you'll see it in the doc_values code frequently.

Why is rowStrideStoredFieldSpec()?

When decompressing stored fields lucene can skip stored field that aren't used. They still have to be decompressed, but they aren't turned into java objects which saves a fair bit of work. If you don't need any stored fields return StoredFieldsSpec.NO_REQUIREMENTS. Otherwise, return what you need.

Thread safety

Instances of this class must be immutable and thread safe. Instances of BlockLoader.ColumnAtATimeReader and BlockLoader.RowStrideReader are all mutable and can only be accessed by one thread at a time but may be passed between threads. See implementations BlockLoader.Reader.canReuse(int) for how that's handled. "Normal" java objects don't need to do anything special to be kicked from thread to thread - the transfer itself establishes a happens-before relationship that makes everything you need visible. But Lucene's readers aren't "normal" java objects and sometimes need to be rebuilt if we shift threads.

  • Field Details

    • CONSTANT_NULLS

      static final BlockLoader CONSTANT_NULLS
      Load blocks with only null.
  • Method Details

    • builder

      BlockLoader.Builder builder(BlockLoader.BlockFactory factory, int expectedCount)
      The BlockLoader.Builder for data of this type. Called when loading from a multi-segment or unsorted block.
    • columnAtATimeReader

      @Nullable BlockLoader.ColumnAtATimeReader columnAtATimeReader(org.apache.lucene.index.LeafReaderContext context) throws IOException
      Build a column-at-a-time reader. May return null if the underlying storage needs to be loaded row-by-row. Callers should try this first, only falling back to rowStrideReader(org.apache.lucene.index.LeafReaderContext) if this returns null or if they can't load column-at-a-time themselves.
      Throws:
      IOException
    • rowStrideReader

      BlockLoader.RowStrideReader rowStrideReader(org.apache.lucene.index.LeafReaderContext context) throws IOException
      Build a row-by-row reader. Must never return null, evan if the underlying storage prefers to be loaded column-at-a-time. Some callers simply can't load column-at-a-time so all implementations must support this method.
      Throws:
      IOException
    • rowStrideStoredFieldSpec

      StoredFieldsSpec rowStrideStoredFieldSpec()
      What stored fields are needed by this reader.
    • supportsOrdinals

      boolean supportsOrdinals()
      Does this loader support loading bytes via calling ordinals(org.apache.lucene.index.LeafReaderContext).
    • ordinals

      org.apache.lucene.index.SortedSetDocValues ordinals(org.apache.lucene.index.LeafReaderContext context) throws IOException
      Load ordinals for the provided context.
      Throws:
      IOException
    • convert

      default BlockLoader.Block convert(BlockLoader.Block block)
      In support of 'Union Types', we sometimes desire that Blocks loaded from source are immediately converted in some way. Typically, this would be a type conversion, or an encoding conversion.
      Parameters:
      block - original block loaded from source
      Returns:
      converted block (or original if no conversion required)
    • constantBytes

      static BlockLoader constantBytes(org.apache.lucene.util.BytesRef value)
      Load blocks with only value.