java.lang.Object
org.elasticsearch.compute.aggregation.blockhash.BlockHash
All Implemented Interfaces:
Closeable, AutoCloseable, SeenGroupIds, org.elasticsearch.core.Releasable
Direct Known Subclasses:
CategorizeBlockHash, CategorizePackedValuesBlockHash, TimeSeriesBlockHash

public abstract class BlockHash extends Object implements org.elasticsearch.core.Releasable, SeenGroupIds
Specialized hash table implementations that map rows to a set of bucket IDs to which they belong to implement GROUP BY expressions.

A row is always in at least one bucket so the results are never null. null valued key columns will map to some integer bucket id. If none of key columns are multivalued then the output is always an IntVector. If any of the key are multivalued then a row is in a bucket for each value. If more than one key is multivalued then the row is in the combinatorial explosion of all value combinations. Luckily for the number of values rows can only be in each bucket once. Unluckily, it's the responsibility of BlockHash to remove those duplicates.

These classes typically delegate to some combination of BytesRefHash, LongHash, LongLongHash, Int3Hash. They don't technically have to be hash tables, so long as they implement the deduplication semantics above and vend integer ids.

The integer ids are assigned to offsets into arrays of aggregation states so its permissible to have gaps in the ints. But large gaps are a bad idea because they'll waste space in the aggregations that use these positions. For example, BooleanBlockHash assigns 0 to null, 1 to false, and 1 to true and that's fine and simple and good because it'll never leave a big gap, even if we never see null.

  • Field Details

  • Method Details

    • add

      public abstract void add(Page page, GroupingAggregatorFunction.AddInput addInput)
      Add all values for the "group by" columns in the page to the hash and pass the ordinals to the provided GroupingAggregatorFunction.AddInput.

      This call will not Releasable.close() addInput.

    • lookup

      public abstract org.elasticsearch.core.ReleasableIterator<IntBlock> lookup(Page page, ByteSizeValue targetBlockSize)
      Lookup all values for the "group by" columns in the page to the hash and return an Iterator of the values. The sum of Block.getPositionCount() for all blocks returned by the iterator will equal Page.getPositionCount() but will "target" a size of targetBlockSize.

      The returned ReleasableIterator may retain a reference to Blocks inside the Page. Close it to release those references.

    • getKeys

      public abstract Block[] getKeys()
      Returns a Block that contains all the keys that are inserted by add(org.elasticsearch.compute.data.Page, org.elasticsearch.compute.aggregation.GroupingAggregatorFunction.AddInput).

      Keys must be in the same order as the IDs returned by nonEmpty().

    • nonEmpty

      public abstract IntVector nonEmpty()
      The grouping ids that are not empty. We use this because some block hashes reserve space for grouping ids and then don't end up using them. For example, BooleanBlockHash does this by always assigning false to 0 and true to 1. It's only after collection when we know if there actually were any true or false values received.

      IDs must be in the same order as the keys returned by getKeys().

    • seenGroupIds

      public abstract BitArray seenGroupIds(BigArrays bigArrays)
      Description copied from interface: SeenGroupIds
      The grouping ids that have been seen already. This BitArray is kept and mutated by the caller so make a copy if it's something you need your own copy of it.
      Specified by:
      seenGroupIds in interface SeenGroupIds
    • build

      public static BlockHash build(List<BlockHash.GroupSpec> groups, BlockFactory blockFactory, int emitBatchSize, boolean allowBrokenOptimizations)
      Creates a specialized hash table that maps one or more Blocks to ids.
      Parameters:
      emitBatchSize - maximum batch size to be emitted when handling combinatorial explosion of groups caused by multivalued fields
      allowBrokenOptimizations - true to allow optimizations with bad null handling. We will fix their null handling and remove this flag, but we need to disable these in production until we can. And this lets us continue to compile and test them.
    • buildPackedValuesBlockHash

      public static BlockHash buildPackedValuesBlockHash(List<BlockHash.GroupSpec> groups, BlockFactory blockFactory, int emitBatchSize)
      Temporary method to build a PackedValuesBlockHash.
    • buildCategorizeBlockHash

      public static BlockHash buildCategorizeBlockHash(List<BlockHash.GroupSpec> groups, AggregatorMode aggregatorMode, BlockFactory blockFactory, AnalysisRegistry analysisRegistry, int emitBatchSize)
      Builds a BlockHash for the Categorize grouping function.
    • hashOrdToGroup

      public static long hashOrdToGroup(long ord)
      Convert the result of calling LongHash or LongLongHash or BytesRefHash or similar to a group ordinal. These hashes return negative numbers if the value that was added has already been seen. We don't use that and convert it back to the positive ord.
    • hashOrdToGroupNullReserved

      public static long hashOrdToGroupNullReserved(long ord)
      Convert the result of calling LongHash or LongLongHash or BytesRefHash or similar to a group ordinal, reserving 0 for null.