Class IndexedDISI
- java.lang.Object
-
- org.apache.lucene.search.DocIdSetIterator
-
- org.apache.lucene.codecs.lucene90.IndexedDISI
-
public final class IndexedDISI extends DocIdSetIterator
Disk-based implementation of aDocIdSetIterator
which can return the index of the current document, i.e. the ordinal of the current document among the list of documents that this iterator can return. This is useful to implement sparse doc values by only having to encode values for documents that actually have a value.Implementation-wise, this
DocIdSetIterator
is inspired ofroaring bitmaps
and encodes ranges of65536
documents independently and picks between 3 encodings depending on the density of the range:ALL
if the range contains 65536 documents exactly,DENSE
if the range contains 4096 documents or more; in that case documents are stored in a bit set,SPARSE
otherwise, and the lower 16 bits of the doc IDs are stored in ashort
.
Only ranges that contain at least one value are encoded.
This implementation uses 6 bytes per document in the worst-case, which happens in the case that all ranges contain exactly one document.
To avoid O(n) lookup time complexity, with n being the number of documents, two lookup tables are used: A lookup table for block offset and index, and a rank structure for DENSE block index lookups.
The lookup table is an array of
int
-pairs, with a pair for each block. It allows for direct jumping to the block, as opposed to iteration from the current position and forward one block at a time.Each int-pair entry consists of 2 logical parts:
The first 32 bit int holds the index (number of set bits in the blocks) up to just before the wanted block. The maximum number of set bits is the maximum number of documents, which is less than 2^31.
The next int holds the offset in bytes into the underlying slice. As there is a maximum of 2^16 blocks, it follows that the maximum size of any block must not exceed 2^15 bytes to avoid overflow (2^16 bytes if the int is treated as unsigned). This is currently the case, with the largest block being DENSE and using 2^13 + 36 bytes.
The cache overhead is numDocs/1024 bytes.
Note: There are 4 types of blocks: ALL, DENSE, SPARSE and non-existing (0 set bits). In the case of non-existing blocks, the entry in the lookup table has index equal to the previous entry and offset equal to the next non-empty block.
The block lookup table is stored at the end of the total block structure.
The rank structure for DENSE blocks is an array of byte-pairs with an entry for each sub-block (default 512 bits) out of the 65536 bits in the outer DENSE block.
Each rank-entry states the number of set bits within the block up to the bit before the bit positioned at the start of the sub-block. Note that that the rank entry of the first sub-block is always 0 and that the last entry can at most be 65536-2 = 65634 and thus will always fit into an byte-pair of 16 bits.
The rank structure for a given DENSE block is stored at the beginning of the DENSE block. This ensures locality and keeps logistics simple.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description (package private) static class
IndexedDISI.Method
-
Field Summary
Fields Modifier and Type Field Description (package private) int
block
private static int
BLOCK_SIZE
(package private) long
blockEnd
(package private) long
cost
static byte
DEFAULT_DENSE_RANK_POWER
private static int
DENSE_BLOCK_LONGS
(package private) long
denseBitmapOffset
(package private) int
denseOrigoIndex
(package private) byte
denseRankPower
(package private) byte[]
denseRankTable
(package private) int
doc
(package private) boolean
exists
(package private) int
gap
(package private) int
index
(package private) RandomAccessInput
jumpTable
(package private) int
jumpTableEntryCount
(package private) static int
MAX_ARRAY_LENGTH
(package private) IndexedDISI.Method
method
(package private) int
nextBlockIndex
(package private) int
numberOfOnes
(package private) IndexInput
slice
The slice that stores theDocIdSetIterator
.(package private) long
word
(package private) int
wordIndex
-
Fields inherited from class org.apache.lucene.search.DocIdSetIterator
NO_MORE_DOCS
-
-
Constructor Summary
Constructors Constructor Description IndexedDISI(IndexInput in, long offset, long length, int jumpTableEntryCount, byte denseRankPower, long cost)
This constructor always creates a new blockSlice and a new jumpTable from in, to ensure that operations are independent from the caller.IndexedDISI(IndexInput blockSlice, RandomAccessInput jumpTable, int jumpTableEntryCount, byte denseRankPower, long cost)
This constructor allows to pass the slice and jumpTable directly in case it helps reuse.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private static int[]
addJumps(int[] jumps, long offset, int index, int startBlock, int endBlock)
int
advance(int target)
Advances to the first beyond the current whose document number is greater than or equal to target, and returns the document number itself.private void
advanceBlock(int targetBlock)
boolean
advanceExact(int target)
long
cost()
Returns the estimated cost of thisDocIdSetIterator
.static IndexInput
createBlockSlice(IndexInput slice, java.lang.String sliceDescription, long offset, long length, int jumpTableEntryCount)
Helper method for usingIndexedDISI(IndexInput, RandomAccessInput, int, byte, long)
.static RandomAccessInput
createJumpTable(IndexInput slice, long offset, long length, int jumpTableEntryCount)
Helper method for usingIndexedDISI(IndexInput, RandomAccessInput, int, byte, long)
.private static byte[]
createRank(FixedBitSet buffer, byte denseRankPower)
int
docID()
Returns the following:-1
ifDocIdSetIterator.nextDoc()
orDocIdSetIterator.advance(int)
were not called yet.private static void
flush(int block, FixedBitSet buffer, int cardinality, byte denseRankPower, IndexOutput out)
private static short
flushBlockJumps(int[] jumps, int blockCount, IndexOutput out)
int
index()
int
nextDoc()
Advances to the next document in the set and returns the doc it is currently on, orDocIdSetIterator.NO_MORE_DOCS
if there are no more docs in the set.
NOTE: after the iterator has exhausted you should not call this method, as it may result in unpredicted behavior.private static void
rankSkip(IndexedDISI disi, int targetInBlock)
If the distance between the current position and the target is > 8 words, the rank cache will be used to guarantee a worst-case of 1 rank-lookup and 7 word-read-and-count-bits operations.private void
readBlockHeader()
(package private) static short
writeBitSet(DocIdSetIterator it, IndexOutput out)
Writes the docIDs from it to out, in logical blocks, one for each 65536 docIDs in monotonically increasing gap-less order.static short
writeBitSet(DocIdSetIterator it, IndexOutput out, byte denseRankPower)
Writes the docIDs from it to out, in logical blocks, one for each 65536 docIDs in monotonically increasing gap-less order.-
Methods inherited from class org.apache.lucene.search.DocIdSetIterator
all, empty, range, slowAdvance
-
-
-
-
Field Detail
-
BLOCK_SIZE
private static final int BLOCK_SIZE
- See Also:
- Constant Field Values
-
DENSE_BLOCK_LONGS
private static final int DENSE_BLOCK_LONGS
- See Also:
- Constant Field Values
-
DEFAULT_DENSE_RANK_POWER
public static final byte DEFAULT_DENSE_RANK_POWER
- See Also:
- Constant Field Values
-
MAX_ARRAY_LENGTH
static final int MAX_ARRAY_LENGTH
- See Also:
- Constant Field Values
-
slice
final IndexInput slice
The slice that stores theDocIdSetIterator
.
-
jumpTableEntryCount
final int jumpTableEntryCount
-
denseRankPower
final byte denseRankPower
-
jumpTable
final RandomAccessInput jumpTable
-
denseRankTable
final byte[] denseRankTable
-
cost
final long cost
-
block
int block
-
blockEnd
long blockEnd
-
denseBitmapOffset
long denseBitmapOffset
-
nextBlockIndex
int nextBlockIndex
-
method
IndexedDISI.Method method
-
doc
int doc
-
index
int index
-
exists
boolean exists
-
word
long word
-
wordIndex
int wordIndex
-
numberOfOnes
int numberOfOnes
-
denseOrigoIndex
int denseOrigoIndex
-
gap
int gap
-
-
Constructor Detail
-
IndexedDISI
public IndexedDISI(IndexInput in, long offset, long length, int jumpTableEntryCount, byte denseRankPower, long cost) throws java.io.IOException
This constructor always creates a new blockSlice and a new jumpTable from in, to ensure that operations are independent from the caller. SeeIndexedDISI(IndexInput, RandomAccessInput, int, byte, long)
for re-use of blockSlice and jumpTable.- Parameters:
in
- backing data.offset
- starting offset for blocks in the backing data.length
- the number of bytes holding blocks and jump-table in the backing data.jumpTableEntryCount
- the number of blocks covered by the jump-table. This must match the number returned bywriteBitSet(DocIdSetIterator, IndexOutput, byte)
.denseRankPower
- the number of docIDs covered by each rank entry in DENSE blocks, expressed as2^denseRankPower
. This must match the power given inwriteBitSet(DocIdSetIterator, IndexOutput, byte)
cost
- normally the number of logical docIDs.- Throws:
java.io.IOException
-
IndexedDISI
IndexedDISI(IndexInput blockSlice, RandomAccessInput jumpTable, int jumpTableEntryCount, byte denseRankPower, long cost) throws java.io.IOException
This constructor allows to pass the slice and jumpTable directly in case it helps reuse. see eg. Lucene80 norms producer's merge instance.- Parameters:
blockSlice
- data blocks, normally created bycreateBlockSlice(org.apache.lucene.store.IndexInput, java.lang.String, long, long, int)
.jumpTable
- table holding jump-data for block-skips, normally created bycreateJumpTable(org.apache.lucene.store.IndexInput, long, long, int)
.jumpTableEntryCount
- the number of blocks covered by the jump-table. This must match the number returned bywriteBitSet(DocIdSetIterator, IndexOutput, byte)
.denseRankPower
- the number of docIDs covered by each rank entry in DENSE blocks, expressed as2^denseRankPower
. This must match the power given inwriteBitSet(DocIdSetIterator, IndexOutput, byte)
cost
- normally the number of logical docIDs.- Throws:
java.io.IOException
-
-
Method Detail
-
flush
private static void flush(int block, FixedBitSet buffer, int cardinality, byte denseRankPower, IndexOutput out) throws java.io.IOException
- Throws:
java.io.IOException
-
createRank
private static byte[] createRank(FixedBitSet buffer, byte denseRankPower)
-
writeBitSet
static short writeBitSet(DocIdSetIterator it, IndexOutput out) throws java.io.IOException
Writes the docIDs from it to out, in logical blocks, one for each 65536 docIDs in monotonically increasing gap-less order. DENSE blocks usesDEFAULT_DENSE_RANK_POWER
of 9 (every 512 docIDs / 8 longs). The caller must keep track of the number of jump-table entries (returned by this method) as well as the denseRankPower (9 for this method) and provide them when constructing an IndexedDISI for reading.- Parameters:
it
- the document IDs.out
- destination for the blocks.- Returns:
- the number of jump-table entries following the blocks, -1 for no entries. This should be stored in meta and used when creating an instance of IndexedDISI.
- Throws:
java.io.IOException
- if there was an error writing to out.
-
writeBitSet
public static short writeBitSet(DocIdSetIterator it, IndexOutput out, byte denseRankPower) throws java.io.IOException
Writes the docIDs from it to out, in logical blocks, one for each 65536 docIDs in monotonically increasing gap-less order. The caller must keep track of the number of jump-table entries (returned by this method) as well as the denseRankPower and provide them when constructing an IndexedDISI for reading.- Parameters:
it
- the document IDs.out
- destination for the blocks.denseRankPower
- forIndexedDISI.Method.DENSE
blocks, a rank will be written every2^denseRankPower
docIDs. Values < 7 (every 128 docIDs) or > 15 (every 32768 docIDs) disables DENSE rank. Recommended values are 8-12: Every 256-4096 docIDs or 4-64 longs.DEFAULT_DENSE_RANK_POWER
is 9: Every 512 docIDs. This should be stored in meta and used when creating an instance of IndexedDISI.- Returns:
- the number of jump-table entries following the blocks, -1 for no entries. This should be stored in meta and used when creating an instance of IndexedDISI.
- Throws:
java.io.IOException
- if there was an error writing to out.
-
addJumps
private static int[] addJumps(int[] jumps, long offset, int index, int startBlock, int endBlock)
-
flushBlockJumps
private static short flushBlockJumps(int[] jumps, int blockCount, IndexOutput out) throws java.io.IOException
- Throws:
java.io.IOException
-
createBlockSlice
public static IndexInput createBlockSlice(IndexInput slice, java.lang.String sliceDescription, long offset, long length, int jumpTableEntryCount) throws java.io.IOException
Helper method for usingIndexedDISI(IndexInput, RandomAccessInput, int, byte, long)
. Creates a disiSlice for the IndexedDISI data blocks, without the jump-table.- Parameters:
slice
- backing data, holding both blocks and jump-table.sliceDescription
- human readable slice designation.offset
- relative to the backing data.length
- full length of the IndexedDISI, including blocks and jump-table data.jumpTableEntryCount
- the number of blocks covered by the jump-table.- Returns:
- a jumpTable containing the block jump-data or null if no such table exists.
- Throws:
java.io.IOException
- if a RandomAccessInput could not be created from slice.
-
createJumpTable
public static RandomAccessInput createJumpTable(IndexInput slice, long offset, long length, int jumpTableEntryCount) throws java.io.IOException
Helper method for usingIndexedDISI(IndexInput, RandomAccessInput, int, byte, long)
. Creates a RandomAccessInput covering only the jump-table data or null.- Parameters:
slice
- backing data, holding both blocks and jump-table.offset
- relative to the backing data.length
- full length of the IndexedDISI, including blocks and jump-table data.jumpTableEntryCount
- the number of blocks covered by the jump-table.- Returns:
- a jumpTable containing the block jump-data or null if no such table exists.
- Throws:
java.io.IOException
- if a RandomAccessInput could not be created from slice.
-
docID
public int docID()
Description copied from class:DocIdSetIterator
Returns the following:-1
ifDocIdSetIterator.nextDoc()
orDocIdSetIterator.advance(int)
were not called yet.DocIdSetIterator.NO_MORE_DOCS
if the iterator has exhausted.- Otherwise it should return the doc ID it is currently on.
- Specified by:
docID
in classDocIdSetIterator
-
advance
public int advance(int target) throws java.io.IOException
Description copied from class:DocIdSetIterator
Advances to the first beyond the current whose document number is greater than or equal to target, and returns the document number itself. Exhausts the iterator and returnsDocIdSetIterator.NO_MORE_DOCS
if target is greater than the highest document number in the set.The behavior of this method is undefined when called with
target ≤ current
, or after the iterator has exhausted. Both cases may result in unpredicted behavior.When
target > current
it behaves as if written:int advance(int target) { int doc; while ((doc = nextDoc()) < target) { } return doc; }
Some implementations are considerably more efficient than that.NOTE: this method may be called with
DocIdSetIterator.NO_MORE_DOCS
for efficiency by some Scorers. If your implementation cannot efficiently determine that it should exhaust, it is recommended that you check for that value in each call to this method.- Specified by:
advance
in classDocIdSetIterator
- Throws:
java.io.IOException
-
advanceExact
public boolean advanceExact(int target) throws java.io.IOException
- Throws:
java.io.IOException
-
advanceBlock
private void advanceBlock(int targetBlock) throws java.io.IOException
- Throws:
java.io.IOException
-
readBlockHeader
private void readBlockHeader() throws java.io.IOException
- Throws:
java.io.IOException
-
nextDoc
public int nextDoc() throws java.io.IOException
Description copied from class:DocIdSetIterator
Advances to the next document in the set and returns the doc it is currently on, orDocIdSetIterator.NO_MORE_DOCS
if there are no more docs in the set.
NOTE: after the iterator has exhausted you should not call this method, as it may result in unpredicted behavior.- Specified by:
nextDoc
in classDocIdSetIterator
- Throws:
java.io.IOException
-
index
public int index()
-
cost
public long cost()
Description copied from class:DocIdSetIterator
Returns the estimated cost of thisDocIdSetIterator
.This is generally an upper bound of the number of documents this iterator might match, but may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.
- Specified by:
cost
in classDocIdSetIterator
-
rankSkip
private static void rankSkip(IndexedDISI disi, int targetInBlock) throws java.io.IOException
If the distance between the current position and the target is > 8 words, the rank cache will be used to guarantee a worst-case of 1 rank-lookup and 7 word-read-and-count-bits operations. Note: This does not guarantee a skip up to target, only up to nearest rank boundary. It is the responsibility of the caller to iterate further to reach target.- Parameters:
disi
- standard DISI.targetInBlock
- lower 16 bits of the target- Throws:
java.io.IOException
- if a DISI seek failed.
-
-