Class WordStorage
- java.lang.Object
-
- org.apache.lucene.analysis.hunspell.WordStorage
-
class WordStorage extends java.lang.Object
A data structure for memory-efficient word storage and fast lookup/enumeration. Each dictionary entry is stored as:- the last character
- pointer to a similar entry for the prefix (all characters except the last one)
- value data: a list of ints representing word flags and morphological data, and a pointer to hash collisions, if any
FST
, but a reversed one: each node points to a single previous node instead of several following ones. For example, "abc" and "abd" point to the same prefix entry "ab" which points to "a" which points to 0.
The entries are stored in a contiguous byte array, identified by their offsets, usingDataOutput.writeVInt(int)
()} VINT} format for compression.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description (package private) static class
WordStorage.Builder
-
Field Summary
Fields Modifier and Type Field Description private static int
COLLISION_MASK
private int[]
hashTable
A map from word's hash (modulo array's length) into an int containing: lowerOFFSET_BITS
: the offset inwordData
of the last entry with this hash the remaining highest bits: COLLISION+LENGTH info for that entry, i.e.private static int
MAX_STORED_LENGTH
private static int
OFFSET_BITS
private static int
OFFSET_MASK
private byte[]
wordData
An array of word entries: VINT: the word's last character VINT: a delta pointer to the entry for the same word without the last character.
-
Constructor Summary
Constructors Modifier Constructor Description private
WordStorage(int[] hashTable, byte[] wordData)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private static boolean
hasCollision(int mask)
private boolean
hasLength(int mask, int length)
private static boolean
hasLengthInRange(int mask, int minLength, int maxLength)
private boolean
isSameString(char[] word, int offset, int length, int dataPos, ByteArrayDataInput in)
(package private) IntsRef
lookupWord(char[] word, int offset, int length)
(package private) void
processAllWords(int minLength, int maxLength, java.util.function.BiConsumer<CharsRef,IntsRef> processor)
Calls the processor for every dictionary entry with length between minLength and maxLength, both ends inclusive.private void
readForms(IntsRef forms, ByteArrayDataInput in, int length)
-
-
-
Field Detail
-
OFFSET_BITS
private static final int OFFSET_BITS
- See Also:
- Constant Field Values
-
OFFSET_MASK
private static final int OFFSET_MASK
- See Also:
- Constant Field Values
-
COLLISION_MASK
private static final int COLLISION_MASK
- See Also:
- Constant Field Values
-
MAX_STORED_LENGTH
private static final int MAX_STORED_LENGTH
- See Also:
- Constant Field Values
-
hashTable
private final int[] hashTable
A map from word's hash (modulo array's length) into an int containing:- lower
OFFSET_BITS
: the offset inwordData
of the last entry with this hash - the remaining highest bits: COLLISION+LENGTH info for that entry, i.e. one bit indicating
whether there are other entries with the same hash, and the length of the entry in chars,
or
MAX_STORED_LENGTH
if the length exceeds that limit (next highest bits)
- lower
-
wordData
private final byte[] wordData
An array of word entries:- VINT: the word's last character
- VINT: a delta pointer to the entry for the same word without the last character. Precisely, it's the difference of this entry's start and the prefix's entry start. 0 for single-character entries
- (Optional, for hash-colliding entries only)
- BYTE: COLLISION+LENGTH info (see
hashTable
) for the previous entry with the same hash - VINT: (delta) pointer to the previous entry
- BYTE: COLLISION+LENGTH info (see
- (Optional, for non-leaf entries only) VINT+: word form data, returned from
lookupWord(char[], int, int)
, preceded by its length
-
-
Method Detail
-
lookupWord
IntsRef lookupWord(char[] word, int offset, int length)
-
hasCollision
private static boolean hasCollision(int mask)
-
processAllWords
void processAllWords(int minLength, int maxLength, java.util.function.BiConsumer<CharsRef,IntsRef> processor)
Calls the processor for every dictionary entry with length between minLength and maxLength, both ends inclusive. Note that the callback arguments (word and forms) are reused, so they can be modified in any way, but may not be saved for later by the processor
-
hasLength
private boolean hasLength(int mask, int length)
-
hasLengthInRange
private static boolean hasLengthInRange(int mask, int minLength, int maxLength)
-
isSameString
private boolean isSameString(char[] word, int offset, int length, int dataPos, ByteArrayDataInput in)
-
readForms
private void readForms(IntsRef forms, ByteArrayDataInput in, int length)
-
-