Class FSTTermsWriter

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable

    public class FSTTermsWriter
    extends FieldsConsumer
    FST-based term dict, using metadata as FST output.

    The FST directly holds the mapping between <term, metadata>.

    Term metadata consists of three parts: 1. term statistics: docFreq, totalTermFreq; 2. monotonic long[], e.g. the pointer to the postings list for that term; 3. generic byte[], e.g. other information need by postings reader.

    File:

    Term Dictionary

    The .tst contains a list of FSTs, one for each field. The FST maps a term to its corresponding statistics (e.g. docfreq) and metadata (e.g. information for postings list reader like file pointer to postings list).

    Typically the metadata is separated into two parts:

    • Monotonical long array: Some metadata will always be ascending in order with the corresponding term. This part is used by FST to share outputs between arcs.
    • Generic byte array: Used to store non-monotonic metadata.
    File format:
    • TermsDict(.tst) --> Header, PostingsHeader, FieldSummary, DirOffset
    • FieldSummary --> NumFields, <FieldNumber, NumTerms, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, TermFST >NumFields
    • TermFST --> FST<TermData>
    • TermData --> Flag, BytesSize?, LongDeltaLongsSize?, ByteBytesSize?, < DocFreq[Same?], (TotalTermFreq-DocFreq) > ?
    • Header --> IndexHeader
    • DirOffset --> Uint64
    • DocFreq, LongsSize, BytesSize, NumFields, FieldNumber, DocCount --> VInt
    • TotalTermFreq, NumTerms, SumTotalTermFreq, SumDocFreq, LongDelta --> VLong

    Notes:

    • The format of PostingsHeader and generic meta bytes are customized by the specific postings implementation: they contain arbitrary per-file data (such as parameters or versioning information), and per-term data (non-monotonic ones like pulsed postings data).
    • The format of TermData is determined by FST, typically monotonic metadata will be dense around shallow arcs, while in deeper arcs only generic bytes and term statistics exist.
    • The byte Flag is used to indicate which part of metadata exists on current arc. Specially the monotonic part is omitted when it is an array of 0s.
    • Since LongsSize is per-field fixed, it is only written once in field summary.
    • Method Detail

      • writeTrailer

        private void writeTrailer​(IndexOutput out,
                                  long dirStart)
                           throws java.io.IOException
        Throws:
        java.io.IOException
      • write

        public void write​(Fields fields,
                          NormsProducer norms)
                   throws java.io.IOException
        Description copied from class: FieldsConsumer
        Write all fields, terms and postings. This the "pull" API, allowing you to iterate more than once over the postings, somewhat analogous to using a DOM API to traverse an XML tree.

        Notes:

        • You must compute index statistics, including each Term's docFreq and totalTermFreq, as well as the summary sumTotalTermFreq, sumTotalDocFreq and docCount.
        • You must skip terms that have no docs and fields that have no terms, even though the provided Fields API will expose them; this typically requires lazily writing the field or term until you've actually seen the first term or document.
        • The provided Fields instance is limited: you cannot call any methods that return statistics/counts; you cannot pass a non-null live docs when pulling docs/positions enums.
        Specified by:
        write in class FieldsConsumer
        Throws:
        java.io.IOException
      • close

        public void close()
                   throws java.io.IOException
        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable
        Specified by:
        close in class FieldsConsumer
        Throws:
        java.io.IOException