Class SloppyPhraseMatcher


  • public final class SloppyPhraseMatcher
    extends PhraseMatcher
    Find all slop-valid position-combinations (matches) encountered while traversing/hopping the PhrasePositions.
    The sloppy frequency contribution of a match depends on the distance:
    - highest freq for distance=0 (exact match).
    - freq gets lower as distance gets higher.
    Example: for query "a b"~2, a document "x a b a y" can be matched twice: once for "a b" (distance=0), and once for "b a" (distance=2).
    Possibly not all valid combinations are encountered, because for efficiency we always propagate the least PhrasePosition. This allows to base on PriorityQueue and move forward faster. As result, for example, document "a b c b a" would score differently for queries "a b c"~4 and "c b a"~4, although they really are equivalent. Similarly, for doc "a b c b a f g", query "c b"~2 would get same score as "g f"~2, although "c b"~2 could be matched twice. We may want to fix this in the future (currently not, for performance reasons).
    • Field Detail

      • slop

        private final int slop
      • numPostings

        private final int numPostings
      • captureLeadMatch

        private final boolean captureLeadMatch
      • impactsApproximation

        private final ImpactsDISI impactsApproximation
      • end

        private int end
      • leadPosition

        private int leadPosition
      • leadOffset

        private int leadOffset
      • leadEndOffset

        private int leadEndOffset
      • leadOrd

        private int leadOrd
      • hasRpts

        private boolean hasRpts
      • checkedRpts

        private boolean checkedRpts
      • hasMultiTermRpts

        private boolean hasMultiTermRpts
      • positioned

        private boolean positioned
      • matchLength

        private int matchLength
    • Method Detail

      • maxFreq

        float maxFreq()
               throws java.io.IOException
        Description copied from class: PhraseMatcher
        An upper bound on the number of possible matches on this document
        Specified by:
        maxFreq in class PhraseMatcher
        Throws:
        java.io.IOException
      • sloppyWeight

        float sloppyWeight()
        Description copied from class: PhraseMatcher
        The slop-adjusted weight of the current match

        The sum of the slop-adjusted weights is used as the freq for scoring

        Specified by:
        sloppyWeight in class PhraseMatcher
      • nextMatch

        public boolean nextMatch()
                          throws java.io.IOException
        Description copied from class: PhraseMatcher
        Find the next match on the current document, returning false if there are none.
        Specified by:
        nextMatch in class PhraseMatcher
        Throws:
        java.io.IOException
      • captureLead

        private void captureLead​(PhrasePositions pp)
                          throws java.io.IOException
        Throws:
        java.io.IOException
      • startOffset

        public int startOffset()
                        throws java.io.IOException
        Description copied from class: PhraseMatcher
        The start offset of the current match
        Specified by:
        startOffset in class PhraseMatcher
        Throws:
        java.io.IOException
      • endOffset

        public int endOffset()
                      throws java.io.IOException
        Description copied from class: PhraseMatcher
        The end offset of the current match
        Specified by:
        endOffset in class PhraseMatcher
        Throws:
        java.io.IOException
      • advancePP

        private boolean advancePP​(PhrasePositions pp)
                           throws java.io.IOException
        advance a PhrasePosition and update 'end', return false if exhausted
        Throws:
        java.io.IOException
      • advanceRpts

        private boolean advanceRpts​(PhrasePositions pp)
                             throws java.io.IOException
        pp was just advanced. If that caused a repeater collision, resolve by advancing the lesser of the two colliding pps. Note that there can only be one collision, as by the initialization there were no collisions before pp was advanced.
        Throws:
        java.io.IOException
      • collide

        private int collide​(PhrasePositions pp)
        index of a pp2 colliding with pp, or -1 if none
      • initPhrasePositions

        private boolean initPhrasePositions()
                                     throws java.io.IOException
        Initialize PhrasePositions in place. A one time initialization for this scorer (on first doc matching all terms):
        • Check if there are repetitions
        • If there are, find groups of repetitions.
        Examples:
        1. no repetitions: "ho my"~2
        2. repetitions: "ho my my"~2
        3. repetitions: "my ho my"~2
        Returns:
        false if PPs are exhausted (and so current doc will not be a match)
        Throws:
        java.io.IOException
      • initSimple

        private void initSimple()
                         throws java.io.IOException
        no repeats: simplest case, and most common. It is important to keep this piece of the code simple and efficient
        Throws:
        java.io.IOException
      • initComplex

        private boolean initComplex()
                             throws java.io.IOException
        with repeats: not so simple.
        Throws:
        java.io.IOException
      • placeFirstPositions

        private void placeFirstPositions()
                                  throws java.io.IOException
        move all PPs to their first position
        Throws:
        java.io.IOException
      • fillQueue

        private void fillQueue()
        Fill the queue (all pps are already placed
      • advanceRepeatGroups

        private boolean advanceRepeatGroups()
                                     throws java.io.IOException
        At initialization (each doc), each repetition group is sorted by (query) offset. This provides the start condition: no collisions.

        Case 1: no multi-term repeats
        It is sufficient to advance each pp in the group by one less than its group index. So lesser pp is not advanced, 2nd one advance once, 3rd one advanced twice, etc.

        Case 2: multi-term repeats

        Returns:
        false if PPs are exhausted.
        Throws:
        java.io.IOException
      • initFirstTime

        private boolean initFirstTime()
                               throws java.io.IOException
        initialize with checking for repeats. Heavy work, but done only for the first candidate doc.

        If there are repetitions, check if multi-term postings (MTP) are involved.

        Without MTP, once PPs are placed in the first candidate doc, repeats (and groups) are visible.
        With MTP, a more complex check is needed, up-front, as there may be "hidden collisions".
        For example P1 has {A,B}, P1 has {B,C}, and the first doc is: "A C B". At start, P1 would point to "A", p2 to "C", and it will not be identified that P1 and P2 are repetitions of each other.

        The more complex initialization has two parts:
        (1) identification of repetition groups.
        (2) advancing repeat groups at the start of the doc.
        For (1), a possible solution is to just create a single repetition group, made of all repeating pps. But this would slow down the check for collisions, as all pps would need to be checked. Instead, we compute "connected regions" on the bipartite graph of postings and terms.

        Throws:
        java.io.IOException
      • sortRptGroups

        private void sortRptGroups​(java.util.ArrayList<java.util.ArrayList<PhrasePositions>> rgs)
        sort each repetition group by (query) offset. Done only once (at first doc) and allows to initialize faster for each doc.
      • gatherRptGroups

        private java.util.ArrayList<java.util.ArrayList<PhrasePositions>> gatherRptGroups​(java.util.LinkedHashMap<Term,​java.lang.Integer> rptTerms)
                                                                                   throws java.io.IOException
        Detect repetition groups. Done once - for first doc
        Throws:
        java.io.IOException
      • tpPos

        private final int tpPos​(PhrasePositions pp)
        Actual position in doc of a PhrasePosition, relies on that position = tpPos - offset)
      • repeatingTerms

        private java.util.LinkedHashMap<Term,​java.lang.Integer> repeatingTerms()
        find repeating terms and assign them ordinal values
      • repeatingPPs

        private PhrasePositions[] repeatingPPs​(java.util.HashMap<Term,​java.lang.Integer> rptTerms)
        find repeating pps, and for each, if has multi-terms, update this.hasMultiTermRpts
      • ppTermsBitSets

        private java.util.ArrayList<FixedBitSet> ppTermsBitSets​(PhrasePositions[] rpp,
                                                                java.util.HashMap<Term,​java.lang.Integer> tord)
        bit-sets - for each repeating pp, for each of its repeating terms, the term ordinal values is set
      • unionTermGroups

        private void unionTermGroups​(java.util.ArrayList<FixedBitSet> bb)
        union (term group) bit-sets until they are disjoint (O(n^^2)), and each group have different terms
      • termGroups

        private java.util.HashMap<Term,​java.lang.Integer> termGroups​(java.util.LinkedHashMap<Term,​java.lang.Integer> tord,
                                                                           java.util.ArrayList<FixedBitSet> bb)
                                                                    throws java.io.IOException
        map each term to the single group that contains it
        Throws:
        java.io.IOException