Class Stemmer


  • final class Stemmer
    extends java.lang.Object
    Stemmer uses the affix rules declared in the Dictionary to generate one or more stems for a word. It conforms to the algorithm in the original hunspell algorithm, including recursive suffix stripping.
    • Constructor Summary

      Constructors 
      Constructor Description
      Stemmer​(Dictionary dictionary)
      Constructs a new Stemmer which will use the provided Dictionary to create its stems.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private boolean applyAffix​(char[] strippedWord, int offset, int length, WordContext context, int affix, int previousAffix, int prefixId, int recursionDepth, boolean prefix, Stemmer.RootProcessor processor)
      Applies the affix rule to the given word, producing a list of stems if any are found
      private boolean callProcessor​(char[] word, int offset, int length, Stemmer.RootProcessor processor, IntsRef forms, int i)  
      private static char[] capitalizeAfterApostrophe​(char[] word, int length)  
      private char[] caseFoldLower​(char[] word, int length)
      folds lowercase variant of word (title cased) to lowerBuffer
      private char[] caseFoldTitle​(char[] word, int length)
      folds titlecase variant of word to titleBuffer
      (package private) WordCase caseOf​(char[] word, int length)
      returns EXACT_CASE,TITLE_CASE, or UPPER_CASE type for the word
      (package private) boolean doStem​(char[] word, int offset, int length, WordContext context, Stemmer.RootProcessor processor)  
      private boolean isAffixCompatible​(int affix, char prevFlag, int recursionDepth, boolean isPrefix, boolean previousWasPrefix, WordContext context)  
      private boolean isFlagAppendedByAffix​(int affixId, char flag)  
      private boolean isRootCompatibleWithContext​(WordContext context, int lastAffix, int entryId)  
      private boolean needsAnotherAffix​(int affix, int previousAffix, boolean isSuffix, int prefixId)  
      private CharsRef newStem​(CharsRef stem, int morphDataId)  
      java.util.List<CharsRef> stem​(char[] word, int length)
      Find the stem(s) of the provided word
      private boolean stem​(char[] word, int offset, int length, WordContext context, int previous, char prevFlag, int prefixId, int recursionDepth, boolean doPrefix, boolean previousWasPrefix, Stemmer.RootProcessor processor)
      Generates a list of stems for the provided word
      java.util.List<CharsRef> stem​(java.lang.String word)
      Find the stem(s) of the provided word.
      private java.lang.String stemException​(int morphDataId)  
      private char[] stripAffix​(char[] word, int offset, int length, int affixLen, int affix, boolean isPrefix)  
      java.util.List<CharsRef> uniqueStems​(char[] word, int length)
      Find the unique stem(s) of the provided word
      (package private) boolean varyCase​(char[] word, int length, WordCase wordCase, Stemmer.CaseVariationProcessor processor)  
      private boolean varySharpS​(char[] word, int length, Stemmer.CaseVariationProcessor processor)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • formStep

        private final int formStep
    • Constructor Detail

      • Stemmer

        public Stemmer​(Dictionary dictionary)
        Constructs a new Stemmer which will use the provided Dictionary to create its stems.
        Parameters:
        dictionary - Dictionary that will be used to create the stems
    • Method Detail

      • stem

        public java.util.List<CharsRef> stem​(java.lang.String word)
        Find the stem(s) of the provided word.
        Parameters:
        word - Word to find the stems for
        Returns:
        List of stems for the word
      • stem

        public java.util.List<CharsRef> stem​(char[] word,
                                             int length)
        Find the stem(s) of the provided word
        Parameters:
        word - Word to find the stems for
        Returns:
        List of stems for the word
      • caseOf

        WordCase caseOf​(char[] word,
                        int length)
        returns EXACT_CASE,TITLE_CASE, or UPPER_CASE type for the word
      • caseFoldTitle

        private char[] caseFoldTitle​(char[] word,
                                     int length)
        folds titlecase variant of word to titleBuffer
      • caseFoldLower

        private char[] caseFoldLower​(char[] word,
                                     int length)
        folds lowercase variant of word (title cased) to lowerBuffer
      • capitalizeAfterApostrophe

        private static char[] capitalizeAfterApostrophe​(char[] word,
                                                        int length)
      • uniqueStems

        public java.util.List<CharsRef> uniqueStems​(char[] word,
                                                    int length)
        Find the unique stem(s) of the provided word
        Parameters:
        word - Word to find the stems for
        Returns:
        List of stems for the word
      • stemException

        private java.lang.String stemException​(int morphDataId)
      • stem

        private boolean stem​(char[] word,
                             int offset,
                             int length,
                             WordContext context,
                             int previous,
                             char prevFlag,
                             int prefixId,
                             int recursionDepth,
                             boolean doPrefix,
                             boolean previousWasPrefix,
                             Stemmer.RootProcessor processor)
        Generates a list of stems for the provided word
        Parameters:
        word - Word to generate the stems for
        previous - previous affix that was removed (so we dont remove same one twice)
        prevFlag - Flag from a previous stemming step that need to be cross-checked with any affixes in this recursive step
        prefixId - ID of the most inner removed prefix, so that when removing a suffix, it's also checked against the word
        recursionDepth - current recursiondepth
        doPrefix - true if we should remove prefixes
        previousWasPrefix - true if the previous removal was a prefix: if we are removing a suffix, and it has no continuation requirements, it's ok. but two prefixes (COMPLEXPREFIXES) or two suffixes must have continuation requirements to recurse.
        Returns:
        whether the processing should be continued
      • stripAffix

        private char[] stripAffix​(char[] word,
                                  int offset,
                                  int length,
                                  int affixLen,
                                  int affix,
                                  boolean isPrefix)
        Returns:
        null if affix conditions isn't met; a reference to the same char[] if the affix has no strip data and can thus be simply removed, or a new char[] containing the word affix removal
      • isAffixCompatible

        private boolean isAffixCompatible​(int affix,
                                          char prevFlag,
                                          int recursionDepth,
                                          boolean isPrefix,
                                          boolean previousWasPrefix,
                                          WordContext context)
      • applyAffix

        private boolean applyAffix​(char[] strippedWord,
                                   int offset,
                                   int length,
                                   WordContext context,
                                   int affix,
                                   int previousAffix,
                                   int prefixId,
                                   int recursionDepth,
                                   boolean prefix,
                                   Stemmer.RootProcessor processor)
        Applies the affix rule to the given word, producing a list of stems if any are found
        Parameters:
        strippedWord - Char array containing the word with the affix removed and the strip added
        offset - where the word actually starts in the array
        length - the length of the stripped word
        affix - HunspellAffix representing the affix rule itself
        prefixId - when we already stripped a prefix, we can't simply recurse and check the suffix, unless both are compatible so we must check dictionary form against both to add it as a stem!
        recursionDepth - current recursion depth
        prefix - true if we are removing a prefix (false if it's a suffix)
        Returns:
        whether the processing should be continued
      • isRootCompatibleWithContext

        private boolean isRootCompatibleWithContext​(WordContext context,
                                                    int lastAffix,
                                                    int entryId)
      • callProcessor

        private boolean callProcessor​(char[] word,
                                      int offset,
                                      int length,
                                      Stemmer.RootProcessor processor,
                                      IntsRef forms,
                                      int i)
      • needsAnotherAffix

        private boolean needsAnotherAffix​(int affix,
                                          int previousAffix,
                                          boolean isSuffix,
                                          int prefixId)
      • isFlagAppendedByAffix

        private boolean isFlagAppendedByAffix​(int affixId,
                                              char flag)