Class Dictionary


  • public class Dictionary
    extends java.lang.Object
    In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary.
    • Field Detail

      • NOFLAGS

        static final char[] NOFLAGS
      • DEFAULT_CHARSET

        static final java.nio.charset.Charset DEFAULT_CHARSET
      • decoder

        java.nio.charset.CharsetDecoder decoder
      • patterns

        java.util.ArrayList<AffixCondition> patterns
        All condition checks used by prefixes and suffixes. these are typically re-used across many affix stripping rules. so these are deduplicated, to save RAM.
      • words

        WordStorage words
        The entries in the .dic file, mapping to their set of flags
      • flagLookup

        final FlagEnumerator.Lookup flagLookup
        The list of unique flagsets (wordforms). theoretically huge, but practically small (for Polish this is 756), otherwise humans wouldn't be able to deal with it either.
      • stripData

        char[] stripData
      • stripOffsets

        int[] stripOffsets
      • wordChars

        java.lang.String wordChars
      • affixData

        char[] affixData
      • currentAffix

        private int currentAffix
      • aliases

        private java.lang.String[] aliases
      • aliasCount

        private int aliasCount
      • morphAliases

        private java.lang.String[] morphAliases
      • morphAliasCount

        private int morphAliasCount
      • morphData

        final java.util.List<java.lang.String> morphData
      • hasCustomMorphData

        boolean hasCustomMorphData
        we set this during sorting, so we know to add an extra int (index in morphData) to FST output
      • ignoreCase

        boolean ignoreCase
      • checkSharpS

        boolean checkSharpS
      • complexPrefixes

        boolean complexPrefixes
      • secondStagePrefixFlags

        private char[] secondStagePrefixFlags
        All flags used in affix continuation classes. If an outer affix's flag isn't here, there's no need to do 2-level affix stripping with it.
      • secondStageSuffixFlags

        private char[] secondStageSuffixFlags
        All flags used in affix continuation classes. If an outer affix's flag isn't here, there's no need to do 2-level affix stripping with it.
      • circumfix

        char circumfix
      • keepcase

        char keepcase
      • forceUCase

        char forceUCase
      • needaffix

        char needaffix
      • forbiddenword

        char forbiddenword
      • onlyincompound

        char onlyincompound
      • compoundBegin

        char compoundBegin
      • compoundMiddle

        char compoundMiddle
      • compoundEnd

        char compoundEnd
      • compoundFlag

        char compoundFlag
      • compoundPermit

        char compoundPermit
      • compoundForbid

        char compoundForbid
      • checkCompoundCase

        boolean checkCompoundCase
      • checkCompoundDup

        boolean checkCompoundDup
      • checkCompoundRep

        boolean checkCompoundRep
      • checkCompoundTriple

        boolean checkCompoundTriple
      • simplifiedTriple

        boolean simplifiedTriple
      • compoundMin

        int compoundMin
      • compoundMax

        int compoundMax
      • ignore

        private char[] ignore
      • tryChars

        java.lang.String tryChars
      • neighborKeyGroups

        java.lang.String[] neighborKeyGroups
      • enableSplitSuggestions

        boolean enableSplitSuggestions
      • repTable

        java.util.List<RepEntry> repTable
      • mapTable

        java.util.List<java.util.List<java.lang.String>> mapTable
      • maxDiff

        int maxDiff
      • maxNGramSuggestions

        int maxNGramSuggestions
      • onlyMaxDiff

        boolean onlyMaxDiff
      • noSuggest

        char noSuggest
      • subStandard

        char subStandard
      • fullStrip

        boolean fullStrip
      • language

        java.lang.String language
      • alternateCasing

        private boolean alternateCasing
      • BOM_UTF8

        private static final byte[] BOM_UTF8
      • CHARSET_ALIASES

        static final java.util.Map<java.lang.String,​java.lang.String> CHARSET_ALIASES
    • Constructor Detail

      • Dictionary

        public Dictionary​(Directory tempDir,
                          java.lang.String tempFileNamePrefix,
                          java.io.InputStream affix,
                          java.io.InputStream dictionary)
                   throws java.io.IOException,
                          java.text.ParseException
        Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.
        Parameters:
        tempDir - Directory to use for offline sorting
        tempFileNamePrefix - prefix to use to generate temp file names
        affix - InputStream for reading the hunspell affix file (won't be closed).
        dictionary - InputStream for reading the hunspell dictionary file (won't be closed).
        Throws:
        java.io.IOException - Can be thrown while reading from the InputStreams
        java.text.ParseException - Can be thrown if the content of the files does not meet expected formats
      • Dictionary

        public Dictionary​(Directory tempDir,
                          java.lang.String tempFileNamePrefix,
                          java.io.InputStream affix,
                          java.util.List<java.io.InputStream> dictionaries,
                          boolean ignoreCase)
                   throws java.io.IOException,
                          java.text.ParseException
        Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.
        Parameters:
        tempDir - Directory to use for offline sorting
        tempFileNamePrefix - prefix to use to generate temp file names
        affix - InputStream for reading the hunspell affix file (won't be closed).
        dictionaries - InputStream for reading the hunspell dictionary files (won't be closed).
        Throws:
        java.io.IOException - Can be thrown while reading from the InputStreams
        java.text.ParseException - Can be thrown if the content of the files does not meet expected formats
    • Method Detail

      • formStep

        int formStep()
      • lookupWord

        IntsRef lookupWord​(char[] word,
                           int offset,
                           int length)
        Looks up Hunspell word forms from the dictionary
      • lookupPrefix

        IntsRef lookupPrefix​(char[] word)
      • lookupSuffix

        IntsRef lookupSuffix​(char[] word)
      • readAffixFile

        private void readAffixFile​(java.io.InputStream affixStream,
                                   java.nio.charset.CharsetDecoder decoder,
                                   FlagEnumerator flags)
                            throws java.io.IOException,
                                   java.text.ParseException
        Reads the affix file through the provided InputStream, building up the prefix and suffix maps
        Parameters:
        affixStream - InputStream to read the content of the affix file from
        decoder - CharsetDecoder to decode the content of the file
        Throws:
        java.io.IOException - Can be thrown while reading from the InputStream
        java.text.ParseException
      • checkCriticalDirectiveSame

        private void checkCriticalDirectiveSame​(java.lang.String directive,
                                                java.io.LineNumberReader reader,
                                                java.lang.Object expected,
                                                java.lang.Object actual)
                                         throws java.text.ParseException
        Throws:
        java.text.ParseException
      • parseMapEntry

        private java.util.List<java.lang.String> parseMapEntry​(java.io.LineNumberReader reader,
                                                               java.lang.String line)
                                                        throws java.text.ParseException
        Throws:
        java.text.ParseException
      • hasLanguage

        boolean hasLanguage​(java.lang.String... langCodes)
      • lookupEntries

        public DictEntries lookupEntries​(java.lang.String root)
        Parameters:
        root - a string to look up in the dictionary. No case conversion or affix removal is performed. To get the possible roots of any word, you may call Hunspell.getRoots(String)
        Returns:
        the dictionary entries for the given root, or null if there's none
      • extractLanguageCode

        static java.lang.String extractLanguageCode​(java.lang.String isoCode)
      • parseNum

        private int parseNum​(java.io.LineNumberReader reader,
                             java.lang.String line)
                      throws java.text.ParseException
        Throws:
        java.text.ParseException
      • singleArgument

        private java.lang.String singleArgument​(java.io.LineNumberReader reader,
                                                java.lang.String line)
                                         throws java.text.ParseException
        Throws:
        java.text.ParseException
      • firstArgument

        private java.lang.String firstArgument​(java.io.LineNumberReader reader,
                                               java.lang.String line)
                                        throws java.text.ParseException
        Throws:
        java.text.ParseException
      • splitBySpace

        private java.lang.String[] splitBySpace​(java.io.LineNumberReader reader,
                                                java.lang.String line,
                                                int expectedParts)
                                         throws java.text.ParseException
        Throws:
        java.text.ParseException
      • splitBySpace

        private java.lang.String[] splitBySpace​(java.io.LineNumberReader reader,
                                                java.lang.String line,
                                                int minParts,
                                                int maxParts)
                                         throws java.text.ParseException
        Throws:
        java.text.ParseException
      • parseCompoundRules

        private java.util.List<CompoundRule> parseCompoundRules​(java.io.LineNumberReader reader,
                                                                int num)
                                                         throws java.io.IOException,
                                                                java.text.ParseException
        Throws:
        java.io.IOException
        java.text.ParseException
      • parseBreaks

        private Dictionary.Breaks parseBreaks​(java.io.LineNumberReader reader,
                                              java.lang.String line)
                                       throws java.io.IOException,
                                              java.text.ParseException
        Throws:
        java.io.IOException
        java.text.ParseException
      • affixFST

        private FST<IntsRef> affixFST​(java.util.TreeMap<java.lang.String,​java.util.List<java.lang.Integer>> affixes)
                               throws java.io.IOException
        Throws:
        java.io.IOException
      • parseAffix

        private void parseAffix​(java.util.TreeMap<java.lang.String,​java.util.List<java.lang.Integer>> affixes,
                                java.util.Set<java.lang.Character> secondStageFlags,
                                java.lang.String header,
                                java.io.LineNumberReader reader,
                                AffixKind kind,
                                java.util.Map<java.lang.String,​java.lang.Integer> seenPatterns,
                                java.util.Map<java.lang.String,​java.lang.Integer> seenStrips,
                                FlagEnumerator flags)
                         throws java.io.IOException,
                                java.text.ParseException
        Parses a specific affix rule putting the result into the provided affix map
        Parameters:
        affixes - Map where the result of the parsing will be put
        header - Header line of the affix rule
        reader - BufferedReader to read the content of the rule from
        seenPatterns - map from condition -> index of patterns, for deduplication.
        Throws:
        java.io.IOException - Can be thrown while reading the rule
        java.text.ParseException
      • affixData

        char affixData​(int affixIndex,
                       int offset)
      • isCrossProduct

        boolean isCrossProduct​(int affix)
      • getAffixCondition

        int getAffixCondition​(int affix)
      • parseConversions

        private ConvTable parseConversions​(java.io.LineNumberReader reader,
                                           int num)
                                    throws java.io.IOException,
                                           java.text.ParseException
        Throws:
        java.io.IOException
        java.text.ParseException
      • readConfig

        private void readConfig​(java.io.InputStream stream,
                                java.nio.charset.Charset streamCharset)
                         throws java.io.IOException,
                                java.text.ParseException
        Parses the encoding and flag format specified in the provided InputStream
        Throws:
        java.io.IOException
        java.text.ParseException
      • maybeConsume

        private static boolean maybeConsume​(java.io.BufferedInputStream stream,
                                            byte[] bytes)
                                     throws java.io.IOException
        Consume the provided byte sequence in full, if present. Otherwise leave the input stream intact.
        Returns:
        true if the sequence matched and has been consumed.
        Throws:
        java.io.IOException
      • getDecoder

        private java.nio.charset.CharsetDecoder getDecoder​(java.lang.String encoding)
        Retrieves the CharsetDecoder for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and MICROSOFT-CP1251 etc are allowed...
        Parameters:
        encoding - Encoding to retrieve the CharsetDecoder for
        Returns:
        CharSetDecoder for the given encoding
      • replacingDecoder

        private static java.nio.charset.CharsetDecoder replacingDecoder​(java.nio.charset.Charset charset)
      • getFlagParsingStrategy

        static Dictionary.FlagParsingStrategy getFlagParsingStrategy​(java.lang.String flagLine,
                                                                     java.nio.charset.Charset charset)
        Determines the appropriate Dictionary.FlagParsingStrategy based on the FLAG definition line taken from the affix file
        Parameters:
        flagLine - Line containing the flag information
        Returns:
        FlagParsingStrategy that handles parsing flags in the way specified in the FLAG definition
      • unescapeEntry

        private java.lang.String unescapeEntry​(java.lang.String entry)
      • shouldSkipEscapedChar

        private static boolean shouldSkipEscapedChar​(char ch)
      • morphBoundary

        private static int morphBoundary​(java.lang.String line)
      • indexOfSpaceOrTab

        static int indexOfSpaceOrTab​(java.lang.String text,
                                     int start)
      • mergeDictionaries

        private int mergeDictionaries​(java.util.List<java.io.InputStream> dictionaries,
                                      java.nio.charset.CharsetDecoder decoder,
                                      IndexOutput output)
                               throws java.io.IOException
        Throws:
        java.io.IOException
      • writeNormalizedWordEntry

        private int writeNormalizedWordEntry​(java.lang.StringBuilder reuse,
                                             OfflineSorter.ByteSequencesWriter writer,
                                             java.lang.String line)
                                      throws java.io.IOException
        Returns:
        the number of word entries written
        Throws:
        java.io.IOException
      • addHiddenCapitalizedWord

        private void addHiddenCapitalizedWord​(java.lang.StringBuilder reuse,
                                              OfflineSorter.ByteSequencesWriter writer,
                                              java.lang.String word,
                                              java.lang.String afterSep)
                                       throws java.io.IOException
        Throws:
        java.io.IOException
      • toLowerCase

        java.lang.String toLowerCase​(java.lang.String word)
      • toTitleCase

        java.lang.String toTitleCase​(java.lang.String word)
      • sortWordsOffline

        private java.lang.String sortWordsOffline​(Directory tempDir,
                                                  java.lang.String tempFileNamePrefix,
                                                  IndexOutput unsorted)
                                           throws java.io.IOException
        Throws:
        java.io.IOException
      • readSortedDictionaries

        private WordStorage readSortedDictionaries​(Directory tempDir,
                                                   java.lang.String sorted,
                                                   FlagEnumerator flags,
                                                   int wordCount)
                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • readMorphFields

        private java.util.List<java.lang.String> readMorphFields​(java.lang.String word,
                                                                 java.lang.String unparsed)
      • addMorphFields

        private int addMorphFields​(java.util.Map<java.lang.String,​java.lang.Integer> indices,
                                   java.lang.String morphFields)
      • addPhoneticRepEntries

        private void addPhoneticRepEntries​(java.lang.String word,
                                           java.lang.String ph)
      • isDotICaseChangeDisallowed

        boolean isDotICaseChangeDisallowed​(char[] word)
      • parseAlias

        private void parseAlias​(java.lang.String line)
      • getAliasValue

        private java.lang.String getAliasValue​(int id)
      • parseMorphAlias

        private void parseMorphAlias​(java.lang.String line)
      • splitMorphData

        private java.util.List<java.lang.String> splitMorphData​(java.lang.String morphData)
      • hasFlag

        boolean hasFlag​(IntsRef forms,
                        char flag)
      • hasFlag

        boolean hasFlag​(int entryId,
                        char flag)
      • mayNeedInputCleaning

        boolean mayNeedInputCleaning()
      • needsInputCleaning

        boolean needsInputCleaning​(java.lang.CharSequence input)
      • cleanInput

        java.lang.CharSequence cleanInput​(java.lang.CharSequence input,
                                          java.lang.StringBuilder reuse)
      • toSortedCharArray

        static char[] toSortedCharArray​(java.util.Set<java.lang.Character> set)
      • isSecondStagePrefix

        boolean isSecondStagePrefix​(char flag)
      • isSecondStageSuffix

        boolean isSecondStageSuffix​(char flag)
      • caseFold

        char caseFold​(char c)
        folds single character (according to LANG if present)
      • getIgnoreCase

        public boolean getIgnoreCase()
        Returns true if this dictionary was constructed with the ignoreCase option
      • getDefaultTempDir

        static java.nio.file.Path getDefaultTempDir()
                                             throws java.io.IOException
        Returns the default temporary directory pointed to by java.io.tmpdir. If not accessible or not available, an IOException is thrown.
        Throws:
        java.io.IOException