Class DefaultICUTokenizerConfig


  • public class DefaultICUTokenizerConfig
    extends ICUTokenizerConfig
    Default ICUTokenizerConfig that is generally applicable to many languages.

    Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:

    • Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private boolean cjkAsWords  
      private static com.ibm.icu.text.BreakIterator cjkBreakIterator  
      private static com.ibm.icu.text.RuleBasedBreakIterator defaultBreakIterator  
      private boolean myanmarAsWords  
      private static com.ibm.icu.text.RuleBasedBreakIterator myanmarSyllableIterator  
      static java.lang.String WORD_EMOJI
      Token type for words that appear to be emoji sequences
      static java.lang.String WORD_HANGUL
      Token type for words containing Korean hangul
      static java.lang.String WORD_HIRAGANA
      Token type for words containing Japanese hiragana
      static java.lang.String WORD_IDEO
      Token type for words containing ideographic characters
      static java.lang.String WORD_KATAKANA
      Token type for words containing Japanese katakana
      static java.lang.String WORD_LETTER
      Token type for words that contain letters
      static java.lang.String WORD_NUMBER
      Token type for words that appear to be numbers
    • Constructor Summary

      Constructors 
      Constructor Description
      DefaultICUTokenizerConfig​(boolean cjkAsWords, boolean myanmarAsWords)
      Creates a new config.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      boolean combineCJ()
      true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
      com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator​(int script)
      Return a breakiterator capable of processing a given script.
      java.lang.String getType​(int script, int ruleStatus)
      Return a token type value for a given script and BreakIterator rule status.
      private static com.ibm.icu.text.RuleBasedBreakIterator readBreakIterator​(java.lang.String filename)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • WORD_IDEO

        public static final java.lang.String WORD_IDEO
        Token type for words containing ideographic characters
      • WORD_HIRAGANA

        public static final java.lang.String WORD_HIRAGANA
        Token type for words containing Japanese hiragana
      • WORD_KATAKANA

        public static final java.lang.String WORD_KATAKANA
        Token type for words containing Japanese katakana
      • WORD_HANGUL

        public static final java.lang.String WORD_HANGUL
        Token type for words containing Korean hangul
      • WORD_LETTER

        public static final java.lang.String WORD_LETTER
        Token type for words that contain letters
      • WORD_NUMBER

        public static final java.lang.String WORD_NUMBER
        Token type for words that appear to be numbers
      • WORD_EMOJI

        public static final java.lang.String WORD_EMOJI
        Token type for words that appear to be emoji sequences
      • cjkBreakIterator

        private static final com.ibm.icu.text.BreakIterator cjkBreakIterator
      • defaultBreakIterator

        private static final com.ibm.icu.text.RuleBasedBreakIterator defaultBreakIterator
      • myanmarSyllableIterator

        private static final com.ibm.icu.text.RuleBasedBreakIterator myanmarSyllableIterator
      • cjkAsWords

        private final boolean cjkAsWords
      • myanmarAsWords

        private final boolean myanmarAsWords
    • Constructor Detail

      • DefaultICUTokenizerConfig

        public DefaultICUTokenizerConfig​(boolean cjkAsWords,
                                         boolean myanmarAsWords)
        Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.
        Parameters:
        cjkAsWords - true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.
        myanmarAsWords - true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
    • Method Detail

      • getBreakIterator

        public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator​(int script)
        Description copied from class: ICUTokenizerConfig
        Return a breakiterator capable of processing a given script.
        Specified by:
        getBreakIterator in class ICUTokenizerConfig
      • getType

        public java.lang.String getType​(int script,
                                        int ruleStatus)
        Description copied from class: ICUTokenizerConfig
        Return a token type value for a given script and BreakIterator rule status.
        Specified by:
        getType in class ICUTokenizerConfig
      • readBreakIterator

        private static com.ibm.icu.text.RuleBasedBreakIterator readBreakIterator​(java.lang.String filename)