Class ICUTokenizerFactory

  • All Implemented Interfaces:
    ResourceLoaderAware

    public class ICUTokenizerFactory
    extends TokenizerFactory
    implements ResourceLoaderAware
    Factory for ICUTokenizer. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the DefaultICUTokenizerConfig.

    To use the default set of per-script rules:

     <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.ICUTokenizerFactory"/>
       </analyzer>
     </fieldType>

    You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.

    To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):

     <fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
                    rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
       </analyzer>
     </fieldType>
    Since:
    3.1
    • Field Detail

      • tailored

        private final java.util.Map<java.lang.Integer,​java.lang.String> tailored
      • cjkAsWords

        private final boolean cjkAsWords
      • myanmarAsWords

        private final boolean myanmarAsWords
    • Constructor Detail

      • ICUTokenizerFactory

        public ICUTokenizerFactory​(java.util.Map<java.lang.String,​java.lang.String> args)
        Creates a new ICUTokenizerFactory
      • ICUTokenizerFactory

        public ICUTokenizerFactory()
        Default ctor for compatibility with SPI
    • Method Detail

      • inform

        public void inform​(ResourceLoader loader)
                    throws java.io.IOException
        Description copied from interface: ResourceLoaderAware
        Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).
        Specified by:
        inform in interface ResourceLoaderAware
        Throws:
        java.io.IOException
      • parseRules

        private com.ibm.icu.text.BreakIterator parseRules​(java.lang.String filename,
                                                          ResourceLoader loader)
                                                   throws java.io.IOException
        Throws:
        java.io.IOException