Class ScriptIterator
- java.lang.Object
-
- org.apache.lucene.analysis.icu.segmentation.ScriptIterator
-
final class ScriptIterator extends java.lang.Object
An iterator that locates ISO 15924 script boundaries in text.This is not the same as simply looking at the Unicode block, or even the Script property. Some characters are 'common' across multiple scripts, and some 'inherit' the script value of text surrounding them.
This is similar to ICU (internal-only) UScriptRun, with the following differences:
- Doesn't attempt to match paired punctuation. For tokenization purposes, this is not necessary. It's also quite expensive.
- Non-spacing marks inherit the script of their base character, following recommendations from UTR #24.
-
-
Field Summary
Fields Modifier and Type Field Description private static int[]
basicLatin
linear fast-path for basic latin caseprivate boolean
combineCJ
private int
index
private int
limit
private int
scriptCode
private int
scriptLimit
private int
scriptStart
private int
start
private char[]
text
-
Constructor Summary
Constructors Constructor Description ScriptIterator(boolean combineCJ)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private int
getScript(int codepoint)
fast version of UScript.getScript().(package private) int
getScriptCode()
Get the UScript script code for this script run(package private) int
getScriptLimit()
Get the index of the first character after the end of this script run(package private) int
getScriptStart()
Get the start of this script runprivate static boolean
isSameScript(int scriptOne, int scriptTwo)
Determine if two scripts are compatible.(package private) boolean
next()
Iterates to the next script run, returning true if one exists.(package private) void
setText(char[] text, int start, int length)
Set a new region of text to be examined by this iterator
-
-
-
Field Detail
-
text
private char[] text
-
start
private int start
-
limit
private int limit
-
index
private int index
-
scriptStart
private int scriptStart
-
scriptLimit
private int scriptLimit
-
scriptCode
private int scriptCode
-
combineCJ
private final boolean combineCJ
-
basicLatin
private static final int[] basicLatin
linear fast-path for basic latin case
-
-
Method Detail
-
getScriptStart
int getScriptStart()
Get the start of this script run- Returns:
- start position of script run
-
getScriptLimit
int getScriptLimit()
Get the index of the first character after the end of this script run- Returns:
- position of the first character after this script run
-
getScriptCode
int getScriptCode()
Get the UScript script code for this script run- Returns:
- code for the script of the current run
-
next
boolean next()
Iterates to the next script run, returning true if one exists.- Returns:
- true if there is another script run, false otherwise.
-
isSameScript
private static boolean isSameScript(int scriptOne, int scriptTwo)
Determine if two scripts are compatible.
-
setText
void setText(char[] text, int start, int length)
Set a new region of text to be examined by this iterator- Parameters:
text
- text buffer to examinestart
- offset into bufferlength
- maximum length to examine
-
getScript
private int getScript(int codepoint)
fast version of UScript.getScript(). Basic Latin is an array lookup
-
-