Class JapaneseIterationMarkCharFilter
- java.lang.Object
-
- java.io.Reader
-
- org.apache.lucene.analysis.CharFilter
-
- org.apache.lucene.analysis.ja.JapaneseIterationMarkCharFilter
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,java.lang.Readable
public class JapaneseIterationMarkCharFilter extends CharFilter
Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though the question mark isn't hiragana.
Note that a full stop punctuation character "。" (U+3002) can not be iterated (see below). Iteration marks themselves can be emitted in case they are illegal, i.e. if they go back past the beginning of the character stream.
The implementation buffers input until a full stop punctuation character (U+3002) or EOF is reached in order to not keep a copy of the character stream in memory. Vertical iteration marks, which are even rarer than horizontal iteration marks in contemporary Japanese, are unsupported.
-
-
Field Summary
Fields Modifier and Type Field Description private RollingCharBuffer
buffer
private int
bufferPosition
private static char
FULL_STOP_PUNCTUATION
private static char[]
h2d
private static char
HIRAGANA_ITERATION_MARK
private static char
HIRAGANA_VOICED_ITERATION_MARK
private int
iterationMarkSpanEndPosition
private int
iterationMarksSpanSize
private static char[]
k2d
private static char
KANJI_ITERATION_MARK
private static char
KATAKANA_ITERATION_MARK
private static char
KATAKANA_VOICED_ITERATION_MARK
static boolean
NORMALIZE_KANA_DEFAULT
Normalize kana iteration marks by defaultstatic boolean
NORMALIZE_KANJI_DEFAULT
Normalize kanji iteration marks by defaultprivate boolean
normalizeKana
private boolean
normalizeKanji
-
Fields inherited from class org.apache.lucene.analysis.CharFilter
input
-
-
Constructor Summary
Constructors Constructor Description JapaneseIterationMarkCharFilter(java.io.Reader input)
Constructor.JapaneseIterationMarkCharFilter(java.io.Reader input, boolean normalizeKanji, boolean normalizeKana)
Constructor
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected int
correct(int currentOff)
Subclasses override to correct the current offset.private boolean
inside(char c, char[] map, char offset)
Predicate indicating if the lookup character is within dakuten map rangeprivate boolean
isHiraganaDakuten(char c)
Hiragana dakuten predicateprivate boolean
isHiraganaIterationMark(char c)
Hiragana iteration mark character predicateprivate boolean
isIterationMark(char c)
Iteration mark character predicateprivate boolean
isKanjiIterationMark(char c)
Kanji iteration mark character predicateprivate boolean
isKatakanaDakuten(char c)
Katakana dakuten predicateprivate boolean
isKatakanaIterationMark(char c)
Katakana iteration mark character predicateprivate char
lookup(char c, char[] map, char offset)
Looks up a character in dakuten map and returns the dakuten variant if it exists.private char
lookupHiraganaDakuten(char c)
Look up hiragana dakutenprivate char
lookupKatakanaDakuten(char c)
Look up katakana dakuten.private int
nextIterationMarkSpanSize()
Finds the number of subsequent next iteration marksprivate char
normalize(char c, char m)
Normalize a characterprivate char
normalizedHiragana(char c, char m)
Normalize hiragana characterprivate char
normalizedKatakana(char c, char m)
Normalize katakana characterprivate char
normalizeIterationMark(char c)
Normalizes the iteration mark character cint
read()
int
read(char[] buffer, int offset, int length)
private char
sourceCharacter(int position, int spanSize)
Returns the source character for a given position and iteration mark span size-
Methods inherited from class org.apache.lucene.analysis.CharFilter
close, correctOffset
-
-
-
-
Field Detail
-
NORMALIZE_KANJI_DEFAULT
public static final boolean NORMALIZE_KANJI_DEFAULT
Normalize kanji iteration marks by default- See Also:
- Constant Field Values
-
NORMALIZE_KANA_DEFAULT
public static final boolean NORMALIZE_KANA_DEFAULT
Normalize kana iteration marks by default- See Also:
- Constant Field Values
-
KANJI_ITERATION_MARK
private static final char KANJI_ITERATION_MARK
- See Also:
- Constant Field Values
-
HIRAGANA_ITERATION_MARK
private static final char HIRAGANA_ITERATION_MARK
- See Also:
- Constant Field Values
-
HIRAGANA_VOICED_ITERATION_MARK
private static final char HIRAGANA_VOICED_ITERATION_MARK
- See Also:
- Constant Field Values
-
KATAKANA_ITERATION_MARK
private static final char KATAKANA_ITERATION_MARK
- See Also:
- Constant Field Values
-
KATAKANA_VOICED_ITERATION_MARK
private static final char KATAKANA_VOICED_ITERATION_MARK
- See Also:
- Constant Field Values
-
FULL_STOP_PUNCTUATION
private static final char FULL_STOP_PUNCTUATION
- See Also:
- Constant Field Values
-
h2d
private static char[] h2d
-
k2d
private static char[] k2d
-
buffer
private final RollingCharBuffer buffer
-
bufferPosition
private int bufferPosition
-
iterationMarksSpanSize
private int iterationMarksSpanSize
-
iterationMarkSpanEndPosition
private int iterationMarkSpanEndPosition
-
normalizeKanji
private boolean normalizeKanji
-
normalizeKana
private boolean normalizeKana
-
-
Constructor Detail
-
JapaneseIterationMarkCharFilter
public JapaneseIterationMarkCharFilter(java.io.Reader input)
Constructor. Normalizes both kanji and kana iteration marks by default.- Parameters:
input
- char stream
-
JapaneseIterationMarkCharFilter
public JapaneseIterationMarkCharFilter(java.io.Reader input, boolean normalizeKanji, boolean normalizeKana)
Constructor- Parameters:
input
- char streamnormalizeKanji
- indicates whether kanji iteration marks should be normalizednormalizeKana
- indicates whether kana iteration marks should be normalized
-
-
Method Detail
-
read
public int read(char[] buffer, int offset, int length) throws java.io.IOException
- Specified by:
read
in classjava.io.Reader
- Throws:
java.io.IOException
-
read
public int read() throws java.io.IOException
- Overrides:
read
in classjava.io.Reader
- Throws:
java.io.IOException
-
normalizeIterationMark
private char normalizeIterationMark(char c) throws java.io.IOException
Normalizes the iteration mark character c- Parameters:
c
- iteration mark character to normalize- Returns:
- normalized iteration mark
- Throws:
java.io.IOException
- If there is a low-level I/O error.
-
nextIterationMarkSpanSize
private int nextIterationMarkSpanSize() throws java.io.IOException
Finds the number of subsequent next iteration marks- Returns:
- number of iteration marks starting at the current buffer position
- Throws:
java.io.IOException
- If there is a low-level I/O error.
-
sourceCharacter
private char sourceCharacter(int position, int spanSize) throws java.io.IOException
Returns the source character for a given position and iteration mark span size- Parameters:
position
- buffer position (should not exceed bufferPosition)spanSize
- iteration mark span size- Returns:
- source character
- Throws:
java.io.IOException
- If there is a low-level I/O error.
-
normalize
private char normalize(char c, char m)
Normalize a character- Parameters:
c
- character to normalizem
- repetition mark referring to c- Returns:
- normalized character - return c on illegal iteration marks
-
normalizedHiragana
private char normalizedHiragana(char c, char m)
Normalize hiragana character- Parameters:
c
- hiragana characterm
- repetition mark referring to c- Returns:
- normalized character - return c on illegal iteration marks
-
normalizedKatakana
private char normalizedKatakana(char c, char m)
Normalize katakana character- Parameters:
c
- katakana characterm
- repetition mark referring to c- Returns:
- normalized character - return c on illegal iteration marks
-
isIterationMark
private boolean isIterationMark(char c)
Iteration mark character predicate- Parameters:
c
- character to test- Returns:
- true if c is an iteration mark character. Otherwise false.
-
isHiraganaIterationMark
private boolean isHiraganaIterationMark(char c)
Hiragana iteration mark character predicate- Parameters:
c
- character to test- Returns:
- true if c is a hiragana iteration mark character. Otherwise false.
-
isKatakanaIterationMark
private boolean isKatakanaIterationMark(char c)
Katakana iteration mark character predicate- Parameters:
c
- character to test- Returns:
- true if c is a katakana iteration mark character. Otherwise false.
-
isKanjiIterationMark
private boolean isKanjiIterationMark(char c)
Kanji iteration mark character predicate- Parameters:
c
- character to test- Returns:
- true if c is a kanji iteration mark character. Otherwise false.
-
lookupHiraganaDakuten
private char lookupHiraganaDakuten(char c)
Look up hiragana dakuten- Parameters:
c
- character to look up- Returns:
- hiragana dakuten variant of c or c itself if no dakuten variant exists
-
lookupKatakanaDakuten
private char lookupKatakanaDakuten(char c)
Look up katakana dakuten. Only full-width katakana are supported.- Parameters:
c
- character to look up- Returns:
- katakana dakuten variant of c or c itself if no dakuten variant exists
-
isHiraganaDakuten
private boolean isHiraganaDakuten(char c)
Hiragana dakuten predicate- Parameters:
c
- character to check- Returns:
- true if c is a hiragana dakuten and otherwise false
-
isKatakanaDakuten
private boolean isKatakanaDakuten(char c)
Katakana dakuten predicate- Parameters:
c
- character to check- Returns:
- true if c is a hiragana dakuten and otherwise false
-
lookup
private char lookup(char c, char[] map, char offset)
Looks up a character in dakuten map and returns the dakuten variant if it exists. Otherwise return the character being looked up itself- Parameters:
c
- character to look upmap
- dakuten mapoffset
- code point offset from c- Returns:
- mapped character or c if no mapping exists
-
inside
private boolean inside(char c, char[] map, char offset)
Predicate indicating if the lookup character is within dakuten map range- Parameters:
c
- character to look upmap
- dakuten mapoffset
- code point offset from c- Returns:
- true if c is mapped by map and otherwise false
-
correct
protected int correct(int currentOff)
Description copied from class:CharFilter
Subclasses override to correct the current offset.- Specified by:
correct
in classCharFilter
- Parameters:
currentOff
- current offset- Returns:
- corrected offset
-
-