DefaultIcuTokenizerConfig (elasticsearch-analysis-icu 5.1.1.0 API)

java.lang.Object
- org.xbib.elasticsearch.index.analysis.icu.segmentation.DefaultIcuTokenizerConfig

All Implemented Interfaces:

IcuTokenizerConfig
```
public class DefaultIcuTokenizerConfig
extends java.lang.Object
implements IcuTokenizerConfig
```
Default IcuTokenizerConfig that is generally applicable to many languages. Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, and CJK text is broken into words with a dictionary.
- Myanmar, and Khmer text is broken into syllables based on custom BreakIterator rules.

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`WORD_HANGUL` Token type for words containing Korean hangul.
`static java.lang.String`	`WORD_HIRAGANA` Token type for words containing Japanese hiragana.
`static java.lang.String`	`WORD_IDEO` Token type for words containing ideographic characters.
`static java.lang.String`	`WORD_KATAKANA` Token type for words containing Japanese katakana.
`static java.lang.String`	`WORD_LETTER` Token type for words that contain letters.
`static java.lang.String`	`WORD_NUMBER` Token type for words that appear to be numbers.

Constructor Summary

Constructors
Constructor and Description

DefaultIcuTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)
Creates a new config.

Constructors
Constructor and Description
`DefaultIcuTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)` Creates a new config.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`combineCJ()`
`com.ibm.icu.text.BreakIterator`	`getBreakIterator(int script)` Return a breakiterator capable of processing a given script.
`java.lang.String`	`getType(int script, int ruleStatus)` Return a token type value for a given script and BreakIterator rule status.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - WORD_IDEO
```
public static final java.lang.String WORD_IDEO
```
    Token type for words containing ideographic characters.
  - WORD_HIRAGANA
```
public static final java.lang.String WORD_HIRAGANA
```
    Token type for words containing Japanese hiragana.
  - WORD_KATAKANA
```
public static final java.lang.String WORD_KATAKANA
```
    Token type for words containing Japanese katakana.
  - WORD_HANGUL
```
public static final java.lang.String WORD_HANGUL
```
    Token type for words containing Korean hangul.
  - WORD_LETTER
```
public static final java.lang.String WORD_LETTER
```
    Token type for words that contain letters.
  - WORD_NUMBER
```
public static final java.lang.String WORD_NUMBER
```
    Token type for words that appear to be numbers.
- Constructor Detail
  - DefaultIcuTokenizerConfig
```
public DefaultIcuTokenizerConfig(boolean cjkAsWords,
                                 boolean myanmarAsWords)
```
    Creates a new config. The first time the class is referenced, breakiterators will be initialized.
    
    Parameters:
    
    cjkAsWords - true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.
    
    myanmarAsWords - true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
- Method Detail
  - combineCJ
```
public boolean combineCJ()
```
    Specified by:
    
    combineCJ in interface IcuTokenizerConfig
    
    Returns:
    
    true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
  - getBreakIterator
```
public com.ibm.icu.text.BreakIterator getBreakIterator(int script)
```
    Description copied from interface: IcuTokenizerConfig
    
    Return a breakiterator capable of processing a given script.
    
    Specified by:
    
    getBreakIterator in interface IcuTokenizerConfig
    
    Parameters:
    
    script - script
    
    Returns:
    
    iterator
  - getType
```
public java.lang.String getType(int script,
                                int ruleStatus)
```
    Description copied from interface: IcuTokenizerConfig
    
    Return a token type value for a given script and BreakIterator rule status.
    
    Specified by:
    
    getType in interface IcuTokenizerConfig
    
    Parameters:
    
    script - script
    
    ruleStatus - rule status
    
    Returns:
    
    type

Class DefaultIcuTokenizerConfig

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

WORD_IDEO

WORD_HIRAGANA

WORD_KATAKANA

WORD_HANGUL

WORD_LETTER

WORD_NUMBER

Constructor Detail

DefaultIcuTokenizerConfig

Method Detail

combineCJ

getBreakIterator

getType