EncodingUtilImpl (JavaDocs for mmm-all dev-SNAPSHOT)

java.lang.Object
- net.sf.mmm.util.component.base.AbstractComponent
- - net.sf.mmm.util.component.base.AbstractLoggableComponent
  - - net.sf.mmm.util.io.base.EncodingUtilImpl

All Implemented Interfaces:

EncodingUtil
```
public class EncodingUtilImpl
extends AbstractLoggableComponent
implements EncodingUtil
```
This is the implementation of the EncodingUtil interface.

Since:

1.0.1

Author:

Joerg Hohwiller (hohwille at users.sourceforge.net)

See Also:

getInstance()

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`protected static class`	`EncodingUtilImpl.AsciiProcessor` This inner class is used to process the byes from the underlying `InputStream` in ASCII mode.
`protected static class`	`EncodingUtilImpl.Surrogate` This enum contains represents the type of a `EncodingUtilImpl.Surrogate` from an UTF-16 sequence.
`protected static class`	`EncodingUtilImpl.UtfDetectionProcessor` This inner class is used to perform the actual UTF detection.
`protected class`	`EncodingUtilImpl.UtfDetectionReader`

Field Summary

Fields
Modifier and Type	Field and Description
`private static EncodingUtil`	`instance`
`private static int`	`RANK_BOM` The rank gain if a proper `ByteOrderMark` was detected.
`private static int`	`RANK_UTF16_SURROGATE` The rank gain if an UTF-16 surrogate pair was detected.
`private static int`	`RANK_UTF8_SEQUNCE` The rank gain if a proper UTF-8 multi-byte sequence was detected.
`static byte`	`UTF_16_FIRST_SURROGATE_MAX` An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
`static byte`	`UTF_16_FIRST_SURROGATE_MIN` An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
`static byte`	`UTF_16_SECOND_SURROGATE_MAX` An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
`static byte`	`UTF_16_SECOND_SURROGATE_MIN` An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
`static byte`	`UTF_8_CONTINUATION_BYTE_MAX` In an UTF-8 multi-byte-sequence all bytes except the first one have the from `10xxxxxx`.
`static byte`	`UTF_8_CONTINUATION_BYTE_MIN` In an UTF-8 multi-byte-sequence all bytes except the first one have the from `10xxxxxx`.
`static byte`	`UTF_8_FOUR_BYTE_MAX` An UTF-8 four-byte-sequence has the form `11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`.
`static byte`	`UTF_8_FOUR_BYTE_MIN` An UTF-8 four-byte-sequence has the form `11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`.
`static byte`	`UTF_8_THREE_BYTE_MAX` An UTF-8 thee-byte-sequence has the form `1110xxxx 10xxxxxx 10xxxxxx`.
`static byte`	`UTF_8_THREE_BYTE_MIN` An UTF-8 thee-byte-sequence has the form `1110xxxx 10xxxxxx 10xxxxxx`.
`static byte`	`UTF_8_TWO_BYTE_MAX` An UTF-8 two-byte-sequence has the form `110xxxxx 10xxxxxx`.
`static byte`	`UTF_8_TWO_BYTE_MIN` An UTF-8 two-byte-sequence has the form `110xxxxx 10xxxxxx`.

Fields inherited from interface net.sf.mmm.util.io.api.EncodingUtil
ENCODING_CP_437, ENCODING_CP_737, ENCODING_CP_850, ENCODING_CP_852, ENCODING_CP_855, ENCODING_CP_857, ENCODING_CP_858, ENCODING_CP_860, ENCODING_CP_861, ENCODING_CP_863, ENCODING_CP_865, ENCODING_CP_866, ENCODING_CP_869, ENCODING_ISO_8859_1, ENCODING_ISO_8859_10, ENCODING_ISO_8859_11, ENCODING_ISO_8859_12, ENCODING_ISO_8859_13, ENCODING_ISO_8859_14, ENCODING_ISO_8859_15, ENCODING_ISO_8859_16, ENCODING_ISO_8859_2, ENCODING_ISO_8859_3, ENCODING_ISO_8859_4, ENCODING_ISO_8859_5, ENCODING_ISO_8859_6, ENCODING_ISO_8859_7, ENCODING_ISO_8859_8, ENCODING_ISO_8859_9, ENCODING_KOI8_R, ENCODING_KOI8_U, ENCODING_US_ASCII, ENCODING_UTF_16, ENCODING_UTF_16_BE, ENCODING_UTF_16_LE, ENCODING_UTF_32, ENCODING_UTF_32_BE, ENCODING_UTF_32_LE, ENCODING_UTF_8, ENCODING_WINDOWS_1250, ENCODING_WINDOWS_1251, ENCODING_WINDOWS_1252, ENCODING_WINDOWS_1253, ENCODING_WINDOWS_1254, ENCODING_WINDOWS_1255, ENCODING_WINDOWS_1256, ENCODING_WINDOWS_1257, ENCODING_WINDOWS_1258, SYSTEM_DEFAULT_ENCODING

Constructor Summary

Constructors
Constructor and Description

EncodingUtilImpl()
The constructor.

Constructors
Constructor and Description
`EncodingUtilImpl()` The constructor.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`EncodingDetectionReader`	`createUtfDetectionReader(InputStream inputStream, String nonUtfEncoding)` This method creates a new `Reader` for the given `inputStream`.
`static EncodingUtil`	`getInstance()` This method gets the singleton instance of this `EncodingUtilImpl`.

Methods inherited from class net.sf.mmm.util.component.base.AbstractLoggableComponent
createLogger, doInitialize, getLogger

Methods inherited from class net.sf.mmm.util.component.base.AbstractComponent
doInitialized, getInitializationState, initialize

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - UTF_8_CONTINUATION_BYTE_MIN
```
public static final byte UTF_8_CONTINUATION_BYTE_MIN
```
    In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx. This is the lower bound to detect such char.
    
    See Also:
    
    Constant Field Values
  - UTF_8_CONTINUATION_BYTE_MAX
```
public static final byte UTF_8_CONTINUATION_BYTE_MAX
```
    In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx. This is the upper bound to detect such char.
    
    See Also:
    
    Constant Field Values
  - UTF_8_TWO_BYTE_MIN
```
public static final byte UTF_8_TWO_BYTE_MIN
```
    An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx. This is the lower bound to detect the first char of such sequence.
    ATTENTION:
    The bytes 0xC0 or 0xC1 would indicate a two-byte-sequence with code-point <= 127 what makes no sense.
    
    See Also:
    
    Constant Field Values
  - UTF_8_TWO_BYTE_MAX
```
public static final byte UTF_8_TWO_BYTE_MAX
```
    An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx. This is the upper bound to detect the first char of such sequence.
    
    See Also:
    
    Constant Field Values
  - UTF_8_THREE_BYTE_MIN
```
public static final byte UTF_8_THREE_BYTE_MIN
```
    An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx. This is the lower bound to detect the first char of such sequence.
    
    See Also:
    
    Constant Field Values
  - UTF_8_THREE_BYTE_MAX
```
public static final byte UTF_8_THREE_BYTE_MAX
```
    An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx. This is the upper bound to detect the first char of such sequence.
    
    See Also:
    
    Constant Field Values
  - UTF_8_FOUR_BYTE_MIN
```
public static final byte UTF_8_FOUR_BYTE_MIN
```
    An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. This is the lower bound to detect the first char of such sequence.
    
    See Also:
    
    Constant Field Values
  - UTF_8_FOUR_BYTE_MAX
```
public static final byte UTF_8_FOUR_BYTE_MAX
```
    An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. This is the upper bound to detect the first char of such sequence.
    ATTENTION:
    The bytes 0xF5, 0xF6, or 0xF7 would lead to a four-byte-sequence with code-point greater than 10FFFF which is restricted by rfc3629.
    
    See Also:
    
    Constant Field Values
  - UTF_16_FIRST_SURROGATE_MIN
```
public static final byte UTF_16_FIRST_SURROGATE_MIN
```
    An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The first has the form 110110xx xxxxxxxx. This is the lower bound to detect the first char of such sequence.
    
    See Also:
    
    Constant Field Values
  - UTF_16_FIRST_SURROGATE_MAX
```
public static final byte UTF_16_FIRST_SURROGATE_MAX
```
    An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The first has the form 110110xx xxxxxxxx. This is the upper bound to detect the first char of such sequence.
    
    See Also:
    
    Constant Field Values
  - UTF_16_SECOND_SURROGATE_MIN
```
public static final byte UTF_16_SECOND_SURROGATE_MIN
```
    An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The second has the form 110111xx xxxxxxxx. This is the lower bound to detect the first char of such sequence.
    
    See Also:
    
    Constant Field Values
  - UTF_16_SECOND_SURROGATE_MAX
```
public static final byte UTF_16_SECOND_SURROGATE_MAX
```
    An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The second has the form 110111xx xxxxxxxx. This is the upper bound to detect the first char of such sequence.
    
    See Also:
    
    Constant Field Values
  - RANK_BOM
```
private static final int RANK_BOM
```
    The rank gain if a proper ByteOrderMark was detected.
    
    See Also:
    
    Constant Field Values
  - RANK_UTF8_SEQUNCE
```
private static final int RANK_UTF8_SEQUNCE
```
    The rank gain if a proper UTF-8 multi-byte sequence was detected.
    
    See Also:
    
    Constant Field Values
  - RANK_UTF16_SURROGATE
```
private static final int RANK_UTF16_SURROGATE
```
    The rank gain if an UTF-16 surrogate pair was detected.
    
    See Also:
    
    Constant Field Values
  - instance
```
private static EncodingUtil instance
```
- Constructor Detail
  - EncodingUtilImpl
```
public EncodingUtilImpl()
```
    The constructor.
- Method Detail
  - getInstance
```
public static EncodingUtil getInstance()
```
    This method gets the singleton instance of this EncodingUtilImpl.
    ATTENTION:
    Please read Cdi.GET_INSTANCE before using.
    
    Returns:
    
    the singleton instance.
  - createUtfDetectionReader
```
public EncodingDetectionReader createUtfDetectionReader(InputStream inputStream,
                                                        String nonUtfEncoding)
```
    Description copied from interface: EncodingUtil
    
    This method creates a new Reader for the given inputStream. The EncodingDetectionReader automatically detects UTF (Unicode Transformation Format) encodings. If the data provided by inputStream is NOT in such encoding, it will use the given nonUtfEncoding as fallback.
    The EncodingDetectionReader will behave like InputStreamReader but with an encoding that is automatically detected whilst reading. It will use a lookahead buffer to detect the encoding. As long as no UTF characteristic was detected and only ASCII-characters (<128) are hit, the encoding remains EncodingUtil.ENCODING_US_ASCII. As soon as an UTF sequence was detected (e.g. EncodingUtil.ENCODING_UTF_8 or EncodingUtil.ENCODING_UTF_16_BE), the encoding switches to that encoding. If a non-ASCII character is hit and no UTF encoding is detected, the EncodingDetectionReader switches to the given nonUtfEncoding.
    
    Specified by:
    
    createUtfDetectionReader in interface EncodingUtil
    
    Parameters:
    
    inputStream - is the InputStream to decode and read.
    
    nonUtfEncoding - is the encoding to use in case the data is NOT encoded in UTF (e.g. EncodingUtil.ENCODING_ISO_8859_15). It is pointless to use an UTF-based encoding or EncodingUtil.ENCODING_US_ASCII here.
    
    Returns:
    
    a new EncodingDetectionReader that can be used to read the inputStream.

Class EncodingUtilImpl

Nested Class Summary

Field Summary

Fields inherited from interface net.sf.mmm.util.io.api.EncodingUtil

Constructor Summary

Method Summary

Methods inherited from class net.sf.mmm.util.component.base.AbstractLoggableComponent

Methods inherited from class net.sf.mmm.util.component.base.AbstractComponent

Methods inherited from class java.lang.Object

Field Detail

UTF_8_CONTINUATION_BYTE_MIN

UTF_8_CONTINUATION_BYTE_MAX

UTF_8_TWO_BYTE_MIN

UTF_8_TWO_BYTE_MAX

UTF_8_THREE_BYTE_MIN

UTF_8_THREE_BYTE_MAX

UTF_8_FOUR_BYTE_MIN

UTF_8_FOUR_BYTE_MAX

UTF_16_FIRST_SURROGATE_MIN

UTF_16_FIRST_SURROGATE_MAX

UTF_16_SECOND_SURROGATE_MIN

UTF_16_SECOND_SURROGATE_MAX

RANK_BOM

RANK_UTF8_SEQUNCE

RANK_UTF16_SURROGATE

instance

Constructor Detail

EncodingUtilImpl

Method Detail

getInstance

createUtfDetectionReader