public class EncodingUtilImpl extends AbstractLoggableComponent implements EncodingUtil
EncodingUtil
interface.getInstance()
Modifier and Type | Class and Description |
---|---|
protected static class |
EncodingUtilImpl.AsciiProcessor
This inner class is used to process the byes from the underlying
InputStream in ASCII mode. |
protected static class |
EncodingUtilImpl.Surrogate
This enum contains represents the type of a
EncodingUtilImpl.Surrogate from an UTF-16 sequence. |
protected static class |
EncodingUtilImpl.UtfDetectionProcessor
This inner class is used to perform the actual UTF detection.
|
protected class |
EncodingUtilImpl.UtfDetectionReader |
Modifier and Type | Field and Description |
---|---|
private static EncodingUtil |
instance |
private static int |
RANK_BOM
The rank gain if a proper
ByteOrderMark was detected. |
private static int |
RANK_UTF16_SURROGATE
The rank gain if an UTF-16 surrogate pair was detected.
|
private static int |
RANK_UTF8_SEQUNCE
The rank gain if a proper UTF-8 multi-byte sequence was detected.
|
static byte |
UTF_16_FIRST_SURROGATE_MAX
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
|
static byte |
UTF_16_FIRST_SURROGATE_MIN
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
|
static byte |
UTF_16_SECOND_SURROGATE_MAX
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
|
static byte |
UTF_16_SECOND_SURROGATE_MIN
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
|
static byte |
UTF_8_CONTINUATION_BYTE_MAX
In an UTF-8 multi-byte-sequence all bytes except the first one have the from
10xxxxxx . |
static byte |
UTF_8_CONTINUATION_BYTE_MIN
In an UTF-8 multi-byte-sequence all bytes except the first one have the from
10xxxxxx . |
static byte |
UTF_8_FOUR_BYTE_MAX
An UTF-8 four-byte-sequence has the form
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx . |
static byte |
UTF_8_FOUR_BYTE_MIN
An UTF-8 four-byte-sequence has the form
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx . |
static byte |
UTF_8_THREE_BYTE_MAX
An UTF-8 thee-byte-sequence has the form
1110xxxx 10xxxxxx 10xxxxxx . |
static byte |
UTF_8_THREE_BYTE_MIN
An UTF-8 thee-byte-sequence has the form
1110xxxx 10xxxxxx 10xxxxxx . |
static byte |
UTF_8_TWO_BYTE_MAX
An UTF-8 two-byte-sequence has the form
110xxxxx 10xxxxxx . |
static byte |
UTF_8_TWO_BYTE_MIN
An UTF-8 two-byte-sequence has the form
110xxxxx 10xxxxxx . |
ENCODING_CP_437, ENCODING_CP_737, ENCODING_CP_850, ENCODING_CP_852, ENCODING_CP_855, ENCODING_CP_857, ENCODING_CP_858, ENCODING_CP_860, ENCODING_CP_861, ENCODING_CP_863, ENCODING_CP_865, ENCODING_CP_866, ENCODING_CP_869, ENCODING_ISO_8859_1, ENCODING_ISO_8859_10, ENCODING_ISO_8859_11, ENCODING_ISO_8859_12, ENCODING_ISO_8859_13, ENCODING_ISO_8859_14, ENCODING_ISO_8859_15, ENCODING_ISO_8859_16, ENCODING_ISO_8859_2, ENCODING_ISO_8859_3, ENCODING_ISO_8859_4, ENCODING_ISO_8859_5, ENCODING_ISO_8859_6, ENCODING_ISO_8859_7, ENCODING_ISO_8859_8, ENCODING_ISO_8859_9, ENCODING_KOI8_R, ENCODING_KOI8_U, ENCODING_US_ASCII, ENCODING_UTF_16, ENCODING_UTF_16_BE, ENCODING_UTF_16_LE, ENCODING_UTF_32, ENCODING_UTF_32_BE, ENCODING_UTF_32_LE, ENCODING_UTF_8, ENCODING_WINDOWS_1250, ENCODING_WINDOWS_1251, ENCODING_WINDOWS_1252, ENCODING_WINDOWS_1253, ENCODING_WINDOWS_1254, ENCODING_WINDOWS_1255, ENCODING_WINDOWS_1256, ENCODING_WINDOWS_1257, ENCODING_WINDOWS_1258, SYSTEM_DEFAULT_ENCODING
Constructor and Description |
---|
EncodingUtilImpl()
The constructor.
|
Modifier and Type | Method and Description |
---|---|
EncodingDetectionReader |
createUtfDetectionReader(InputStream inputStream,
String nonUtfEncoding)
This method creates a new
Reader for the given inputStream . |
static EncodingUtil |
getInstance()
This method gets the singleton instance of this
EncodingUtilImpl . |
createLogger, doInitialize, getLogger
doInitialized, getInitializationState, initialize
public static final byte UTF_8_CONTINUATION_BYTE_MIN
10xxxxxx
. This is
the lower bound to detect such char.public static final byte UTF_8_CONTINUATION_BYTE_MAX
10xxxxxx
. This is
the upper bound to detect such char.public static final byte UTF_8_TWO_BYTE_MIN
110xxxxx 10xxxxxx
. This is the lower bound to detect the
first char of such sequence. 0xC0
or 0xC1
would indicate a two-byte-sequence with code-point <= 127 what
makes no sense.public static final byte UTF_8_TWO_BYTE_MAX
110xxxxx 10xxxxxx
. This is the upper bound to detect the
first char of such sequence.public static final byte UTF_8_THREE_BYTE_MIN
1110xxxx 10xxxxxx 10xxxxxx
. This is the lower bound to
detect the first char of such sequence.public static final byte UTF_8_THREE_BYTE_MAX
1110xxxx 10xxxxxx 10xxxxxx
. This is the upper bound to
detect the first char of such sequence.public static final byte UTF_8_FOUR_BYTE_MIN
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
. This is the lower
bound to detect the first char of such sequence.public static final byte UTF_8_FOUR_BYTE_MAX
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
. This is the upper
bound to detect the first char of such sequence. 0xF5
, 0xF6
, or 0xF7
would lead to a four-byte-sequence with code-point
greater than 10FFFF
which is restricted by
rfc3629.public static final byte UTF_16_FIRST_SURROGATE_MIN
110110xx xxxxxxxx
. This is the lower bound to detect the first char of such sequence.public static final byte UTF_16_FIRST_SURROGATE_MAX
110110xx xxxxxxxx
. This is the upper bound to detect the first char of such sequence.public static final byte UTF_16_SECOND_SURROGATE_MIN
110111xx xxxxxxxx
. This is the lower bound to detect the first char of such sequence.public static final byte UTF_16_SECOND_SURROGATE_MAX
110111xx xxxxxxxx
. This is the upper bound to detect the first char of such sequence.private static final int RANK_BOM
ByteOrderMark
was detected.private static final int RANK_UTF8_SEQUNCE
private static final int RANK_UTF16_SURROGATE
private static EncodingUtil instance
public static EncodingUtil getInstance()
EncodingUtilImpl
. Cdi.GET_INSTANCE
before using.public EncodingDetectionReader createUtfDetectionReader(InputStream inputStream, String nonUtfEncoding)
EncodingUtil
Reader
for the given inputStream
. The
EncodingDetectionReader
automatically detects UTF (Unicode Transformation Format) encodings. If
the data provided by inputStream
is NOT in such encoding, it will use the given
nonUtfEncoding
as fallback. EncodingDetectionReader
will behave like InputStreamReader
but with an
encoding that is automatically detected whilst reading. It will use a lookahead buffer to detect the
encoding. As long as no UTF characteristic was detected and only ASCII-characters (<128
) are hit,
the encoding remains EncodingUtil.ENCODING_US_ASCII
. As soon as an UTF sequence was detected (e.g.
EncodingUtil.ENCODING_UTF_8
or EncodingUtil.ENCODING_UTF_16_BE
), the encoding switches to that encoding. If a
non-ASCII character is hit and no UTF encoding is detected, the EncodingDetectionReader
switches
to the given nonUtfEncoding
.createUtfDetectionReader
in interface EncodingUtil
inputStream
- is the InputStream
to decode and read.nonUtfEncoding
- is the encoding to use in case the data is NOT encoded in UTF (e.g.
EncodingUtil.ENCODING_ISO_8859_15
). It is pointless to use an UTF-based encoding or
EncodingUtil.ENCODING_US_ASCII
here.EncodingDetectionReader
that can be used to read the inputStream
.Copyright © 2001–2016 mmm-Team. All rights reserved.