protected static class EncodingUtilImpl.UtfDetectionProcessor extends Object implements ByteProcessor
InputStream
from a lookahead buffer. It respects a ByteOrderMark
, UTF-8
multi-byte-sequences, UTF-16 surrogates, zero-bytes for UTF-16 and UTF-32 ASCII overhead, etc.Modifier and Type | Field and Description |
---|---|
private ByteOrderMark |
bom
The
ByteOrderMark or null if NOT present (or detection NOT started). |
private long |
bytePosition
The byte-position in the stream relative to the head.
|
private RankMap<String> |
encodingRankMap
The
RankMap for encoding detection. |
private long |
firstNonAsciiPosition
The
bytePosition where the first non-ascii byte was detected. |
private boolean |
maybeAscii
false if the data can NOT be ASCII, true otherwise. |
private boolean |
maybeUtf16
false if the data can NOT be UTF-16, true otherwise. |
private boolean |
maybeUtf8
false if the data can NOT be UTF-8, true otherwise. |
private String |
nonUtfEncoding
The encoding to use if encoding is neither UTF nor ASCII.
|
private EncodingUtilImpl.Surrogate[] |
surrogates
The last
EncodingUtilImpl.Surrogate s for each of the positions modulo 4. |
private int |
utf8ContinuationByteCount
The expected number of UTF-8 continuation bytes to come or
0 if no UTF-8 multi-byte-sequence is
currently processed. |
private int[] |
zeroByteCounts
The number of bytes that have been
0 for each of the positions modulo 4. |
Constructor and Description |
---|
UtfDetectionProcessor(String nonUtfEncoding)
The constructor.
|
Modifier and Type | Method and Description |
---|---|
String |
getEncoding()
This method gets the detected encoding from the currently processed data.
|
String |
getLowByteEncoding()
This method gets the encoding without taking high-bytes (non-ASCII) into account.
|
int |
process(byte[] buffer,
int offset,
int length)
This method is called to process the number of
length bytes from the given buffer starting from the
given offset . |
private int |
processBom(byte[] buffer,
int offset,
int i)
Detects if a
ByteOrderMark (BOM) is available as hint for the encoding. |
private void |
processUtf16Detection(byte b)
Heuristic analysis to detect UTF-16 indications.
|
private void |
processUtf8Detection(byte b)
Heuristic analysis to detect UTF-8 indications.
|
private ByteOrderMark bom
ByteOrderMark
or null
if NOT present (or detection NOT started).private final String nonUtfEncoding
private boolean maybeAscii
false
if the data can NOT be ASCII, true
otherwise.private boolean maybeUtf8
false
if the data can NOT be UTF-8, true
otherwise.private boolean maybeUtf16
false
if the data can NOT be UTF-16, true
otherwise.private long bytePosition
private long firstNonAsciiPosition
bytePosition
where the first non-ascii byte was detected.private int[] zeroByteCounts
0
for each of the positions
modulo 4.private EncodingUtilImpl.Surrogate[] surrogates
EncodingUtilImpl.Surrogate
s for each of the positions
modulo 4.private int utf8ContinuationByteCount
0
if no UTF-8 multi-byte-sequence is
currently processed.public UtfDetectionProcessor(String nonUtfEncoding)
nonUtfEncoding
- is the encoding to use if encoding is neither UTF nor ASCII.public int process(byte[] buffer, int offset, int length)
ByteProcessor
length
bytes from the given buffer
starting from the
given offset
. buffer
. It is NOT permitted to
modify the given buffer
unless this is explicitly specified by the calling object (typically an
implementation of ByteProcessable
).process
in interface ByteProcessor
buffer
- contains the bytes to process.offset
- is the index where to start in the buffer
.length
- is the number of bytes to proceed.length
. However you
can also return a value less than length and greater or equal to zero, in order to stop processing at a
specific position.private void processUtf16Detection(byte b)
b
- is the single byte to process.private void processUtf8Detection(byte b)
b
- is the single byte to process.private int processBom(byte[] buffer, int offset, int i)
ByteOrderMark
(BOM) is available as hint for the encoding.buffer
- is the buffer of the raw data.offset
- is the current offseti
- is the current index.i
or greater if bytes (for detected BOM) have been
consumed.public String getLowByteEncoding()
null
if it looks like ASCII so far.public String getEncoding()
null
if the encoding has NOT yet been detected and it looks
like ASCII so far.Copyright © 2001–2016 mmm-Team. All rights reserved.