EncodingUtilImpl.UtfDetectionProcessor (JavaDocs for mmm-all dev-SNAPSHOT)

java.lang.Object
- net.sf.mmm.util.io.base.EncodingUtilImpl.UtfDetectionProcessor

All Implemented Interfaces:

ByteProcessor

Enclosing class:

EncodingUtilImpl
```
protected static class EncodingUtilImpl.UtfDetectionProcessor
extends Object
implements ByteProcessor
```
This inner class is used to perform the actual UTF detection. It processes the bytes from the underlying InputStream from a lookahead buffer. It respects a ByteOrderMark, UTF-8 multi-byte-sequences, UTF-16 surrogates, zero-bytes for UTF-16 and UTF-32 ASCII overhead, etc.

Field Summary

Fields
Modifier and Type	Field and Description
`private ByteOrderMark`	`bom` The `ByteOrderMark` or `null` if NOT present (or detection NOT started).
`private long`	`bytePosition` The byte-position in the stream relative to the head.
`private RankMap<String>`	`encodingRankMap` The `RankMap` for encoding detection.
`private long`	`firstNonAsciiPosition` The `bytePosition` where the first non-ascii byte was detected.
`private boolean`	`maybeAscii` `false` if the data can NOT be ASCII, `true` otherwise.
`private boolean`	`maybeUtf16` `false` if the data can NOT be UTF-16, `true` otherwise.
`private boolean`	`maybeUtf8` `false` if the data can NOT be UTF-8, `true` otherwise.
`private String`	`nonUtfEncoding` The encoding to use if encoding is neither UTF nor ASCII.
`private EncodingUtilImpl.Surrogate[]`	`surrogates` The last `EncodingUtilImpl.Surrogate`s for each of the `positions` modulo 4.
`private int`	`utf8ContinuationByteCount` The expected number of UTF-8 continuation bytes to come or `0` if no UTF-8 multi-byte-sequence is currently processed.
`private int[]`	`zeroByteCounts` The number of bytes that have been `0` for each of the `positions` modulo 4.

Constructor Summary

Constructors
Constructor and Description

UtfDetectionProcessor(String nonUtfEncoding)
The constructor.

Constructors
Constructor and Description
`UtfDetectionProcessor(String nonUtfEncoding)` The constructor.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`String`	`getEncoding()` This method gets the detected encoding from the currently processed data.
`String`	`getLowByteEncoding()` This method gets the encoding without taking high-bytes (non-ASCII) into account.
`int`	`process(byte[] buffer, int offset, int length)` This method is called to process the number of `length` bytes from the given `buffer` starting from the given `offset`.
`private int`	`processBom(byte[] buffer, int offset, int i)` Detects if a `ByteOrderMark` (BOM) is available as hint for the encoding.
`private void`	`processUtf16Detection(byte b)` Heuristic analysis to detect UTF-16 indications.
`private void`	`processUtf8Detection(byte b)` Heuristic analysis to detect UTF-8 indications.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - encodingRankMap
```
private RankMap<String> encodingRankMap
```
    The RankMap for encoding detection.
  - bom
```
private ByteOrderMark bom
```
    The ByteOrderMark or null if NOT present (or detection NOT started).
  - nonUtfEncoding
```
private final String nonUtfEncoding
```
    The encoding to use if encoding is neither UTF nor ASCII.
  - maybeAscii
```
private boolean maybeAscii
```
    false if the data can NOT be ASCII, true otherwise.
  - maybeUtf8
```
private boolean maybeUtf8
```
    false if the data can NOT be UTF-8, true otherwise.
  - maybeUtf16
```
private boolean maybeUtf16
```
    false if the data can NOT be UTF-16, true otherwise.
  - bytePosition
```
private long bytePosition
```
    The byte-position in the stream relative to the head.
  - firstNonAsciiPosition
```
private long firstNonAsciiPosition
```
    The bytePosition where the first non-ascii byte was detected.
  - zeroByteCounts
```
private int[] zeroByteCounts
```
    The number of bytes that have been 0 for each of the positions modulo 4.
  - surrogates
```
private EncodingUtilImpl.Surrogate[] surrogates
```
    The last EncodingUtilImpl.Surrogates for each of the positions modulo 4.
  - utf8ContinuationByteCount
```
private int utf8ContinuationByteCount
```
    The expected number of UTF-8 continuation bytes to come or 0 if no UTF-8 multi-byte-sequence is currently processed.
- Constructor Detail
  - UtfDetectionProcessor
```
public UtfDetectionProcessor(String nonUtfEncoding)
```
    The constructor.
    
    Parameters:
    
    nonUtfEncoding - is the encoding to use if encoding is neither UTF nor ASCII.
- Method Detail
  - process
```
public int process(byte[] buffer,
                   int offset,
                   int length)
```
    Description copied from interface: ByteProcessor
    
    This method is called to process the number of length bytes from the given buffer starting from the given offset.
    ATTENTION:
    An implementation of this interface should only read bytes from the given buffer. It is NOT permitted to modify the given buffer unless this is explicitly specified by the calling object (typically an implementation of ByteProcessable).
    
    Specified by:
    
    process in interface ByteProcessor
    
    Parameters:
    
    buffer - contains the bytes to process.
    
    offset - is the index where to start in the buffer.
    
    length - is the number of bytes to proceed.
    
    Returns:
    
    the number of bytes that should be consumed. Typically you will simply return length . However you can also return a value less than length and greater or equal to zero, in order to stop processing at a specific position.
  - processUtf16Detection
```
private void processUtf16Detection(byte b)
```
    Heuristic analysis to detect UTF-16 indications.
    
    Parameters:
    
    b - is the single byte to process.
  - processUtf8Detection
```
private void processUtf8Detection(byte b)
```
    Heuristic analysis to detect UTF-8 indications.
    
    Parameters:
    
    b - is the single byte to process.
  - processBom
```
private int processBom(byte[] buffer,
                       int offset,
                       int i)
```
    Detects if a ByteOrderMark (BOM) is available as hint for the encoding.
    
    Parameters:
    
    buffer - is the buffer of the raw data.
    
    offset - is the current offset
    
    i - is the current index.
    
    Returns:
    
    the new index. Will be the same as i or greater if bytes (for detected BOM) have been consumed.
  - getLowByteEncoding
```
public String getLowByteEncoding()
```
    This method gets the encoding without taking high-bytes (non-ASCII) into account.
    
    Returns:
    
    the low-byte encoding or null if it looks like ASCII so far.
  - getEncoding
```
public String getEncoding()
```
    This method gets the detected encoding from the currently processed data.
    
    Returns:
    
    the detected encoding or null if the encoding has NOT yet been detected and it looks like ASCII so far.

Class EncodingUtilImpl.UtfDetectionProcessor

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

encodingRankMap

bom

nonUtfEncoding

maybeAscii

maybeUtf8

maybeUtf16

bytePosition

firstNonAsciiPosition

zeroByteCounts

surrogates

utf8ContinuationByteCount

Constructor Detail

UtfDetectionProcessor

Method Detail

process

processUtf16Detection

processUtf8Detection

processBom

getLowByteEncoding

getEncoding