ContentParserHtml (JavaDocs for mmm-all dev-SNAPSHOT)

java.lang.Object
- net.sf.mmm.util.component.base.AbstractComponent
- - net.sf.mmm.util.component.base.AbstractLoggableComponent
  - - net.sf.mmm.content.parser.base.AbstractContentParser
    - - net.sf.mmm.content.parser.impl.text.AbstractContentParserText
      - net.sf.mmm.content.parser.impl.text.AbstractContentParserTextMarkupAware
        
        net.sf.mmm.content.parser.impl.html.ContentParserHtml

All Implemented Interfaces:

ContentParser
```
@Singleton
 @Named
public class ContentParserHtml
extends AbstractContentParserTextMarkupAware
```
This is the implementation of the ContentParser interface for HTML documents (content with the mimetype "text/html").
It uses JTidy for HTML-parsing but falls back to raw parsing for files that are large or have unpredictable size to avoid memory problems.

Author:

Joerg Hohwiller (hohwille at users.sourceforge.net)

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

private static class ContentParserHtml.NullWriter
This is a writer that does nothing.

Nested Classes
Modifier and Type	Class and Description
`private static class`	`ContentParserHtml.NullWriter` This is a writer that does nothing.

Field Summary

Fields
Modifier and Type	Field and Description
`private static String`	`ATR_META_CONTENT` the content attribute of the meta tag
`private static String`	`ATR_META_NAME` the name attribute of the meta tag
`private static Pattern`	`AUTHOR_PATTERN` pattern to extract the author
`static String`	`KEY_EXTENSION` The default extension.
`static String`	`KEY_MIMETYPE` The mimetype.
`private static Pattern`	`KEYWORDS_PATTERN` pattern to extract the title
`private static String`	`TAG_BODY` the body tag
`private static String`	`TAG_HEAD` the head tag
`private static String`	`TAG_META` the meta tag
`private static String`	`TAG_TITLE` the title tag
`private static Pattern`	`TITLE_PATTERN` pattern to extract the title

Fields inherited from interface net.sf.mmm.content.parser.api.ContentParser
VARIABLE_NAME_CREATOR, VARIABLE_NAME_KEYWORDS, VARIABLE_NAME_LANGUAGE, VARIABLE_NAME_TEXT, VARIABLE_NAME_TITLE

Constructor Summary

Constructors
Constructor and Description

ContentParserHtml()
The constructor.

Constructors
Constructor and Description
`ContentParserHtml()` The constructor.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`private void`	`collectTextContent(Element element, StringBuffer buffer)` This method recursively collects the text from the given `element` and appends it to the given `buffer`.
`String[]`	`getAlternativeKeyArray()` This method gets the alternative `primary keys` in addition to `extension` and `mimetype`.
`String`	`getExtension()` This method gets the default filename extension excluding the dot (e.g.
`private Element`	`getFirstChildElement(Element element, String tagname)` This method gets the first `child` element of `element` with the given `tagname`.
`String`	`getMimetype()` This method gets the default mimetype (e.g.
`String[]`	`getSecondaryKeyArray()` This method gets the `secondary keys` as array.
`private String`	`getTextContent(Element element)` This method gets all text content from the given `element` recursively including the text of all children.
`void`	`parse(InputStream inputStream, long filesize, ContentParserOptions options, MutableGenericContext context)`
`protected void`	`parseJtidy(InputStream inputStream, long filesize, MutableGenericContext context)`
`protected void`	`parseLine(MutableGenericContext context, String line)` This method may be overridden to parse additional metadata from the content.

Methods inherited from class net.sf.mmm.content.parser.impl.text.AbstractContentParserTextMarkupAware
doInitialize, parse

Methods inherited from class net.sf.mmm.content.parser.impl.text.AbstractContentParserText
getEncodingUtil, getXmlUtil, parseProperty, parseProperty, setEncodingUtil, setXmlUtil

Methods inherited from class net.sf.mmm.content.parser.base.AbstractContentParser
getPrimaryKeys, getSecondaryKeys, parse, parse, setGenericContextFactory

Methods inherited from class net.sf.mmm.util.component.base.AbstractLoggableComponent
createLogger, getLogger

Methods inherited from class net.sf.mmm.util.component.base.AbstractComponent
doInitialized, getInitializationState, initialize

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - KEY_MIMETYPE
```
public static final String KEY_MIMETYPE
```
    The mimetype.
    
    See Also:
    
    Constant Field Values
  - KEY_EXTENSION
```
public static final String KEY_EXTENSION
```
    The default extension.
    
    See Also:
    
    Constant Field Values
  - TAG_HEAD
```
private static final String TAG_HEAD
```
    the head tag
    
    See Also:
    
    Constant Field Values
  - TAG_TITLE
```
private static final String TAG_TITLE
```
    the title tag
    
    See Also:
    
    Constant Field Values
  - TAG_META
```
private static final String TAG_META
```
    the meta tag
    
    See Also:
    
    Constant Field Values
  - ATR_META_NAME
```
private static final String ATR_META_NAME
```
    the name attribute of the meta tag
    
    See Also:
    
    Constant Field Values
  - ATR_META_CONTENT
```
private static final String ATR_META_CONTENT
```
    the content attribute of the meta tag
    
    See Also:
    
    Constant Field Values
  - TAG_BODY
```
private static final String TAG_BODY
```
    the body tag
    
    See Also:
    
    Constant Field Values
  - TITLE_PATTERN
```
private static final Pattern TITLE_PATTERN
```
    pattern to extract the title
  - AUTHOR_PATTERN
```
private static final Pattern AUTHOR_PATTERN
```
    pattern to extract the author
  - KEYWORDS_PATTERN
```
private static final Pattern KEYWORDS_PATTERN
```
    pattern to extract the title
- Constructor Detail
  - ContentParserHtml
```
public ContentParserHtml()
```
    The constructor.
- Method Detail
  - getExtension
```
public String getExtension()
```
    This method gets the default filename extension excluding the dot (e.g. "txt", "xml", "html", "pdf", etc.) for the content managed by this ContentParser.
    
    Specified by:
    
    getExtension in interface ContentParser
    
    Overrides:
    
    getExtension in class AbstractContentParserText
    
    Returns:
    
    the default filename extension or null if this is the generic parser.
  - getMimetype
```
public String getMimetype()
```
    This method gets the default mimetype (e.g. "text/plain", "text/xml", "text/html", "application/pdf", etc.) for the content managed by this ContentParser.
    
    Specified by:
    
    getMimetype in interface ContentParser
    
    Overrides:
    
    getMimetype in class AbstractContentParserText
    
    Returns:
    
    the default filename extension or null if this is the generic parser.
  - getAlternativeKeyArray
```
public String[] getAlternativeKeyArray()
```
    This method gets the alternative primary keys in addition to extension and mimetype.
    
    Overrides:
    
    getAlternativeKeyArray in class AbstractContentParser
    
    Returns:
    
    an array with the alternative keys.
    
    See Also:
    
    AbstractContentParser.getPrimaryKeys()
  - getSecondaryKeyArray
```
public String[] getSecondaryKeyArray()
```
    This method gets the secondary keys as array. This is just a convenience to make it easier for the implementors of individual parsers not to deal with creating a Set and make it unmodifiable.
    
    Overrides:
    
    getSecondaryKeyArray in class AbstractContentParser
    
    Returns:
    
    an array with the alternative keys.
    
    See Also:
    
    AbstractContentParser.getPrimaryKeys()
  - parseJtidy
```
protected void parseJtidy(InputStream inputStream,
                          long filesize,
                          MutableGenericContext context)
                   throws Exception
```
    Parameters:
    
    inputStream - is the fresh input stream of the content to parse. It will be closed by this method (on success and in exceptional state).
    
    filesize - is the size (content-length) of the content to parse in bytes or 0 if NOT available (unknown). If available, the parser may use this value for optimized allocations.
    
    context - is the MutableGenericContext where metadata can be added.
    
    Throws:
    
    Exception - on error.
    
    See Also:
    
    AbstractContentParser.parse(InputStream, long)
  - parse
```
public void parse(InputStream inputStream,
                  long filesize,
                  ContentParserOptions options,
                  MutableGenericContext context)
           throws Exception
```
    Overrides:
    
    parse in class AbstractContentParserText
    
    Parameters:
    
    inputStream - is the fresh input stream of the content to parse.
    
    filesize - is the size (content-length) of the content to parse in bytes or 0 if NOT available (unknown). If available, the parser may use this value for optimized allocations.
    
    options - are the ContentParserOptions.
    
    context - is the MutableGenericContext where the extracted metadata from the parsed inputStream will be added to.
    
    Throws:
    
    Exception - if the operation fails for arbitrary reasons.
    
    See Also:
    
    ContentParser.parse(InputStream, long)
  - parseLine
```
protected void parseLine(MutableGenericContext context,
                         String line)
```
    This method may be overridden to parse additional metadata from the content.
    
    Overrides:
    
    parseLine in class AbstractContentParserText
    
    Parameters:
    
    context - are the properties with the collected metadata.
    
    line - is a single line read from the text.
  - getFirstChildElement
```
private Element getFirstChildElement(Element element,
                                     String tagname)
```
    This method gets the first child element of element with the given tagname.
    
    Parameters:
    
    element - is the element where the child is requested from.
    
    tagname - is the tagname of the requested child element.
    
    Returns:
    
    the first child-element with the given tagname or null if no such child exists.
  - getTextContent
```
private String getTextContent(Element element)
```
    This method gets all text content from the given element recursively including the text of all children.
    
    Parameters:
    
    element - is the element for which the text is requested.
    
    Returns:
    
    the requested text.
  - collectTextContent
```
private void collectTextContent(Element element,
                                StringBuffer buffer)
```
    This method recursively collects the text from the given element and appends it to the given buffer.
    
    Parameters:
    
    element - is the element for which the text is requested.
    
    buffer - is the buffer where the text will be appended to.

Class ContentParserHtml

Nested Class Summary

Field Summary

Fields inherited from interface net.sf.mmm.content.parser.api.ContentParser

Constructor Summary

Method Summary

Methods inherited from class net.sf.mmm.content.parser.impl.text.AbstractContentParserTextMarkupAware

Methods inherited from class net.sf.mmm.content.parser.impl.text.AbstractContentParserText

Methods inherited from class net.sf.mmm.content.parser.base.AbstractContentParser

Methods inherited from class net.sf.mmm.util.component.base.AbstractLoggableComponent

Methods inherited from class net.sf.mmm.util.component.base.AbstractComponent

Methods inherited from class java.lang.Object

Field Detail

KEY_MIMETYPE

KEY_EXTENSION

TAG_HEAD

TAG_TITLE

TAG_META

ATR_META_NAME

ATR_META_CONTENT

TAG_BODY

TITLE_PATTERN

AUTHOR_PATTERN

KEYWORDS_PATTERN

Constructor Detail

ContentParserHtml

Method Detail

getExtension

getMimetype

getAlternativeKeyArray

getSecondaryKeyArray

parseJtidy

parse

parseLine

getFirstChildElement

getTextContent

collectTextContent