@Singleton @Named public class ContentParserHtml extends AbstractContentParserTextMarkupAware
ContentParser
interface for HTML
documents (content with the mimetype "text/html"). large
or have
unpredictable size to avoid memory problems.Modifier and Type | Class and Description |
---|---|
private static class |
ContentParserHtml.NullWriter
This is a writer that does nothing.
|
Modifier and Type | Field and Description |
---|---|
private static String |
ATR_META_CONTENT
the content attribute of the meta tag
|
private static String |
ATR_META_NAME
the name attribute of the meta tag
|
private static Pattern |
AUTHOR_PATTERN
pattern to extract the author
|
static String |
KEY_EXTENSION
The default extension.
|
static String |
KEY_MIMETYPE
The mimetype.
|
private static Pattern |
KEYWORDS_PATTERN
pattern to extract the title
|
private static String |
TAG_BODY
the body tag
|
private static String |
TAG_HEAD
the head tag
|
private static String |
TAG_META
the meta tag
|
private static String |
TAG_TITLE
the title tag
|
private static Pattern |
TITLE_PATTERN
pattern to extract the title
|
VARIABLE_NAME_CREATOR, VARIABLE_NAME_KEYWORDS, VARIABLE_NAME_LANGUAGE, VARIABLE_NAME_TEXT, VARIABLE_NAME_TITLE
Constructor and Description |
---|
ContentParserHtml()
The constructor.
|
Modifier and Type | Method and Description |
---|---|
private void |
collectTextContent(Element element,
StringBuffer buffer)
This method recursively collects the text from the given
element and appends it to the given buffer . |
String[] |
getAlternativeKeyArray()
|
String |
getExtension()
This method gets the default filename extension excluding the dot (e.g.
|
private Element |
getFirstChildElement(Element element,
String tagname)
|
String |
getMimetype()
This method gets the default mimetype (e.g.
|
String[] |
getSecondaryKeyArray()
This method gets the
secondary keys as array. |
private String |
getTextContent(Element element)
This method gets all text content from the given
element
recursively including the text of all children. |
void |
parse(InputStream inputStream,
long filesize,
ContentParserOptions options,
MutableGenericContext context) |
protected void |
parseJtidy(InputStream inputStream,
long filesize,
MutableGenericContext context) |
protected void |
parseLine(MutableGenericContext context,
String line)
This method may be overridden to parse additional metadata from the
content.
|
doInitialize, parse
getEncodingUtil, getXmlUtil, parseProperty, parseProperty, setEncodingUtil, setXmlUtil
getPrimaryKeys, getSecondaryKeys, parse, parse, setGenericContextFactory
createLogger, getLogger
doInitialized, getInitializationState, initialize
public static final String KEY_MIMETYPE
public static final String KEY_EXTENSION
private static final String TAG_HEAD
private static final String TAG_TITLE
private static final String TAG_META
private static final String ATR_META_NAME
private static final String ATR_META_CONTENT
private static final String TAG_BODY
private static final Pattern TITLE_PATTERN
private static final Pattern AUTHOR_PATTERN
private static final Pattern KEYWORDS_PATTERN
public String getExtension()
ContentParser
.getExtension
in interface ContentParser
getExtension
in class AbstractContentParserText
null
if this is the
generic parser
.public String getMimetype()
ContentParser
.getMimetype
in interface ContentParser
getMimetype
in class AbstractContentParserText
null
if this is the
generic parser
.public String[] getAlternativeKeyArray()
getAlternativeKeyArray
in class AbstractContentParser
AbstractContentParser.getPrimaryKeys()
public String[] getSecondaryKeyArray()
secondary keys
as array.
This is just a convenience to make it easier for the implementors of
individual parsers not to deal with creating a Set
and make it
unmodifiable.getSecondaryKeyArray
in class AbstractContentParser
AbstractContentParser.getPrimaryKeys()
protected void parseJtidy(InputStream inputStream, long filesize, MutableGenericContext context) throws Exception
inputStream
- is the fresh input stream of the content to parse. It
will be closed
by this method (on
success and in exceptional state).filesize
- is the size (content-length) of the content to parse in
bytes or 0
if NOT available (unknown). If available,
the parser may use this value for optimized allocations.context
- is the MutableGenericContext
where metadata can be
added.Exception
- on error.AbstractContentParser.parse(InputStream, long)
public void parse(InputStream inputStream, long filesize, ContentParserOptions options, MutableGenericContext context) throws Exception
parse
in class AbstractContentParserText
inputStream
- is the fresh input stream of the content to parse.filesize
- is the size (content-length) of the content to parse in
bytes or 0
if NOT available (unknown). If available,
the parser may use this value for optimized allocations.options
- are the ContentParserOptions
.context
- is the MutableGenericContext
where the extracted
metadata from the parsed inputStream
will be
added
to.Exception
- if the operation fails for arbitrary reasons.ContentParser.parse(InputStream, long)
protected void parseLine(MutableGenericContext context, String line)
parseLine
in class AbstractContentParserText
context
- are the properties with the collected metadata.line
- is a single line read from the text.private Element getFirstChildElement(Element element, String tagname)
element
- is the element where the child is requested from.tagname
- is the tagname of the requested child element.tagname
or
null
if no such child exists.private String getTextContent(Element element)
element
recursively including the text of all children.element
- is the element for which the text is requested.private void collectTextContent(Element element, StringBuffer buffer)
element
and appends it to the given buffer
.element
- is the element for which the text is requested.buffer
- is the buffer where the text will be appended to.Copyright © 2001–2016 mmm-Team. All rights reserved.