@Singleton @Named public class ContentParserHtml extends AbstractContentParserTextMarkupAware
ContentParser interface for HTML
documents (content with the mimetype "text/html"). large or have
unpredictable size to avoid memory problems.| Modifier and Type | Class and Description |
|---|---|
private static class |
ContentParserHtml.NullWriter
This is a writer that does nothing.
|
| Modifier and Type | Field and Description |
|---|---|
private static String |
ATR_META_CONTENT
the content attribute of the meta tag
|
private static String |
ATR_META_NAME
the name attribute of the meta tag
|
private static Pattern |
AUTHOR_PATTERN
pattern to extract the author
|
static String |
KEY_EXTENSION
The default extension.
|
static String |
KEY_MIMETYPE
The mimetype.
|
private static Pattern |
KEYWORDS_PATTERN
pattern to extract the title
|
private static String |
TAG_BODY
the body tag
|
private static String |
TAG_HEAD
the head tag
|
private static String |
TAG_META
the meta tag
|
private static String |
TAG_TITLE
the title tag
|
private static Pattern |
TITLE_PATTERN
pattern to extract the title
|
VARIABLE_NAME_CREATOR, VARIABLE_NAME_KEYWORDS, VARIABLE_NAME_LANGUAGE, VARIABLE_NAME_TEXT, VARIABLE_NAME_TITLE| Constructor and Description |
|---|
ContentParserHtml()
The constructor.
|
| Modifier and Type | Method and Description |
|---|---|
private void |
collectTextContent(Element element,
StringBuffer buffer)
This method recursively collects the text from the given
element and appends it to the given buffer. |
String[] |
getAlternativeKeyArray()
|
String |
getExtension()
This method gets the default filename extension excluding the dot (e.g.
|
private Element |
getFirstChildElement(Element element,
String tagname)
|
String |
getMimetype()
This method gets the default mimetype (e.g.
|
String[] |
getSecondaryKeyArray()
This method gets the
secondary keys as array. |
private String |
getTextContent(Element element)
This method gets all text content from the given
element
recursively including the text of all children. |
void |
parse(InputStream inputStream,
long filesize,
ContentParserOptions options,
MutableGenericContext context) |
protected void |
parseJtidy(InputStream inputStream,
long filesize,
MutableGenericContext context) |
protected void |
parseLine(MutableGenericContext context,
String line)
This method may be overridden to parse additional metadata from the
content.
|
doInitialize, parsegetEncodingUtil, getXmlUtil, parseProperty, parseProperty, setEncodingUtil, setXmlUtilgetPrimaryKeys, getSecondaryKeys, parse, parse, setGenericContextFactorycreateLogger, getLoggerdoInitialized, getInitializationState, initializepublic static final String KEY_MIMETYPE
public static final String KEY_EXTENSION
private static final String TAG_HEAD
private static final String TAG_TITLE
private static final String TAG_META
private static final String ATR_META_NAME
private static final String ATR_META_CONTENT
private static final String TAG_BODY
private static final Pattern TITLE_PATTERN
private static final Pattern AUTHOR_PATTERN
private static final Pattern KEYWORDS_PATTERN
public String getExtension()
ContentParser.getExtension in interface ContentParsergetExtension in class AbstractContentParserTextnull if this is the
generic parser.public String getMimetype()
ContentParser.getMimetype in interface ContentParsergetMimetype in class AbstractContentParserTextnull if this is the
generic parser.public String[] getAlternativeKeyArray()
getAlternativeKeyArray in class AbstractContentParserAbstractContentParser.getPrimaryKeys()public String[] getSecondaryKeyArray()
secondary keys as array.
This is just a convenience to make it easier for the implementors of
individual parsers not to deal with creating a Set and make it
unmodifiable.getSecondaryKeyArray in class AbstractContentParserAbstractContentParser.getPrimaryKeys()protected void parseJtidy(InputStream inputStream, long filesize, MutableGenericContext context) throws Exception
inputStream - is the fresh input stream of the content to parse. It
will be closed by this method (on
success and in exceptional state).filesize - is the size (content-length) of the content to parse in
bytes or 0 if NOT available (unknown). If available,
the parser may use this value for optimized allocations.context - is the MutableGenericContext where metadata can be
added.Exception - on error.AbstractContentParser.parse(InputStream, long)public void parse(InputStream inputStream, long filesize, ContentParserOptions options, MutableGenericContext context) throws Exception
parse in class AbstractContentParserTextinputStream - is the fresh input stream of the content to parse.filesize - is the size (content-length) of the content to parse in
bytes or 0 if NOT available (unknown). If available,
the parser may use this value for optimized allocations.options - are the ContentParserOptions.context - is the MutableGenericContext where the extracted
metadata from the parsed inputStream will be
added to.Exception - if the operation fails for arbitrary reasons.ContentParser.parse(InputStream, long)protected void parseLine(MutableGenericContext context, String line)
parseLine in class AbstractContentParserTextcontext - are the properties with the collected metadata.line - is a single line read from the text.private Element getFirstChildElement(Element element, String tagname)
element - is the element where the child is requested from.tagname - is the tagname of the requested child element.tagname or
null if no such child exists.private String getTextContent(Element element)
element
recursively including the text of all children.element - is the element for which the text is requested.private void collectTextContent(Element element, StringBuffer buffer)
element and appends it to the given buffer.element - is the element for which the text is requested.buffer - is the buffer where the text will be appended to.Copyright © 2001–2016 mmm-Team. All rights reserved.