@ComponentSpecification(plugin=true) public interface ContentParser
extracts
(meta-)data from the content of an InputStream
. parsing
. See also
ContentParserOptions.getMaximumBufferSize()
.Modifier and Type | Field and Description |
---|---|
static String |
VARIABLE_NAME_CREATOR
This is the name of the
variable with the creator (also called author, artist,
composer, etc.) of the content from the parsed GenericContext . |
static String |
VARIABLE_NAME_KEYWORDS
This is the name of the
variable with the keywords (also called tags) of the content
from the parsed GenericContext . |
static String |
VARIABLE_NAME_LANGUAGE
|
static String |
VARIABLE_NAME_TEXT
|
static String |
VARIABLE_NAME_TITLE
|
Modifier and Type | Method and Description |
---|---|
String |
getExtension()
This method gets the default filename extension excluding the dot (e.g.
|
String |
getMimetype()
This method gets the default mimetype (e.g.
|
Set<String> |
getPrimaryKeys()
This method gets the primary
keys used to register this ContentParser . |
Set<String> |
getSecondaryKeys()
This method gets the secondary
keys used to register this ContentParser . |
GenericContext |
parse(InputStream inputStream,
long filesize)
This method parses the document given as
inputStream and
extracts text and metadata returned as
GenericContext . |
GenericContext |
parse(InputStream inputStream,
long filesize,
ContentParserOptions options)
This method parses the document given as
inputStream and
extracts text and metadata returned as
GenericContext . |
static final String VARIABLE_NAME_TEXT
variable
with the plain text
of the content from the
parsed
GenericContext
. String
and should always be set (not
null
).static final String VARIABLE_NAME_TITLE
variable
with the title
of the content from the
parsed
GenericContext
. String
and is optional (may be null
).static final String VARIABLE_NAME_KEYWORDS
variable
with the keywords
(also called tags) of the content
from the parsed
GenericContext
. String
and is optional (may be null
).static final String VARIABLE_NAME_CREATOR
variable
with the creator
(also called author, artist,
composer, etc.) of the content from the parsed
GenericContext
. String
and is optional (may be null
).static final String VARIABLE_NAME_LANGUAGE
variable
with the language
of the content from the
parsed
GenericContext
. String
and is optional (may be null
).GenericContext parse(InputStream inputStream, long filesize) throws Exception
inputStream
and
extracts text
and metadata returned as
GenericContext
.inputStream
- is the fresh input stream of the content to parse. It
will be closed
by this method (on
success and in exceptional state).filesize
- is the size (content-length) of the content to parse in
bytes or 0
if NOT available (unknown). If available,
the parser may use this value for optimized allocations.GenericContext
containing the extracted metadata from
the parsed inputStream
. See the
VARIABLE_NAME_*
constants (e.g.
VARIABLE_NAME_TEXT
) for the default keys. Please note that
an implementation may use individual names for additional
variables.Exception
- if the parsing failed for a technical reason. There can
be arbitrary implementations for this interface that can throw any
Exception
from this method. Declaring a specific
ParseException
here would cause the overhead of
additional encapsulation of exceptions without any advantage. The
user of this interface has to catch for Exception
what
includes RuntimeException
s and excludes Error
s. He
has to handle the problem anyways (also for
RuntimeException
s) and has all contextual information
required to enhance the exception message
. This is NOT a matter of bad design.GenericContext parse(InputStream inputStream, long filesize, ContentParserOptions options) throws Exception
inputStream
and
extracts text
and metadata returned as
GenericContext
.inputStream
- is the fresh input stream of the content to parse. It
will be closed
by this method (on
success and in exceptional state).filesize
- is the size (content-length) of the content to parse in
bytes or 0
if NOT available (unknown). If available,
the parser may use this value for optimized allocations.options
- are the ContentParserOptions
.GenericContext
containing the extracted metadata from
the parsed inputStream
. See the
VARIABLE_NAME_*
constants (e.g.
VARIABLE_NAME_TEXT
) for the default keys. Please note that
an implementation may use individual names for additional
variables.Exception
- if the parsing failed for a technical reason. There can
be arbitrary implementations for this interface that can throw any
Exception
from this method. Declaring a specific
ParseException
here would cause the overhead of
additional encapsulation of exceptions without any advantage. The
user of this interface has to catch for Exception
what
includes RuntimeException
s and excludes Error
s. He
has to handle the problem anyways (also for
RuntimeException
s) and has all contextual information
required to enhance the exception message
. This is NOT a matter of bad design.String getExtension()
ContentParser
.null
if this is the
generic parser
.String getMimetype()
ContentParser
.null
if this is the
generic parser
.Set<String> getPrimaryKeys()
keys
used to register this ContentParser
. This set contains
extension
and mimetype
and
maybe other alternatives. getExtension()
but can
also include "htm" as primary key
.extension
"tar.gz" may also
include "tgz" as primary key
.mimetype
"text/xml" may also
include "application/xml" as primary key
.ContentParser
.ContentParserService.getParser(String)
,
getSecondaryKeys()
,
Collections.emptySet()
Set<String> getSecondaryKeys()
keys
used to register this ContentParser
. If an other (more
specific) ContentParser
defines such key as
primary key
, that ContentParser
is chosen
first. Otherwise this implementation will be used. getExtension()
but can
define "xhtml" and "application/xhtml+xml" as secondary key
.extension
"txt" may return
"java", "php", "c", "cpp", etc. as secondary
keys
.ContentParser
.ContentParserService.getParser(String)
,
getPrimaryKeys()
,
Collections.emptySet()
Copyright © 2001–2016 mmm-Team. All rights reserved.