WebSearch Parsers Configuration
From Seeks
This is a proposal for configuring :
- engines that get scraped by the websearch plugin
- parsers that need a declarative configuration
- configuration of each parser
The configuration uses the xml format, the advantage of using this syntax is that one can use a schema to validate the configuration at seeks startup. Some translators form curly brakets syntax to xml may be provided in a second time.
The goal of a parser is to identifiy in the scapped data a set of snippets, each snippet beeing a structure with the following facets:
- title
- summary
- URL
- URL to cached version of the page, if available
- type, if available (forum, ...)
- date, if available
- language, if available
for image snippets:
- image URL
- image URL in engine's cache
There may be other data to scrap, general to a set of snippets, such as:
- related queries
- related documents
- categories
More facets may occur in certain use-cases (news, code search, ...).
Contents |
Search Engine Configuration
A search engine is defined by :
- An url template, (should conform to opensearch url patterns)
- If http protocol is used, The http method used to query (POST | GET)
- A type of websearch : page, image, video, tweet
- A parser as defined further in othis document
Parsers configuration
- Parser name, should be unique
- Parser type : sax, xpath, css selector, json, ...
- Parameters, which deoed on the parser type
sax parsers
- class, the class implmenting sax handlers
xpath parsers
- namespace declarations
- snippet : xpath return the set of snippets contained in the page, further xpath exoression shall be expressed relatively to the nodes returned by this expression
- url : xpath returning a string containing the url facet of the snippet
- cache : xpath returning the cached verision of the resourc
_
For web
