How to build and test a search engine parser
From Seeks
Seeks websearch plugin comes with support for a few existing search engines yet the plugin has generic code for building new search engine parsers.
If you would like to construct a parser for an additional search engine, this is quite easy.
Contents |
Preparing the field
In your local Seeks directory, go to src/plugins/websearch/. Here, copy an existing parser (e.g. se_parser_bing.cpp and se_parser_bing.h) to create the files for your parser. Then edit Makefile.am and add your parser in it (same as for the other parsers). Edit websearch_configuration.h, add your parser to the define list (the number should be double for each new line).
You might need to hack se_handler.h to add your parser here too.
Preparing the test
Go to src/proxy/tests/. Here, use test_curl_mget to download a sample of the page you are going to parse.
Example: ./test_curl_mget 1 http://search.blah.net/search?q=blabla&... The second argument is the number of time you want to download the page.
(Note that it won't work for Wikipedia since it doesn't accept requests without a specified User-Agent.)
Save the output to a file, then move this file to src/plugins/websearch/tests/.
Here, copy test test-bing-parser.cpp to a new file for your parser. Edit this file, remplace bing with your parser name.
Edit the Makefile.am, add the test of your parser in it.
Preparing the parser
Edit the files (.cpp and .h) of your parser in src/plugins/websearch/. Remplace the bing keyword (or the equivalent of the parser your have copy) with the keyword of your parser.
Comment the code inside the method.
Launch make. Prey. If this does not compile, find why and edit this page (or come and talk with us on irc). If everything goes fine, breath, you are now ready to hack.
Hack
Okay, now the game begins. Open up se_parser_yourwebsite.cpp and the page you have download with test_curl_mget. The goal is to find the pattern of the result snippet in this HTML file and to hack se_parser_yourwebsite to find it.
This is an event based parser, it means that each time it encounters an opening markup, a closing markup or the content of a tag, it will call the corresponding method. You should play with some boolean attributes to indicate where you are in the snippet (declare them in se_parser_yourwebsite.h and initialize them in the constructor). The important part is the creation of your snippet, you should already have the code from the copied parser (a big if with a lot of ||).
To test your parser, use the test created in src/plugins/websearch/tests/ on the file download with test_curl_mget. You should have the corresponding number of parsers created and no errors.
Happy hacking.
Here is a very simple example for twitter: se_parser_twitter.cpp (not included in Seeks as of writing this page).
It looks for <entry> tag, then <title> tag, grab what is inside <title></title> look for <link />, grab the href="" then push it snippet when it encounter </entry>.
