Technical
From Seeks
This page gives technical details of the Seeks architecture. It does not details the APIs of the libraries, nor the full protocols used in the network transactions. Rather, the main principles are described, pointers to the underlying maths and algorithms are given and a software architecture is depicted.
Contents |
Overview
(See main Overview page).
The basic functionality offered by Seeks is to enable the sharing by users of their websearch queries. For every query a user decides to share, Seeks looks up similar queries performed by others, and regroups the users so they can work together. More technically, every user is a peer on the Seeks peer-to-peer network. Whenever users have been regrouped by Seeks, this means a peer-to-peer connection has been established among users, so they can share information, automatically (i.e. automated collaborative filtering), or not (i.e. chat and other exchanges). Mostly, information sharing includes queries, URL ratings, comments and network routing data. Privacy is strongly enforced since all personal data remain local to the user peer, and exchanged information are hashed and/or encrypted.
Overall Seeks offers three key services:
- automatic grouping of users
- tools for collaborative websearch control and filtering
- self-publishing directly to the user groups, with no need to be indexed by a search engine.
Principles
The core basis of Seeks relies on three technologies:
- A pattern matching solution to regroup similar queries, here Locality Sensitive Hashing (LSH).
- A Distributed HashTable (DHT), also referred to as an overlay network.
- A distributed collaborative filtering algorithm for rating / predicting the fit between a query and the search engine results. To the difference of existing websearch engines, the filter here is fed mostly by users (and bots).
grouping of users
The grouping of users is enabled through the sharing of websearch queries. Queries are regrouped by using a pattern matching technique known as Locality Sensitive Hashing. In a nutshell, similar queries fall into the same bucket of a hashtable. Since the hashtable is distributed, buckets are peers, and users who perform similar queries found themselves routed to the same peers. A search group is then simply a set of interconnected users.
websearch control and filtering
Once users have been regrouped, much information can flow among them. Of interest here is information that allows a better filtering and ranking of search engine results. By better, it is meant:
- that provides a more accurate fit between the query and the result,
- that removes or rates down spam and doubtful contents.
- that personalizes the search results.
This is achieved by an automated distributed collaborative filter. The Documents section hosts a mathematical description of the distributed filter.
self-publishing
Self-publishing is achieved by registering a query to the DHT, using the LSH similarity scheme. But instead of searching the Web, the publisher uses its connection to the search group to publish content directly to the group users. Very nicely, enabling this system does not require anything more than the LSH and the DHT.
Software Blocks
Here are the main software blocks along with their functionalities1 as required for running a basic version of Seeks:
Network P2P library
- asynchronous RPCs based on protobuffers.
- automated NAT traversal (ICE).
- connection management (i.e. caches, ...) and keep-alive system for sockets.
LSH library (7)
- Locality sensitive hashing over continuous and discrete values.
- continuous and discrete values storage with support for dynamic local radius (through Patricia trees and/or Bigrams and Markov random fields, see 9).
DHT library on top of the network and the LSH libraries
- standard p2p structure: Chord-like ring over 160bit keys.
- p2p bootstrap mechanism (8b);
- p2p main operations, including lookups (8c), routing (8b) and registring (8d), over the ring, through RPCs;
- wrappers to the RPC library (connection is at 5c).
CF (Collaborative Filtering) library (11)
- Similarity measures over urls and queries.
- Local functions for distributed collaborative filtering.
Proxy library on top of the DHT (10a) and the CF libraries
- port 80 query interceptions (10c).
- HTML parsers and dynamic code processing (10b).
- local webpages and configuration interface (10f).
- query context and context switching (10e).
- p2p interaction with peer users for information retrieval for CF (10g).
- output to the user browser (10d).
- cache manager for information that should persist between runs (10h).
Note: The Proxy library is based on Privoxy. Required functionalities have been hacked in accordingly.
UI (User Interface) library on top of the Proxy output (HTML or other)
- support for asynchronous updates of portions of the screen only.
- support for the proxy and general Seeks configuration.
