Technical

From Seeks

Jump to: navigation, search

This page gives technical details of the Seeks architecture. It does not details the APIs of the libraries, nor the full protocols used in the network transactions. Rather, the main principles are described, pointers to the underlying maths and algorithms are given and a software architecture is depicted.

1: User U sends a query Q to a websearch engine, through a Web browser. Its local LSH instance computes a set of K of queries similar to Q. 2: A proxy intercepts the original query Q, creates a search context for it, before it is transferred normally to the search engine (step 3). Queries K are transferred to the SEEKS DHT (step 4). 3: The search engine returns URLs in a web page that is intercepted by the proxy, and reconditioned in order to be enriched with information from the SEEKS network, before it is given back to the user U. 4: The proxy registers queries K into the SEEKS DHT. This step is in general slower than step 3, the two steps taking place in parallel. The SEEKS DHT routes the K queries to the K peers whose identification key in the P2P network is the closest to each K query. Each query is then stored onto the distant peer. This step gives rise to the search groups, connecting users whose queries are routed to the same peer. 5: The SEEKS DHT returns a set of users connected to every of the identified peers from step 4. These users have sent queries similar to Q and therefore form a search group for user U. 6 & 7: Through its DHT client, U is connected to other users from the search group. From there, much information can be exchanged, automatically such as URL ratings or useful related queries, or actively through discussion. Automated exchanges are mostly controlled by a distributed collaborative filtering algorithm that computes personalized rankings of the search results from the search engine. 8 & 9: The web content from the URLs is downloaded from the Internet, normally.

Contents

Overview

(See main Overview page).

The basic functionality offered by Seeks is to enable the sharing by users of their websearch queries. For every query a user decides to share, Seeks looks up similar queries performed by others, and regroups the users so they can work together. More technically, every user is a peer on the Seeks peer-to-peer network. Whenever users have been regrouped by Seeks, this means a peer-to-peer connection has been established among users, so they can share information, automatically (i.e. automated collaborative filtering), or not (i.e. chat and other exchanges). Mostly, information sharing includes queries, URL ratings, comments and network routing data. Privacy is strongly enforced since all personal data remain local to the user peer, and exchanged information are hashed and/or encrypted.

Overall Seeks offers three key services:

  • automatic grouping of users
  • tools for collaborative websearch control and filtering
  • self-publishing directly to the user groups, with no need to be indexed by a search engine.

Principles

The core basis of Seeks relies on three technologies:

  • A pattern matching solution to regroup similar queries, here Locality Sensitive Hashing (LSH).
  • A Distributed HashTable (DHT), also referred to as an overlay network.
  • A distributed collaborative filtering algorithm for rating / predicting the fit between a query and the search engine results. To the difference of existing websearch engines, the filter here is fed mostly by users (and bots).

grouping of users

The grouping of users is enabled through the sharing of websearch queries. Queries are regrouped by using a pattern matching technique known as Locality Sensitive Hashing. In a nutshell, similar queries fall into the same bucket of a hashtable. Since the hashtable is distributed, buckets are peers, and users who perform similar queries found themselves routed to the same peers. A search group is then simply a set of interconnected users.

websearch control and filtering

Once users have been regrouped, much information can flow among them. Of interest here is information that allows a better filtering and ranking of search engine results. By better, it is meant:

  • that provides a more accurate fit between the query and the result,
  • that removes or rates down spam and doubtful contents.
  • that personalizes the search results.

This is achieved by an automated distributed collaborative filter. The Documents section hosts a mathematical description of the distributed filter.

self-publishing

Self-publishing is achieved by registering a query to the DHT, using the LSH similarity scheme. But instead of searching the Web, the publisher uses its connection to the search group to publish content directly to the group users. Very nicely, enabling this system does not require anything more than the LSH and the DHT.

Software Blocks

Seeks software blocks.

Here are the main software blocks along with their functionalities1 as required for running a basic version of Seeks:

Network P2P library

  • asynchronous RPCs based on protobuffers.
  • automated NAT traversal (ICE).
  • connection management (i.e. caches, ...) and keep-alive system for sockets.

LSH library (7)

  • Locality sensitive hashing over continuous and discrete values.
  • continuous and discrete values storage with support for dynamic local radius (through Patricia trees and/or Bigrams and Markov random fields, see 9).

DHT library on top of the network and the LSH libraries

  • standard p2p structure: Chord-like ring over 160bit keys.
  • p2p bootstrap mechanism (8b);
  • p2p main operations, including lookups (8c), routing (8b) and registring (8d), over the ring, through RPCs;
  • wrappers to the RPC library (connection is at 5c).

CF (Collaborative Filtering) library (11)

  • Similarity measures over urls and queries.
  • Local functions for distributed collaborative filtering.

Proxy library on top of the DHT (10a) and the CF libraries

  • port 80 query interceptions (10c).
  • HTML parsers and dynamic code processing (10b).
  • local webpages and configuration interface (10f).
  • query context and context switching (10e).
  • p2p interaction with peer users for information retrieval for CF (10g).
  • output to the user browser (10d).
  • cache manager for information that should persist between runs (10h).

Note: The Proxy library is based on Privoxy. Required functionalities have been hacked in accordingly.

UI (User Interface) library on top of the Proxy output (HTML or other)

  • support for asynchronous updates of portions of the screen only.
  • support for the proxy and general Seeks configuration.
Personal tools