data structures - Designing a web crawler -


i have come across interview question "if designing web crawler, how avoid getting infinite loops? " , trying answer it.

how begin beginning. google started hub pages hundreds of them (how these hub pages found in first place different sub-question). google follows links page , on, keep making hash table make sure doesn't follow earlier visited pages.

what if same page has 2 names (urls) in these days when have url shorteners etc..

i have taken google example. though google doesn't leak how web crawler algorithms , page ranking etc work, guesses?

if want detailed answer take @ section 3.8 paper, describes url-seen test of modern scraper:

in course of extracting links, web crawler encounter multiple links same document. avoid downloading , processing document multiple times, url-seen test must performed on each extracted link before adding url frontier. (an alternative design instead perform url-seen test when url removed frontier, approach result in larger frontier.)

to perform url-seen test, store of urls seen mercator in canonical form in large table called url set. again, there many entries them fit in memory, document fingerprint set, url set stored on disk.

to save space, not store textual representation of each url in url set, rather fixed-sized checksum. unlike fingerprints presented content-seen test’s document fingerprint set, stream of urls tested against url set has non-trivial amount of locality. reduce number of operations on backing disk file, therefore keep in-memory cache of popular urls. intuition cache links urls quite common, caching popular ones in memory lead high in-memory hit rate.

in fact, using in-memory cache of 2^18 entries , lru-like clock replacement policy, achieve overall hit rate on in-memory cache of 66.2%, , hit rate of 9.5% on table of recently-added urls, net hit rate of 75.7%. moreover, of 24.3% of requests miss in both cache of popular urls , table of recently-added urls, 1=3 produce hits on buffer in our random access file implementation, resides in user-space. net result of buffering each membership test perform on url set results in average of 0.16 seek , 0.17 read kernel calls (some fraction of served out of kernel’s file system buffers). so, each url set membership test induces one-sixth many kernel calls membership test on document fingerprint set. these savings purely due amount of url locality (i.e., repetition of popular urls) inherent in stream of urls encountered during crawl.

basically hash of urls hashing function guarantees unique hashes each url , due locality of urls, becomes easy find urls. google open-sourced hashing function: cityhash

warning!
might talking bot traps!!! bot trap section of page keeps generating new links unique urls , trapped in "infinite loop" following links being served page. not loop, because loop result of visiting same url, it's infinite chain of urls should avoid crawling.

update 12/13/2012- day after world supposed end :)

per fr0zenfyr's comment: if 1 uses aopic algorithm selecting pages, it's easy avoid bot-traps of infinite loop kind. here summary of how aopic works:

  1. get set of n seed pages.
  2. allocate x amount of credit each page, such each page has x/n credit (i.e. equal amount of credit) before crawling has started.
  3. select page p, p has highest amount of credit (or if pages have same amount of credit, crawl random page).
  4. crawl page p (let's p had 100 credits when crawled).
  5. extract links page p (let's there 10 of them).
  6. set credits of p 0.
  7. take 10% "tax" , allocate lambda page.
  8. allocate equal amount of credits each link found on page p p's original credit - tax: (100 (p credits) - 10 (10% tax))/10 (links) = 9 credits per each link.
  9. repeat step 3.

since lambda page continuously collects tax, page largest amount of credit , we'll have "crawl" it. "crawl" in quotes, because don't make http request lambda page, take credits , distribute them equally all of pages in our database.

since bot traps give internal links credits , credit outside, continually leak credits (from taxation) lambda page. lambda page distribute credits out of pages in database evenly , upon each cycle bot trap page lose more , more credits, until has little credits never gets crawled again. not happen pages, because credits back-links found on other pages. results in dynamic page rank , notice time take snapshot of database, order pages amount of credits have, ordered according true page rank.

this avoid bot traps of infinite-loop kind, there many other bot traps should watch out , there ways around them too.


Comments

Popular posts from this blog

objective c - Change font of selected text in UITextView -

php - Accessing POST data in Facebook cavas app -

c# - Getting control value when switching a view as part of a multiview -