c# - How to best parallelize parsing of webpages? -

- April 15, 2010

i using html agility pack parse individual pages of forum website. parsing method returns topic/thread links on page link, passed argument. gather these topic links of parsed pages in single collection.

after that, check if on dictionary of already-viewed urls, , if not, add them new list , ui shows list, new topics/threads created since last time.

since these operations seem independent, best way parallelize this?

should use .net 4.0's parallel.for/foreach?

either way, how can gather results of each page in single collection? or not necessary?

can read centralized dictionary whenever parse method finishes see if there, simultaneously?

if run program 4000 pages, takes 90 mins, great if use 8 cores finish same task in ~10 mins.

after that, check if on dictionary of already-viewed urls, , if not, add them new list , ui shows list, new topics/threads created since last time. since these operations seem independent, best way parallelize this?

you can use parallel.for/foreach that, should think design of crawler bit. crawlers tend dedicate several threads crawling , each thread associated page fetching client responsible fetching pages (in case, using webrequest/webresponse) recommend reading these papers:

mercator: scalable, extensible web crawler (an 11 page paper, should pretty light read).
irlbot: scaling 6 billion pages , beyond (a 10 page paper describes crawler crawls @ 600 pages per second on 150 mbit connection).
irlbot: scaling 6 billion pages , beyond: full paper

if implement mercator design, should able download 50 pages per second, 4000 pages downloaded in 80 seconds.

either way, how can gather results of each page in single collection?

you can store results in concurrentdictionary<tkey, tvalue>, darin mentioned. don't need store in value, since key link/url, if you're performing url-seen test can hash each link/url integer , store hash key , link/url value.

or not necessary?

it's entirely decide what's necessary, if you're performing url-seen test, necessary.

can read centralized dictionary whenever parse method finishes see if there, simultaneously?

yes, concurrentdictionary allows multiple threads read simultaneously, should fine. work fine if want see if link has been crawled.

if run program 4000 pages, takes 90 mins, great if use 8 cores finish same task in ~10 mins.

if design crawler sufficiently well, should able download , parse (extracts links) of 4000 pages in 57 seconds on average desktop pc... results standard c# webrequest on 4gb, i5 3.2 ghz pc 10 mbps connection.

Search This Blog

Support

c# - How to best parallelize parsing of webpages? -

Comments

Post a Comment

Popular posts from this blog

objective c - Change font of selected text in UITextView -

php - Accessing POST data in Facebook cavas app -

c# - Getting control value when switching a view as part of a multiview -