c# - How to best parallelize parsing of webpages? -
i using html agility pack parse individual pages of forum website. parsing method returns topic/thread links on page link, passed argument. gather these topic links of parsed pages in single collection.
after that, check if on dictionary
of already-viewed urls, , if not, add them new list , ui shows list, new topics/threads created since last time.
since these operations seem independent, best way parallelize this?
should use .net 4.0's parallel.for/foreach
?
either way, how can gather results of each page in single collection? or not necessary?
can read centralized dictionary
whenever parse method finishes see if there, simultaneously?
if run program 4000 pages, takes 90 mins, great if use 8 cores finish same task in ~10 mins.
after that, check if on dictionary of already-viewed urls, , if not, add them new list , ui shows list, new topics/threads created since last time. since these operations seem independent, best way parallelize this?
you can use parallel.for/foreach that, should think design of crawler bit. crawlers tend dedicate several threads crawling , each thread associated page fetching client responsible fetching pages (in case, using webrequest
/webresponse
) recommend reading these papers:
- mercator: scalable, extensible web crawler (an 11 page paper, should pretty light read).
- irlbot: scaling 6 billion pages , beyond (a 10 page paper describes crawler crawls @ 600 pages per second on 150 mbit connection).
- irlbot: scaling 6 billion pages , beyond: full paper
if implement mercator
design, should able download 50 pages per second, 4000 pages downloaded in 80 seconds.
either way, how can gather results of each page in single collection?
you can store results in concurrentdictionary<tkey, tvalue>
, darin mentioned. don't need store in value, since key link/url, if you're performing url-seen test can hash each link/url integer , store hash key , link/url value.
or not necessary?
it's entirely decide what's necessary, if you're performing url-seen test, necessary.
can read centralized dictionary whenever parse method finishes see if there, simultaneously?
yes, concurrentdictionary
allows multiple threads read simultaneously, should fine. work fine if want see if link has been crawled.
if run program 4000 pages, takes 90 mins, great if use 8 cores finish same task in ~10 mins.
if design crawler sufficiently well, should able download , parse (extracts links) of 4000 pages in 57 seconds on average desktop pc... results standard c# webrequest
on 4gb, i5 3.2 ghz pc 10 mbps connection.
Comments
Post a Comment