web scraping - Python Post to form with anti-scrape protection -


trying scrape content off site python, has simple form authentication username , password, has hidden field called "foil" contains looks randomly generated string each time page loaded. in order login value must included in content header of post. i've tried scraping out random string after login page loads still redirects me login. have valid username , password site works, update sporadically , send myself email when changes. here code i've been working far...

import urllib, urllib2, cookielib,subprocess  url='https://example.com/login.asp'  username='blah' password='blah'  request = urllib2.request(url) opener = urllib2.build_opener(urllib2.httphandler(debuglevel=1)) predata = opener.open(request).readlines() line in predata:     if("foil" in line):         foils = line.split('"')         notfoiled = foils[3]  query_args={'location':'','qstring':'','absr_id':notfoiled,'id':username,'pin':password,'submit':'sign in'} requestwheader = urllib2.request('https://example.com/login.asp') requestwheader.add_data(urllib.urlencode(query_args)) print 'request method after data :', requestwheader.get_method()  print print 'outgoing data:' print requestwheader.get_data()  print print 'server response:' print urllib2.urlopen(requestwheader).read() rawres = urllib2.urlopen(requestwheader).read() 

the form looks this...

<form name="loginform" method="post" action="https://example.com/login.asp?x=x&amp;&amp;pswd="> <input type=hidden name="location" value=""> <input type=hidden name="qstring" value=""> <input type=hidden name="absr_id" value=""> <input type=hidden name="foil" value="91fcmo"> <input type="text" name="id" maxlength="80" size="21" value="" mask="" desc="id" required="true"> <input type="submit" name="submit" value="sign in" onclick="return checkform(loginform)"> <input type="password" name="pin" size="6" maxlength="6" desc="pin" required="true"> 

you import cookielib not seem you're using cookiejars:

jar = cookielib.cookiejar() opener = urllib2.build_opener(urllib2.httpcookieprocessor(jar)) 

then use same opener both initial form fetching , login form submission. assume it's cookie-based protection value comes foil field has match cookie comes in headers.

another thing noticed in code assign notfoiled absr_id instead of foil. intentional?

also please favor , use html5lib or beautifulsoup instead of parsing html manually.


Comments

Popular posts from this blog

objective c - Change font of selected text in UITextView -

php - Accessing POST data in Facebook cavas app -

c# - Getting control value when switching a view as part of a multiview -