web scraping - Python Post to form with anti-scrape protection -
trying scrape content off site python, has simple form authentication username , password, has hidden field called "foil" contains looks randomly generated string each time page loaded. in order login value must included in content header of post. i've tried scraping out random string after login page loads still redirects me login. have valid username , password site works, update sporadically , send myself email when changes. here code i've been working far...
import urllib, urllib2, cookielib,subprocess url='https://example.com/login.asp' username='blah' password='blah' request = urllib2.request(url) opener = urllib2.build_opener(urllib2.httphandler(debuglevel=1)) predata = opener.open(request).readlines() line in predata: if("foil" in line): foils = line.split('"') notfoiled = foils[3] query_args={'location':'','qstring':'','absr_id':notfoiled,'id':username,'pin':password,'submit':'sign in'} requestwheader = urllib2.request('https://example.com/login.asp') requestwheader.add_data(urllib.urlencode(query_args)) print 'request method after data :', requestwheader.get_method() print print 'outgoing data:' print requestwheader.get_data() print print 'server response:' print urllib2.urlopen(requestwheader).read() rawres = urllib2.urlopen(requestwheader).read()
the form looks this...
<form name="loginform" method="post" action="https://example.com/login.asp?x=x&&pswd="> <input type=hidden name="location" value=""> <input type=hidden name="qstring" value=""> <input type=hidden name="absr_id" value=""> <input type=hidden name="foil" value="91fcmo"> <input type="text" name="id" maxlength="80" size="21" value="" mask="" desc="id" required="true"> <input type="submit" name="submit" value="sign in" onclick="return checkform(loginform)"> <input type="password" name="pin" size="6" maxlength="6" desc="pin" required="true">
you import cookielib
not seem you're using cookiejar
s:
jar = cookielib.cookiejar() opener = urllib2.build_opener(urllib2.httpcookieprocessor(jar))
then use same opener both initial form fetching , login form submission. assume it's cookie-based protection value comes foil
field has match cookie comes in headers.
another thing noticed in code assign notfoiled
absr_id
instead of foil
. intentional?
also please favor , use html5lib
or beautifulsoup
instead of parsing html manually.
Comments
Post a Comment