here scraper created using python on scraperwiki:
import lxml.html import re import scraperwiki pattern = re.compile(r'\s') html = scraperwiki.scrape("http://www.shanghairanking.com/arwu2012.html") root = lxml.html.fromstring(html) tr in root.cssselect("#universityranking tr:not(:first-child)"): if len(tr.cssselect("td.ranking")) > 0 , len(tr.cssselect("td.rankingname")) > 0: data = { 'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())), 'university' : tr.cssselect("td.rankingname")[0].text_content().strip() } # debug begin if not type(data["arwu_rank"]) str: print type(data["arwu_rank"]) print data["arwu_rank"] print data["university"] # debug end if "-" in data["arwu_rank"]: arwu_rank_bounds = data["arwu_rank"].split("-") data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 ) if not type(data["arwu_rank"]) int: data["arwu_rank"] = int(data["arwu_rank"]) scraperwiki.sqlite.save(unique_keys=['university'], data=data) it works except when scraping final data row of table (the "york university" line), @ point instead of lines 9 through 11 of code causing string "401-500" retrieved table , assigned data["arwu_rank"], lines somehow seem instead causing int 450 assigned data["arwu_rank"]. can see i've added few lines of "debugging" code better understanding of what's going on, that debugging code doesn't go deep.
i have 2 questions:
- what options debugging scrapers run on scraperwiki infrastructure, e.g. troubleshooting issues this? e.g. there way step through?
- can tell me why the int
450, instead of string "401-500", being assigneddata["arwu_rank"]"york university" line?
edit 6 may 2013, 20:07h utc
the following scraper completes without issue, i'm still unsure why first 1 failed on "york university" line:
import lxml.html import re import scraperwiki pattern = re.compile(r'\s') html = scraperwiki.scrape("http://www.shanghairanking.com/arwu2012.html") root = lxml.html.fromstring(html) tr in root.cssselect("#universityranking tr:not(:first-child)"): if len(tr.cssselect("td.ranking")) > 0 , len(tr.cssselect("td.rankingname")) > 0: data = { 'arwu_rank' : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())), 'university' : tr.cssselect("td.rankingname")[0].text_content().strip() } # debug begin if not type(data["arwu_rank"]) str: print type(data["arwu_rank"]) print data["arwu_rank"] print data["university"] # debug end if "-" in data["arwu_rank"]: arwu_rank_bounds = data["arwu_rank"].split("-") data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 ) if not type(data["arwu_rank"]) int: data["arwu_rank"] = int(data["arwu_rank"]) scraperwiki.sqlite.save(unique_keys=['university'], data=data)
there's no easy way debug scripts on scraperwiki, unfortunately sends code in entirety , gets results back, there's no way execute code interactively.
i added couple more prints copy of code, , looks if check before bit assigns data
if len(tr.cssselect("td.ranking")) > 0 , len(tr.cssselect("td.rankingname")) > 0: doesn't trigger "york university" keeping int value (you set later on) previous time around loop.
Comments
Post a Comment