python - Debugging ScraperWiki scraper (producing spurious integer) -


here scraper created using python on scraperwiki:

import lxml.html import re import scraperwiki  pattern = re.compile(r'\s') html = scraperwiki.scrape("http://www.shanghairanking.com/arwu2012.html") root = lxml.html.fromstring(html) tr in root.cssselect("#universityranking tr:not(:first-child)"):     if len(tr.cssselect("td.ranking")) > 0 , len(tr.cssselect("td.rankingname")) > 0:         data = {             'arwu_rank'  : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),             'university' : tr.cssselect("td.rankingname")[0].text_content().strip()         }     # debug begin     if not type(data["arwu_rank"]) str:         print type(data["arwu_rank"])         print data["arwu_rank"]         print data["university"]     # debug end     if "-" in data["arwu_rank"]:         arwu_rank_bounds  = data["arwu_rank"].split("-")         data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )     if not type(data["arwu_rank"]) int:         data["arwu_rank"] = int(data["arwu_rank"])     scraperwiki.sqlite.save(unique_keys=['university'], data=data) 

it works except when scraping final data row of table (the "york university" line), @ point instead of lines 9 through 11 of code causing string "401-500" retrieved table , assigned data["arwu_rank"], lines somehow seem instead causing int 450 assigned data["arwu_rank"]. can see i've added few lines of "debugging" code better understanding of what's going on, that debugging code doesn't go deep.

i have 2 questions:

  1. what options debugging scrapers run on scraperwiki infrastructure, e.g. troubleshooting issues this? e.g. there way step through?
  2. can tell me why the int 450, instead of string "401-500", being assigned data["arwu_rank"] "york university" line?

edit 6 may 2013, 20:07h utc

the following scraper completes without issue, i'm still unsure why first 1 failed on "york university" line:

import lxml.html import re import scraperwiki  pattern = re.compile(r'\s') html = scraperwiki.scrape("http://www.shanghairanking.com/arwu2012.html") root = lxml.html.fromstring(html) tr in root.cssselect("#universityranking tr:not(:first-child)"):     if len(tr.cssselect("td.ranking")) > 0 , len(tr.cssselect("td.rankingname")) > 0:         data = {             'arwu_rank'  : str(re.sub(pattern, r'', tr.cssselect("td.ranking")[0].text_content())),             'university' : tr.cssselect("td.rankingname")[0].text_content().strip()         }         # debug begin         if not type(data["arwu_rank"]) str:             print type(data["arwu_rank"])             print data["arwu_rank"]             print data["university"]         # debug end         if "-" in data["arwu_rank"]:             arwu_rank_bounds  = data["arwu_rank"].split("-")             data["arwu_rank"] = int( ( float(arwu_rank_bounds[0]) + float(arwu_rank_bounds[1]) ) * 0.5 )         if not type(data["arwu_rank"]) int:             data["arwu_rank"] = int(data["arwu_rank"])         scraperwiki.sqlite.save(unique_keys=['university'], data=data) 

there's no easy way debug scripts on scraperwiki, unfortunately sends code in entirety , gets results back, there's no way execute code interactively.

i added couple more prints copy of code, , looks if check before bit assigns data

if len(tr.cssselect("td.ranking")) > 0 , len(tr.cssselect("td.rankingname")) > 0: 

doesn't trigger "york university" keeping int value (you set later on) previous time around loop.


Comments