i'd pick html:
doc <- htmlparse("http://eusoils.jrc.ec.europa.eu/esdb_archive/esdbv3/legend/sg_attr.htm")
but have issues special characters (i.e. > , < signs) , different lengths of nodes, see here:
legs <- getnodeset(doc, "//a") leg_names <- sapply(legs, xmlgetattr, "name") leg_descr <- xpathsapply(doc, "//strong", xmlvalue) # not same length?? cbind(leg_names, leg_descr) # different length?? getnodeset(doc, '//text()[following-sibling::a]')
and
# why not working? getnodeset(doc, '//a[@name="aglim1"]/text()[following-sibling::strong')
in end i'd have every legend (text after tags name) in table 2 columns, 1st value/symbol 2nd label it..
like 1 wrb-full:
value label ab albeluvisol abal alic albeluvisol abap abruptic albeluvisol abar arenic albeluvisol abau alumic albeluvisol abeun endoeutric albeluvisol ... ... ...
the formatting of document not consistent: there <a>
elements without following <strong>
element -- there more of former.
cbind( head(leg_names,8), head(leg_descr,8) ) [,1] [,2] # [1,] "aglim1" "aglim1: code of important limitation agricultural use of stu" # [2,] "aglim2" "aglim2: code of secondary limitation agricultural use of stu" # [3,] "border_soil1m" "fao85-full: full soil code 1974 fao" # [4,] "soil1m" "fao85-lev1: soil major group code of stu 1974 (modified cec 1985) fao-unesco soil legend" # [5,] "cfl" "fao85-lev2: second level soil code of stu 1974 (modified cec 1985) fao-unesco soil legend" # [6,] "cl" "fao85-lev3: third level soil code of stu 1974 (modified cec 1985) fao-unesco soil legend" # [7,] "country" "fao90-full:full soil code of stu 1990 fao-unesco soil legend" # [8,] "fao85fu" "fao90-lev1: soil major group code of stu 1990 fao-unesco soil legend"
the following-sibling
approach seems more promising, since there <a>
elements not followed <strong>
element, can end description of element.
getnodeset(doc, '//a[@name="aglim1"]/following-sibling::strong/text()')[[1]]
an alternative forget formatting , consider file text file.
raw_data <- readlines("http://eusoils.jrc.ec.europa.eu/esdb_archive/esdbv3/legend/sg_attr.htm") library(stringr) matches <- str_extract(raw_data, '<a .*<strong>.*') matches <- matches[ ! is.na(matches) ] result <- str_match(matches, '<a name="(.*?)".*<strong>(.*)</strong>')[,-1] head(result) [,1] [,2] [1,] "aglim1" "aglim1: code of important limitation agricultural use of stu" [2,] "aglim2" "aglim2: code of secondary limitation agricultural use of stu" [3,] "fao85fu" "fao85-full: full soil code 1974 fao" [4,] "fao85lv1" "fao85-lev1: soil major group code of stu 1974 (modified cec 1985) fao-unesco soil legend" [5,] "fao85lv2" "fao85-lev2: second level soil code of stu 1974 (modified cec 1985) fao-unesco soil legend" [6,] "fao85lv3" "fao85-lev3: third level soil code of stu 1974 (modified cec 1985) fao-unesco soil legend"
Comments
Post a Comment