R XPath text after sibling -


i'd pick html:

doc <- htmlparse("http://eusoils.jrc.ec.europa.eu/esdb_archive/esdbv3/legend/sg_attr.htm") 

but have issues special characters (i.e. > , < signs) , different lengths of nodes, see here:

legs <- getnodeset(doc, "//a") leg_names <- sapply(legs, xmlgetattr, "name") leg_descr <- xpathsapply(doc, "//strong", xmlvalue)  # not same length?? cbind(leg_names, leg_descr)  # different length?? getnodeset(doc, '//text()[following-sibling::a]') 

and

# why not working? getnodeset(doc, '//a[@name="aglim1"]/text()[following-sibling::strong') 

in end i'd have every legend (text after tags name) in table 2 columns, 1st value/symbol 2nd label it..

like 1 wrb-full:

     value                  label         ab            albeluvisol       abal       alic albeluvisol       abap   abruptic albeluvisol       abar     arenic albeluvisol       abau     alumic albeluvisol      abeun endoeutric albeluvisol        ...        ...         ... 

the formatting of document not consistent: there <a> elements without following <strong> element -- there more of former.

cbind( head(leg_names,8), head(leg_descr,8) )      [,1]            [,2]                                                                                                     # [1,] "aglim1"        "aglim1: code of important limitation agricultural use of stu"                           # [2,] "aglim2"        "aglim2: code of secondary limitation agricultural use of stu"                                  # [3,] "border_soil1m" "fao85-full: full soil code 1974 fao"                                                                    # [4,] "soil1m"        "fao85-lev1: soil major group code of stu 1974 (modified cec 1985) fao-unesco soil legend"  # [5,] "cfl"           "fao85-lev2: second level soil code of stu 1974 (modified cec 1985) fao-unesco soil legend" # [6,] "cl"            "fao85-lev3: third level soil code of stu 1974 (modified cec 1985) fao-unesco soil legend"  # [7,] "country"       "fao90-full:full soil code of stu 1990 fao-unesco soil legend"                              # [8,] "fao85fu"       "fao90-lev1: soil major group code of stu 1990 fao-unesco soil legend"       

the following-sibling approach seems more promising, since there <a> elements not followed <strong> element, can end description of element.

getnodeset(doc, '//a[@name="aglim1"]/following-sibling::strong/text()')[[1]] 

an alternative forget formatting , consider file text file.

raw_data <- readlines("http://eusoils.jrc.ec.europa.eu/esdb_archive/esdbv3/legend/sg_attr.htm") library(stringr) matches <- str_extract(raw_data, '<a .*<strong>.*') matches <- matches[ ! is.na(matches) ] result <- str_match(matches, '<a name="(.*?)".*<strong>(.*)</strong>')[,-1] head(result)      [,1]       [,2]                                                                                                     [1,] "aglim1"   "aglim1: code of important limitation agricultural use of stu"                           [2,] "aglim2"   "aglim2: code of secondary limitation agricultural use of stu"                                  [3,] "fao85fu"  "fao85-full: full soil code 1974 fao"                                                                    [4,] "fao85lv1" "fao85-lev1: soil major group code of stu 1974 (modified cec 1985) fao-unesco soil legend"  [5,] "fao85lv2" "fao85-lev2: second level soil code of stu 1974 (modified cec 1985) fao-unesco soil legend" [6,] "fao85lv3" "fao85-lev3: third level soil code of stu 1974 (modified cec 1985) fao-unesco soil legend"  

Comments