i have started reading documentation , examples dom, in order crawl , parse document.
for example have part of document shown below:
<div id="showcontent"> <table> <tr> <td> crap </td> </tr> <tr> <td width="172" valign="top"><a href="link"><img height="91" border="0" width="172" class="" src="img"></a></td> <td width="10"> </td> <td valign="top"><table cellspacing="0" cellpadding="0" border="0"> <tbody><tr> <td height="30"><a class="px11" href="link">title</a><a><br> <span class="px10"></span> </a></td> </tr> <tr> <td><img height="1" width="580" src="crap"></td> </tr> <tr> <td align="right"> <a href="link"><img height="16" border="0" width="65" src="/buy"></a> </td> </tr> <tr> <td valign="top" class="px10"> <p style="width: 500px;">description.</p> </td> </tr> </tbody></table></td> </tr> <tr> <td> crap </td> </tr> <tr> <td> crap </td> </tr> </table> </div> i'm trying use following code tr tags , analyze whether there crap or information inside them:
$dom = new domdocument(); @$dom->loadhtml($html); $xpath = new domxpath($dom); $tags = $xpath->query('.//div[@id="showcontent"]'); foreach ($tags $tag) { $string=""; $string=trim($tag->nodevalue); if(strlen($string)>3) { echo $string; echo '<br>'; } } however i'm getting stripped string without tags, example:
crap crap title description but get:
<tr> <td>crap</td> </tr> <tr> <a href="link">title</a> </tr> how keep html nodes (tags)?
if want work dom have understand concept. in dom document, including domdocument, node.
the domdocument hierarchical tree structure of nodes. starts root node. root node can have child nodes , these child nodes can have child nodes on own. in domdocument node type of sort, elements, attributes or text content.
html legend: / \ uppercase = domelement head body lowercase = domattr / \ "quoted" = domtext title div - class - "header" | \ "the title" h1 | "welcome nodeville" the diagram above shows domdocument nodes. there root element (html) 2 children (head , body). connecting lines called axes. if follow down axis title element, see has 1 domtext leaf. important because illustrates overlooked thing:
<title>the title</title> is not one, 2 nodes. domelement domtext child. likewise, this
<div class="header"> is 3 nodes: domelement domattr holding domtext. because these inherit properties , methods domnode, essential familiarize domnode class.
in practise, means div fetched linked other nodes in document. go way root element or down leaves @ time. it's there. have query or traverse document wanted information.
whether iterating childnodes of div or use getelementbytagname() or xpath you. have understand not working raw html, nodes representing entire html document.
if need extracting specific information document, need clarify information want fetch it. instance, ask how fetch links table , answer like:
$div = $dom->getelementbyid('showcontent'); foreach ($div->getelementsbytagname('a') $link) { echo $dom->savexml($link); } but unless more specific, can guess nodes might relevant.
if need more examples , code snippets on how work dom browse through previous answers related questions:
by now, there should snippet every basic medium usecase might have dom.
Comments
Post a Comment