Skip to content Skip to sidebar Skip to footer

Bulletproofing SimpleXMLElement

Everyone knows that we should always use DOM techniques instead of regexes to extract content from HTML, but I get the feeling that I can never trust the SimpleXML extension or sim

Solution 1:

You can load the HTML with DOM's loadHTML then import the result to SimpleXML.

IIRC, it will still choke on some stuff but it will accept pretty much anything that exists in the real world of broken websites.

$html = '<html><head><body><div>stuff & stuff</body></html>';

// disable PHP errors
$old = libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTML($html);

// restore the old behaviour
libxml_use_internal_errors($old);

$sxe = simplexml_import_dom($dom);
die($sxe->asXML());

Solution 2:

you could always try a SAX parser... A bit more robust to errors.

May not be as efficient on large XML.


Post a Comment for "Bulletproofing SimpleXMLElement"