Php Reg Ex To Find Data Not In Html Tags But Identify Html Using < And >
Solution 1:
That looks like an HTML Fragment inside a XML, more specific inside the description of a RSS feed. If this is the case you should parse the RSS using DOM, this will decode the entities a long the way:
$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);
Iterate the items:
foreach ($xpath->evaluate('/rss/channel/item') as$rssItem) {
The title of an item is only a text value it can be used directly:
echo'Title: ', $xpath->evaluate('string(title)', $rssItem), "\n";
The description in your example contains the html fragment in a text node with escaped entities, I have seen other example with a CDATA. It doesn't really matter for the outer xml document. It is text and if you read is as text the entities will get transformed back into their respective characters.
$description = $xpath->evaluate('string(description)', $rssItem);
So now $description contains < and > again. It can be loaded into a DOM with loadHtml() or just cleaned up with strip_tags().
echo'Description: ', strip_tags($description), "\n\n";
A full example (RSS adapted from Wikipedia):
$rss = <<<'RSS'
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
<item>
<title>Example entry</title>
<description>Here is some <b>text</b> containing an interesting <i>description</i> with <span class="important">html</span>.</description>
</item>
</channel>
</rss>
RSS;
$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);
foreach ($xpath->evaluate('/rss/channel/item') as $rssItem) {
echo'Title: ', $xpath->evaluate('string(title)', $rssItem), "\n";
$description = $xpath->evaluate('string(description)', $rssItem);
echo'Description: ', strip_tags($description), "\n\n";
}
Output:
Title: Example entry
Description: Here is some text containing an interesting description with html.
Solution 2:
for decoding you can user htmlspecialchars_decode
for more detail please check http://php.net/manual/en/function.htmlspecialchars-decode.php
Solution 3:
To obtain quickly the raw text (without tags) you can do this replacement:
$result = preg_replace('~<.*?>~s', ' ', $source);
Solution 4:
This gives you all the texts you're seeking as an array:
preg_match_all("/(?<=>)(?!<).*?(?=<)/", $source, $result);
See a live demo of this regex working with your sample input.
Post a Comment for "Php Reg Ex To Find Data Not In Html Tags But Identify Html Using < And >"