How To Get Web Page Title Using Html Parser
Solution 1:
According to your (redefined) question, the problem is that you only check the first node Node node = list.elementAt(0);
while you should iterate over the list to find the title (which is not the first). You could also use a NodeFilter
for your parse()
to only return the TitleTag
and then the title would be in the first and you wouldn't have to iterate.
Solution 2:
Well - assuming you're using java, but there is the equivalent in most of the languages - you can use a SAX parser (such as TagSoup which transform any html to xhtml) and in your handler you can do :
publicclassMyHandlerextendsorg.xml.sax.helpers.DefaultHandler {
booleanreadTitle=false;
StringBuildertitle=newStringBuilder();
publicvoidstartElement(String uri, String localName, String name,
Attributes attributes)throws SAXException {
if(localName.equals("title") {
readTitle = true;
}
}
publicvoidendElement(String uri, String localName, String name)throws SAXException {
if(localName.equals("title") {
readTitle = false;
}
}
publicvoidcharacters(char[] ch, int start, int length)throws SAXException {
if(readTitle) title.append(newString(ch, start, length));
}
}
and you use it in your parser (example with tagsoup) :
org.ccil.cowan.tagsoup.Parserparser=newParser();
MyHandlerhandler=newMyHander();
parser.setContentHandler(handler);
parser.parse(an input stream to your html file);
return handler.title.toString();
Solution 3:
BTW there is already a very simple title extract that ships with HTMLParser. You can use that : http://htmlparser.sourceforge.net/samples.html
The method to run it is (from within the HtmlParser code base) : Run :
bin/parser http://website_url TITLE
or run
java -jar <path to htmlparser.jar> http://website_url TITLE
or from your code call the method
org.htmlparser.Parser.main(String[] args)
with the parameters new String[] {"<website url>", "TITLE"}
Solution 4:
RegEx match open tags except XHTML self-contained tags
Smart you don't want to use the Regex.
To use an HTML parser, we need to know which language you're using. Since you say you're "on eclipse", I'm going to assume Java.
Take a look at http://www.ibm.com/developerworks/xml/library/x-domjava/ for a description, overview, and various viewpoints.
Solution 5:
This will be very easy using HTMLAgilityPack you only need to get responce of httpRequest in the form of string.
String response=httpRequest.getResponseString(); // this may have a few changes or no
HtmlDocument doc= newHtmlDocument();
doc.loadHtml(response);
HtmlNodenode=doc.DocumentNode.selectSingleNode("//title"); // this line will fetch title tage from whole html document and return collection could iterate
node.innerText; //gives you the title of the page
helloWorld node.innerText contains helloWorld
OR
String response=httpRequest.getResponseString(); // this may have a few changes or no
HtmlDocument doc= newHtmlDocument();
doc.loadHtml(response);
HtmlNodenode=doc.DocumentNode.selectSingleNode("//head");// this additional will get head which is a single node in html than get title from head's childrensHtmlNodenode=node.selectSingleNode("//title"); // this line will fetch title tage from whole html document and return collection could iterate
node.innerText; //gives you the title of the page
Post a Comment for "How To Get Web Page Title Using Html Parser"