Html Agility Pack Get Content Of
I'm trying get the content of using HTML agility pack. Here's a sample of the HTML i'm trying to parse :
Hundreds of thousands of Ukr
Solution 1:
It seems that the New York Times actually detects that you're not accepting cookies from them. As such, they present you with a cookie warning and a logon box. By actually providing a CookieContainer
you can have .NET handle the whole cookie business under the hood and NYT will actually present you its contents:
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespaceUnitTestProject3
{
using System.Net;
using System.Runtime;
using HtmlAgilityPack;
[TestClass]
publicclassUnitTest1
{
[TestMethod]
publicvoidWhenProvidingCookiesYouSeeContent()
{
HtmlDocument doc = new HtmlDocument();
WebClient wc = new WebClientEx(new CookieContainer());
string contents = wc.DownloadString(
"http://www.nytimes.com/2013/12/10/world/asia/thailand-protests.html?partner=rss&emc=rss&_r=1&");
doc.LoadHtml(contents);
var nodes = doc.DocumentNode.SelectNodes(@"//p[@itemprop='articleBody']");
Assert.IsNotNull(nodes);
Assert.IsTrue(nodes.Count > 0);
}
}
publicclassWebClientEx : WebClient
{
publicWebClientEx(CookieContainer container)
{
this.container = container;
}
privatereadonly CookieContainer container = new CookieContainer();
protectedoverride WebRequest GetWebRequest(Uri address)
{
WebRequest r = base.GetWebRequest(address);
var request = r as HttpWebRequest;
if (request != null)
{
request.CookieContainer = container;
}
return r;
}
protectedoverride WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
{
WebResponse response = base.GetWebResponse(request, result);
ReadCookies(response);
return response;
}
protectedoverride WebResponse GetWebResponse(WebRequest request)
{
WebResponse response = base.GetWebResponse(request);
ReadCookies(response);
return response;
}
privatevoidReadCookies(WebResponse r)
{
var response = r as HttpWebResponse;
if (response != null)
{
CookieCollection cookies = response.Cookies;
container.Add(cookies);
}
}
}
}
With thanks to this answer for the extended WebClient class.
Note
It might be against the NYT terms of usage to blatantly scrape the new stories off their website.
Post a Comment for "Html Agility Pack Get Content Of
"