Skip to content Skip to sidebar Skip to footer

Html Agility Pack Get Content Of

I'm trying get the content of using HTML agility pack. Here's a sample of the HTML i'm trying to parse :

Hundreds of thousands of Ukr

Solution 1:

It seems that the New York Times actually detects that you're not accepting cookies from them. As such, they present you with a cookie warning and a logon box. By actually providing a CookieContainer you can have .NET handle the whole cookie business under the hood and NYT will actually present you its contents:

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespaceUnitTestProject3
{
    using System.Net;
    using System.Runtime;

    using HtmlAgilityPack;

    [TestClass]
    publicclassUnitTest1
    {
        [TestMethod]
        publicvoidWhenProvidingCookiesYouSeeContent()
        {
            HtmlDocument doc = new HtmlDocument();

            WebClient wc = new WebClientEx(new CookieContainer());

            string contents = wc.DownloadString(
                "http://www.nytimes.com/2013/12/10/world/asia/thailand-protests.html?partner=rss&emc=rss&_r=1&");
            doc.LoadHtml(contents);

            var nodes = doc.DocumentNode.SelectNodes(@"//p[@itemprop='articleBody']");

            Assert.IsNotNull(nodes);
            Assert.IsTrue(nodes.Count > 0);
        }
    }

    publicclassWebClientEx : WebClient
    {
        publicWebClientEx(CookieContainer container)
        {
            this.container = container;
        }

        privatereadonly CookieContainer container = new CookieContainer();

        protectedoverride WebRequest GetWebRequest(Uri address)
        {
            WebRequest r = base.GetWebRequest(address);
            var request = r as HttpWebRequest;
            if (request != null)
            {
                request.CookieContainer = container;
            }
            return r;
        }

        protectedoverride WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
        {
            WebResponse response = base.GetWebResponse(request, result);
            ReadCookies(response);
            return response;
        }

        protectedoverride WebResponse GetWebResponse(WebRequest request)
        {
            WebResponse response = base.GetWebResponse(request);
            ReadCookies(response);
            return response;
        }

        privatevoidReadCookies(WebResponse r)
        {
            var response = r as HttpWebResponse;
            if (response != null)
            {
                CookieCollection cookies = response.Cookies;
                container.Add(cookies);
            }
        }
    }
}

With thanks to this answer for the extended WebClient class.

Note

It might be against the NYT terms of usage to blatantly scrape the new stories off their website.

Post a Comment for "Html Agility Pack Get Content Of

"