Html Agility Pack Get Content Of

June 27, 2023 Post a Comment

I'm trying get the content of using HTML agility pack. Here's a sample of the HTML i'm trying to parse :

Hundreds of thousands of Ukr

Solution 1:

It seems that the New York Times actually detects that you're not accepting cookies from them. As such, they present you with a cookie warning and a logon box. By actually providing a CookieContainer you can have .NET handle the whole cookie business under the hood and NYT will actually present you its contents:

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespaceUnitTestProject3
{
    using System.Net;
    using System.Runtime;

    using HtmlAgilityPack;

    [TestClass]
    publicclassUnitTest1
    {
        [TestMethod]
        publicvoidWhenProvidingCookiesYouSeeContent()
        {
            HtmlDocument doc = new HtmlDocument();

            WebClient wc = new WebClientEx(new CookieContainer());

            string contents = wc.DownloadString(
                "http://www.nytimes.com/2013/12/10/world/asia/thailand-protests.html?partner=rss&emc=rss&_r=1&");
            doc.LoadHtml(contents);

            var nodes = doc.DocumentNode.SelectNodes(@"//p[@itemprop='articleBody']");

            Assert.IsNotNull(nodes);
            Assert.IsTrue(nodes.Count > 0);
        }
    }

    publicclassWebClientEx : WebClient
    {
        publicWebClientEx(CookieContainer container)
        {
            this.container = container;
        }

        privatereadonly CookieContainer container = new CookieContainer();

        protectedoverride WebRequest GetWebRequest(Uri address)
        {
            WebRequest r = base.GetWebRequest(address);
            var request = r as HttpWebRequest;
            if (request != null)
            {
                request.CookieContainer = container;
            }
            return r;
        }

        protectedoverride WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
        {
            WebResponse response = base.GetWebResponse(request, result);
            ReadCookies(response);
            return response;
        }

        protectedoverride WebResponse GetWebResponse(WebRequest request)
        {
            WebResponse response = base.GetWebResponse(request);
            ReadCookies(response);
            return response;
        }

        privatevoidReadCookies(WebResponse r)
        {
            var response = r as HttpWebResponse;
            if (response != null)
            {
                CookieCollection cookies = response.Cookies;
                container.Add(cookies);
            }
        }
    }
}

With thanks to this answer for the extended WebClient class.

Note

It might be against the NYT terms of usage to blatantly scrape the new stories off their website.

Html5 Ready

Html Agility Pack Get Content Of

Solution 1:

Note

Post a Comment for "Html Agility Pack Get Content Of
"

Html Agility Pack Get Content Of

Solution 1:

Note

Post a Comment for "Html Agility Pack Get Content Of "

Post a Comment for "Html Agility Pack Get Content Of
"