Trouble Scraping A Table Into R
I am trying and failing to scrape the table of average IQs by country from this web page into R. I'm trying to follow the process described in this blog post, but I can't seem to f
Solution 1:
Thanks to @hrbrmstr's pointer about javascript being the issue, I was able to get this done using using phantomjs
and following this tutorial. Specifically, I:
- Downloaded phantomjs for Windows from this site to my working directory;
- In Windows Explorer, manually extracted the file
phantomjs.exe
to my working directory (I guess I could have done 1 and 2 in R withdownload.file
andunzip
, but...); - Copied the 'scrape_techstars.js' file shown in the tutorial identified in step 1, pasted it to Notepad, edited it to fit my case, and saved it to my working directory as "scrape_iq.js";
- Back in my R console, ran
system("./phantomjs scrape_iq.js")
; - Back in Windows Explorer, looked in my working directory to find the html file created in step 4 ("iq.html"), right-clicked on that html file, and selected Open with > Google Chrome;
- In the Chrome tab that launched, right-clicked on the page and selected Inspect;
- Moused over the table I wanted to scrape and looked in the Elements window to confirm that it is a node of type "table"; and, finally,
- Ran the R code below to scrape that table from "iq.html".
Here's the .js file I created in step 3 and used in step 4:
// scrape_iq.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'iq.html'
page.open('https://iq-research.info/en/page/average-iq-by-country', function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
Here's the R code I used in step 8 to scrape the table from the local html file phantomjs
had created and get a data frame called IQ
in my workspace.
library(dplyr)
library(rvest)
IQ <- read_html("iq.html") %>%
html_nodes('table') %>%
html_table() %>%
.[[1]]
And here's the head the data frame that produced:
> head(IQ)
Rank Country IQ
1 1 Hong Kong 108
2 1 Singapore 108
3 2 South Korea 106
4 3 Japan 105
5 3 China 105
6 4 Taiwan 104
Post a Comment for "Trouble Scraping A Table Into R"