Skip to content Skip to sidebar Skip to footer

Extracting Html Table From A Website In R

Hi I am trying to extract the table from the premierleague website. The package I am using is rvest package and the code I am using in the inital phase is as follows: library(rv

Solution 1:

Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):

library(RSelenium)
library(rvest)

# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()

# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]

# clean up
rd$close()
ptm$stop()

# parse with rvest
df <- html %>% read_html() %>% 
    html_node('#ismr-event-history table.ism-table') %>% 
    html_table() %>% 
    setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>%    # clean column names
    setNames(gsub('\\s', '_', names(.)))

str(df)
## 'data.frame':    20 obs. of  10 variables:
##  $ Gameweek                : chr  "GW1""GW2""GW3""GW4" ...
##  $ Gameweek_Points         : int  34475351666665634890 ...
##  $ Points_Bench            : int  16971429382 ...
##  $ Gameweek_Rank           : chr  "2,406,373""2,659,789""541,258""905,524" ...
##  $ Transfers_Made          : int  0020322020 ...
##  $ Transfers_Cost          : int  0000444000 ...
##  $ Overall_Points          : chr  "34""81""134""185" ...
##  $ Overall_Rank            : chr  "2,406,373""2,448,674""1,914,025""1,461,665" ...
##  $ Value                   : chr  "£100.0""£100.0""£99.9""£100.0" ...
##  $ Change_Previous_Gameweek: logi  NA NA NA NA NA NA ...

As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number) will do pretty well.) The arrows are images which is why the last column is all NA, but you can calculate those anyway.

Solution 2:

This solution uses RSelenium along with the package XML. It also assumes that you have a working installation of RSelenium that can properly work with firefox. Just make sure you have the firefox starter script path added to your PATH.

If you are using OS X, you will need to add /Applications/Firefox.app/Contents/MacOS/ to your PATH. Or, if you're on an Ubuntu machine, it's likely /usr/lib/firefox/. Once you're sure this is working, you can move on to R with the following:

# Install RSelenium and XML for R#install.packages("RSelenium")#install.packages("XML")# Import packages
library(RSelenium)
library(XML)

# Check and start servers for Selenium
checkForServer()
startServer()

# Use firefox as a browser and a port that's not used
remote_driver <- remoteDriver(browserName="firefox", port=4444)
remote_driver$open(silent=T)

# Use RSelenium to browse the site
epl_link <- "https://fantasy.premierleague.com/a/entry/767830/history"
remote_driver$navigate(epl_link)
elem <- remote_driver$findElement(using="class", value="ism-table")

# Get the HTML source
elemtxt <- elem$getElementAttribute("outerHTML")

# Use the XML package to work with the HTML source
elem_html <- htmlTreeParse(elemtxt, useInternalNodes = T, asText = TRUE)

# Convert the table into a dataframe
games_table <- readHTMLTable(elem_html, header = T, stringsAsFactors = FALSE)[[1]]

# Change the column names into something legible
names(games_table) <- unlist(lapply(strsplit(names(games_table), split = "\\n\\s+"), function(x) x[2]))
names(games_table) <- gsub("£", "Value", gsub("#", "CPW", gsub("Â","",names(games_table))))

# Convert the fields into numeric values
games_table <- transform(games_table, GR = as.numeric(gsub(",","",GR)),
                    OP = as.numeric(gsub(",","",OP)),
                    OR = as.numeric(gsub(",","",OR)),
                    Value = as.numeric(gsub("£","",Value)))

This should yield:

GWGPPBGRTMTCOPORValueCPWGW1341240637300342406373100.0GW2476265978900812448674100.0GW353954125820134191402599.9GW4517905524001851461665100.0GW5661437943834247958889100.1GW66623037042430951037699.9GW76591387922437023247499.8GW86331083630043387967100.4GW948811146092048175385100.9GW10902712100057127716101.1GW117124217063463816083100.9GW1235927986612466931820101.2GW1341827385351071053487101.1GW1482153087250079229436100.2GW1555910488082484329399100.6GW1649818015490089235142100.7GW1748421167062094040857100.7GW1842233150310098278136100.8GW194192600618001023   99048100.6GW205301644385001076  113148100.8

Please note that the column CPW (change from previous week) is a vector of empty strings.

I hope this helps.

Post a Comment for "Extracting Html Table From A Website In R"