May 12, 2021 R language tutorial
Many websites provide data for their users to use. F or example, the World Health Organization (WHO) provides reports of health and medical information in the form of CSV, txt and XML files. U sing the R language program, we can programmatically extract specific data from these sites. S ome of the packages used in the R language to extract data from a Web site are "RCurl," "XML", and "stringr", which are used to connect to urls, identify links required for files, and download them to the local environment.
The following packages are required to process URLs and link to files. I f they are not available in the R-language environment, you can install them using the following commands.
install.packages("RCurl") install.packages("XML") install.packages("stringr") install.packages("plyr")
We will access URL weather data and use R to download CSV files in 2015.
We will use the function getHTMLLinks() to collect the URL of the file. T hen we'll use the function downlaod.file() to save the file to the local system. S ince we're going to apply the same code to multiple files over and over again, we're going to create a function that's been called multiple times. T he file name is passed to this function as an argument as an R-list object.
# Read the URL. url <- "http://www.geos.ed.ac.uk/~weather/jcmb_ws/" # Gather the html links present in the webpage. links <- getHTMLLinks(url) # Identify only the links which point to the JCMB 2015 files. filenames <- links[str_detect(links, "JCMB_2015")] # Store the file names as a list. filenames_list <- as.list(filenames) # Create a function to download the files by passing the URL and filename list. downloadcsv <- function (mainurl,filename) { filedetails <- str_c(mainurl,filename) download.file(filedetails,filename) } # Now apply the l_ply function and save the files into the current R working directory. l_ply(filenames,downloadcsv,mainurl = "http://www.geos.ed.ac.uk/~weather/jcmb_ws/")
After you run the above code, you can find the following files in the current R-language working directory.
"JCMB_2015.csv" "JCMB_2015_Apr.csv" "JCMB_2015_Feb.csv" "JCMB_2015_Jan.csv" "JCMB_2015_Mar.csv"