Access the ETER database via API request using R

Accessing data via API requests has a large advantage in that your data will be up-to-date anytime you start a request. Implementing an API request requires some investment, especially compared to downloading a .csv-file, but it is especially useful if you need to repeat a process several times. This blog post shows how the ETER data can be loaded into R using the ETER API.

First, we need to load the necessary packages for making API calls:

#---------------------------#
#        load packages      #
#---------------------------#
library(httr)
library(jsonlite)
library(tidyverse)

Then we want to log in, since we want to get all data, not the public dataset:

#---------------------------#
#            Login          #
#---------------------------#
cred <- '{
"email": "YOURUSERNAME",
"password": "YOURPASSWORD"
}'
url_login <- "https://eter-project.com/api/4.0/users/login"
result_login <- POST(url_login, body = cred, content_type_json())
 
# translate Unicode into text
login_raw <- rawToChar(result_login$content)
 
# transform json into workable format for R
mylogin <- fromJSON(login_raw, flatten = TRUE)
mytoken <- mylogin$token

Now we can make some requests. We have the following options:

  • get all data at once
  • filter the data by country and/or year (in the example below we extract data for Austria and the year 2015).

We provided examples for all options below. Be aware that you need to add the token and the string “Bearer” to your request to receive the unrestricted dataset.

#---------------------------#
#         API requests      #
#---------------------------#
##  get all data at once  ##
url <- "https://eter-project.com/api/4.0/HEIs/full"
# request for all data
raw_result <- GET(url, add_headers(Authorization = paste("Bearer ", mytoken)))
##  filter data within API request  ##
#  build a query
myquery <- '{
    "filter": {"BAS.REFYEAR.v": 2015, "BAS.COUNTRY.v": "AT"},
    "searchTerms": []
    }
myquery2 <- '{
    "filter": {"BAS.REFYEAR.v": { "$in": [2013, 2014, 2015, 2016]},
    "BAS.COUNTRY.v": { "$in": ["AL", "AT", "BG", "LV"]}},
    "fieldIds": {},
    "searchTerms": []
    }
# now make the request for a filtered query
url <- "https://www.eter-project.com/api/4.0/HEIs/query"
raw_result <- POST(url, body = myquery, content_type_json(),
add_headers(Authorization = paste("Bearer ", mytoken)))

After the API request, we need to do some data processing to get the data from a JSON format into a workable format for R:

#---------------------------#
#       data processing     #
#---------------------------#
# check the status code ("200" for all HEIs and "201" for POST request should be the output)
raw_result$status_code
# translate unicode into text
this.raw.content <- rawToChar(raw_result$content)
# encoding to UTF-8
Encoding(this.raw.content) <- "UTF-8"
# transform JSON into workable format for R
mydata <- fromJSON(this.raw.content, flatten = TRUE)

Now you can check the data in order to be sure they have the right structure:

# check if mydata is a dataframe
class(mydata)
# check column names
colnames(mydata)
#check number of rows and columns
dim(mydata)

Converting data from json-format leads to converting quotation marks into ‘”‘. We need to re-convert this using the following code (each code appears two times because otherwise only the first ‘”‘ is corrected):

# replace \" by nothing (run each row 2x to replace everything)
mydata$BAS.INSTNAME.v <- str_replace(mydata$BAS.INSTNAME.v, '[\"]', '')
mydata$BAS.INSTNAME.v <- str_replace(mydata$BAS.INSTNAME.v, '[\"]', '')
mydata$BAS.INSTNAMEENGL.v <- str_replace(mydata$BAS.INSTNAMEENGL.v, '[\"]', '')
mydata$BAS.INSTNAMEENGL.v <- str_replace(mydata$BAS.INSTNAMEENGL.v, '[\"]', '')

You may have seen that the number of columns is more than 1,000. In reality, we have over 600 unique columns in ETER, but because of the usage of special codes and variable codes, we need to store several data per cell, depending on the user access rights. Therefore, you’ll see columns with .code.r and .v suffixes. For data analysis, only columns with suffix .v and .code are relevant. Therefore, we can get rid of columns with suffix .r. We will also do some additional data manipulation, depending on your needs:

  • As already stated, we remove all columns with suffixes .r.
  • We drop the first 3 columns, since they contain metadata about each data row, which is not relevant for data analysis.
  • We drop the .v and .code suffix from column names.
  • We create two datasets:
    • data, which contains the data in the same way as they are, showing either numbers or value codes (“m”, “a”, “x”, etc.). Mixed columns are of class “character” in R and cannot be used for calculations.
    • data_numeric, where value codes are treated as NAs (not available). This has the advantage that the concerned columns are of type “numeric” and can be used for calculations.

Here are all steps for data manipulation:

#---------------------------#
#      data manipulation    #
#---------------------------#
# drop all columns with suffix ".r" in the column name
mydata <- mydata[, !grepl(".r", colnames(mydata))]
# drop first 3 columns of the dataframe
mydata <- mydata[, -c(1:3)]
# separate data into ".code" and ".v" (== the numeric dataset)
data <- mydata[, !grepl(".v", colnames(mydata))]
data_numeric <- mydata[, !grepl(".code", colnames(mydata))]
# delete ".code" and ".v" from variable names
names(data) <- gsub(pattern = ".code", replacement = "", x = names(data))
names(data_numeric) <- gsub(pattern = ".v", replacement = "", x = names(data_numeric))
# add second "data_numeric" dataset for merging, since we need to convert
# numeric to character before merging into each other
data_numeric2 <- data_numeric
# before merging, column classes must be the same
data_numeric2 <- data_numeric2 %>%
     mutate_all(as.character)
 
# merge "data" and "data_numeric2" in "data" where NAs should vanish
data <- list(data, data_numeric2) %>%
    transpose(union(names(data), names(data_numeric2))) %>%
    map_dfc(. %>% compact %>% invoke(coalesce, .))
 
# add "data" to "data_numeric" also in order to add variables, where only codes
# are included
data2 <- data
 
# convert "m", "a", "x", "xc", "xr" etc. into NA
data2[data2 == "m" | data2 == "a" | data2 == "x" | data2 == "" | data2 == "xr" |
    data2 == "xc" | data2 == "x " | data2 == "m "] <- NA
data_numeric <- merge(data_numeric, data2, all.x = TRUE)
# data with only NA are of type character and thus cannot be used for
# calculations -> find and convert to numeric
colAllNA <- sapply(data_numeric, function(x) all(is.na(x)))
colnamesNA <- names(data_numeric)[colAllNA == TRUE]
data_numeric[, colnamesNA] <- apply(data_numeric[, colnamesNA], 2, function(x) as.numeric(x))

Congratulations, you have now all options for doing your analysis. And the best: if we have new data, just rerun your script, and your analysis is done!