open_data_history.Rmd

---
title: "The present and future status of open government data"
author: "STAT 448, Assignment 1-2, Fernando Cagua, March 2017"
output:
  pdf_document: default
  html_document: default
header-includes:
- \usepackage{setspace}
bibliography: references.bib
urlcolor: magenta
---
\onehalfspacing

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(ggplot2)
library(magrittr)
library(ggrepel)
library(stringdist)
library(survival)
library(rgdal)
library(ggmap)
library(rgeos)
library(maptools)
library(broom)

fer_theme <- theme_bw() +
	theme(text = element_text(family = "Helvetica"),
	      title = element_text(size = 7, hjust = 0),
	      legend.title = element_text(size = 8),
	      legend.text = element_text(size = 7),
				axis.text = element_text(size = 7),
				axis.title = element_text(size = 8, hjust = 0.5),
				strip.text = element_text(size = 8, hjust = 0),
				strip.background = element_blank(),
				plot.margin = grid::unit(c(5, 0.5, 2, 0), "mm"),
				panel.grid = element_blank()
				)
```

In order to fulfill their functions, governments tend to collect large amounts of data that are high in volume, variety, veracity, and arguably velocity.
Hence, government data as a whole seems to be a good candidate to receive the "big data" label.
Nevertheless, the government data has the potential to be big data for another, perhaps more important, criterium—value.

Few organizations have the potential that government has to use their data to transform society.
It has been shown that the value of government datasets can be even larger when they are open (being open means that they can be freely used and shared by anyone).
This is so because the benefits of open data are not only for the civil society, but also for the government itself and range from transparency improvements, to citizen empowerment, passing by public service improvement and indirect effects in the wider economy [@Ubaldi2013].
For example, the potential global value of open data has been estimated to be $3 trillion [@Chui2013], and 100,000 direct jobs would be created by 2020 in Europe alone [@Carrara2015].

Despite the clear advantages, a large proportion of countries still have a limited or inexistent public open data infrastructure.
While many civic initiatives attempt to ameliorate this issue by sharing independently collected data, they are usually unable to substitute the value of government datasets.
Large initiatives like the [Global Partnership for Sustainable Development Data](http://www.data4sdgs.org/)—which unite governments and NGOs with the aim of fostering the applications of open government data to achieve the 2030 UN sustainable development goals–are trying to remedy the situation.
In a prime example, the [Open Government Partnership](https://www.opengovpartnership.org), a multi-country alliance in which governments commit to be "more open, accountable, and responsive to citizens", has grown from eight countries in 2011 to 75 in 2016 [@OGP2016].

More and more governments are committing to open public data, certainly important progress.
However, monitoring the *actual* progress to which government data is being made open is more challenging.
Currently, three indexes attempt to quantify this progress.
The [Open Data Inventory](http://odin.opendatawatch.com/) includes 173 countries and focus on gaps on the official statistics provided by the national statistics offices [@ODW2016].
The [Open Data Barometer](http://opendatabarometer.org/) includes 92 countries and relies on expert assessment which inform on the readiness, implementation, and impact of open data [@OBD2015].
Finally, the [Open Data Index](http://index.okfn.org/) includes 122 countries and relies on peer-reviewed crowdsourced data from individuals and citizen initiatives. It goes beyond national statistics and evaluates the status of open data on dimensions like legislation, GIS, pollutants, land ownership, etc [@Foundation2017].

Although these indices provide important insight into the current status of open government data, they do not provide all the information required to analyze its evolution because many governments embraced open data policies before the indices were developed.
Understanding the macro factors that drive the implementation of open government data, and predicting future trends, would be an valuable asset to design strategies to speed up the adoption of the "data revolution" across countries.

```{r}
country_names <- readr::read_csv("./data/official_country_names.csv")

data_history <- readr::read_csv("./data/open_data_history.csv")
# manually fix weird names
alternative_country_names <- "./data/alternative_country_names.csv" %>%
	readr::read_csv()
data_history %<>%
	dplyr::left_join(alternative_country_names) %>%
	dplyr::mutate(country_name = dplyr::if_else(
		is.na(country_name),
		Country,
		country_name
	))
dm <- stringdistmatrix(tolower(country_names$name),
					 tolower(data_history$country_name), method = "osa",
					 weight = c(d = 1, i = 0.1, s = 1, t = 1))

data_history$official_name <- NA
for(i in 1:ncol(dm)){
	data_history$official_name[i] <- country_names$name[which.min(dm[, i])]
}

data_history %<>%
	dplyr::left_join(country_names, by = c("official_name" = "name"))
```

```{r}
odi <- readr::read_csv("./data/open_data_index.csv") %>%
	dplyr::mutate(id = toupper(id)) %>%
	dplyr::rename(odi_country_name = name,
								odi_region = region,
								odi_rank = rank)

odb <- readr::read_csv("./data/odb_index.csv") %>%
	dplyr::select(-Region, -Country)

data_history %<>%
	dplyr::mutate(i_start_date = dplyr::if_else(is.na(start_date),
																							Sys.Date(),
																							start_date),
								date_rank = rank(i_start_date, ties.method = "average"))

rank_comparison <- data_history %>%
	dplyr::left_join(odi, by = c("alpha2" = "id")) %>%
	dplyr::left_join(odb, by = c("alpha3" = "ISO3"))

odi_cor <- rank_comparison %$%
	cor(as.numeric(i_start_date), as.numeric(as.character(score_2014)),
			method = "spearman", use = "complete.obs")

obd_cor <- rank_comparison %$%
	cor(as.numeric(i_start_date), ODB.Score.Scaled,
			method = "spearman", use = "complete.obs")

```

Here, I do precisely that.
Specifically, I use the date in which a country opened an open government data portal (such as [data.govt.nz](https://data.govt.nz/)) as a proxy for a country support for open data.
Although the opening date is not able to accurately measure the quantity and quality of public data, I found it to be highly correlated with both the Open Data Index and the Open Data Barometer (Spearman correlation coefficient of `r -round(odi_cor, 2)` and `r -round(obd_cor, 2)`, respectively).
Open data portals are a good indication of the progress of open data because–by making datasets discoverable and managing metadata–they have the potential to accelerate the creation of value [@Attard2015].

To obtain the web address of the open data portals were open I curated an automated search that returned the 10 first results of a Google Search in an english locale for the string "`Open Data + [country]`" for each of the 193 United Nations member states.
I then obtained an approximate opening date for the portal by automatically retrieving the date in which the site was first registered by the Wayback Machine, which keeps historical snapshots of billions of URLs over time. 
This data would be easily improved by performing searches in local languages. Code can be foun in [github:efcaguab/open_data_history](https://github.com/efcaguab/open_data_history).

Using this methodology I found that the adoption of open government data is not homogeneous across regions (Figure 1).
Europe seems to be at the vanguard of in terms of support for open government data portals.
In particular there was a period of rapid growth between 2012 and 2014 when most West European countries launched their portals.
In Asia and the Americas the largest growth occurred after 2014 and are currently on track to catch up with European countries.
Growth in Africa and Oceania has been rather moderate and, with a few exceptions, governments are yet to embrace open data.
Using the historical records also allow us to identify the pioneers of open data.
The USA, UK, Norway, Australia, and New Zealand, all launched their open government data portals in the earliest dates, setting up the example to other countries to follow.

```{r}
read_wb <- function(file, varname) {
	file %>%
		readr::read_csv(skip = 4) %>%
		dplyr::select(dplyr::contains("Country Code"),
									dplyr::matches("[0-9]")) %>%
		reshape2::melt("Country Code", variable.name = "year", value.name = "var") %>%
		dplyr::filter(!is.na(var)) %>%
		dplyr::group_by(`Country Code`) %>%
		dplyr::mutate(year = as.numeric(as.character(year)),
									last_year = max(year),
									var = as.numeric(as.character(var))) %>%
		dplyr::filter(year == last_year) %>%
		dplyr::select(-last_year) %>%
		dplyr::rename_(.dots = setNames("var", varname)) %>%
		dplyr::rename_(.dots = setNames("year", paste("year", varname, sep = ".")))
}

netp <- read_wb("./data/wb_net_penetration.csv", "netp")
gdp <- read_wb("./data/wb_gdp_per_capita.csv", "gdppc")
pop <- read_wb("./data/wb_population_size.csv", "pop")

study_start <- data_history$start_date %>% min(na.rm = T) - 2
study_end <- as.Date("2017-03-28")
surv_history <- data_history %>%
	dplyr::left_join(netp, by = c("alpha3" = "Country Code")) %>%
	dplyr::left_join(gdp, by = c("alpha3" = "Country Code")) %>%
	dplyr::left_join(pop, by = c("alpha3" = "Country Code")) %>%
	dplyr::mutate(time = 1,
								time2 = difftime(start_date, study_start, units = "day"),
								time2 = ifelse(is.na(time2),
															 difftime(study_end, study_start, units = "day"),
															 time2),
								event = ifelse(is.na(start_date),0,1),
								tnetp = netp + 2) %>%
	dplyr::filter(!is.na(netp), !is.na(gdppc))

aft3 <- survreg(Surv(time2, event) ~ gdppc + tnetp, data = surv_history)
aft2 <- survreg(Surv(time2, event) ~ gdppc, data = surv_history)
aft1 <- survreg(Surv(time2, event) ~ tnetp, data = surv_history)


surv_history$prediction <- predict(aft3)
surv_history$pred_se <- predict(aft1, se.fit = T)$se.fit
surv_history$residuals <- residuals(aft3)
surv_history %<>%
	dplyr::mutate(optimistic = study_start + prediction - pred_se*2,
								pesimistic = study_start + prediction + pred_se*2,
								prediction_n = study_start + prediction)
```

```{r}
cum_history <- data_history %>%
	dplyr::arrange(start_date) %>%
	plyr::ddply("region", function(x){
			d <- dplyr::data_frame(dens = ecdf(x$start_date)(unique(x$start_date)),
											 start_date = unique(x$start_date)) %>%
			dplyr::mutate(n = dens * (sum(!is.na(x$start_date))),
										prop = n/nrow(x)) %>%
				dplyr::filter(!is.na(start_date))
			d %>%
				dplyr::bind_rows(dplyr::data_frame(dens = 1,
																					 start_date = Sys.Date(),
																					 n = max(d$n),
																					 prop = max(d$prop))) %>%
				dplyr::bind_rows(dplyr::data_frame(dens = 0,
																					 start_date = min(data_history$start_date-90, na.rm = T),
																					 n = 0,
																					 prop = 0))
	})


```

```{r, fig.height= 2.5, fig.width=3.5, fig.cap= "Proportion of countries for which national open government data portals exist."}

lege <- cum_history %>%
	dplyr::group_by(region) %>%
	dplyr::mutate(max_prop = max(prop)) %>%
	dplyr::filter(prop == max_prop,
								start_date == Sys.Date()) %>%
	dplyr::mutate(n_tot = n/prop,
								message = paste0(region, " (", n, "/", n_tot, ")"))

cum_history %>%
	ggplot(aes(x = start_date)) +
	geom_step(aes(y = prop,
							 colour = region), position = "identity") +
	ylab("proportion of countries") +
	scale_x_date(name = "", limits = c(min(data_history$start_date-90, na.rm = T),
																		 Sys.Date() + 700)) +
	scale_y_continuous(labels = scales::percent) +

	# geom_point(data = data_history,
	# 								aes(x = start_date, y = -0.05), size = 0.5) +
	geom_label_repel(data = lege,
									aes(x = start_date, y = prop, label = message, colour = region),
									angle =0, size = 2,
									nudge_x = 350,
									# segment.color = NA,
									segment.size = 0.5,
									segment.alpha = 0.5,
									point.padding = unit(0.5, "lines")) +
	scale_color_brewer(palette = "Set1") +
	# ggtitle("National level open government initiatives") +
	fer_theme +
	theme(legend.position = "none")


```

Although informative, the regional bins do not allow us to understand the relationships between the adoption of open government data policies and potential covariates.
I therefore constructed a model to determine how the gross domestic product per capita and the number of internet users (per 100 people) are related to the launch date of the data portals.
Although many other factors might contribute to the progress of open data, I chose these two variables because they are likely to serve as a proxy for the capacity that countries have to implement and utilize open data.
Specifically, I used a parametric survival regression model under the assumption that the launch date follows a Weibull distribution [@Therneau2000].

Although somewhat simplistic, this model already allows for some insightful information.
First, albeit the two explanatory variables are moderately correlated, the proportion of internet users is a much stronger predictor of the portal launch date than the per-capita gross domestic product.

Second, by examining the model residuals and the launch date the model allows to identify open data champions.
Countries that are "ahead of time" and launched open government data portals way before it would have been expected given their levels of internet penetration and socioeconomic status.
Specifically, the top five is composed by Burkina Faso, Ethiopia, Pakistan, Ghana, and Bangladesh.

Third, the model allows to forecast the expected date in which open government data portals could be implemented in the countries that have not yet done so (Figure 2).
Under the current trajectory, the most likely outcome is that ~70% of the nations would embrace open government data portals by 2030.
This includes all European countries and most countries in Asia, Oceania, and the Americas, but excludes a large proportion of central and west african countries, and poor and conflictive countries in Asia.

```{r, results="hide"}
categ_lab <- c("< 2010", "2010-15", "2015-20", "2020-25", "2025-30", "> 2030")
categ_breaks <- as.Date(c("2000-01-01",
													"2010-06-01",
													"2015-06-01",
													"2020-06-01",
													"2025-06-01",
													"2030-06-01",
													"2070-06-01"))
surv_history %<>%
	dplyr::mutate(optimistic_fix = dplyr::if_else(is.na(start_date),
																								optimistic,
																								start_date),
								pesimistic_fix = dplyr::if_else(is.na(start_date),
																								pesimistic,
																								start_date),
								prediction_fix = dplyr::if_else(is.na(start_date),
																								prediction_n,
																								start_date)) %>%
	dplyr::mutate(
		opt_cut = cut(optimistic_fix, breaks = categ_breaks, labels = categ_lab),
		pes_cut = cut(pesimistic_fix, breaks = categ_breaks, labels = categ_lab),
		pre_cut = cut(prediction_fix, breaks = categ_breaks, labels = categ_lab))


countries <- readOGR("./data/countries.geojson", "OGRGeoJSON")
countries.df <- tidy(countries, region = "iso_a3") %>%
	dplyr::filter(id != "ATA")
```

```{r, fig.width= 6.5, fig.height=2.2, fig.cap = "Predictions "}
p1 <- countries.df %>%
	dplyr::inner_join(surv_history, by = c("id" = "alpha3")) %>%
	ggplot(aes(x = long, y = lat, group = group)) +
	geom_polygon(aes(fill = opt_cut)) +
	scale_fill_brewer(palette = "Reds", na.value = "grey50", name = "year of\nimplementation") +
	scale_colour_brewer(palette = "Set1", na.value = "grey50", guide = F) +
	fer_theme +
	coord_quickmap() +
	theme(axis.text = element_blank(),
				axis.title = element_blank(),
				axis.ticks = element_blank(),
				panel.border = element_blank(),
				legend.title = element_text(size = 8),
				legend.text = element_text(size = 7),
				legend.direction = "horizontal",
				legend.key.size = grid::unit(0.7, "lines")) +
	ggtitle("optimistic scenario")

p2 <- countries.df %>%
	dplyr::inner_join(surv_history, by = c("id" = "alpha3")) %>%
	ggplot(aes(x = long, y = lat, group = group)) +
	geom_polygon(aes(fill = pes_cut)) +
	scale_fill_brewer(palette = "Reds", na.value = "grey50", name = "year") +
	scale_colour_brewer(palette = "Set1", na.value = "grey50", guide = F) +
	fer_theme +
	coord_quickmap() +
	theme(axis.text = element_blank(),
				axis.title = element_blank(),
				axis.ticks = element_blank(),
				panel.border = element_blank()) +
	ggtitle("pesimistic scenario")

leg <- cowplot::get_legend(p1)

cowplot::plot_grid(p1 + theme(legend.position = "none"),
									 p2 + theme(legend.position = "none"),
									 ncol = 2) %>%
	cowplot::plot_grid(leg, ncol = 1, rel_heights = c(2, 0.2), hjust = -10)
```

A major goal of the major international organizations that promote open government data is to harness the big-data revolution to achieve the 17 goals of the 2030 Agenda for Sustainable Development.
If indeed open government data can be used to "help end extreme poverty, combat inequality and injustice, and combat climate change", is value would be so large that it could, without a doubt, be called big-data.
This analysis suggest that the adoption of open data could be accelerated by enabling all people to use the internet, which then can translate into creation of value when citizens demand and use public data.
Lamentably, this analysis suggest that, unless this acceleration takes place, the countries that need it the most will also be the ones that are likely to miss it.

\singlespacing

## References

\footnotesize