project - Copy.Rmd

---
title: 'BDA Project: Malicious and Benign Website URL detections'
author: "Nguyen Xuan Binh"
date: "January 2023"
output:
  word_document:
    toc: yes
    toc_depth: '3'
  pdf_document:
    toc: yes
    toc_depth: 3
    fig_caption: yes
bibliography: bibliography.bib
---

# Introduction

## Central problem
Detection of malicious URLs among the benign ones is a crucial goal of modern-day cybersecurity as it helps prevent individuals and organizations from falling victim to phishing, data breaching, malware infections, and other types of cyber threats. The most common type is phishing, where the URLs are disguised as valid sites to trick users into revealing their credentials. Some other types even install harmful softwares or redirect users to other malicious sites. With the rapid growth of the internet and the increasing dependence on technology, black-hat hackers and thieves have found innovative ways to spread their malicious content through fake URLs. A 2017 report from Cybersecurity Ventures predicted ransomware damages would cost the world $5 billion in 2017, up from $325 million in 2015 — a 15X increase in just two years. The damages for 2018 were predicted to reach $8 billion, and for 2019 the figure is $11.5 billion @cybersec. Therefore, it is an urgent task to automate the process of detecting and blocking malicious URLs floating around the internet.  
## Motivation

In order to protect against these threats, it is important to detect malicious URLs and to prevent individuals and organizations from accessing them. This can be accomplished through the use of various techniques, including URL reputation analysis, machine learning algorithms, and network security solutions. By detecting and blocking malicious URLs, individuals and organizations can better protect themselves and their sensitive information from cyber attacks. In this report, I aim to detect malicious URLs among the benign or safe ones based on various features of the URLs and the websites associated with them. The method will be based on Bayesian inference approach to take into account the past data.  

## Main modeling idea


# Dataset 

## Data Description
The dataset described in this manuscript is meant for such machine learning based analysis of malicious and benign webpages. The data has been collected from Internet using a specialized focused web crawler named MalCrawler [1]. The dataset comprises of various extracted attributes, and also raw webpage content including JavaScript code. It supports both supervised and unsupervised learning. For supervised learning, class labels for malicious and benign webpages have been added to the dataset using the Google Safe Browsing API.1 The most relevant attributes within the scope have already been extracted and included in this dataset. However, the raw web content, including JavaScript code included in this dataset supports further attribute extraction, if so desired. Also, this raw content and code can be used as unstructured data input for text-based analytics. This dataset consists of data from approximately 1.5 million webpages, which makes it suitable for deep learning algorithms. This article also provides code snippets used for data extraction and its analysis.


The dataset contains extracted attributes from websites that can be used for Classification of webpages as malicious or benign. The dataset also includes raw page content including JavaScript code that can be used as unstructured data in Deep Learning or for extracting further attributes. The data has been collected by crawling the Internet using MalCrawler [1]. The labels have been verified using the Google Safe Browsing API [2]. Attributes have been selected based on their relevance [3]. The details of dataset attributes is as given below: 
'url'         - The URL of the webpage.
'ip_add'      - IP Address of the webpage.
'geo_loc'     - The geographic location where the webpage is hosted.
'url_len'     - The length of URL.
'js_len'      - Length of JavaScript code on the webpage.
'js_obf_len - Length of obfuscated JavaScript code.
'tld'         - The Top Level Domain of the webpage.
'who_is'      - Whether the WHO IS domain information is compete or not.
'https'         - Whether the site uses https or http.
'content'     - The raw webpage content including JavaScript code.
'label'          - The class label for benign or malicious webpage. 
 
Python code for extraction of the above listed dataset attributes is attached.
The Visualisation of this dataset and it python code is also attached. This visualisation can be seen online on Kaggle

## Data source and analysis difference
Kaggle: https://www.kaggle.com/datasets/aksingh2411/dataset-of-malicious-and-benign-webpages
Data source: https://data.mendeley.com/datasets/gdx3pkwp47/2
https://www.researchgate.net/publication/347936136_Malicious_and_Benign_Webpages_Dataset


# Data cleaning

# Feature selection and transformation


```{r, include=FALSE}
library(rstan)
library(cmdstanr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(grid)
library(gridExtra)
library(scales)
library(loo)
library(sentimentr)
library(stringr)
library(gridExtra)
library(MASS)
library(Metrics)
library(caret)
library(cvms)
library(tibble)
library(posterior)
library(purrr)
options(dplyr.summarise.inform = FALSE)
```

```{r setup, include = FALSE}
#knitr::opts_chunk$set(eval = TRUE)
#knitr::opts_chunk$set(eval = FALSE)
```

```{r}
train_websites <- read.csv("websites/train_websites.csv")
test_websites <- read.csv("websites/test_websites.csv")
train_websites_top_3 <- read.csv("websites/train_websites_top_3.csv")
test_websites_top_3 <- read.csv("websites/test_websites_top_3.csv")
```


```{r}
cat("Number of training data:",nrow(train_websites_top_3))
cat("\nNumber of testing data:",nrow(test_websites_top_3))
head(train_websites_top_3)
```

```{r,echo=FALSE}
# Count the number of rows for each combination of https and who_is

train_websites_count <- train_websites %>% 
  filter(label == "good") 
numberOfMaliciousURLs <- nrow(train_websites_count)
  
train_websites_count <- train_websites %>% 
  filter(label == "good") %>% 
  group_by(https) %>%
  summarize(count = n())

# Create a pie chart with a legend
p1 <- ggplot(train_websites_count, aes(x = "", y = count, fill = interaction(https))) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0, direction = -1) +
  scale_fill_manual(values = c("red", "green"), labels=c("no","yes")) +
  theme(legend.position = "bottom") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10), legend.text = element_text(size = 8)) +
  ggtitle(paste("HTTPS in",numberOfMaliciousURLs,"benign URLs\n(yes/no)")) +
  
  guides(fill=guide_legend(title=""))

# Count the number of rows for each combination of https and who_is

train_websites_count <- train_websites %>% 
  filter(label == "good") 
numberOfBenignURLs <- nrow(train_websites_count)
  
train_websites_count <- train_websites %>% 
  filter(label == "good") %>% 
  group_by(who_is) %>%
  summarize(count = n()) %>% 
  arrange(desc(count))

# Create a pie chart with a legend
p2 <- ggplot(train_websites_count, aes(x = "", y = count, fill = interaction(who_is))) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0, direction = 1) +
  scale_fill_manual(values = c("green", "red"), labels=c("complete","incomplete")) +
  theme(legend.position = "bottom") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10), legend.text = element_text(size = 8)) +
  ggtitle(paste("WHOIS in",numberOfBenignURLs,"benign URLs\n(complete/incomplete)")) +
  
  guides(fill=guide_legend(title=""))

train_websites_count <- train_websites %>% 
  filter(label == "good") 
numberOfBenignURLs <- nrow(train_websites_count)
  
train_websites_count <- train_websites %>% 
  filter(label == "good") %>% 
  group_by(https, who_is) %>%
  summarize(count = n())

# Create a pie chart with a legend
p3 <- ggplot(train_websites_count, aes(x = "", y = count, fill = interaction(https, who_is))) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0, direction = 1) +
  scale_fill_manual(values = c("blue", "green", "red", "orange")) +
  theme(legend.position = "bottom") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10), legend.text = element_text(size = 8)) +
  ggtitle(paste("(HTTPS.WHOIS) pair combination\nin",numberOfBenignURLs,"benign URLs")) +
  guides(fill=guide_legend(title="", nrow=2))

grid.arrange(p1, p2, p3, ncol = 3)
```

```{r, echo=FALSE}
# Count the number of rows for each combination of https and who_is

train_websites_count <- train_websites %>% 
  filter(label == "bad") 
numberOfMaliciousURLs <- nrow(train_websites_count)
  
train_websites_count <- train_websites %>% 
  filter(label == "bad") %>% 
  group_by(https) %>%
  summarize(count = n())

# Create a pie chart with a legend
p1 <- ggplot(train_websites_count, aes(x = "", y = count, fill = interaction(https))) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0, direction = 1) +
  scale_fill_manual(values = c("red", "green"), labels=c("no","yes")) +
  theme(legend.position = "bottom") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10), legend.text = element_text(size = 8)) +
  ggtitle(paste("HTTPS in",numberOfMaliciousURLs,"malicious URLs\n(yes/no)")) +
  
  guides(fill=guide_legend(title=""))

# Count the number of rows for each combination of https and who_is

train_websites_count <- train_websites %>% 
  filter(label == "bad") 
numberOfMaliciousURLs <- nrow(train_websites_count)
  
train_websites_count <- train_websites %>% 
  filter(label == "bad") %>% 
  group_by(who_is) %>%
  summarize(count = n()) %>% 
  arrange(desc(count))

# Create a pie chart with a legend
p2 <- ggplot(train_websites_count, aes(x = "", y = count, fill = interaction(who_is))) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0, direction = -1) +
  scale_fill_manual(values = c("green", "red"), labels=c("complete","incomplete")) +
  theme(legend.position = "bottom") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10), legend.text = element_text(size = 8)) +
  ggtitle(paste("WHOIS in",numberOfMaliciousURLs,"malicious URLs\n(complete/incomplete)")) +
  
  guides(fill=guide_legend(title=""))

# Count the number of rows for each combination of https and who_is

train_websites_count <- train_websites %>% 
  filter(label == "bad") 
numberOfMaliciousURLs <- nrow(train_websites_count)
  
train_websites_count <- train_websites %>% 
  filter(label == "bad") %>% 
  group_by(https, who_is) %>%
  summarize(count = n())

# Create a pie chart with a legend
p3 <- ggplot(train_websites_count, aes(x = "", y = count, fill = interaction(https, who_is))) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0, direction = 1) +
  scale_fill_manual(values = c("blue", "green", "red", "orange")) +
  theme(legend.position = "bottom") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10), legend.text = element_text(size = 8)) +
  ggtitle(paste("(HTTPS.WHOIS) pair combination\nin",numberOfMaliciousURLs,"malicious URLs")) +
  guides(fill=guide_legend(title="")) +
  guides(fill=guide_legend(title="", nrow=2))

grid.arrange(p1, p2, p3, ncol = 3)
```

```{r, echo=FALSE}
ggplot(data = train_websites, aes(x = js_len, y = js_obf_len, color = label)) +
  geom_point() +
  scale_color_manual(values = c("red", "blue"), 
                     labels = c("malicious", "benign"),
                     guide = guide_legend(title = "Label")) +
  ggtitle("js_len vs js_obf_len") +
  xlab("js_len") +
  ylab("js_obf_len") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_vline(xintercept = 250, linetype = "dashed", color = "black") + 
  geom_hline(yintercept = 100, linetype = "dashed", color = "black") +
  annotate("text", x = 260, y = Inf, label = "js_len = 250", hjust = 0, vjust = 1) +
  annotate("text", x = Inf, y = 60, label = "js_ofs_len = 100", hjust = 1, vjust = 0) 
  #guides(color = guide_legend(title = "Label"))
```


```{r, include=FALSE}
# Set up the plotting grid for the coefficients
par(mfrow = c(2,1))

hist(train_websites$js_len, main = "JS length histogram of the recorded URLs", xlab = "Javascript length", breaks = 100)
hist(train_websites$js_obf_len, main ="Obfuscated JS length histogram of the recorded URLs", xlab = "Obfuscated Javascript length",breaks = 100)
```

```{r, echo=FALSE}
# Group the dataframe by geo_loc and count the number of rows for each country
train_websites_count <- train_websites %>% 
  group_by(geo_loc) %>%
  summarize(count = n() ) %>%
  top_n(3, count) %>%
  slice_tail(n=3)

# Count the number of benign and malicious URLs for each country
train_websites_count_label <- train_websites %>% 
  filter(geo_loc %in% train_websites_count$geo_loc) %>%
  group_by(geo_loc, label) %>%
  summarize(count = n())

# Plot the bar chart
ggplot(train_websites_count_label, aes(x = geo_loc, y = count, fill = label)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_manual(values = c("red", "blue"), 
                    labels = c("malicious","benign")) +
  xlab("Country") +
  ylab("Number of URLs") +
  ggtitle("Distribution of benign and malicious URLs\n of top 3 recorded countries") +
  guides(fill = guide_legend(title = "Label")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, vjust = 0.5, size = 10, 
                                  margin = margin(r = -20, unit = "pt"),
                                  #family = "serif", 
                                  lineheight = 0.9, color = "black"))
```

```{r, echo=FALSE}

# Get unique country names
countries <- unique(train_websites_top_3$geo_loc)

# Get number of countries
K <- length(countries)

# Number of URLs per country, varying length of vector element
N_list = list()

# For each country, saving the number of URLs
for (country in countries) {
  train_websites_country <- train_websites_top_3 %>%
    filter(geo_loc == country)
  N_list <- c(N_list, nrow(train_websites_country))
}

# Maximum length of training URLs for all countries
Nmax <- N_list[[which.max(N_list)]]

M_list = list()
# For each country, saving the number of URLs
for (country in countries) {
  test_websites_country <- test_websites_top_3 %>%
    filter(geo_loc == country)
  M_list <- c(M_list, nrow(test_websites_country))
}

# Maximum length of training URLs for all countries
Mmax <- M_list[[which.max(M_list)]]

# The matrix of Javascript code
js_len_list = list()
for (country in countries) {
  train_websites_country <- train_websites_top_3 %>%
    filter(geo_loc == country)
  js_len_list <- c(js_len_list, list(train_websites_country$js_len_bin))
}

js_len_list <- map(js_len_list, function(x) {return(c(x, rep(0, Nmax - length(x))))})

# The matrix of Javascript obfuscated code
js_obf_len_list = list()
for (country in countries) {
  train_websites_country <- train_websites_top_3 %>%
    filter(geo_loc == country)
  js_obf_len_list <- c(js_obf_len_list, list(train_websites_country$js_obf_len_bin))
}

js_obf_len_list <- map(js_obf_len_list, function(x) {return(c(x, rep(0, Nmax - length(x))))})

# The matrix of safety level of the URL, varying length of vector element
https_list = list()
for (country in countries) {
  train_websites_country <- train_websites_top_3 %>%
    filter(geo_loc == country)
  https_list <- c(https_list, list(train_websites_country$https_bin))
}
https_list <- map(https_list, function(x) {return(c(x, rep(0, Nmax - length(x))))})

# The matrix of safety level of the URL, varying length of vector element
whois_list = list()
for (country in countries) {
  train_websites_country <- train_websites_top_3 %>%
    filter(geo_loc == country)
  whois_list <- c(whois_list, list(train_websites_country$whois_bin))
}
whois_list <- map(whois_list, function(x) {return(c(x, rep(0, Nmax - length(x))))})

# The matrix of Javascript code, varying length of vector element
js_len_test_list = list()
for (country in countries) {
  test_websites_country <- test_websites_top_3 %>%
    filter(geo_loc == country)
  js_len_test_list <- c(js_len_test_list, list(test_websites_country$js_len_bin))
}
js_len_test_list <- map(js_len_test_list, function(x) {return(c(x, rep(0, Mmax - length(x))))})

# The matrix of Javascript code, varying length of vector element
js_obf_len_test_list = list()
for (country in countries) {
  test_websites_country <- test_websites_top_3 %>%
    filter(geo_loc == country)
  js_obf_len_test_list <- c(js_obf_len_test_list, list(test_websites_country$js_obf_len_bin))
}
js_obf_len_test_list <- map(js_obf_len_test_list, function(x) {return(c(x, rep(0, Mmax - length(x))))})

# The matrix of safety level of the URL, varying length of vector element
https_test_list = list()
for (country in countries) {
  test_websites_country <- test_websites_top_3 %>%
    filter(geo_loc == country)
  https_test_list <- c(https_test_list, list(test_websites_country$https_bin))
}
https_test_list <- map(https_test_list, function(x) {return(c(x, rep(0, Mmax - length(x))))})

# The matrix of safety level of the URL, varying length of vector element
whois_test_list = list()
for (country in countries) {
  test_websites_country <- test_websites_top_3 %>%
    filter(geo_loc == country)
  whois_test_list <- c(whois_test_list, list(test_websites_country$whois_bin))
}
whois_test_list <- map(whois_test_list, function(x) {return(c(x, rep(0, Mmax - length(x))))})

# The matrix of label of the URL (malicious/benign), varying length of vector element
label_list = list()
for (country in countries) {
  train_websites_country <- train_websites_top_3 %>%
    filter(geo_loc == country)
  label_list <- append(label_list, list(train_websites_country$label_bin))
}
label_list <- map(label_list, function(x) {return(c(x, rep(0, Nmax - length(x))))})

# The matrix of label of the URL (malicious/benign), varying length of vector element
label_test_list = list()
for (country in countries) {
  test_websites_country <- test_websites_top_3 %>%
    filter(geo_loc == country)
  label_test_list <- append(label_test_list, list(test_websites_country$label_bin))
}
label_test_list <- map(label_test_list, function(x) {return(c(x, rep(0, Mmax - length(x))))})

```

```{r, echo=FALSE}
stan_data <- list(
  Nmax = Nmax,
  Mmax = Mmax,
  K = K,
  N_list = N_list,
  M_list = M_list,
  js_len_list = as.matrix(do.call(rbind, js_len_list)),
  js_obf_len_list = as.matrix(do.call(rbind, js_obf_len_list)),
  https_list = as.matrix(do.call(rbind, https_list)),
  whois_list = as.matrix(do.call(rbind, whois_list)),
  js_len_pred_list = as.matrix(do.call(rbind, js_len_test_list)),
  js_obf_len_pred_list = as.matrix(do.call(rbind, js_obf_len_test_list)),
  https_pred_list = as.matrix(do.call(rbind, https_test_list)),
  whois_pred_list = as.matrix(do.call(rbind, whois_test_list)),
  label_list = as.matrix(do.call(rbind, label_list))
)
```

# Separate model

## Model description

## Prior choice and justifications
Default protocol https is used by 81.5% of all the websites. @https://w3techs.com/technologies/details/ce-httpsdefault

There are 1.24 billion with complete WHOIS registration, while there are currently 1.7 billion websites. So the ratio of complete WHOIS website is 0.73.

## Stan code and running options

The Stan model code:
```{r, echo = TRUE, results = 'hide'}
"
data {
  int<lower=1> Nmax; // Number of maximum URLs among all countries (training)
  int<lower=1> Mmax; // Number of maximum URLs among all countries (testing)
  int<lower=1> K; // Number of countries
  array[K] int<lower=1> N_list; // Number of URLs of each country (training)
  array[K] int<lower=1> M_list; // Number of URLs of each country (testing)
  // The training features
  array[K, Nmax] int<lower=0,upper=1> js_len_list;
  array[K, Nmax] int<lower=0,upper=1> js_obf_len_list;
  array[K, Nmax] int<lower=0,upper=1> https_list;
  array[K, Nmax] int<lower=0,upper=1> whois_list;
  // The testing predicting features
  array[K, Mmax] int<lower=0,upper=1> js_len_pred_list;
  array[K, Mmax] int<lower=0,upper=1> js_obf_len_pred_list;
  array[K, Mmax] int<lower=0,upper=1> https_pred_list;
  array[K, Mmax] int<lower=0,upper=1> whois_pred_list;
  // label for each URL: benign(0) or malicious(1)
  array[K, Nmax] int<lower=0,upper=1> label_list; 
}

parameters {
  array[K] real<lower=0, upper=1> theta_js_len; // probability for js_len
  array[K] real<lower=0, upper=1> theta_js_obf_len; // probability for js_obf_len
  array[K] real<lower=0, upper=1> theta_https; // probability for https
  array[K] real<lower=0, upper=1> theta_whois; // probability for whois
  array[K] real js_len_coeff; // Slope coefficient for js_len
  array[K] real js_obf_len_coeff; // Slope coefficient for js_obf_len
  array[K] real https_coeff; // Slope coefficient for https_coeff
  array[K] real whois_coeff; // Slope coefficient for whois_coeff
  array[K] real intercept; // Intercept coefficient
}

model {
    // Prior probabilities of the features
    for (k in 1:K){
        theta_js_len[k] ~ beta(1,10);
        theta_js_obf_len[k] ~ beta(1,10);
        theta_https[k] ~ beta(8,10);
        theta_whois[k] ~ beta(7,10);
    }
    // likelihood for the features
    for (k in 1:K){
        js_len_list[k, 1:N_list[k]] ~ bernoulli(theta_js_len[K]);
        js_obf_len_list[k, 1:N_list[k]] ~ bernoulli(theta_js_obf_len[K]);
        https_list[k, 1:N_list[k]] ~ bernoulli(theta_https[K]);
        whois_list[k, 1:N_list[k]] ~ bernoulli(theta_whois[K]);     
    }
    // priors of the coefficients
    for (k in 1:K){
       js_len_coeff[k] ~ cauchy(1,1);
       js_obf_len_coeff[k] ~ cauchy(1,1);
       https_coeff[k]  ~ cauchy(-1,1);
       whois_coeff[k] ~ cauchy(-1,1);
       intercept[k] ~ normal(0,20);        
    }
    // Modelling of the label based on bernoulli logistic regression by 
    // multiple variable linear regression 
    for (k in 1:K){
      for (i in 1:N_list[k]){
        label_list[k, i] ~ bernoulli(inv_logit(intercept[k] 
            + https_coeff[k] * https_list[k, i]
            + whois_coeff[k] * whois_list[k, i] 
            + js_len_coeff[k] * js_len_list[k, i] 
            + js_obf_len_coeff[k] * js_obf_len_list[k, i]));
      }
    }
}

generated quantities {
    array[K, Nmax] real label_train_pred;
    array[K, Mmax] real label_test_pred;
    array[Nmax] real log_likelihood; 
    // Predictions for the training data
    for (k in 1:K){
      for (i in 1:N_list[k]){
        label_train_pred[k, i] = bernoulli_rng(inv_logit(intercept[k] 
            + https_coeff[k] * https_list[k, i] 
            + whois_coeff[k] * whois_list[k, i] 
            + js_len_coeff[k] * js_len_list[k, i] 
            + js_obf_len_coeff[k] * js_obf_len_list[k, i]));
      }
    }
    // Predictions for the testing data
    for (k in 1:K){
      for (i in 1:M_list[k]){
        label_test_pred[k, i] = bernoulli_rng(inv_logit(intercept[k] 
            + https_coeff[k] * https_pred_list[k, i] 
            + whois_coeff[k] * whois_pred_list[k, i] 
            + js_len_coeff[k] * js_len_pred_list[k, i] 
            + js_obf_len_coeff[k] * js_obf_len_pred_list[k, i]));
      }
    }
    for (k in 1:K) {
      if (N_list[k] == Nmax){
        for (i in 1:Nmax){
          log_likelihood[i] = bernoulli_lpmf(label_list[k, i] | inv_logit(intercept[k] 
            + https_coeff[k] * https_list[k, i] 
            + whois_coeff[k] * whois_list[k, i] 
            + js_len_coeff[k] * js_len_list[k, i] 
            + js_obf_len_coeff[k] * js_obf_len_list[k, i]));
        }
      }
    }
}

"
```

```{r, echo=FALSE}
# Compiling the separate Stan model
file_separate <- file.path("models/model_separate.stan")
model_separate <- cmdstan_model(file_separate)
model_separate$compile(quiet = FALSE)
```

The sampling running options

```{r}
  separate_sampling <- model_separate$sample(data = stan_data, chains=3, iter_warmup = 1000, iter_sampling = 500, refresh=0)
```

```{r, echo=FALSE}
# Set up the plotting grid
par(mfrow = c(3,5))

row_labels <- c("intercept", "HTTPS coefficient", "WHOIS coefficient","JS length coefficient","JS obf. length coefficient")
row_names <- c("intercept","https_coeff","whois_coeff","js_len_coeff","js_obf_len_coeff")

#separate_sampling$summary()
# Loop through the countries
for (j in 1:3){
  for(i in 1:5){
      # Create the subplot
      hist(separate_sampling$draws(paste(row_names[i],"[",j,"]", sep="")),main = countries[j], xlab=row_labels[i])
      # Add the country name to the top of the column
      mtext(countries[j], side = 3, line = 0.2, outer = TRUE)
    }
}
```

## Convergence diagnostics

MCMC convergence chains visualization
```{r, echo=FALSE}
plotConvergence <- function (draws, paramName){
  chain1 <- as.vector(draws[1:500, 1])
  chain2 <- as.vector(draws[501:1000, 1])
  chain3 <- as.vector(draws[1001:1500, 1])
  #chain4 <- as.vector(draws[3001:4000, 1])
  iters = length(chain1)
  indices <- 1:iters
  data <- data.frame(indices, chain1, chain2, 
                              chain3)
  
  p <- ggplot(data, aes(x=indices)) +
    ggtitle(paste("Separate model - Convergence of",paramName,"\n",iters,"sampling iterations, no warm-up")) +
    xlab("iteration") + 
    ylab(paramName) +
    theme(plot.title = element_text(hjust = 0.5, size=10), legend.position = "right") +
    geom_line(aes(y = chain1, color = "chain1")) + 
    geom_line(aes(y = chain2, color = "chain2")) +
    geom_line(aes(y = chain3, color = "chain3")) + 
    #geom_line(aes(y = chain4, color = "chain4")) +
    scale_color_manual(name = "MCMC", values = c("chain1" = "red", "chain2" = "blue", "chain3" = "black"))
  return(p)
}

intercept_draws <- separate_sampling$draws("intercept[1]", format = "matrix")
p1 <- plotConvergence(intercept_draws, "intercept")

https_draws <- separate_sampling$draws("https_coeff[1]", format = "matrix")
p2 <- plotConvergence(https_draws, "https coefficient")

grid.arrange(p1, p2, ncol = 2)
# whois_draws <- separate_sampling$draws("whois_coeff[1]", format = "matrix")
# plotConvergence(whois_draws, "whois coefficient")

#js_len_draws <- separate_sampling$draws("js_len_coeff[1]", format = "matrix")
#plotConvergence(js_len_draws, "JS length coefficient")

#js_obf_len_draws <- separate_sampling$draws("js_obf_len_coeff[1]", format = "matrix")
#plotConvergence(js_obf_len_draws, "JS obfuscated length coefficient")
```

HMC specific convergence diagnostics
```{r, eval=TRUE, include=TRUE}
separate_sampling$diagnostic_summary()
```

$\hat{R}$-values and effective sample size
```{r, echo=FALSE}
summaryDiagnostics = data.frame()
for (i in 1:K){
  summaryDiagnosticsCountry <- separate_sampling$summary(c(paste("intercept[",i,"]", sep=""), paste("https_coeff[",i,"]", sep=""),paste("whois_coeff[",i,"]", sep=""),paste("js_len_coeff[",i,"]", sep=""),paste("js_obf_len_coeff[",i,"]", sep="")))[, c("variable", "rhat", "ess_bulk", "ess_tail")]
 summaryDiagnostics <- rbind(summaryDiagnostics, summaryDiagnosticsCountry)
 
}
summaryDiagnostics$variable <- 1:15 
summaryDiagnostics <- summaryDiagnostics %>% rename(index = variable)

p1 <- ggplot(summaryDiagnostics, aes(x = index, y = rhat)) +
  geom_point() +
  geom_hline(yintercept = 1.05, color = "red") +
  annotate("text", x = 15, y = 1.055, label = "critical Rhat value = 1.05", 
           hjust = 1, color = "red") + 
  xlab("Index") +
  ylab("Rhat values") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10)) + 
  ggtitle("Rhat values\n of all coefficients")

p2 <- ggplot(summaryDiagnostics, aes(x = index, y = ess_bulk)) +
  geom_point() +
  geom_hline(yintercept = 1500, color = "red") +
  annotate("text", x = 17, y = 1400, label = "Samples = 1500", 
           hjust = 1, color = "red") + 
  xlab("Index") +
  ylab("Bulk ESS values") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10)) + 
  ggtitle("Bulk effective sample size\n of all coefficients")

#p3 <- ggplot(summaryDiagnostics, aes(x = index, y = ess_tail)) +
#  geom_point() +
#  geom_hline(yintercept = 1500, color = "red") +
 # annotate("text", x = 17, y = 1400, label = "Samples = 1500", 
#           hjust = 1, color = "red") + 
#  xlab("Index") +
#  ylab("Tail ESS values") +
#  theme(plot.title = element_text(hjust = 0.5)) +
#  theme(plot.title = element_text(size = 10)) + 
 # ggtitle("Tail effective sample size\n of all coefficients")

grid.arrange(p1, p2, ncol = 2)
```
## Posterior predictive checks

```{r, echo=FALSE}
inv_logit <- function(vec){
  return(1/(1+exp(-vec)))
}

bernoulli_logit <- function (intercept, js_len_coeff, js_obf_len_coeff, https_coeff, whois_coeff, js_len_truncated, js_obf_len_truncated, https_truncated, whois_truncated){
  probability <-  inv_logit(intercept + js_len_coeff * js_len_truncated + js_obf_len_coeff * js_obf_len_truncated + https_coeff * https_truncated +  whois_coeff * whois_truncated)
  classification <- ifelse(probability >= 0.5, 1, 0)
  return(classification)
}
```

```{r, echo=FALSE}
metricsName <- c("Accuracy", "Precision", "Recall", "F1")
metricsSummary <- data.frame(Metrics=metricsName)
metricsSummary$Accuracy <- NULL
metricsSummary$Precision <- NULL
metricsSummary$Recall <- NULL
metricsSummary$F1 <- NULL
#print(metricsSummary)
sumTP <- 0
sumTN <- 0
sumFP <- 0
sumFN <- 0
listTrue <- c()
listPred <- c()
for (k in 1:K){
  #predicted_train = c()
  #for (i in 1:N_list[[k]]){
  #  draws <- separate_sampling$draws(paste("label_train_pred[",k,",",i,"]", sep=""), format = "matrix")
  #  predicted_train <- c(predicted_train, as.vector(draws[1, ]))
  #}

  #true_train = label_list[[k]][1:N_list[[k]]]
  #confusion_matrix <- table(predicted_train, true_train)

  true_train = label_list[[k]][1:N_list[[k]]]
  
  js_len_truncated <- js_len_list[[k]][1:N_list[[k]]]
  js_obf_len_truncated <- js_obf_len_list[[k]][1:N_list[[k]]]
  https_truncated <- https_list[[k]][1:N_list[[k]]]
  whois_truncated <- whois_list[[k]][1:N_list[[k]]]
  
  intercept <- mean(as.vector(separate_sampling$draws(paste("intercept[",k,"]",sep=""), format = "matrix")[, 1]))
  js_len_coeff <- mean(as.vector(separate_sampling$draws(paste("js_len_coeff[",k,"]",sep=""), format = "matrix")[, 1]))
  js_obf_len_coeff <- mean(as.vector(separate_sampling$draws(paste("js_obf_len_coeff[",k,"]",sep=""), format = "matrix")[, 1]))
  https_coeff <- mean(as.vector(separate_sampling$draws(paste("https_coeff[",k,"]",sep=""), format = "matrix")[, 1]))
  whois_coeff <- mean(as.vector(separate_sampling$draws(paste("whois_coeff[",k,"]",sep=""), format = "matrix")[, 1]))
 
    predicted_train <- bernoulli_logit(intercept, js_len_coeff, js_obf_len_coeff, https_coeff, whois_coeff, js_len_truncated, js_obf_len_truncated, https_truncated, whois_truncated)
   
  listTrue <- c(listTrue, true_train)
  listPred <- c(listPred, predicted_train)
  confusion_matrix <- table(predicted_train, true_train)
  TP <- confusion_matrix[2,2]
  TN <- confusion_matrix[1,1]
  FP <-confusion_matrix[1,2]
  FN <-confusion_matrix[2,1]
  #cat(TP,TN,FP,FN)
  sumTP <- sumTP + TP
  sumTN <- sumTN + TN
  sumFP <- sumFP + FP
  sumFN <- sumFN + FN
  accuracy <- (TP+TN)/(TP+FP+FN+TN) 
  precision <- TP/(TP+FP)
  recall <- TP/(TP+FN)
  f1 <-  2*(precision*recall)/(precision+recall)
  metricsSummary[,countries[k]] <- c(accuracy, precision,recall,f1)
}
accuracy <- (sumTP+sumTN)/(sumTP+sumFP+sumFN+sumTN) 
prevalence <- (sumTP+sumFN)/(sumTP+sumFP+sumFN+sumTN) 
sensitivity <- sumTP/(sumTP+sumFN)
specificity <- sumFN/(sumFN+sumFP)
precision <- sumTP/(sumTP+sumFP)
recall <- sumTP/(sumTP+sumFN)
f1 <-  2*(precision*recall)/(precision+recall)
metricsSummary[,"All countries"] <- c(accuracy, precision,recall,f1)
metricsSummary
```

```{r, warning = FALSE}
confusion_matrix <- tibble("actual" = listTrue,
                     "prediction" = listPred)


basic_table <- table(confusion_matrix)
cfm <- as_tibble(basic_table)
plot_confusion_matrix(cfm, 
                      target_col = "actual", 
                      prediction_col = "prediction",
                      counts_col = "n", palette = "Oranges")
```

## Predictive performance assessment

```{r, echo=FALSE}
metricsName <- c("Accuracy", "Precision", "Recall", "F1")
metricsSummary <- data.frame(Metrics=metricsName)
metricsSummary$Accuracy <- NULL
metricsSummary$Precision <- NULL
metricsSummary$Recall <- NULL
metricsSummary$F1 <- NULL
#print(metricsSummary)
sumTP <- 0
sumTN <- 0
sumFP <- 0
sumFN <- 0
listTrue <- c()
listPred <- c()
for (k in 1:K){
  # predicted_test = c()
  # for (i in 1:M_list[[k]]){
  #  draws <- separate_sampling$draws(paste("label_test_pred[",k,",",i,"]", sep=""), format = "matrix")
  #  predicted_test <- c(predicted_test, as.vector(draws[1, ]))
  # }
  true_test = label_test_list[[k]][1:M_list[[k]]]
  
  js_len_truncated <- js_len_test_list[[k]][1:M_list[[k]]]
  js_obf_len_truncated <- js_obf_len_test_list[[k]][1:M_list[[k]]]
  https_truncated <- https_test_list[[k]][1:M_list[[k]]]
  whois_truncated <- whois_test_list[[k]][1:M_list[[k]]]
  
  intercept <- mean(as.vector(separate_sampling$draws(paste("intercept[",k,"]",sep=""), format = "matrix")[, 1]))
  js_len_coeff <- mean(as.vector(separate_sampling$draws(paste("js_len_coeff[",k,"]",sep=""), format = "matrix")[, 1]))
  js_obf_len_coeff <- mean(as.vector(separate_sampling$draws(paste("js_obf_len_coeff[",k,"]",sep=""), format = "matrix")[, 1]))
  https_coeff <- mean(as.vector(separate_sampling$draws(paste("https_coeff[",k,"]",sep=""), format = "matrix")[, 1]))
  whois_coeff <- mean(as.vector(separate_sampling$draws(paste("whois_coeff[",k,"]",sep=""), format = "matrix")[, 1]))
  
  predicted_test <- bernoulli_logit(intercept, js_len_coeff, js_obf_len_coeff, https_coeff, whois_coeff, js_len_truncated, js_obf_len_truncated, https_truncated, whois_truncated)
  
  confusion_matrix <- table(predicted_test, true_test)
  listTrue <- c(listTrue, true_test)
  listPred <- c(listPred, predicted_test)

  TP <- confusion_matrix[2,2]
  TN <- confusion_matrix[1,1]
  FP <-confusion_matrix[1,2]
  FN <-confusion_matrix[2,1]

  sumTP <- sumTP + TP
  sumTN <- sumTN + TN
  sumFP <- sumFP + FP
  sumFN <- sumFN + FN
  accuracy <- (TP+TN)/(TP+FP+FN+TN) 
  precision <- TP/(TP+FP)
  recall <- TP/(TP+FN)
  f1 <-  2*(precision*recall)/(precision+recall)
  metricsSummary[,countries[k]] <- c(accuracy, precision,recall,f1)
}
accuracy <- (sumTP+sumTN)/(sumTP+sumFP+sumFN+sumTN) 
prevalence <- (sumTP+sumFN)/(sumTP+sumFP+sumFN+sumTN) 
sensitivity <- sumTP/(sumTP+sumFN)
specificity <- sumFN/(sumFN+sumFP)
precision <- sumTP/(sumTP+sumFP)
recall <- sumTP/(sumTP+sumFN)
f1 <-  2*(precision*recall)/(precision+recall)
metricsSummary[,"All countries"] <- c(accuracy, precision,recall,f1)
metricsSummary
```

```{r, echo=FALSE}
confusion_matrix <- tibble("actual" = listTrue,
                     "prediction" = listPred)


basic_table <- table(confusion_matrix)
cfm <- as_tibble(basic_table)
plot_confusion_matrix(cfm, 
                      target_col = "actual", 
                      prediction_col = "prediction",
                      counts_col = "n", palette = "Oranges")
```


## Prior sensitivity analysis

```{r, echo=FALSE}
# Compiling the separate Stan model
file_separate_prior_sensitivity1 <- file.path("models/model_separate_prior_sensitivity_1.stan")
model_separate_prior_sensitivity1 <- cmdstan_model(file_separate_prior_sensitivity1)
model_separate_prior_sensitivity1$compile(quiet = FALSE)

file_separate_prior_sensitivity2 <- file.path("models/model_separate_prior_sensitivity_2.stan")
model_separate_prior_sensitivity2 <- cmdstan_model(file_separate_prior_sensitivity2)
model_separate_prior_sensitivity2$compile(quiet = FALSE)
```

The sampling running options
```{r}
  separate_sampling_prior_sensitivity1 <- model_separate_prior_sensitivity1$sample(data = stan_data, chains=3, iter_warmup = 1000, iter_sampling = 500, refresh=0)
  separate_sampling_prior_sensitivity2 <- model_separate_prior_sensitivity2$sample(data = stan_data, chains=3, iter_warmup = 1000, iter_sampling = 500, refresh=0)
```

```{r, warning=FALSE, echo=FALSE}
loo_loglike_separate0 <- separate_sampling$loo(variables="log_likelihood",r_eff=TRUE)
loo_loglike_separate1 <- separate_sampling_prior_sensitivity1$loo(variables="log_likelihood",r_eff=TRUE)
loo_loglike_separate2 <- separate_sampling_prior_sensitivity2$loo(variables="log_likelihood",r_eff=TRUE)

cat("The elpd value of the separate model 0 is\n")
print(loo_loglike_separate0$estimates[1][1])

cat("The elpd value of the separate model 1 is\n")
print(loo_loglike_separate1$estimates[1][1])

cat("The elpd value of the separate model 2 is\n")
print(loo_loglike_separate2$estimates[1][1])

```

```{r}
loo_compare_separate <- loo_compare(x = list(loo_loglike_separate0=loo_loglike_separate0, loo_loglike_separate1=loo_loglike_separate1, loo_loglike_separate2=loo_loglike_separate2))
print(loo_compare_separate)
```

We know that elpd_loo is the Bayesian LOO estimate of the expected log pointwise predictive density and is a sum of N individual pointwise log predictive densities. From the comparison table, elpd_diff is the difference in elpd_loo for two models. If more than two models are compared, the difference is computed relative to the model with highest elpd_loo, which is true in this case, as I am comparing three models

The standard error of component-wise differences of elpd_loo (Eq 24 in VGG2017) between two models. This SE is smaller than the SE for individual models due to correlation (i.e., if some observations are easier and some more difficult to predict for all models)

As quick rule: If elpd difference (elpd_diff in loo package) is less than 4, the difference is small. If elpd difference (elpd_diff in loo package) is larger than 4, then compare that difference to standard error of elpd_diff. When the difference (elpd_diff) is larger than 4, the number of observations is larger than 100 and the model is not badly misspecified then normal approximation and SE are quite reliable description of the uncertainty in the difference. Differences smaller than 4 are small and then the models have very similar predictive performance and it doesn’t matter if the normal approximation fails or SE is underestimated @akiLoo.

SE assumes that normal approximation describes well the uncertainty related to the expected difference. Due to cross-validation folds not being independent, SE tends to be underestimated especially if the number of observations is small or the models are badly misspecified. The whole normal approximation tends to fail if the models are very similar or the models are badly misspecified. 


# Pooled model

## Model description

## Prior choice and justifications
Default protocol https is used by 81.5% of all the websites. @https://w3techs.com/technologies/details/ce-httpsdefault

There are 1.24 billion with complete WHOIS registration, while there are currently 1.7 billion websites. So the ratio of complete WHOIS website is 0.73.

## Stan code and running options

The Stan model code:
```{r, echo = TRUE, results = 'hide'}
"

"
```

```{r, echo=FALSE}
# Compiling the separate Stan model
file_pooled <- file.path("models/model_pooled.stan")
model_pooled <- cmdstan_model(file_pooled)
model_pooled$compile(quiet = FALSE)
```

The sampling running options

```{r}
pooled_sampling <- model_pooled$sample(data = stan_data, chains=3, 
  iter_warmup = 500, iter_sampling = 1000, refresh=0)
```

```{r, echo=FALSE}
# Set up the plotting grid
par(mfrow = c(3,3))

row_labels <- c(" θ(JS length)","θ(JS obf. length)","θ(https)","θ(whois)", "intercept", "HTTPS coefficient", "WHOIS coefficient","JS length coefficient","JS obf. length coefficient")
row_names <- c("theta_js_len","theta_js_obf_len","theta_https","theta_whois","intercept","https_coeff","whois_coeff","js_len_coeff","js_obf_len_coeff")

#pooled_sampling$summary()
# Loop through the countries
for(i in 1:9){
    # Create the subplot
    hist(pooled_sampling$draws(row_names[i]), main=row_labels[i], xlab=row_labels[i])
  }

```

## Convergence diagnostics

MCMC convergence chains visualization
```{r, echo=FALSE}
plotConvergence <- function (draws, paramName){
  chain1 <- as.vector(draws[1:500, 1])
  chain2 <- as.vector(draws[501:1000, 1])
  chain3 <- as.vector(draws[1001:1500, 1])
  iters = length(chain1)
  indices <- 1:iters
  data <- data.frame(indices, chain1, chain2, 
                              chain3)
  
  p <- ggplot(data, aes(x=indices)) +
    ggtitle(paste("Pooled model - Convergence of",paramName,"\n",iters,"sampling iterations, no warm-up")) +
    xlab("iteration") + 
    ylab(paramName) +
    theme(plot.title = element_text(hjust = 0.5, size=10), legend.position = "right") +
    geom_line(aes(y = chain1, color = "chain1")) + 
    geom_line(aes(y = chain2, color = "chain2")) +
    geom_line(aes(y = chain3, color = "chain3")) + 
    #geom_line(aes(y = chain4, color = "chain4")) +
    scale_color_manual(name = "MCMC", values = c("chain1" = "red", "chain2" = "blue", "chain3" = "black"))
  return(p)
}

intercept_draws <- pooled_sampling$draws("intercept", format = "matrix")
p1 <- plotConvergence(intercept_draws, "intercept")

https_draws <- pooled_sampling$draws("https_coeff", format = "matrix")
p2 <- plotConvergence(https_draws, "https coefficient")

grid.arrange(p1, p2, ncol = 2)
# whois_draws <- pooled_sampling$draws("whois_coeff[1]", format = "matrix")
# plotConvergence(whois_draws, "whois coefficient")

#js_len_draws <- pooled_sampling$draws("js_len_coeff[1]", format = "matrix")
#plotConvergence(js_len_draws, "JS length coefficient")

#js_obf_len_draws <- pooled_sampling$draws("js_obf_len_coeff[1]", format = "matrix")
#plotConvergence(js_obf_len_draws, "JS obfuscated length coefficient")
```

HMC specific convergence diagnostics
```{r, eval=TRUE, include=TRUE}
pooled_sampling$diagnostic_summary()
```

$\hat{R}$-values and effective sample size
```{r, echo=FALSE}
summaryDiagnostics = data.frame()
for (i in 1:K){
  summaryDiagnosticsCountry <- pooled_sampling$summary(c("intercept","https_coeff",
    "whois_coeff","js_len_coeff","js_obf_len_coeff"))[, c("variable", "rhat", "ess_bulk", "ess_tail")]
 summaryDiagnostics <- rbind(summaryDiagnostics, summaryDiagnosticsCountry)
 
}
summaryDiagnostics$variable <- 1:15 
summaryDiagnostics <- summaryDiagnostics %>% rename(index = variable)

p1 <- ggplot(summaryDiagnostics, aes(x = index, y = rhat)) +
  geom_point() +
  geom_hline(yintercept = 1.05, color = "red") +
  annotate("text", x = 15, y = 1.055, label = "critical Rhat value = 1.05", 
           hjust = 1, color = "red") + 
  xlab("Index") +
  ylab("Rhat values") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10)) + 
  ggtitle("Rhat values\n of all coefficients")

p2 <- ggplot(summaryDiagnostics, aes(x = index, y = ess_bulk)) +
  geom_point() +
  geom_hline(yintercept = 1500, color = "red") +
  annotate("text", x = 17, y = 1400, label = "Samples = 1500", 
           hjust = 1, color = "red") + 
  xlab("Index") +
  ylab("Bulk ESS values") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10)) + 
  ggtitle("Bulk effective sample size\n of all coefficients")

p3 <- ggplot(summaryDiagnostics, aes(x = index, y = ess_tail)) +
  geom_point() +
  geom_hline(yintercept = 1500, color = "red") +
  annotate("text", x = 17, y = 1400, label = "Samples = 1500", 
           hjust = 1, color = "red") + 
  xlab("Index") +
  ylab("Tail ESS values") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.title = element_text(size = 10)) + 
  ggtitle("Tail effective sample size\n of all coefficients")

grid.arrange(p1, p2, p3, ncol = 3)
```

## Posterior predictive checks

```{r, echo=FALSE}
metricsName <- c("Accuracy", "Precision", "Recall", "F1")
metricsSummary <- data.frame(Metrics=metricsName)
metricsSummary$Accuracy <- NULL
metricsSummary$Precision <- NULL
metricsSummary$Recall <- NULL
metricsSummary$F1 <- NULL
#print(metricsSummary)
sumTP <- 0
sumTN <- 0
sumFP <- 0
sumFN <- 0
listTrue <- c()
listPred <- c()
for (k in 1:K){
  #predicted_train = c()
  #for (i in 1:N_list[[k]]){
  #  draws <- pooled_sampling$draws(paste("label_train_pred[",k,",",i,"]", sep=""), format = "matrix")
  #  predicted_train <- c(predicted_train, as.vector(draws[1, ]))
  #}

  #true_train = label_list[[k]][1:N_list[[k]]]
  #confusion_matrix <- table(predicted_train, true_train)

  true_train = label_list[[k]][1:N_list[[k]]]
  
  js_len_truncated <- js_len_list[[k]][1:N_list[[k]]]
  js_obf_len_truncated <- js_obf_len_list[[k]][1:N_list[[k]]]
  https_truncated <- https_list[[k]][1:N_list[[k]]]
  whois_truncated <- whois_list[[k]][1:N_list[[k]]]
  
  intercept <- mean(as.vector(pooled_sampling$draws(paste("intercept",sep=""), format = "matrix")[, 1]))
  js_len_coeff <- mean(as.vector(pooled_sampling$draws(paste("js_len_coeff",sep=""), format = "matrix")[, 1]))
  js_obf_len_coeff <- mean(as.vector(pooled_sampling$draws(paste("js_obf_len_coeff",sep=""), format = "matrix")[, 1]))
  https_coeff <- mean(as.vector(pooled_sampling$draws(paste("https_coeff",sep=""), format = "matrix")[, 1]))
  whois_coeff <- mean(as.vector(pooled_sampling$draws(paste("whois_coeff",sep=""), format = "matrix")[, 1]))
 
    predicted_train <- bernoulli_logit(intercept, js_len_coeff, js_obf_len_coeff, https_coeff, whois_coeff, js_len_truncated, js_obf_len_truncated, https_truncated, whois_truncated)
   
  listTrue <- c(listTrue, true_train)
  listPred <- c(listPred, predicted_train)
  confusion_matrix <- table(predicted_train, true_train)
  TP <- confusion_matrix[2,2]
  TN <- confusion_matrix[1,1]
  FP <-confusion_matrix[1,2]
  FN <-confusion_matrix[2,1]
  #cat(TP,TN,FP,FN)
  sumTP <- sumTP + TP
  sumTN <- sumTN + TN
  sumFP <- sumFP + FP
  sumFN <- sumFN + FN
  accuracy <- (TP+TN)/(TP+FP+FN+TN) 
  precision <- TP/(TP+FP)
  recall <- TP/(TP+FN)
  f1 <-  2*(precision*recall)/(precision+recall)
  metricsSummary[,countries[k]] <- c(accuracy, precision,recall,f1)
}
accuracy <- (sumTP+sumTN)/(sumTP+sumFP+sumFN+sumTN) 
prevalence <- (sumTP+sumFN)/(sumTP+sumFP+sumFN+sumTN) 
sensitivity <- sumTP/(sumTP+sumFN)
specificity <- sumFN/(sumFN+sumFP)
precision <- sumTP/(sumTP+sumFP)
recall <- sumTP/(sumTP+sumFN)
f1 <-  2*(precision*recall)/(precision+recall)
metricsSummary[,"All countries"] <- c(accuracy, precision,recall,f1)
metricsSummary
```

```{r, warning = FALSE}
confusion_matrix <- tibble("actual" = listTrue,
                     "prediction" = listPred)


basic_table <- table(confusion_matrix)
cfm <- as_tibble(basic_table)
plot_confusion_matrix(cfm, 
                      target_col = "actual", 
                      prediction_col = "prediction",
                      counts_col = "n", palette = "Oranges")
```

## Predictive performance assessment

```{r, echo=FALSE}
metricsName <- c("Accuracy", "Precision", "Recall", "F1")
metricsSummary <- data.frame(Metrics=metricsName)
metricsSummary$Accuracy <- NULL
metricsSummary$Precision <- NULL
metricsSummary$Recall <- NULL
metricsSummary$F1 <- NULL
#print(metricsSummary)
sumTP <- 0
sumTN <- 0
sumFP <- 0
sumFN <- 0
listTrue <- c()
listPred <- c()
for (k in 1:K){
  # predicted_test = c()
  # for (i in 1:M_list[[k]]){
  #  draws <- pooled_sampling$draws(paste("label_test_pred[",k,",",i,"]", sep=""), format = "matrix")
  #  predicted_test <- c(predicted_test, as.vector(draws[1, ]))
  # }
  true_test = label_test_list[[k]][1:M_list[[k]]]
  
  js_len_truncated <- js_len_test_list[[k]][1:M_list[[k]]]
  js_obf_len_truncated <- js_obf_len_test_list[[k]][1:M_list[[k]]]
  https_truncated <- https_test_list[[k]][1:M_list[[k]]]
  whois_truncated <- whois_test_list[[k]][1:M_list[[k]]]
  
  intercept <- mean(as.vector(pooled_sampling$draws(paste("intercept",sep=""), format = "matrix")[, 1]))
  js_len_coeff <- mean(as.vector(pooled_sampling$draws(paste("js_len_coeff",sep=""), format = "matrix")[, 1]))
  js_obf_len_coeff <- mean(as.vector(pooled_sampling$draws(paste("js_obf_len_coeff",sep=""), format = "matrix")[, 1]))
  https_coeff <- mean(as.vector(pooled_sampling$draws(paste("https_coeff",sep=""), format = "matrix")[, 1]))
  whois_coeff <- mean(as.vector(pooled_sampling$draws(paste("whois_coeff",sep=""), format = "matrix")[, 1]))
  
  predicted_test <- bernoulli_logit(intercept, js_len_coeff, js_obf_len_coeff, https_coeff, whois_coeff, js_len_truncated, js_obf_len_truncated, https_truncated, whois_truncated)
  
  confusion_matrix <- table(predicted_test, true_test)
  listTrue <- c(listTrue, true_test)
  listPred <- c(listPred, predicted_test)

  TP <- confusion_matrix[2,2]
  TN <- confusion_matrix[1,1]
  FP <-confusion_matrix[1,2]
  FN <-confusion_matrix[2,1]

  sumTP <- sumTP + TP
  sumTN <- sumTN + TN
  sumFP <- sumFP + FP
  sumFN <- sumFN + FN
  accuracy <- (TP+TN)/(TP+FP+FN+TN) 
  precision <- TP/(TP+FP)
  recall <- TP/(TP+FN)
  f1 <-  2*(precision*recall)/(precision+recall)
  metricsSummary[,countries[k]] <- c(accuracy, precision,recall,f1)
}
accuracy <- (sumTP+sumTN)/(sumTP+sumFP+sumFN+sumTN) 
prevalence <- (sumTP+sumFN)/(sumTP+sumFP+sumFN+sumTN) 
sensitivity <- sumTP/(sumTP+sumFN)
specificity <- sumFN/(sumFN+sumFP)
precision <- sumTP/(sumTP+sumFP)
recall <- sumTP/(sumTP+sumFN)
f1 <-  2*(precision*recall)/(precision+recall)
metricsSummary[,"All countries"] <- c(accuracy, precision,recall,f1)
metricsSummary
```

```{r, echo=FALSE}
confusion_matrix <- tibble("actual" = listTrue,
                     "prediction" = listPred)


basic_table <- table(confusion_matrix)
cfm <- as_tibble(basic_table)
plot_confusion_matrix(cfm, 
                      target_col = "actual", 
                      prediction_col = "prediction",
                      counts_col = "n", palette = "Oranges")
```


## Prior sensitivity analysis

```{r, echo=FALSE}
# Compiling the pooled Stan model
file_pooled_prior_sensitivity1 <- file.path("models/model_pooled_prior_sensitivity_1.stan")
model_pooled_prior_sensitivity1 <- cmdstan_model(file_pooled_prior_sensitivity1)
model_pooled_prior_sensitivity1$compile(quiet = FALSE)

file_pooled_prior_sensitivity2 <- file.path("models/model_pooled_prior_sensitivity_2.stan")
model_pooled_prior_sensitivity2 <- cmdstan_model(file_pooled_prior_sensitivity2)
model_pooled_prior_sensitivity2$compile(quiet = FALSE)
```

The sampling running options
```{r}
  pooled_sampling_prior_sensitivity1 <- model_pooled_prior_sensitivity1$sample(data = stan_data, chains=3, iter_warmup = 1000, iter_sampling = 500, refresh=0)
  pooled_sampling_prior_sensitivity2 <- model_pooled_prior_sensitivity2$sample(data = stan_data, chains=3, iter_warmup = 1000, iter_sampling = 500, refresh=0)
```

```{r, warning=FALSE, echo=FALSE}
loo_loglike_pooled0 <- pooled_sampling$loo(variables="log_likelihood",r_eff=TRUE)
loo_loglike_pooled1 <- pooled_sampling_prior_sensitivity1$loo(variables="log_likelihood",r_eff=TRUE)
loo_loglike_pooled2 <- pooled_sampling_prior_sensitivity2$loo(variables="log_likelihood",r_eff=TRUE)

cat("The elpd value of the pooled model 0 is\n")
print(loo_loglike_pooled0$estimates[1][1])

cat("The elpd value of the pooled model 1 is\n")
print(loo_loglike_pooled1$estimates[1][1])

cat("The elpd value of the pooled model 2 is\n")
print(loo_loglike_pooled2$estimates[1][1])

```

```{r}
loo_compare_pooled <- loo_compare(x = list(loo_loglike_pooled0=loo_loglike_pooled0, loo_loglike_pooled1=loo_loglike_pooled1, loo_loglike_pooled2=loo_loglike_pooled2))
print(loo_compare_pooled)
```

We know that elpd_loo is the Bayesian LOO estimate of the expected log pointwise predictive density and is a sum of N individual pointwise log predictive densities. From the comparison table, elpd_diff is the difference in elpd_loo for two models. If more than two models are compared, the difference is computed relative to the model with highest elpd_loo, which is true in this case, as I am comparing three models

The standard error of component-wise differences of elpd_loo (Eq 24 in VGG2017) between two models. This SE is smaller than the SE for individual models due to correlation (i.e., if some observations are easier and some more difficult to predict for all models)

As quick rule: If elpd difference (elpd_diff in loo package) is less than 4, the difference is small. If elpd difference (elpd_diff in loo package) is larger than 4, then compare that difference to standard error of elpd_diff. When the difference (elpd_diff) is larger than 4, the number of observations is larger than 100 and the model is not badly misspecified then normal approximation and SE are quite reliable description of the uncertainty in the difference. Differences smaller than 4 are small and then the models have very similar predictive performance and it doesn’t matter if the normal approximation fails or SE is underestimated @akiLoo.

SE assumes that normal approximation describes well the uncertainty related to the expected difference. Due to cross-validation folds not being independent, SE tends to be underestimated especially if the number of observations is small or the models are badly misspecified. The whole normal approximation tends to fail if the models are very similar or the models are badly misspecified. 


# Model comparison

```{r, echo=FALSE, warning=FALSE}
loo_loglik_separate <- separate_sampling$loo(variables="log_likelihood",r_eff=TRUE)
cat("The PSIS-LOO elpd value of the separate model is\n")
print(loo_loglik_separate$estimates[1][1])

pareto_k_separate <- loo_loglik_separate$diagnostics$pareto_k
cat("\nThe k-hat diagnostics of the separate model is\n")
print(loo_loglik_separate)


loo_loglik_pooled <- pooled_sampling$loo(variables="log_likelihood",r_eff=TRUE)
cat("The PSIS-LOO elpd value of the pooled model is\n")
print(loo_loglik_pooled$estimates[1][1])

pareto_k_pooled <- loo_loglik_pooled$diagnostics$pareto_k
cat("\nThe k-hat diagnostics of the pooled model is\n")
print(loo_loglik_pooled)

```
```{r, echo=FALSE, warning=FALSE}
df <- data.frame(pareto_k_separate = pareto_k_separate, 
                 pareto_k_pooled = pareto_k_pooled)

ggplot(df, aes(x = 1:length(pareto_k_separate), 
               y = pareto_k_separate, 
               fill = "pareto_k_separate")) + 
  geom_bar(stat = "identity", width = 0.5) + 
  geom_bar(aes(x = 1:length(pareto_k_pooled) + 0.5, 
               y = pareto_k_pooled, 
               fill = "pareto_k_pooled"), 
          stat = "identity", width = 0.5) + 
  scale_fill_manual(values = c("pareto_k_separate" = "purple", 
                                "pareto_k_pooled" = "orange")) + 
  labs(title = "PSIS k-values of the separate and pooled models", 
       x = "Index", 
       y = "Value", 
       fill = "") + 
  geom_hline(yintercept = 0.7, color = "darkred", size = 1) +
  annotate("text", x = 40, y = 0.75, label = "bad k-values = 0.7", color = "darkred", hjust = 1, vjust = 0) +
  geom_hline(yintercept = 1, color = "red", size = 1) +
  annotate("text", x = 40, y = 1.05, label = "very bad k-values = 1", color = "red", hjust = 1, vjust = 0) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, size=15)) +
  theme(legend.position = "bottom")
```

```{r}
loo_compare <- loo_compare(x = list(loo_loglik_separate=loo_loglik_separate, loo_loglik_pooled=loo_loglik_pooled))
print(loo_compare)
```


The ELPD is the theoretical expected log pointwise predictive density for a new dataset (Eq 1 in VGG2017), which can be estimated, e.g., using cross-validation. elpd_loo is the Bayesian LOO estimate of the expected log pointwise predictive density (Eq 4 in VGG2017) and is a sum of N individual pointwise log predictive densities.

As quick rule: If elpd difference (elpd_diff in loo package) is less than 4, the difference is small (Sivula, Magnusson and Vehtari, 2020). If elpd difference (elpd_diff in loo package) is larger than 4, then compare that difference to standard error of elpd_diff (provided e.g. by loo package) (Sivula, Magnusson and Vehtari, 2020). 

p_loo (effective number of parameters)
p_loo is the difference between elpd_loo and the non-cross-validated log posterior predictive density. It describes how much more difficult it is to predict future data than the observed data. Asymptotically under certain regularity conditions, p_loo can be interpreted as the effective number of parameters. In well behaving cases p_loo < N and p_loo < p, where p is the total number of parameters in the model. p_loo > N or p_loo > p indicates that the model has very weak predictive capability and may indicate a severe model misspecification. See below for more on interpreting p_loo when there are warnings about high Pareto k diagnostic values

p_loo is called the effective number of parameters and can be computed as the difference between elpd_loo and the non-cross-validated log posterior predictive density (Equations (4) and (3) in Vehtari, Gelman and Gabry (2017)). It is not needed for elpd_loo, but has diagnostic value. It describes how much more difficult it is to predict future data than the observed data. Asymptotically under certain regularity conditions, p_loo can be interpreted as the effective number of parameters. In well behaving cases p_loo <N and p_loo <p, where p is the total number of parameters in the model. p_loo >N or p_loo >p indicates that the model has very weak predictive capability.


The Pareto k estimate is a diagnostic for Pareto smoothed importance sampling (PSIS), which is used to compute components of elpd_loo. In importance-sampling LOO (the full posterior distribution is used as the proposal distribution). The Pareto k diagnostic estimates how far an individual leave-one-out distribution is from the full distribution. If leaving out an observation changes the posterior too much then importance sampling is not able to give reliable estimate. If k<0.5, then the corresponding component of elpd_loo is estimated with high accuracy. If 0.5<k<0.7 the accuracy is lower, but still ok. If k>0.7, then importance sampling is not able to provide useful estimate for that component/observation. Pareto k is also useful as a measure of influence of an observation. Highly influential observations have high k values. Very high k values often indicate model misspecification, outliers or mistakes in data processing.

Interpreting p_loo when Pareto k is large
If k > 0.7 then we can also look at the p_loo estimate for some additional information about the problem:

If p_loo << p (the total number of parameters in the model), then the model is likely to be misspecified. Posterior predictive checks (PPCs) are then likely to also detect the problem. Try using an overdispersed model, or add more structural information (nonlinearity, mixture model, etc.).

If p_loo < p and the number of parameters p is relatively large compared to the number of observations (e.g., p>N/5), it is likely that the model is so flexible or the population prior so weak that it’s difficult to predict the left out observation (even for the true model). This happens, for example, in the simulated 8 schools (in VGG2017), random effect models with a few observations per random effect, and Gaussian processes and spatial models with short correlation lengths.

If p_loo > p, then the model is likely to be badly misspecified. If the number of parameters p<<N, then PPCs are also likely to detect the problem. See the case study at https://avehtari.github.io/modelselection/roaches.html for an example. If p is relatively large compared to the number of observations, say p>N/5 (more accurately we should count number of observations influencing each parameter as in hierarchical models some groups may have few observations and other groups many), it is possible that PPCs won't detect the problem

Online documentations: 
FAQ: https://mc-stan.org/loo/articles/online-only/faq.html#elpd_interpretation
Glossaries: https://mc-stan.org/loo/reference/loo-glossary.html

# Discussion

## Existing issues

## Potential improvements

# Conclusion


# Reflection


# References