Chris1221 · Chris1221 · Feb 10, 2017 · Jun 28, 2017 · Jun 28, 2017 · Jun 28, 2017
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -8,3 +8,6 @@ cran-comments.md
 project_ideas.md
 inst/data
 inst/paper
+site/
+docs/
+mkdocs.yml
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: goldi
 Type: Package
 Title: Gene Ontology Label Discernment and Identification 
-Version: 1.0.1
+Version: 1.1.0
 Date: 2017-06-27
 Authors@R: c(
     person("Christopher B.", "Cole", email = "[email protected]", role = c("aut", "cre", "cph")), 

diff --git a/README.md b/README.md
@@ -8,6 +8,8 @@
 
 ### Status 
 
+*This is the development branch, you probably don't want to install this unless you know what you're doing. If it breaks, you get to keep all the pieces!*
+
 The package is currently checked on `R-oldrel` (v`3.3.3`), `R-release` (v`3.4.0`), and `R-devel` (v`3.5.0`) on
 
 - [Ubuntu LTS 14.06 on Travis-CI](https://travis-ci.org/Chris1221/goldi)

diff --git a/docs/advanced.md b/docs/advanced.md
@@ -0,0 +1 @@
+# Advanced Usage
diff --git a/docs/contributing.md b/docs/contributing.md
@@ -0,0 +1 @@
+# How to contribute
diff --git a/docs/functions.md b/docs/functions.md
@@ -0,0 +1 @@
+# Functions
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,81 @@
+# Introduction
+
+Gene Ontology is a public database which, among other things, classifies gene functions according to the molecular functions involved, the cellular compartment where the product is active, and the relevant biological pathways in which they play a part. These classes, or "terms", are highly useful in molecular biology, and are often referred to in the literature. However, with the size and complexity of biomedical publications, this information is often difficult to study in aggregate.  
+
+`goldi` is a tool for identifying key terms in text. It has been developed with the intention of identifying ontological labels in free form text with specific application to finding Gene Ontology terms in the biomedical literature with strict canonical NLP quality control.
+
+This package performs a few main objectives:
+
+- Identifies terms in free text (we distribute the package with a set of Molecular Function terms from Gene Ontology for easy use)
+- Summarizes the quantity and quality of annotations across a corpus
+- Provides helpful functions for working with `goldi` class objects, including enrichment tests between two corpora. 
+
+`goldi` is freely distributed on CRAN and Github, and bug reports are always welcome. 
+
+Please see the other pages on this website for description of the main functions, as well as some examples of `goldi` in the real world. 
+
+
+## Installation
+
+`goldi` can be installed from CRAN with
+
+```R
+install.packages("goldi")
+```
+
+Or, you may choose to install the latest stable development version with
+
+```R
+devtools::install_github("Chris1221/goldi")
+```
+
+## Status 
+
+The package is currently checked on `R-oldrel` (v`3.3.3`), `R-release` (v`3.4.0`), and `R-devel` (v`3.5.0`) on
+
+- [Ubuntu LTS 14.06 on Travis-CI](https://travis-ci.org/Chris1221/goldi)
+- [XCode 8.3 on OSX 10.13 on Travis-CI](https://travis-ci.org/Chris1221/goldi)
+- Winbuilder 
+
+If you notice any issues, please raise it on the repository!
+
+## Minimal Example
+
+`goldi` attempts to identify terms in free text through semantic similarity. This means that if a term and a sentence share a high number of words, the sentence has a higher probability of talking about the term.
+
+Given the following input text and the included pre-computed term document matrix for approximately 10,000 Gene Onotlogy molecular function terms, we can find which are discussed in our text.
+
+```R
+# Give the free form text
+doc <- "In this sentence we will talk about ribosomal chaperone activity.
+	In this sentence we will talk about nothing. 
+	Here we discuss obsolete molecular terms."
+
+# Load in the included term document matrix for the terms
+data("TDM.go.df")
+
+# Pipe output and log to /dev/null
+output = "/dev/null"
+log = "/dev/null"
+
+# Run the function
+goldi(doc = doc, 
+  term_tdm = TDM.go.df,
+  output = output,
+  log = log,
+  object = TRUE)
+```
+
+Note in the above example, we impliment a few other options. Firstly, we don't want to see the output or the log for this example, so we pipe them to `/dev/null`. Secondly, we would like to return the output as an R object instead of writing it to a file, so we specify `object = TRUE`. 
+
+This will output the following table:
+
+|          Term                |                               Context                            |
+| ---------------------------- | ---------------------------------------------------------------  |
+| ribosomal_chaperone_activity | In this sentence we will talk about ribosomal chaperone activity |
+
+This will give the term identified and the context in the free form where it was identified. This table will form the basis for all further analysis.
+
+## Getting help
+
+For help, please post an issue on the repository.
diff --git a/docs/license.md b/docs/license.md
@@ -0,0 +1 @@
+# License
diff --git a/docs/overexpression-analysis.md b/docs/overexpression-analysis.md
@@ -0,0 +1,127 @@
+# Examples
+
+Here you'll find some complete examples of start-to-finish analysis conducted in `goldi`. Please feel free to contribute! 
+
+## Overexpression of Terms in a Target Set of Abstracts
+
+In this analysis we seek to find Gene Ontology terms which may be overrepresented in a "target set" of abstracts, such as the results of a PubMed query.
+
+We first fetch all the results of a specific query from Pubmed using the `RISmed` package and store their abstracts in a `data.frame`.
+
+
+```r
+library(RISmed)
+
+# Store the input string for reuse
+search_topic <- "anaphylaxis genetics"
+search_query <- EUtilsSummary(search_topic, mindate=2014, maxdate=2015)
+
+summary(search_query)
+
+pull <- EUtilsGet(search_query)
+
+data <- data.frame('Abstracts' = AbstractText(pull))
+
+# Get rid of first entry for some reason, seems to always be blank
+data[,1] <- as.character(data[,1])
+data <- data[-1,]
+
+head(data)
+```
+
+We want to compare the terms found here to something, so we grab all abstracts from 2014 to 2015 which match a similar field, i.e. immunology genetics.  Note that only 1000 records are taken by default.
+
+
+```r
+# Store the input string for reuse
+search_topic <- "immunology genetics"
+search_query <- EUtilsSummary(search_topic, # Find all articles matching the string
+                              mindate=2014, # From 2014
+                              maxdate=2015, # to 2015
+                              retmax = 1000)  # This is the default but explicit
+
+summary(search_query)
+
+pull_control <- EUtilsGet(search_query)
+
+control <- data.frame('Abstracts' = AbstractText(pull_control))
+
+# Get rid of first entry for some reason, seems to always be blank
+control[,1] <- as.character(control[,1])
+control <- control[-1,]
+
+head(control)
+```
+
+We now run `goldi` on each of the entries in both the target group and the control group.
+
+
+```r
+ data(package = "goldi", "TDM.go.df")
+  TDM.go.df <- TDM.go.df[, !duplicated(colnames(TDM.go.df))]
+
+  lims <- c(1,2,2,3,4,5,6,7,7,8,10)
+
+  results <- list()
+
+  for(i in 1:length(data)){
+
+    if(!data[i] == ""){
+      results[[i]] <- goldi(doc = data[i],
+            terms = terms,
+            lims = lims,
+            syn = F,
+            object = T,
+            log = "/dev/null",
+            reader = "local",
+            output = "/dev/null",
+            term_tdm = TDM.go.df)
+    }
+  }
+
+  results <- do.call("rbind", results)
+
+  control_results <- list()
+
+  for(i in 1:length(control)){
+    if(!control[i] == ""){
+      control_results[[i]] <- goldi(doc = control[i],
+            terms = terms,
+            lims = lims,
+            syn = F,
+            object = T,
+            log = "/dev/null",
+            reader = "local",
+            output = "/dev/null",
+            term_tdm = TDM.go.df)
+    }
+  }
+  control_results <- do.call("rbind", control_results)
+```
+
+This gives us two objects holding the results, `results` and `control_results`.  We summarize the results and take all the terms in the result set with more than two occurances to minimize spurious hits. We use the method employed by GOrilla to calculate the enrichment of terms in the target set, and limit it to those which have been identified more than five times. We calculate $P$ values using the hypergeometric distribution.
+
+
+```r
+goldi::enrichment(target = results,
+                  control = control_results
+                  threshold = 5)
+```
+
+
+| Term  | Enrichment | P | 
+| --- | --- | --- |
+| protein_C_(activated)_activity |    66.55 |     1.257e-20 |
+| CD27_receptor_activity | 66 | 2.614e-21 |
+| CD40_receptor_activity | 66 | 2.614e-21 |
+| receptor_activator_activity | 66 | 2.614e-21 |
+| receptor_activity | 66 | 2.614e-21 |
+| IgE_binding | 65.34 | 8.639e-20 |
+| kinase_activator_activity |       65.34 |     8.639e-20 |
+| kinase_activity |            65.34 |     8.639e-20 |
+| B_cell_receptor_activity |       62.23 |     5.11e-14 | 
+| T_cell_receptor_activity |       62.23 |     5.11e-14 | 
+
+
+This replicates the analysis presented in our prepublication. The terms above can be easily changed around, and any bunch of strings may be used for comparrison. 
+
diff --git a/docs/synonyms.md b/docs/synonyms.md
@@ -0,0 +1 @@
+# Working with Synonyms
diff --git a/docs/using_goldi.md b/docs/using_goldi.md
@@ -0,0 +1 @@
+# Using `goldi`
diff --git a/inst/paper b/inst/paper
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -0,0 +1,15 @@
+site_name: goldi
+pages:
+- Home: 'index.md'
+- User Guide:
+        - 'Quick Start': 'using_goldi.md'
+        - 'Synonyms': 'synonyms.md'
+        - 'Advanced Usage': 'advanced.md'
+        - 'Examples': 'overexpression-analysis.md'
+- About:
+        - 'Function Reference': 'functions.md' 
+        - 'License': 'license.md'
+        - 'Contributing': 'contributing.md'
+site_url: http://chrisbcole.me
+repo_url: https://github.com/Chris1221/goldi
+