Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.0.1 -> v1.1.0 #35

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,6 @@ cran-comments.md
project_ideas.md
inst/data
inst/paper
site/
docs/
mkdocs.yml
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: goldi
Type: Package
Title: Gene Ontology Label Discernment and Identification
Version: 1.0.1
Version: 1.1.0
Date: 2017-06-27
Authors@R: c(
person("Christopher B.", "Cole", email = "[email protected]", role = c("aut", "cre", "cph")),
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

### Status

*This is the development branch, you probably don't want to install this unless you know what you're doing. If it breaks, you get to keep all the pieces!*

The package is currently checked on `R-oldrel` (v`3.3.3`), `R-release` (v`3.4.0`), and `R-devel` (v`3.5.0`) on

- [Ubuntu LTS 14.06 on Travis-CI](https://travis-ci.org/Chris1221/goldi)
Expand Down
1 change: 1 addition & 0 deletions docs/advanced.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Advanced Usage
1 change: 1 addition & 0 deletions docs/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# How to contribute
1 change: 1 addition & 0 deletions docs/functions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Functions
81 changes: 81 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Introduction

Gene Ontology is a public database which, among other things, classifies gene functions according to the molecular functions involved, the cellular compartment where the product is active, and the relevant biological pathways in which they play a part. These classes, or "terms", are highly useful in molecular biology, and are often referred to in the literature. However, with the size and complexity of biomedical publications, this information is often difficult to study in aggregate.

`goldi` is a tool for identifying key terms in text. It has been developed with the intention of identifying ontological labels in free form text with specific application to finding Gene Ontology terms in the biomedical literature with strict canonical NLP quality control.

This package performs a few main objectives:

- Identifies terms in free text (we distribute the package with a set of Molecular Function terms from Gene Ontology for easy use)
- Summarizes the quantity and quality of annotations across a corpus
- Provides helpful functions for working with `goldi` class objects, including enrichment tests between two corpora.

`goldi` is freely distributed on CRAN and Github, and bug reports are always welcome.

Please see the other pages on this website for description of the main functions, as well as some examples of `goldi` in the real world.


## Installation

`goldi` can be installed from CRAN with

```R
install.packages("goldi")
```

Or, you may choose to install the latest stable development version with

```R
devtools::install_github("Chris1221/goldi")
```

## Status

The package is currently checked on `R-oldrel` (v`3.3.3`), `R-release` (v`3.4.0`), and `R-devel` (v`3.5.0`) on

- [Ubuntu LTS 14.06 on Travis-CI](https://travis-ci.org/Chris1221/goldi)
- [XCode 8.3 on OSX 10.13 on Travis-CI](https://travis-ci.org/Chris1221/goldi)
- Winbuilder

If you notice any issues, please raise it on the repository!

## Minimal Example

`goldi` attempts to identify terms in free text through semantic similarity. This means that if a term and a sentence share a high number of words, the sentence has a higher probability of talking about the term.

Given the following input text and the included pre-computed term document matrix for approximately 10,000 Gene Onotlogy molecular function terms, we can find which are discussed in our text.

```R
# Give the free form text
doc <- "In this sentence we will talk about ribosomal chaperone activity.
In this sentence we will talk about nothing.
Here we discuss obsolete molecular terms."

# Load in the included term document matrix for the terms
data("TDM.go.df")

# Pipe output and log to /dev/null
output = "/dev/null"
log = "/dev/null"

# Run the function
goldi(doc = doc,
term_tdm = TDM.go.df,
output = output,
log = log,
object = TRUE)
```

Note in the above example, we impliment a few other options. Firstly, we don't want to see the output or the log for this example, so we pipe them to `/dev/null`. Secondly, we would like to return the output as an R object instead of writing it to a file, so we specify `object = TRUE`.

This will output the following table:

| Term | Context |
| ---------------------------- | --------------------------------------------------------------- |
| ribosomal_chaperone_activity | In this sentence we will talk about ribosomal chaperone activity |

This will give the term identified and the context in the free form where it was identified. This table will form the basis for all further analysis.

## Getting help

For help, please post an issue on the repository.
1 change: 1 addition & 0 deletions docs/license.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# License
127 changes: 127 additions & 0 deletions docs/overexpression-analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Examples

Here you'll find some complete examples of start-to-finish analysis conducted in `goldi`. Please feel free to contribute!

## Overexpression of Terms in a Target Set of Abstracts

In this analysis we seek to find Gene Ontology terms which may be overrepresented in a "target set" of abstracts, such as the results of a PubMed query.

We first fetch all the results of a specific query from Pubmed using the `RISmed` package and store their abstracts in a `data.frame`.


```r
library(RISmed)

# Store the input string for reuse
search_topic <- "anaphylaxis genetics"
search_query <- EUtilsSummary(search_topic, mindate=2014, maxdate=2015)

summary(search_query)

pull <- EUtilsGet(search_query)

data <- data.frame('Abstracts' = AbstractText(pull))

# Get rid of first entry for some reason, seems to always be blank
data[,1] <- as.character(data[,1])
data <- data[-1,]

head(data)
```

We want to compare the terms found here to something, so we grab all abstracts from 2014 to 2015 which match a similar field, i.e. immunology genetics. Note that only 1000 records are taken by default.


```r
# Store the input string for reuse
search_topic <- "immunology genetics"
search_query <- EUtilsSummary(search_topic, # Find all articles matching the string
mindate=2014, # From 2014
maxdate=2015, # to 2015
retmax = 1000) # This is the default but explicit

summary(search_query)

pull_control <- EUtilsGet(search_query)

control <- data.frame('Abstracts' = AbstractText(pull_control))

# Get rid of first entry for some reason, seems to always be blank
control[,1] <- as.character(control[,1])
control <- control[-1,]

head(control)
```

We now run `goldi` on each of the entries in both the target group and the control group.


```r
data(package = "goldi", "TDM.go.df")
TDM.go.df <- TDM.go.df[, !duplicated(colnames(TDM.go.df))]

lims <- c(1,2,2,3,4,5,6,7,7,8,10)

results <- list()

for(i in 1:length(data)){

if(!data[i] == ""){
results[[i]] <- goldi(doc = data[i],
terms = terms,
lims = lims,
syn = F,
object = T,
log = "/dev/null",
reader = "local",
output = "/dev/null",
term_tdm = TDM.go.df)
}
}

results <- do.call("rbind", results)

control_results <- list()

for(i in 1:length(control)){
if(!control[i] == ""){
control_results[[i]] <- goldi(doc = control[i],
terms = terms,
lims = lims,
syn = F,
object = T,
log = "/dev/null",
reader = "local",
output = "/dev/null",
term_tdm = TDM.go.df)
}
}
control_results <- do.call("rbind", control_results)
```

This gives us two objects holding the results, `results` and `control_results`. We summarize the results and take all the terms in the result set with more than two occurances to minimize spurious hits. We use the method employed by GOrilla to calculate the enrichment of terms in the target set, and limit it to those which have been identified more than five times. We calculate $P$ values using the hypergeometric distribution.


```r
goldi::enrichment(target = results,
control = control_results
threshold = 5)
```


| Term | Enrichment | P |
| --- | --- | --- |
| protein_C_(activated)_activity | 66.55 | 1.257e-20 |
| CD27_receptor_activity | 66 | 2.614e-21 |
| CD40_receptor_activity | 66 | 2.614e-21 |
| receptor_activator_activity | 66 | 2.614e-21 |
| receptor_activity | 66 | 2.614e-21 |
| IgE_binding | 65.34 | 8.639e-20 |
| kinase_activator_activity | 65.34 | 8.639e-20 |
| kinase_activity | 65.34 | 8.639e-20 |
| B_cell_receptor_activity | 62.23 | 5.11e-14 |
| T_cell_receptor_activity | 62.23 | 5.11e-14 |


This replicates the analysis presented in our prepublication. The terms above can be easily changed around, and any bunch of strings may be used for comparrison.

1 change: 1 addition & 0 deletions docs/synonyms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Working with Synonyms
1 change: 1 addition & 0 deletions docs/using_goldi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Using `goldi`
2 changes: 1 addition & 1 deletion inst/paper
Submodule paper updated from c51ba0 to 4887f7
15 changes: 15 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
site_name: goldi
pages:
- Home: 'index.md'
- User Guide:
- 'Quick Start': 'using_goldi.md'
- 'Synonyms': 'synonyms.md'
- 'Advanced Usage': 'advanced.md'
- 'Examples': 'overexpression-analysis.md'
- About:
- 'Function Reference': 'functions.md'
- 'License': 'license.md'
- 'Contributing': 'contributing.md'
site_url: http://chrisbcole.me
repo_url: https://github.com/Chris1221/goldi

Loading