Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Span, or Span 2 #10

Merged
merged 89 commits into from
Sep 27, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
76a475d
Add files via upload
May 4, 2019
29c2ecd
Add mixture
May 10, 2019
9a64492
Fixed comments on mixture
May 13, 2019
cb83a63
Not working
May 16, 2019
7e452ea
Add override for sample
May 19, 2019
8f3b7fc
Problem with NaN in logLikelihood
May 19, 2019
c20c63b
Fixed logLikelihood for PoissonRegression
May 20, 2019
0d3e754
Fixed fit for ZeroPoisson
May 22, 2019
92019f8
Add update to regressionCoefficients in mix
May 22, 2019
4cd8b1c
Fixed override and some warnings
May 22, 2019
463daf3
Split methods to different classes
May 23, 2019
8f7c62a
Split methods to different classes
May 23, 2019
f6ee4ab
Made fun getLogObservation for regression
May 23, 2019
c077202
Made fun getLogObservation for regression
May 24, 2019
a1d2cc3
Made fun getLogObservation for regression
May 24, 2019
b9d4df1
Fixed name for getPredictor.
May 25, 2019
c121278
Fixed excess copying in ArrayRealVector
Jun 25, 2019
8e1868e
Try to run on real data
Jun 27, 2019
a82a2b7
Add GC-content covariate
Jul 1, 2019
6eb9777
Fixed unnecessary ArrayRealVector usages.
karl-crl Jul 2, 2019
6e9f79e
Fixed line separator.
karl-crl Jul 2, 2019
497d363
Add files via upload
May 4, 2019
6070566
Add mixture
May 10, 2019
3860771
Fixed comments on mixture
May 13, 2019
91c6bdd
Not working
May 16, 2019
2229aac
Add override for sample
May 19, 2019
f0585cf
Problem with NaN in logLikelihood
May 19, 2019
2653f3c
Fixed logLikelihood for PoissonRegression
May 20, 2019
5821597
Fixed fit for ZeroPoisson
May 22, 2019
abbabe7
Add update to regressionCoefficients in mix
May 22, 2019
f629966
Fixed override and some warnings
May 22, 2019
2f7856a
Split methods to different classes
May 23, 2019
f365738
Split methods to different classes
May 23, 2019
a55a737
Made fun getLogObservation for regression
May 23, 2019
f8bfe75
Made fun getLogObservation for regression
May 24, 2019
1d8a313
Made fun getLogObservation for regression
May 24, 2019
2607df4
Fixed name for getPredictor.
May 25, 2019
44c2387
Fixed excess copying in ArrayRealVector
Jun 25, 2019
b4bf284
Try to run on real data
Jun 27, 2019
b92a7d4
Fixed unnecessary ArrayRealVector usages.
karl-crl Jul 2, 2019
d93a631
Fixed line separator.
karl-crl Jul 2, 2019
2fb3df4
(test ver.) Add 1-22 + x chromosome
karl-crl Jul 2, 2019
93ae576
(test ver.) Fixed unnecessary file readings.
karl-crl Jul 3, 2019
d667022
(test ver.) Fit on 40 bam models.
karl-crl Jul 8, 2019
3451e35
(test ver.) Add Design Matrix to save memory (heap now 3.5G instead o…
karl-crl Jul 8, 2019
68f06ae
(test ver.) Add TransportedDesignMatrix
karl-crl Jul 9, 2019
7cd0812
(test ver.) Fixed creation a lot of tiny DoubleArray
karl-crl Jul 10, 2019
edb11ad
(test ver.) Fixed unnecessary boxing
karl-crl Jul 10, 2019
cf5c967
(test ver.) Rewrite counting coverMe, coverInput, GCcontent
karl-crl Jul 11, 2019
3971549
(test ver.) Add mappability covariate
karl-crl Jul 11, 2019
14bc7e3
(test ver.) Add AIC, BIC and F-test(to test linear restrictions)
karl-crl Jul 18, 2019
48eb263
Merge branch 'master' into prototype
Jul 18, 2019
1d75c44
Merge pull request #1 from karl-crl/prototype
Jul 18, 2019
ebeef0e
Changed link functions from properties to methods.
karl-crl Jul 18, 2019
100d6e5
Merge branch 'improving-span' into master
Jul 18, 2019
e9bdbdf
Merge pull request #6 from karl-crl/master
Jul 18, 2019
39c9bc4
Add PeakCalling with GLM
karl-crl Jul 29, 2019
cff7023
(test) Add fitting GLM model and creation of a new GLM model
karl-crl Jul 30, 2019
c31ef25
(test) Added SpanPathToData and added some fixes.
karl-crl Jul 31, 2019
b13dc08
Autodetect 2bit in the same folder as chrom.sizes
dievsky Aug 1, 2019
5bdab33
Replaced 0 -> 1 in mixture internals
dievsky Aug 1, 2019
efe37db
Change WLSMultipleLinearRegression.java to *.kt
karl-crl Aug 2, 2019
141b2d7
(test) Add possibility not to use mappability.
karl-crl Aug 6, 2019
72c9848
Merge remote-tracking branch 'origin/improving-span' into improving-span
karl-crl Aug 6, 2019
3fecd8a
Change ArrayRealVector.map() in IntegerRegressionEmissionScheme; mode…
karl-crl Aug 8, 2019
84fde90
(test) Reduce IntegerRegressionEmissionScheme.update() time twice
karl-crl Aug 9, 2019
c9e683c
(test) Reduce IntegerRegressionEmissionScheme.update time
karl-crl Aug 9, 2019
a967733
(test) Reduce IntegerRegressionEmissionScheme.update time (0.1)
karl-crl Aug 9, 2019
7016e62
(test) Optimize calculateBeta()
karl-crl Aug 14, 2019
9d6faf6
Reduced model fitting time
karl-crl Aug 14, 2019
2e48721
(test) Change RealVector and RealMatrix to F64Vector
karl-crl Aug 16, 2019
6bf86f9
Add semanticCheck for ZeroPoissonMixture
karl-crl Aug 22, 2019
31f2442
Merge remote-tracking branch 'origin/master' into improving-span
dievsky Sep 4, 2019
aded821
Trying to fix Span hierarchy
dievsky Sep 5, 2019
1803108
Moved ZeroPoissonMixture to span root
dievsky Sep 6, 2019
854e76c
Added signal-to-noise report for Poisson regression mixture
dievsky Sep 11, 2019
6f0eb3e
Simplified WLS regression, design matrix generation etc.
dievsky Sep 11, 2019
301621e
Comments, checks and tests for WLSRegression
dievsky Sep 12, 2019
dd33d31
Comments and code simplification for regression emission schemes
dievsky Sep 12, 2019
49f4a1d
Comments, renaming, return type tweaking in WLSRegression
dievsky Sep 12, 2019
e30f274
Extracted assertEquals for DoubleArray's into Tests
dievsky Sep 12, 2019
e5b19a8
Correct state flipping for Poisson regression mixture and tests
dievsky Sep 12, 2019
8cfd840
Moved binned mean GC to CpGContent
dievsky Sep 18, 2019
366b034
More generic deserialization error messages
dievsky Sep 19, 2019
f1ad396
Fixed ClassificationModelTest
dievsky Sep 20, 2019
27d56e9
Added a test for binnedMeanCG
dievsky Sep 20, 2019
0925df6
Fixed various tests
dievsky Sep 20, 2019
91aa978
Fixed ClassificationModelTest
dievsky Sep 20, 2019
ff4a26f
Added mapability to test organism data generation
dievsky Sep 23, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion src/main/kotlin/org/jetbrains/bio/genome/Genome.kt
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,7 @@ class Genome private constructor(
genesGTFPath: Path? = null,
genesDescriptionsPath: Path? = null
) = getOrAdd(build, true) {
val chromSizesDir = chromSizesPath.parent
Genome(
build,
annotationsConfig = annotationsConfig,
Expand All @@ -253,7 +254,7 @@ class Genome private constructor(
cytobandsPath = cytobandsPath,
repeatsPath = repeatsPath,
gapsPath = gapsPath,
twoBitPath = twoBitPath,
twoBitPath = twoBitPath ?: (chromSizesDir / "$build.2bit").let { if (it.exists) it else null },
genesGTFPath = genesGTFPath,
genesDescriptionsPath = genesDescriptionsPath

Expand Down
12 changes: 7 additions & 5 deletions src/main/kotlin/org/jetbrains/bio/genome/PeaksInfo.kt
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,13 @@ object PeaksInfo {

private fun Long.formatLongNumber() = String.format("%,d", this).replace(',', ' ')

fun compute(genomeQuery: GenomeQuery,
peaksStream: Stream<Location>,
src: URI?,
paths: List<Path>,
fragment: Fragment = AutoFragment): Map<String, String> {
fun compute(
genomeQuery: GenomeQuery,
peaksStream: Stream<Location>,
src: URI?,
paths: List<Path>,
fragment: Fragment = AutoFragment
): Map<String, String> {
val peaks = peaksStream.collect(Collectors.toList())
val peaksLengths = peaks.map { it.length().toDouble() }.toDoubleArray()
val peaksCount = peaksLengths.count()
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
package org.jetbrains.bio.genome

import com.google.common.math.IntMath
import gnu.trove.list.array.TFloatArrayList
import org.apache.commons.csv.CSVFormat
import org.apache.log4j.Logger
import org.jetbrains.bio.Configuration
import org.jetbrains.bio.big.BigWigFile
import org.jetbrains.bio.big.FixedStepSection
import org.jetbrains.bio.genome.sequence.Nucleotide
import org.jetbrains.bio.genome.sequence.TwoBitWriter
import org.jetbrains.bio.io.FastaRecord
Expand All @@ -26,7 +29,8 @@ object TestOrganismDataGenerator {
"chr2" to IntMath.pow(10, 6),
"chr3" to IntMath.pow(10, 6),
"chrX" to IntMath.pow(10, 6),
"chrM" to IntMath.pow(10, 6))
"chrM" to IntMath.pow(10, 6)
)

@JvmStatic
fun main(args: Array<String>) {
Expand Down Expand Up @@ -58,6 +62,7 @@ object TestOrganismDataGenerator {
generateCytobands(genome)
generateRepeats(genome)
generateCGI(genome)
generateMapability(genome)
LOG.info("Done")
}

Expand Down Expand Up @@ -278,4 +283,28 @@ object TestOrganismDataGenerator {
""".trimMargin().trim())
}
}

/**
* Mapability is a wiggle track with 0 for non-mapable nucleotides and 1 for mapable ones.
*
* It's generated in the same folder as "chrom.sizes" with a filename "mapability.bigWig".
*
* We only generate mapability for chrX.
* This is done to test that the genome mean substitution for no-data chromosome works correctly.
*/
private fun generateMapability(genome: Genome) {
LOG.info("Generating mapability bigWig")
val path = genome.chromSizesPath.parent / "mapability.bigWig"
val gq = genome.toQuery()
val chrX = gq["chrX"]!!
val random = ThreadLocalRandom.current()
val section = FixedStepSection(
chrX.name,
start = 0,
values = TFloatArrayList(
(0 until chrX.length).map { if (random.nextInt(5) == 4) 0.0f else 1.0f }.toFloatArray()
)
)
BigWigFile.write(listOf(section), gq.get().map { it.name to it.length }, path)
}
}
13 changes: 13 additions & 0 deletions src/main/kotlin/org/jetbrains/bio/genome/sequence/CpGContent.kt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package org.jetbrains.bio.genome.sequence

import com.google.common.annotations.VisibleForTesting
import org.jetbrains.bio.genome.Chromosome
import org.jetbrains.bio.genome.Location

/**
Expand Down Expand Up @@ -130,6 +131,18 @@ enum class CpGContent {
}
return cg
}

/**
* Slice the given chromosome into bins and return an array containing the mean GC content for each bin.
*/
fun binnedMeanCG(chromosome: Chromosome, binSize: Int): DoubleArray {
val sequence = chromosome.sequence
return chromosome.range.slice(binSize).mapToDouble { bin ->
(bin.startOffset until bin.endOffset).count { pos ->
sequence.charAt(pos).let { it == 'c' || it == 'g' }
}.toDouble() / bin.length()
}.toArray()
}
}
}

41 changes: 26 additions & 15 deletions src/main/kotlin/org/jetbrains/bio/statistics/ClassificationModel.kt
Original file line number Diff line number Diff line change
Expand Up @@ -162,22 +162,29 @@ interface Fitter<out Model : ClassificationModel> {
* @param maxIter an upper bound on fitting iterations (if applicable).
* @return guessed classification model.
*/
fun guess(preprocessed: Preprocessed<DataFrame>,
title: String, threshold: Double, maxIter: Int, attempt: Int): Model

fun guess(preprocessed: List<Preprocessed<DataFrame>>,
title: String, threshold: Double, maxIter: Int, attempt: Int): Model =
guess(preprocessed.first(), title, threshold, maxIter, attempt)

fun fit(preprocessed: Preprocessed<DataFrame>,
fun guess(
preprocessed: Preprocessed<DataFrame>,
title: String, threshold: Double, maxIter: Int, attempt: Int
): Model

fun guess(
preprocessed: List<Preprocessed<DataFrame>>,
title: String, threshold: Double, maxIter: Int, attempt: Int
): Model = guess(preprocessed.first(), title, threshold, maxIter, attempt)

fun fit(
preprocessed: Preprocessed<DataFrame>,
title: String = TITLE, threshold: Double = THRESHOLD,
maxIter: Int = MAX_ITERATIONS,
attempt: Int = 0): Model = fit(listOf(preprocessed), title, threshold, maxIter, attempt)
attempt: Int = 0
): Model = fit(listOf(preprocessed), title, threshold, maxIter, attempt)

fun fit(preprocessed: List<Preprocessed<DataFrame>>,
fun fit(
preprocessed: List<Preprocessed<DataFrame>>,
title: String = TITLE, threshold: Double = THRESHOLD,
maxIter: Int = MAX_ITERATIONS,
attempt: Int = 0): Model {
attempt: Int = 0
): Model {
require(threshold > 0) { "threshold $threshold must be >0" }
require(maxIter > 0) { "maximum number of iterations $maxIter must be >0" }

Expand All @@ -199,8 +206,10 @@ interface Fitter<out Model : ClassificationModel> {
require(multiStarts > 1) { "number of starts $multiStarts must be >1" }
}

override fun fit(preprocessed: Preprocessed<DataFrame>, title: String,
threshold: Double, maxIter: Int, attempt: Int): Model {
override fun fit(
preprocessed: Preprocessed<DataFrame>, title: String,
threshold: Double, maxIter: Int, attempt: Int
): Model {
require(attempt == 0) {
"cyclic multistart is not allowed"
}
Expand All @@ -218,8 +227,10 @@ interface Fitter<out Model : ClassificationModel> {
return msModel
}

override fun fit(preprocessed: List<Preprocessed<DataFrame>>, title: String,
threshold: Double, maxIter: Int, attempt: Int): Model {
override fun fit(
preprocessed: List<Preprocessed<DataFrame>>, title: String,
threshold: Double, maxIter: Int, attempt: Int
): Model {
require(attempt == 0) {
"cyclic multistart is not allowed"
}
Expand Down
Loading