Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update for CEM1.5M #5

Open
wants to merge 17 commits into
base: v2
Choose a base branch
from
25 changes: 23 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,30 @@
# CellEMNet
# CEM Dataset

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/cem500k-a-large-scale-heterogeneous-unlabeled/electron-microscopy-image-segmentation-on-1)](https://paperswithcode.com/sota/electron-microscopy-image-segmentation-on-1?p=cem500k-a-large-scale-heterogeneous-unlabeled)


Code for the paper: [CEM500K, a large-scale heterogeneous unlabeled cellular electron microscopy image dataset for deep learning](https://elifesciences.org/articles/65894)

## About the Dataset

<figure>
<img align="left" src="./images/cem500k.jpg" width="250" height="250"></img>
</figure>

Typical EM datasets are created and shared to further biological research. Often that means that the sample size is n=1 (one instrument, one sample preparation protocol, one organism, one tissue, one cell line, etc.) and usually such datasets are hundreds of gigabytes to terabytes in size. For deep learning it is obviously true that a neural network trained on a dataset of 100 images from 100 different EM experiments will generalize better than the equivalent trained on 100 images from 1 EM experiment. CEM500K is an attempt to build a better dataset for deep learning by collecting and curating data from as many different EM experiments as possible. In total, we put together data from 102 unrelated EM experiments. Here's a breakdown of the biological details:

<figure>
<img src="./images/description.png"></img>
</figure>

## About Pre-trained Weights

Using CEM500K for unsupervised pre-training, we demonstrated a significant improvement in the performance of a 2D U-Net on a number of 2D AND 3D EM segmentation tasks. Pre-trained models not only achieved better IoU scores than random initialization, but also outperformed state-of-the-art results on all benchmarks for which comparison was possible. Even better CEM500K pre-training enabled models to converge much more quickly (some models took only 45 seconds to train!). See ```evaluation``` for a quick and easy way to use the pre-trained weights.

<figure>
<img src="./images/benchmarks.png", ></img>
<figcaption>Right: Example benchmark datasets. Left: IoU score improvements over random init. using CEM500K pre-trained weights (bottom row). See paper for more details.</figcaption>
</figure>

## Getting Started

Expand Down Expand Up @@ -66,4 +87,4 @@ Please cite this work.
volume = {10},
year = {2021}
}
```
```
8 changes: 7 additions & 1 deletion dataset/3d/reconstruct3d.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,12 @@
their large size. Sparsely sampled ROIs from such NGFF datasets can be downloaded and saved
in one of the supported formats using the ../scraping/ngff_download.py script.

NOTE: This script will try to center a given 2D image in a reconstructed 3D image
without regard to any of the 2D images being reconstructed. It is therefore possible
that the reconstructed volumes/flipbooks could have significant overlap with each other.

TODO: Prevent significantly overlapping reconstructions.

Example usage:
--------------

Expand Down Expand Up @@ -204,4 +210,4 @@ def create_subvols(vp):
subvol, check_contrast=False)

with Pool(processes) as pool:
output = pool.map(create_subvols, volume_fpaths)
output = pool.map(create_subvols, volume_fpaths)
12 changes: 6 additions & 6 deletions dataset/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ python preprocess/cleanup2d.py {dir_of_2d_image_groups} --processes 4
Second, crop each image into fixed size patches (typically 224x224):

```bash
python patchify2d.py {dir_of_2d_image_groups} {patch_dir} -cs 224 --processes 4
python patchify2d.py {dir_of_2d_image_groups} {dedupe_dir} -cs 224 --processes 4
```

The ```patchify2d.py``` script will save a ```.pkl``` file with the name of each 2D image subdirectory. Pickle files contain a dictionary of patches from all images in the subdirectory along with corresponding filenames. These files are ready for filtering (see below).
Expand Down Expand Up @@ -53,7 +53,7 @@ different from xy resolution, then cross-sections will only be cut from the xy p
the script (see usage example below).

```bash
python patchify3d.py {dir_of_3d_datasets} {patch_dir} -cs 224 --axes 0 1 2 --processes 4
python patchify3d.py {dir_of_3d_datasets} {dedupe_dir} -cs 224 --axes 0 1 2 --processes 4
```

The ```patchify3d.py``` script will save a ```.pkl``` file with the name of each volume file. Pickle files contain a
Expand All @@ -69,7 +69,7 @@ trained, if needed, using the ```train_patch_classifier.py``` script.
Filtering will be fastest with a GPU installed, but it's not required.

```bash
python classify_patches.py {patch_dir} {save_dir}
python classify_patches.py {dedupe_dir} {filtered_patch_dir}
```

After running filtering, the ```save_dir``` with have one subdirectory for each of the ```.pkl``` files that were
Expand All @@ -86,8 +86,8 @@ For example, to create short flipbooks of 5 consecutive images from a directory

```bash
python reconstruct3d.py {filtered_patch_dir} \
-vd {volume_dir1} {volume_dir2} {volume_dir3} \
-sd {savedir} -nz -p 4
-vd {dir_of_3d_datasets1} {dir_of_3d_datasets2} {dir_of_3d_datasets3} \
-sd {savedir} -nz 5 -p 4
```

See the script header for more details.
Expand All @@ -107,4 +107,4 @@ python ngff_download.py ngff_datasets.csv {save_path} -gb 5
```

Similarly, large datasets that are not stored in NGFF but are over some size threshold (we've used 5 GB in our work)
can be cropped into smaller ROIs with the ```crop_rois_from_volume.py``` script.
can be cropped into smaller ROIs with the ```crop_rois_from_volume.py``` script.
Loading