Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data assets gathered by DRC #12

Open
seandavi opened this issue Jul 1, 2024 · 3 comments
Open

Data assets gathered by DRC #12

seandavi opened this issue Jul 1, 2024 · 3 comments

Comments

@seandavi
Copy link
Member

seandavi commented Jul 1, 2024

This issue just captures the information shared by the DRC with us via email to [email protected]. These links were a response to my asking for API access to the information on https://info.cfde.cloud and https://data.cfde.cloud.

Currently, each file has at most a couple hundred rows. I think these correspond to links in this Data Matrix View

current dcc assets format

link	lastmodified	current	creator	dcc_id	drcapproved	dccapproved	deleted	created
https://github.com/nih-cfde/LINCS-metadata/blob/main/scripts/process_mcf0a_c2m2.py	2024-02-20 20:16:23.088	False	[email protected]	f3f490cf-fd69-579c-8ea3-472c7cf3fb59	False	False	True	2024-02-20 20:16:23.088
https://github.com/nih-cfde/LINCS-metadata/blob/main/scripts/process_mf10a_c2m2.py	2024-02-20 20:19:52.739	False	[email protected]	f3f490cf-fd69-579c-8ea3-472c7cf3fb59	False	False	True	2024-02-20 20:19:52.739
https://cfde-drc.s3.amazonaws.com/SPARC/KG Assertions/2024-05-08/SPARC.zip	2024-05-08 15:47:03.429	True	[email protected]	2399794e-74c6-5735-a039-0782cdeeb1e2	True	True	False	2024-05-08 15:47:03.429
https://zinc15.docking.org/substances/search/?q={drug.label}	2024-03-08 00:15:43.332	True	[email protected]	a1289ebb-0306-59a1-b0fc-e4d03a4790d7	True	True	False	2024-03-08 00:15:43.332
https://github.com/nih-cfde/LINCS-metadata/blob/ain/scripts/process_mcf10a_c2m2.py	2024-02-21 22:08:41.275	False	[email protected]	f3f490cf-fd69-579c-8ea3-472c7cf3fb59	False	False	True	2024-02-21 22:08:41.275
https://www.gtexportal.org/home/gene/{gene.ensembl}	2024-03-08 00:18:20.759	True	[email protected]	b3028db2-209c-5862-8f4d-33c5b312332e	True	False	False	2024-03-08 00:18:20.759
https://cfde-drc.s3.amazonaws.com/Bridge2AI/KG Assertions/2024-03-01/tkg 2.0.zip	2024-03-01 05:45:05.879	True	[email protected]	75b3be39-a021-5d80-b7e2-2a7938a1e11a	False	True	False	2024-03-01 05:45:05.879
https://drugcentral.org/?q={drug.label}	2024-03-08 00:15:16.867	True	[email protected]	a1289ebb-0306-59a1-b0fc-e4d03a4790d7	True	True	False	2024-03-08 00:15:16.867

current code assets format

type	name	link	description	openAPISpec	smartAPISpec	smartAPIURL	entityPageExample
Apps URL	SenNet Data Portal	https://data.sennetconsortium.org/		False	False
API	HuBMAP Entity API	https://entity.api.hubmapconsortium.org/	The HuBMAP Entity API is a standard RESTful web service with create, update and read operations for the standard HuBMAP provenance graph entities.	False	True	https://smart-api.info/ui/0065e419668f3336a40d1f5ab89c6ba3
API	SenNet Search API	https://search.api.sennetconsortium.org	The SenNet Search API is a thin wrapper of the Elasticsearch API. It handles data indexing and reindexing into the backend Elasticsearch. It also accepts the search query and passes through to the Elasticsearch with data access security check.	False	True	https://smart-api.info/ui/10ed9b5eb8ff960d4431befc591ed842
API	SenNet Entity API	https://entity.api.sennetconsortium.org	The SenNet Entity API is a standard RESTful web service with create, update and read operations for the standard SenNet provenance graph entities.	False	True	https://smart-api.info/ui/7d838c9dee0caa2f8fe57173282c5812
Apps URL	SenNet Exploration User Interface	https://data.sennetconsortium.org/ccf-eui		False	False
Apps URL	LINCS Tools Marketplace	https://lincsproject.org/LINCS/tools	The LINCS Tools Marketplace page serves a listing of applications produced  using LINCS datasets by the LINCS consortium. 	False	False
Apps URL	SPARC Tools and Resources	https://sparc.science/tools-and-resources/tools	SPARC Portal page listing SPARC associated tools and resources	False	False
API	exRNA Atlas JSON-LD	https://brl-bcm.stoplight.io/docs/exrna-atlas-json-api/ZG9jOjQ1Mg-overview		True	False
API	CFDE GeneReg Linked Data Hub	https://genboree.org/cfde-gene-dev/ui/api-docs		False	False

current file assets format

filetype	filename	link	size	sha256checksum
KG Assertions	dictionary_NAR_databases_deduplicate.csv	https://cfde-drc.s3.amazonaws.com/Bridge2AI/KG Assertions/2024-03-01/dictionary_NAR_databases_deduplicate.csv	332431	idxebqqDNkRDqgG4IDziJPyzOG83XfQLznIPMUGypq8=
KG Assertions	Papers.csv	https://cfde-drc.s3.amazonaws.com/Bridge2AI/KG Assertions/2024-03-01/Papers.csv	305136	TFRyCZSHx9Y3RHZd0SXOa7T4+hEYTPlXupVFUPDZxis=
KG Assertions	tkg 2.0.zip	https://cfde-drc.s3.amazonaws.com/Bridge2AI/KG Assertions/2024-03-01/tkg 2.0.zip	303893995	6ZpPhxpW95YWPREcYff5qteHdGaRitMkHeuQbGDerF8=
XMT	MoTrPAC_Endurance_Trained_Rats_2023.gmt	https://cfde-drc.s3.amazonaws.com/MoTrPAC/XMT/2024-03-05/MoTrPAC_Endurance_Trained_Rats_2023.gmt	187733	GHbcupKlcgz78vtTbdkSaY3RbiPNWzLNPMD104Yhugc=
XMT	LINCS_XMT_2022-12-13_LINCS_L1000_Chemical_Pertubation_Consensus_Signatures.gmt	https://cfde-drc.s3.amazonaws.com/LINCS/XMT/2024-04-11/LINCS_XMT_2022-12-13_LINCS_L1000_Chemical_Pertubation_Consensus_Signatures.gmt	16319270	BxZ89Ja3/S7lTaA6yCv/Xhgwoh8EIy25JnF1Sk7VphY=
XMT	testfile.gmt	https://cfde-drc.s3.amazonaws.com/LINCS/XMT/2024-04-23/testfile.gmt	80	MWcavCS4IMIBHGVvlq3AVOy9Qbgd6RgfDAh3CgjCdqY=
C2M2	2024-01-04T11_45_25.136063-a7bb912c-ab20-11ee-9ed4-02402ff490c1.zip	https://cfde-drc.s3.amazonaws.com/HuBMAP/C2M2/2024-04-26/2024-01-04T11_45_25.136063-a7bb912c-ab20-11ee-9ed4-02402ff490c1.zip	1076294222	pmm476mNJOP4SB/kI83WaWNhb5QpBjXv3w7DFvEkuhY=
C2M2	hubmap-test-submission.zip	https://cfde-drc.s3.amazonaws.com/HuBMAP/C2M2/2024-04-26/hubmap-test-submission.zip	1076294222	pmm476mNJOP4SB/kI83WaWNhb5QpBjXv3w7DFvEkuhY=
KG Assertions	GTEx.zip	https://cfde-drc.s3.amazonaws.com/GTEx/KG Assertions/2024-04-29/GTEx.zip	149456464	icIIQG/ikXF55yVGBU8fzfObHFBYbdRo3Vguy0CiDko=
@vincerubinetti
Copy link
Collaborator

In #19 I have code to download and unzip all of the listed DRC "dcc" and "file" assets (the "code" assets don't really have a concrete thing to download). Unzipped, here is the file breakdown:

  (2) .1 files
  (2) .2 files
  (2) .3 files
  (721) . files
  (350) .csv files
  (19) .zip files
  (8) .download files
  (1,422) .txt files
  (264) .json files
  (11,206) .tsv files
  (47) .gmt files
  (19) .tsv~ files
  (4) .pdf files
  (2) .docx files
  (2) .swp files
  (8) .sh files
  (2) .obo files
  (2) .gz files
  (6) .DS_Store files
  (2) .3-cfde-submission_74d840d264792e9d7ab38fcb0deeba0eddbbebdb files
  (166) .numbers files
  (2) .json_orig files
  (2) .tsv_error files
  (2) .tsv_orig files
  (12) .orig files
  (2) .uri files
  (8) .new files
  14,284 files
  124.4 GB

Also, between "dcc" and "file", they are mostly overlapping. There's about 7k entries in each, there are only ~600 that are in one but not the other. The rest are exact matches.

The file contents are things like node/edge lists, compounds, samples, genes, and all sorts of stuff. Could you give me some guidance on what high-level info information I should compile from this that would be useful for program evaluation? I'm not sure what would be useful here.

Also, the data is naturally too big to store in the repo here for providence. I have the ingest process download all this data to /raw/temp, which is excluded with gitignore.

Could you give me some guidence

@seandavi
Copy link
Member Author

seandavi commented Aug 6, 2024

Thanks, @vincerubinetti, for the details. I do not foresee a reason for downloading all the data. I need to investigate the contents of the DRC download link files to see what we can and should be tracking.

@vincerubinetti
Copy link
Collaborator

I have it all downloaded on my laptop, which took a long time, so if you want to take a look at anything in particular or collect some high-level info, let me know.

Personally, looking through all of it, I don't see anything there that would necessarily be useful for the purposes of evaluating the effectiveness of certain projects. In fact, I didn't see anything to link a particular asset to a particular project. But maybe a few of them have it and I just missed it.

Perhaps if we could associate assets to projects, things like "last updated" and "number of genes/nodes/edges" could be useful? Though even that feels like measuring programmer effectiveness via lines of code written (not accurate).

Nonetheless, #20 adds the infrastructure to do things like downloading large/many data files at once, unzipping them, etc... which we may want to do later anyway.

vincerubinetti added a commit that referenced this issue Aug 17, 2024
The diff here looks bigger than it really is due to large raw data
assets, moving files, and indentation changes.

See discussion in #12

- upgrade packages
- add `/temp` dir to gitignore for large files we need to process
- rename api type definitions from `.d.ts` to `.ts`, and move them to
their own `/types` folder
- add ingest function to download drc resource lists, download all of
the resources listed in them, and give a high level summary
- clean up github ingest. split up more appropriately for caching and
parallelization.
- allow all query/queryMultis to log progress. change nature of multi;
return single array of both successes and errors (for easier matching
with original input), and provide util func to filter out errors.
- add util func to download file directly (without playwright), with
caching and progress logging
- add parsing/saving of tsv, txt, and gmt files
- add util func to unzip archive and return details of contained files,
with caching
- improve logging util funcs for nicer and more informative progress
display
- limit concurrency in queryMulti (e.g. prevent trying to download 300
files at once)
- add some string util funcs to reduce repetition
- update `run.sh` to support running arbitrary ts file (for testing)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants