-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data assets gathered by DRC #12
Comments
In #19 I have code to download and unzip all of the listed DRC "dcc" and "file" assets (the "code" assets don't really have a concrete thing to download). Unzipped, here is the file breakdown:
Also, between "dcc" and "file", they are mostly overlapping. There's about 7k entries in each, there are only ~600 that are in one but not the other. The rest are exact matches. The file contents are things like node/edge lists, compounds, samples, genes, and all sorts of stuff. Could you give me some guidance on what high-level info information I should compile from this that would be useful for program evaluation? I'm not sure what would be useful here. Also, the data is naturally too big to store in the repo here for providence. I have the ingest process download all this data to Could you give me some guidence |
Thanks, @vincerubinetti, for the details. I do not foresee a reason for downloading all the data. I need to investigate the contents of the DRC download link files to see what we can and should be tracking. |
I have it all downloaded on my laptop, which took a long time, so if you want to take a look at anything in particular or collect some high-level info, let me know. Personally, looking through all of it, I don't see anything there that would necessarily be useful for the purposes of evaluating the effectiveness of certain projects. In fact, I didn't see anything to link a particular asset to a particular project. But maybe a few of them have it and I just missed it. Perhaps if we could associate assets to projects, things like "last updated" and "number of genes/nodes/edges" could be useful? Though even that feels like measuring programmer effectiveness via lines of code written (not accurate). Nonetheless, #20 adds the infrastructure to do things like downloading large/many data files at once, unzipping them, etc... which we may want to do later anyway. |
The diff here looks bigger than it really is due to large raw data assets, moving files, and indentation changes. See discussion in #12 - upgrade packages - add `/temp` dir to gitignore for large files we need to process - rename api type definitions from `.d.ts` to `.ts`, and move them to their own `/types` folder - add ingest function to download drc resource lists, download all of the resources listed in them, and give a high level summary - clean up github ingest. split up more appropriately for caching and parallelization. - allow all query/queryMultis to log progress. change nature of multi; return single array of both successes and errors (for easier matching with original input), and provide util func to filter out errors. - add util func to download file directly (without playwright), with caching and progress logging - add parsing/saving of tsv, txt, and gmt files - add util func to unzip archive and return details of contained files, with caching - improve logging util funcs for nicer and more informative progress display - limit concurrency in queryMulti (e.g. prevent trying to download 300 files at once) - add some string util funcs to reduce repetition - update `run.sh` to support running arbitrary ts file (for testing)
This issue just captures the information shared by the DRC with us via email to [email protected]. These links were a response to my asking for API access to the information on https://info.cfde.cloud and https://data.cfde.cloud.
Currently, each file has at most a couple hundred rows. I think these correspond to links in this Data Matrix View
current dcc assets format
current code assets format
current file assets format
The text was updated successfully, but these errors were encountered: