Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract geographic coverage via NLP #20

Open
Ly0n opened this issue Sep 1, 2022 · 5 comments
Open

Extract geographic coverage via NLP #20

Ly0n opened this issue Sep 1, 2022 · 5 comments

Comments

@Ly0n
Copy link
Member

Ly0n commented Sep 1, 2022

Identifying the geographical coverage of the models and data behind the project could be very interesting to detect areas without coverage. This could also be for help to find projects for a specific geographical area they are interested in.

@KKulma
Copy link

KKulma commented Sep 1, 2022

Agree, this would be valuable information. You mentioned NLP, do you have an idea where to get this info from? Any reliable/consistent source?

@Ly0n
Copy link
Member Author

Ly0n commented Sep 2, 2022

I never worked with NLP but did some investigations in the past:
This framework could be useful for many application to extract more data from the projects:
https://github.com/RaRe-Technologies/gensim

On the website repo we also have an issue that is talking about this problem:
protontypes/open-sustainable-technology#110

In my view, a first step to get started with NLP would be to create missing topic labels for the projects. For this, one could use the README of the already created projects and their topic as training data. For about 50% of the projects, the topics are missing and could be added to the database in this way.

This would be a clear improvement of the database, would enable much better searches and would also be very interesting in the analysis.

@KKulma
Copy link

KKulma commented Sep 2, 2022

I think there are several approaches we can consider here. {gensim} uses a pretty simple bag-of-words approach for topic modelling (unsupervised ML) and this method can be effective but very sensitive to corpora content and text-cleaning preprocessing steps, as well as our wild guess of how many topics there may be in the first place. Alternatively, we can see if there's a systematic way we could scrape this information directly from the project's GitHub repo's website and/or (big one!) train a simple supervised algorithm to classify the repo based on the content of README. LOTS OF FUN 💯

@Ly0n
Copy link
Member Author

Ly0n commented Sep 3, 2022

I think there are several approaches we can consider here. {gensim} uses a pretty simple bag-of-words approach for topic modelling (unsupervised ML) and this method can be effective but very sensitive to corpora content and text-cleaning preprocessing steps, as well as our wild guess of how many topics there may be in the first place. Alternatively, we can see if there's a systematic way we could scrape this information directly from the project's GitHub repo's website and/or (big one!) train a simple supervised algorithm to classify the repo based on the content of README. LOTS OF FUN 100

That should be feasible. I never worked with such frameworks just the classical CNNs for image processing so far.
The simplest information we could extract from the READMEs is the linked DOIs URLs. This data could be of importance for classification and labeling. After randomly selecting a few projects from the list almost all of them have DOI URLs to papers related to them. Adding this to the existing data mining script should not be a problem but It could increase the runtime by increasing the number of API calls needed per project.

@Ly0n
Copy link
Member Author

Ly0n commented Sep 4, 2022

Had some success last night with the DOI extraction. More details in the separated issue protontypes/open-sustainable-technology#172.

The new list is compiling a now CSV file at the moment. It looks like we are getting DOI links for about a quarter of the projects, but we are still missing some.

Let us see if there are open source tools that give us more contextual information based on the DOIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants