Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loader: digest "all" possible date formats #169

Open
fschwenn opened this issue Sep 1, 2017 · 2 comments
Open

loader: digest "all" possible date formats #169

fschwenn opened this issue Sep 1, 2017 · 2 comments

Comments

@fschwenn
Copy link
Contributor

fschwenn commented Sep 1, 2017

Loader should include some normalization routine to handle dates in different formats.

Expected Behavior

Such a normalization routine would be called for each date field in the record ensuring that the data fit the schema, like "2017 Sep 1" -> "2017-09-01", "2017-Sep-1" -> "2017-09-01", "2017 Sep-Oct" -> "2017", "01.09.2017" -> "2017-09-01"

Current Behavior

I have to admit, I do not know to what extent it is already implemented in hepcrawl. In the harvesting-kit each publisher program has its own normalization code. At DESY we have a hand-written function which tries to catch most the cases.

Context

We will have to write a lot of spiders. It would save time, if we could just map the date-fields without thinking about the format.

@michamos
Copy link
Contributor

michamos commented Oct 12, 2017

There now is a date util, in particular normalize_date, that can be used to normalize any (incomplete) date:

In [1]: from inspire_utils.date import normalize_date

In [2]: normalize_date("2017 Sep 1")
Out[2]: '2017-09-01'

In [3]: normalize_date("2017-Sep-1")
Out[3]: '2017-09-01'

In [4]: normalize_date("2017 Sep-Oct")
[...]
ValueError: Unknown string format

In [5]: normalize_date("01.09.2017")
Out[5]: '2017-01-09'

Date ranges are not suported yet, are they a common occurence? if so we need to extend the utils to understand them. Also the last case is interpreted wrongly, but is ambiguous so we would need to make a choice here. Do you think your interpretation is more common?

@michamos
Copy link
Contributor

@fschwenn did you see my question about date ranges?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants