Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to specify non-14-digit datetimes in URI #301

Closed
9 of 10 tasks
machawk1 opened this issue Dec 1, 2017 · 16 comments
Closed
9 of 10 tasks

Allow user to specify non-14-digit datetimes in URI #301

machawk1 opened this issue Dec 1, 2017 · 16 comments

Comments

@machawk1
Copy link
Member

machawk1 commented Dec 1, 2017

e.g., GET on /2016/http://example.com would return only list of mementos with interpreted date (all from the year of 2016). Currently a non-14-digit date simply returns an interface with all mementos for the URI-R listed as well as an abbreviated TimeMap in the link header.

  • Perform sanity check on 14-digit dates (e.g, prevent invalid 20172004101112)
  • Resolve partial 14-digit dates
    • Just year (4-digit), e.g., 2014
    • Year and month, e.g., 201409 (September 2014)
    • Year, month, and day, e.g., 20140910 (September 10, 2014)
    • Year, month, day, and hour, e.g., 2014091002 (September 10, 2014 2 am)
    • Year, month, day, hour, and minute, e.g., 201409100215 (September 10, 2014 2:15 am)
    • Year, month, day, hour, minute, and second, e.g., 20140910021539 (September 10, 2014 2:15 am and 39 seconds)
  • Allow datetimes beyond 14-digits (refer to WARC/1.1 spec, which ISO date?) (Support Datetime other than those specified as 14-digits #283)
  • Prevent non-14-digit dates (e.g., 2014) from resolving to 0-padded invalid dates (20140000000000)
@ibnesayeed
Copy link
Member

Memento endpoint should return only the closest memento. What you are describing here should be implemented as datetime-based slicing on TimeMap. There are plans to support similar functionality in MemGator too.

@machawk1
Copy link
Member Author

machawk1 commented Dec 1, 2017

Correct, this ticket appeals to the slicing feature of the /d{1,14}/{URI-R} endpoint and not the /memento/ endpoint, per above.

@ibnesayeed
Copy link
Member

I think I am confused with the purposes of showMementoAtDatetime and showMemento.

In my opinion, sliced TimeMaps should be served at /timemap/<format>/<[start-datetime]:[end-datetime]>/<urir>.

@machawk1
Copy link
Member Author

machawk1 commented Dec 1, 2017

showMementoAtDatetime() handles requests for /20171201094013/http://example.com.

It's really a wrapper to the generic show_uri().

showMemento() does the "closest datetime" resolution, i.e., requests for /memento/*/http://example.com.

There is definitely some duplicate code and redundant functionality, which stems from progressively adding the functionality.

The sliced TimeMap functionality also needs to serve the web UI when a 404 for a URI-M is encountered (see screenshots in #286). I think the /timemap/ endpoint ought to be isolated from this.

We also have getLinkHeaderAbbreviatedTimeMap() for the Link response headers. Similar logic can be used for the sliced TMs.

@machawk1
Copy link
Member Author

The current state of resolution is odd. With the 5mementos sample WARC, http://localhost:5002/memento/2020/memento.us/ resolves to http://localhost:5002/memento/20130202100000/memento.us/, even when http://localhost:5002/memento/20161231110001/memento.us/ is available. The facade of date-based resolution needs to clarified relative to the code, then progress can be made on this.

@machawk1
Copy link
Member Author

machawk1 commented Oct 10, 2018

Resolving unspecified date fields to the earlier possibility seems like the most logical resolution, e.g., the aforementioned request for the URI-R at the date 2020 would resolve to 20200101000000. For year-month specified (e.g., 192005) use similar fill-in logic (192005010000000).

@ibnesayeed
Copy link
Member

This is done in MemGator as following:

var regs = map[string]*regexp.Regexp{
	"dttmstr": regexp.MustCompile(`^(\d{4})(\d{2})?(\d{2})?(\d{2})?(\d{2})?(\d{2})?$`),
}

func paddedTime(dttmstr string) (dttm *time.Time, err error) {
	m := regs["dttmstr"].FindStringSubmatch(dttmstr)
	dts := m[1]
	dts += (m[2] + "01")[:2]
	dts += (m[3] + "01")[:2]
	dts += (m[4] + "00")[:2]
	dts += (m[5] + "00")[:2]
	dts += (m[6] + "00")[:2]
	var dtm time.Time
	dtm, err = time.Parse("20060102150405", dts)
	dttm = &dtm
	return
}

@machawk1
Copy link
Member Author

@ibnesayeed Thanks for supplying this logic. Do you want to try to adapt this to ipwb? If not, I can use it as inspiration for an implementation.

@machawk1
Copy link
Member Author

Also, for reference, this should be written/called in/from resolveMemento() prior to the call to getCDXJLineClosestTo().

@machawk1
Copy link
Member Author

Though it's a little out-of-scope of this ticket, we may also want to tweak the regex to be stricter to the date component bound, e.g., reject months > 12.

@ibnesayeed
Copy link
Member

Do you want to try to adapt this to ipwb?

I can look into it.

Though it's a little out-of-scope of this ticket, we may also want to tweak the regex to be stricter to the date component bound, e.g., reject months > 12.

This can be better done by parsing the datetime and throwing invalid datetime exception.

@machawk1
Copy link
Member Author

This can be better done

Is liberal use of exceptions common practice in Python? This pattern seems to be your approach to a few of our problems and I am curious as to whether it is inspired by another language or influenced by other Python examples.

I can look into it.

I will assign this ticket to you in this case. If you'd rather not, let me know and I can take care of it.

@machawk1
Copy link
Member Author

https://github.com/webrecorder/warcio/blob/master/warcio/timeutils.py also has some useful test cases for this, e.g., inclusion of letters, bad month, padding, etc.

@ibnesayeed
Copy link
Member

ibnesayeed commented Oct 10, 2018

Is liberal use of exceptions common practice in Python? This pattern seems to be your approach to a few of our problems and I am curious as to whether it is inspired by another language or influenced by other Python examples.

This is actually very Pythonic to use exceptions in situations like this. Some other languages encourage using exceptions too, but I am not necessarily applying any cross-language influence here. In some ways exceptions fall under a few mantras of Python programming that include 1) explicit is better than implicit, 2) errors should never pass silently, unless explicitly silenced, and 3) easier to ask for forgiveness than permission.

Practically speaking, when something unexpected happens, the function would not know what to return (and the return value in case of unexpected failure could be meaningless other than detecting a failure). So, one may chose to return None or an empty string ''. However, these return values could be legitimate non-failure values as well. Also, the caller needs to know what does the function return in case of failure. Also, if the function returns more than one values, any early termination on failure should maintain that return signature or else the caller will have trouble handling non-uniform return values. In all these scenarios it is much easier to raise a meaningful exception.

@ibnesayeed
Copy link
Member

I have updated the checkboxes of tasks as per #586. For the remaining one task with more precise datetime (in milliseconds or nanoseconds) we can open a new ticket.

@machawk1
Copy link
Member Author

@ibnesayeed #283 covers the more-precise date requirement. Your contribution to this ticket meets the criteria to close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants