Skip to content

Commit

Permalink
Fix ARCHeadersParser splits on space, cause errors with spaces in uri…
Browse files Browse the repository at this point in the history
…'s (#62)

* Use rsplit() to allow urls with spaces in them to be counted as single field -- this should take care of errors in arcs with urls that contain spaces

tests: added space-in-url.arc for testing arc with spaces in url
  • Loading branch information
Chase H.D authored and ikreymer committed Jan 26, 2019
1 parent 704297b commit 759ab07
Show file tree
Hide file tree
Showing 3 changed files with 74 additions and 1 deletion.
69 changes: 69 additions & 0 deletions test/data/example-space-in-url.arc
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
filedesc://live-web-example.arc.gz 127.0.0.1 20140216050221 text/plain 75
1 0 LiveWeb Capture
URL IP-address Archive-date Content-type Archive-length

http://example.com/index.cfm?FuseAction=Email&EmailTitle=Examples From The Live Web&IsPopUp=False 93.184.216.119 20140216050221 text/html 1591
HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html
Date: Sun, 16 Feb 2014 05:02:20 GMT
Etag: "359670651"
Expires: Sun, 23 Feb 2014 05:02:20 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (sjc/4FCE)
X-Cache: HIT
x-ec-custom-error: 1
Content-Length: 1270

<!doctype html>
<html>
<head>
<title>Example Domain</title>

<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

}
div {
width: 600px;
margin: 5em auto;
padding: 50px;
background-color: #fff;
border-radius: 1em;
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
body {
background-color: #fff;
}
div {
width: auto;
margin: 0 auto;
border-radius: 0;
padding: 1em;
}
}
</style>
</head>

<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for illustrative examples in documents. You may use this
domain in examples without prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

4 changes: 4 additions & 0 deletions test/test_archiveiterator.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,10 @@ def test_example_arc_gz(self):
expected = ['arc_header', 'response']
assert self._load_archive('example.arc.gz') == expected

def test_example_space_in_url_arc(self):
expected = ['arc_header', 'response']
assert self._load_archive('example-space-in-url.arc') == expected

def test_example_arc(self):
expected = ['arc_header', 'response']
assert self._load_archive('example.arc') == expected
Expand Down
2 changes: 1 addition & 1 deletion warcio/recordloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ def parse(self, stream, headerline=None):
total_read += len(version)
total_read += len(spec)

parts = headerline.split(' ')
parts = headerline.rsplit(' ', len(headernames)-1)

if len(parts) != len(headernames):
msg = 'Wrong # of headers, expected arc headers {0}, Found {1}'
Expand Down

0 comments on commit 759ab07

Please sign in to comment.