Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangs after downloading a few pages #1

Open
youssefavx opened this issue Nov 15, 2020 · 4 comments · May be fixed by #6
Open

Hangs after downloading a few pages #1

youssefavx opened this issue Nov 15, 2020 · 4 comments · May be fixed by #6

Comments

@youssefavx
Copy link

youssefavx commented Nov 15, 2020

Hey! Thanks for making this! Was looking for something like this and this one actually worked!

For some reason, it downloads a few pages, and then it hangs. If I close it, restart it, and then run with the same settings again, it continues, and then hangs again. It does less than 10 pages/around 10

@youssefavx
Copy link
Author

Sorry I realized a tracebac would be useful:

8% (57/703) done
8% (58/703) done
8% (59/703) done
8% (60/703) done
8% (61/703) done
8% (62/703) done
8% (63/703) done
9% (64/703) done
9% (65/703) done
9% (66/703) done
9% (67/703) done
9% (68/703) done
9% (69/703) done
9% (70/703) done
10% (71/703) done
10% (72/703) done
10% (73/703) done
10% (74/703) done
^CTraceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 265, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 265, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ripper.py", line 60, in <module>
    main()
  File "ripper.py", line 52, in main
    contents = client.download_page(i)
  File "/Volumes/Transcend2/archivedownloader/archiveripper/api.py", line 143, in download_page
    'referer': self.URL_FORMAT % ('details/' + self.book_id)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Exception ignored in: <module 'threading' from '/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py'>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1273, in _shutdown
    t.join()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1032, in join
    self._wait_for_tstate_lock()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

@notevenaperson
Copy link
Contributor

Can't replicate the issue if I don't know which book you tried to borrow. Does this always happen with every book you tried or with a specific book only? Did you try running it again against that specific book? from the traceback it just looks like your internet connection dropped.

@notevenaperson
Copy link
Contributor

Hi again. Lately I've been having this issue myself. It seems archive.org drops the connection if it suspects automated requests. I fixed this in my fork by adding a random second or so delay between requests. It's only implemented in the dev branch of my fork for now, since I'm waiting for my pull request that adds cmdline flags to be accepted.

notevenaperson added a commit to notevenaperson/archiveripper that referenced this issue Sep 13, 2021
…een requests to prevent being blocked

1. adds the -R flag
2. Should fix scoliono#1 and
   adds the -nt flag
@notevenaperson notevenaperson linked a pull request Sep 14, 2021 that will close this issue
@Aphexus
Copy link

Aphexus commented Apr 28, 2024

Similar issue.

Should not just add a few seconds delay(specially after downloading several pages but also check for empty files and then delay longer.

My quick hax:

downloaded_pages = 0
total = end - start
for i in range(start, end):
    fn = '%s/%d.jpg' % (dir, i + 1)
    
    if exists(fn):
        if getsize(fn) != 0:
            continue

    sleeptime=random.uniform(2,5)
    if i % 25 == 0:
        time.sleep(10)
    if i % 50 == 0:
        time.sleep(10)            
        

    logging.debug('downloading page %d (index %d)' % (i + 1, i))
    contents = client.download_page(i, args.scale)
    # likely data is corrupt or blocked so try again.
    if len(contents) < 100:
        time.sleep(50)
        contents = client.download_page(i, args.scale)
        # Ignore it
        if len(contents) < 100:
            print("Error downloading page " + str(i) + ". Ignoring...")
            continue
        
                        
    with open(fn, 'wb') as file:
        file.write(contents)
    downloaded_pages = downloaded_pages + 1
    done_count = i + 1 - start
    print('%d%% (%d/%d) done. Total Downloaded (%d/%d)' % (done_count / total * 100, done_count, total, downloaded_pages, total))
            
    time.sleep(sleeptime)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants