Skip to content

Commit

Permalink
[wip][feature] Add support for fsspec backends
Browse files Browse the repository at this point in the history
 - [ ] Add tests using python3 -m http and asyncssh or paramiko server
   and maybe something for samba and s3 (botocore).
  • Loading branch information
mxmlnkn committed Oct 4, 2024
1 parent f3aaeca commit fa75cfa
Show file tree
Hide file tree
Showing 13 changed files with 503 additions and 16 deletions.
21 changes: 20 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,18 @@ jobs:
# https://github.com/libarchive/libarchive/blob/ad5a0b542c027883d7069f6844045e6788c7d70c/libarchive/
# archive_read_support_filter_lrzip.c#L68
sudo apt-get -y install libfuse2 fuse3 bzip2 pbzip2 pixz zstd unar lrzip lzop gcc liblzo2-dev
set -x
- name: Install Dependencies For Unreleased Python Versions (Linux)
if: >
startsWith( matrix.os, 'ubuntu' ) && (
matrix.python-version == '3.13.0-rc.2' ||
matrix.python-version == '3.14.0-alpha.0')
run: |
#libgit2-dev is too old on Ubuntu 22.04. Leads to error about missing git2/sys/errors.h
#sudo apt-get -y install libgit2-dev
sudo apt-get -y install cmake
git clone --branch v1.8.1 --depth 1 https://github.com/libgit2/libgit2.git
( cd libgit2 && mkdir build && cd build && cmake .. && cmake --build . && sudo cmake --build . -- install )
- name: Install Dependencies (MacOS)
if: startsWith( matrix.os, 'macos' )
Expand All @@ -139,6 +150,14 @@ jobs:
# Add brew installation binary folder to PATH so that command line tools like zstd can be found
export PATH="$PATH:/usr/local/bin"
- name: Install Dependencies For Unreleased Python Versions (MacOS)
if: >
startsWith( matrix.os, 'macos' ) && (
matrix.python-version == '3.13.0-rc.2' ||
matrix.python-version == '3.14.0-alpha.0')
run: |
brew install libgit2
- name: Install pip Dependencies
run: |
python3 -m pip install --upgrade pip
Expand Down
12 changes: 12 additions & 0 deletions AppImage/build-ratarmount-appimage.sh
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,18 @@ function installAppImagePythonPackages()
fi
"$APP_PYTHON_BIN" -I -m pip install --no-cache-dir ../core
"$APP_PYTHON_BIN" -I -m pip install --no-cache-dir ..[full]

# ratarmount-0.10.0-manylinux2014_x86_64.AppImage (the first one!) was 13.6 MB
# ratarmount-v0.11.3-manylinux2014_x86_64.AppImage was 13.6 MB
# ratarmount-0.12.0-manylinux2014_x86_64.AppImage was 26.3 MB thanks to an error with the trime-down script.
# ratarmount-0.15.0-x86_64.AppImage was 14.8 MB
# ratarmount-0.15.1-x86_64.AppImage was 13.3 MB (manylinux_2014)
# ratarmount-0.15.2-x86_64.AppImage was 11.7 MB (manylinux_2_28)
# At this point, with pyfatfs, the AppImage is/was 13.0 MB. Extracts to 45.1 MB
# This bloats the AppImage to 23.7 MB, which is still ok, I guess. Extracts to 83.1 MB
"$APP_PYTHON_BIN" -I -m pip install --no-cache-dir requests aiohttp sshfs smbprotocol pygit2<1.15 fsspec
# This bloats the AppImage to 38.5 MB :/. Extracts to 121.0 MB
"$APP_PYTHON_BIN" -I -m pip install --no-cache-dir s3fs gcsfs adlfs dropboxdrivefs
}

function installAppImageSystemLibraries()
Expand Down
31 changes: 30 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ And in contrast to [tarindexer](https://github.com/devsnd/tarindexer), which als
- **Union Mounting:** Multiple TARs, compressed files, and bind mounted folders can be mounted under the same mountpoint.
- **Write Overlay:** A folder can be specified as write overlay.
All changes below the mountpoint will be redirected to this folder and deletions are tracked so that all changes can be applied back to the archive.
- **Remote Files and Folders:** A remote archive or whole folder structure can be mounted similar to tools like [sshfs](https://github.com/libfuse/sshfs) thanks to the [filesystem_spec](https://github.com/fsspec/filesystem_spec) project.
These can be specified with URIs as explained in the section ["Remote Files"](#remote-files).
Supported remote protocols include: FTP, SFTP, HTTP, HTTPS, SSH, Git, Github, S3, Samba, Azure Datalake, Dropbox, Google Cloud Storage, ...


*TAR compressions supported for random access:*

Expand Down Expand Up @@ -102,7 +106,9 @@ And in contrast to [tarindexer](https://github.com/devsnd/tarindexer), which als
4. [File versions](#file-versions)
5. [Compressed non-TAR files](#compressed-non-tar-files)
6. [Xz and Zst Files](#xz-and-zst-files)
7. [As a Library](#as-a-library)
7. [Remote Files](#remote-files)
8. [Writable Mounting](#writable-mounting)
9. [As a Library](#as-a-library)


# Installation
Expand Down Expand Up @@ -506,6 +512,29 @@ lbzip2 -cd well-compressed-file.bz2 | createMultiFrameZstd $(( 4*1024*1024 )) >
</details>


# Remote Files

The [fsspec](https://github.com/fsspec/filesystem_spec) API backend adds suport for mounting many remote archive or folders:

- `git://[path-to-repo:][ref@]path/to/file`
Uses the current path if no repository path is specified.
- `github://org:repo@[sha]/path-to/file-or-folder`
E.g. github://mxmlnkn:ratarmount@v0.15.2/tests/single-file.tar
- `http[s]://hostname[:port]/path-to/archive.rar`
- `s3://[endpoint-hostname[:port]]/bucket/single-file.tar`
Will default to AWS according to the Boto3 library defaults
when no endpoint is specified. Boto3 will check these environment
variables for credentials:
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_SESSION_TOKEN`
- `[s]ftp://[user[:password]@]hostname[:port]/path-to/archive.rar`
- `ssh://[user[:password]@]hostname[:port]/path-to/archive.rar`
- `smb://[workgroup;][user:password@]server[:port]/share/folder/file.tar`

Many others fsspec-based projects may also work when installed.


# Writable Mounting

The `--write-overlay <folder>` option can be used to create a writable mount point.
Expand Down
42 changes: 42 additions & 0 deletions core/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,51 @@ full = [
# With Python 3.14, when building the wheel, I get:
# /usr/bin/ld: cannot find /tmp/tmpcuw21d78/bin/isa-l.a: No such file or directory
'isal ~= 1.0; python_version < "3.14.0"',
# Pin to < 3.12 because of https://github.com/nathanhi/pyfatfs/issues/41
'pyfatfs ~= 1.0; python_version < "3.12.0"',
# fsspec:
"requests",
"aiohttp",
"sshfs",
# Need newer pyopenssl than comes with Ubuntu 22.04.
# https://github.com/ronf/asyncssh/issues/690
"pyopenssl>=23",
"smbprotocol",
# pygit2 1.15 introduced many breaking changes!
# https://github.com/libgit2/pygit2/issues/1316
# https://github.com/fsspec/filesystem_spec/pull/1703
"pygit2<1.15",
"fsspec",
"s3fs",
"gcsfs",
"adlfs",
"dropboxdrivefs",
]
bzip2 = ["rapidgzip >= 0.13.1"]
gzip = ["indexed_gzip >= 1.6.3, < 2.0"]
fsspec = [
# Copy-pasted from fsspec[full] list. Some were excluded because they are too unproportionally large.
"requests",
"aiohttp",
"sshfs",
# Need newer pyopenssl than comes with Ubuntu 22.04.
# https://github.com/ronf/asyncssh/issues/690
"pyopenssl>=23",
"smbprotocol", # build error in Python 3.13
# pygit2 1.15 introduced many breaking changes!
# https://github.com/libgit2/pygit2/issues/1316
# https://github.com/fsspec/filesystem_spec/pull/1703
"pygit2<1.15", # build error in Python 3.13 because it requires libgit2 1.8.1
"fsspec",
"s3fs",
"gcsfs",
"adlfs", # build error in Python 3.13
"dropboxdrivefs",
# "dask", "distributed" : ~34 MB, ~10 MB gzip-compressed
# "pyarrow >= 1" : ~196 MB, ~60 MB gzip-compressed, build error in Python 3.13
# "ocifs" : ~350 MB
# "panel" : only for fsspec GUI
]
# Need >= 4.1 because of https://github.com/markokr/rarfile/issues/73
rar = ["rarfile ~= 4.1"]
# For now, only optional (and installed in the AppImage) because it is unstable and depends on many other packages
Expand Down
16 changes: 13 additions & 3 deletions core/ratarmountcore/SQLiteIndex.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,7 @@ def __init__(
preferMemory: bool = False,
indexMinimumFileCount: int = 0,
backendName: str = '',
ignoreCurrentFolder: bool = False,
):
"""
indexFilePath
Expand All @@ -206,6 +207,9 @@ def __init__(
exceeded. It may also be written to a file if a gzip index is stored.
backendName
The backend name to be stored as metadata and to determine compatibility of found indexes.
ignoreCurrentFolder
If true, then do not store the index into the current path. This was introduced for URL
opened as file objects but may be useful for any archive given via a file object.
"""

if not backendName:
Expand All @@ -217,7 +221,7 @@ def __init__(
self.indexFilePath: Optional[str] = None
self.encoding = encoding
self.possibleIndexFilePaths = SQLiteIndex.getPossibleIndexFilePaths(
indexFilePath, indexFolders, archiveFilePath
indexFilePath, indexFolders, archiveFilePath, ignoreCurrentFolder
)
# stores which parent folders were last tried to add to database and therefore do exist
self.parentFolderCache: List[Tuple[str, str]] = []
Expand Down Expand Up @@ -247,7 +251,10 @@ def __init__(

@staticmethod
def getPossibleIndexFilePaths(
indexFilePath: Optional[str], indexFolders: Optional[List[str]] = None, archiveFilePath: Optional[str] = None
indexFilePath: Optional[str],
indexFolders: Optional[List[str]] = None,
archiveFilePath: Optional[str] = None,
ignoreCurrentFolder: bool = False,
) -> List[str]:
if indexFilePath:
return [] if indexFilePath == ':memory:' else [os.path.abspath(os.path.expanduser(indexFilePath))]
Expand All @@ -265,7 +272,7 @@ def getPossibleIndexFilePaths(
if folder:
indexPath = os.path.join(folder, indexPathAsName)
possibleIndexFilePaths.append(os.path.abspath(os.path.expanduser(indexPath)))
else:
elif not ignoreCurrentFolder:
possibleIndexFilePaths.append(defaultIndexFilePath)
return possibleIndexFilePaths

Expand Down Expand Up @@ -563,6 +570,9 @@ def reloadIndexReadOnly(self):
self.sqlConnection = SQLiteIndex._openSqlDb(f"file:{uriPath}?mode=ro", uri=True, check_same_thread=False)

def _reloadIndexOnDisk(self):
if self.printDebug >= 2:
print("[Info] Try to reopen SQLite database on disk at:", self.indexFilePath)
print("other index paths:", self.possibleIndexFilePaths)
if not self.indexFilePath or self.indexFilePath != ':memory:' or not self.sqlConnection:
return

Expand Down
16 changes: 12 additions & 4 deletions core/ratarmountcore/SQLiteIndexedTar.py
Original file line number Diff line number Diff line change
Expand Up @@ -704,6 +704,7 @@ def __init__(
self.tarFileName = tarFileName
else:
raise RatarmountError("At least one of tarFileName and fileObject arguments should be set!")
self._fileNameIsURL = re.match('[A-Za-z0-9]*://', self.tarFileName) is not None

# If no fileObject given, then self.tarFileName is the path to the archive to open.
if not fileObject:
Expand Down Expand Up @@ -771,16 +772,19 @@ def __init__(
if indexFolders and isinstance(indexFolders, str):
indexFolders = [indexFolders]

archiveFilePath = self.tarFileName if not self.isFileObject or self._fileNameIsURL else None

super().__init__(
SQLiteIndex(
indexFilePath,
indexFolders=indexFolders,
archiveFilePath=None if self.isFileObject else self.tarFileName,
archiveFilePath=archiveFilePath,
encoding=self.encoding,
checkMetadata=self._checkMetadata,
printDebug=self.printDebug,
indexMinimumFileCount=indexMinimumFileCount,
backendName='SQLiteIndexedTar',
ignoreCurrentFolder=self.isFileObject and self._fileNameIsURL,
),
clearIndexCache=clearIndexCache,
)
Expand Down Expand Up @@ -829,9 +833,9 @@ def __init__(

# Open new database when we didn't find an existing one.
if not self.index.indexIsLoaded():
# Simply open in memory without an error even if writeIndex is True but when not indication
# for a index file location has been given.
if writeIndex and (indexFilePath or not self.isFileObject):
# Simply open in memory without an error even if writeIndex is True but when no indication
# for an index file location has been given.
if writeIndex and (indexFilePath or self._getArchivePath() or not self.isFileObject):
self.index.openWritable()
else:
self.index.openInMemory()
Expand Down Expand Up @@ -890,6 +894,9 @@ def __exit__(self, exception_type, exception_value, exception_traceback):
if not self.isFileObject and self.rawFileObject:
self.rawFileObject.close()

def _getArchivePath(self) -> Optional[str]:
return None if self.tarFileName == '<file object>' else self.tarFileName

def _storeMetadata(self) -> None:
argumentsToSave = [
'mountRecursively',
Expand All @@ -902,6 +909,7 @@ def _storeMetadata(self) -> None:
]

argumentsMetadata = json.dumps({argument: getattr(self, argument) for argument in argumentsToSave})
# The second argument must be a path to a file to call os.stat with, not simply a file name.
self.index.storeMetadata(argumentsMetadata, None if self.isFileObject else self.tarFileName)
self.index.storeMetadataKeyValue('isGnuIncremental', '1' if self._isGnuIncremental else '0')

Expand Down
23 changes: 20 additions & 3 deletions core/ratarmountcore/compressions.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,12 @@ def checkZlibHeader(file):
'bz2': CompressionInfo(
['bz2', 'bzip2'],
['tb2', 'tbz', 'tbz2', 'tz2'],
[CompressionModuleInfo('rapidgzip', lambda x, parallelization=0: rapidgzip.IndexedBzip2File(x, parallelization=parallelization))], # type: ignore
[
CompressionModuleInfo(
'rapidgzip',
(lambda x, parallelization=0: rapidgzip.IndexedBzip2File(x, parallelization=parallelization)),
)
],
lambda x: (x.read(4)[:3] == b'BZh' and x.read(6) == (0x314159265359).to_bytes(6, 'big')),
),
'gz': CompressionInfo(
Expand Down Expand Up @@ -532,9 +537,21 @@ def detectCompression(
) -> Optional[str]:
# isinstance(fileobj, io.IOBase) does not work for everything, e.g., for paramiko.sftp_file.SFTPFile
# because it does not inherit from io.IOBase. Therefore, do duck-typing and test for required methods.
if any(not hasattr(fileobj, method) for method in ['seekable', 'seek', 'read', 'tell']) or not fileobj.seekable():
expectedMethods = ['seekable', 'seek', 'read', 'tell']
isFileObject = any(not hasattr(fileobj, method) for method in expectedMethods)
if isFileObject or not fileobj.seekable():
if printDebug >= 2:
seekable = fileobj.seekable() if isFileObject else None
print(
f"[Warning] Cannot detect compression for given Python object {fileobj} "
f"because it does not look like a file object or is not seekable ({seekable})."
)
if printDebug >= 3:
print("[Warning] Cannot detect compression for give Python object that does not look like a file object.")
print(dir(fileobj))
for name in ['readable', 'seekable', 'writable', 'closed', 'tell']:
method = getattr(fileobj, name, None)
if method is not None:
print(f" fileobj.{name}:", method() if callable(method) else method)
traceback.print_exc()
return None

Expand Down
Loading

0 comments on commit fa75cfa

Please sign in to comment.