Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hierarchical directory structure in TarFileSystem #10

Open
JDatta opened this issue Apr 22, 2017 · 0 comments
Open

Support hierarchical directory structure in TarFileSystem #10

JDatta opened this issue Apr 22, 2017 · 0 comments
Assignees

Comments

@JDatta
Copy link
Owner

JDatta commented Apr 22, 2017

Summary

Currently, TarFileSystem does not support hierarchical directory structure. It ignores all the directory entries inside the TAR and presents all the files in a flat hierarchy.

This issue tracks the efforts required to support hierarchical directory structure in TAR. This is the umbrella issue; sub-issues may be created later to track specific tasks required.

Sub-tasks

To support hierarchical directory structure or nested directories, we need to make following changes in TarFileSystem code.

Make the index data-structure hierarchical

Currently the index is maintained as a flat hashmap. To capture the nested directories we need some kind of tree like structure. Currently I am thinking to have a simple n-way tree (Each IndexEntry of type directory would contain a list of child index entries) along with a map to fast find a specific node in the tree (Map<String, IndexEntry>)

Change serialization of index

Index is serialized to facilitate later reuse. The current format is
Space separated values. In addition to existing issues (does not work if file name contains spaces) this format would not easily work for serializing hierarchical index. Currently I am thinking to use JSON. Need to check if the JSON would work if TAR contains huge number of entries; say, 1 billion.

Re-evaluate IndexEntry data structure

Currently IndexEntry only holds length and offset of the tar header. When the archived file is accessed, we seek to the offset and header is read again. Need to evaluate if we can store more information in index itself so that this disk access is not required. Note that at one extreme there is to store just length and offset. At the other extreme we can keep the complete tar header entry in index. We need to make sure this change does not cause memory pressure.

Note: The actual requirement is to ensure sequential read when we do a ls and ls -R. If all the childs of a directory entry are stored sequentially in tar; i.e. if the entries in tar are sorted by path string then the performance would be no worse than what is current even if we just store length and offset. For recursive walk, we need to ensure the calling code processes the directory hierarchy in the same order as the entries are stored inside tar^. This restriction may turn out to be difficult to impose

Issue in hadoop dfs -ls -R command

To recursively list all paths, -ls -R appends the name component of child path after the parent path. Here the separator is hardcoded / (ouch!)

Here is the corresponding code from FsShell module

// check getPath() so scheme slashes aren't considered part of the path
    String basename = childPath.getName();
    String separator = uri.getPath().endsWith(Path.SEPARATOR)
        ? "" : Path.SEPARATOR;
    return uriToString(uri, inferredSchemeFromPath) + separator + basename;

Obviously this would not work for the current path format of TarFileSystem. For TarFileSystem, the separator for the archived files are +. Also the getName call would get the complete path of archived file. Resulting in the following bug:

  • Parent abs path: tar:///file.tar/+dir1+dir2+
  • Child abs path: tar:///file.tar/+dir1+dir2+f1.txt
  • FsShell would form path as:
    tar:///file.tar/+dir1+dir2+/+dir1+dir2+f1.txt
  • While looking up the file from inside Tar, the TarFileSystem would translate the inArchivePath as /dir1/dir2///dir1/dir2/f1.txt resulting in FileNotFoundException

Footnotes

^ For example, say the TAR order is as follows:

a/b/
a/b/f1.txt
a/b/c/
a/b/c/f2.txt
a/f3.txt
a/f4.txt

Then the calling code that does the recursive walk needs to do a DFS. As we can see a BFS would cause non sequential read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant