Support hierarchical directory structure in TarFileSystem #10

JDatta · 2017-04-22T11:06:49Z

Summary

Currently, TarFileSystem does not support hierarchical directory structure. It ignores all the directory entries inside the TAR and presents all the files in a flat hierarchy.

This issue tracks the efforts required to support hierarchical directory structure in TAR. This is the umbrella issue; sub-issues may be created later to track specific tasks required.

Sub-tasks

To support hierarchical directory structure or nested directories, we need to make following changes in TarFileSystem code.

Make the index data-structure hierarchical

Currently the index is maintained as a flat hashmap. To capture the nested directories we need some kind of tree like structure. Currently I am thinking to have a simple n-way tree (Each IndexEntry of type directory would contain a list of child index entries) along with a map to fast find a specific node in the tree (Map<String, IndexEntry>)

Change serialization of index

Index is serialized to facilitate later reuse. The current format is
Space separated values. In addition to existing issues (does not work if file name contains spaces) this format would not easily work for serializing hierarchical index. Currently I am thinking to use JSON. Need to check if the JSON would work if TAR contains huge number of entries; say, 1 billion.

Re-evaluate `IndexEntry` data structure

Currently IndexEntry only holds length and offset of the tar header. When the archived file is accessed, we seek to the offset and header is read again. Need to evaluate if we can store more information in index itself so that this disk access is not required. Note that at one extreme there is to store just length and offset. At the other extreme we can keep the complete tar header entry in index. We need to make sure this change does not cause memory pressure.

Note: The actual requirement is to ensure sequential read when we do a ls and ls -R. If all the childs of a directory entry are stored sequentially in tar; i.e. if the entries in tar are sorted by path string then the performance would be no worse than what is current even if we just store length and offset. For recursive walk, we need to ensure the calling code processes the directory hierarchy in the same order as the entries are stored inside tar^{^}. This restriction may turn out to be difficult to impose

Issue in hadoop `dfs -ls -R` command

To recursively list all paths, -ls -R appends the name component of child path after the parent path. Here the separator is hardcoded / (ouch!)

Here is the corresponding code from FsShell module

// check getPath() so scheme slashes aren't considered part of the path
    String basename = childPath.getName();
    String separator = uri.getPath().endsWith(Path.SEPARATOR)
        ? "" : Path.SEPARATOR;
    return uriToString(uri, inferredSchemeFromPath) + separator + basename;

Obviously this would not work for the current path format of TarFileSystem. For TarFileSystem, the separator for the archived files are +. Also the getName call would get the complete path of archived file. Resulting in the following bug:

Parent abs path: tar:///file.tar/+dir1+dir2+
Child abs path: tar:///file.tar/+dir1+dir2+f1.txt
FsShell would form path as:
tar:///file.tar/+dir1+dir2+/+dir1+dir2+f1.txt
While looking up the file from inside Tar, the TarFileSystem would translate the inArchivePath as /dir1/dir2///dir1/dir2/f1.txt resulting in FileNotFoundException

Footnotes

^{^} For example, say the TAR order is as follows:

a/b/
a/b/f1.txt
a/b/c/
a/b/c/f2.txt
a/f3.txt
a/f4.txt

Then the calling code that does the recursive walk needs to do a DFS. As we can see a BFS would cause non sequential read.

The text was updated successfully, but these errors were encountered:

JDatta added bug enhancement and removed bug labels Apr 22, 2017

JDatta added this to the 2.1_beta release milestone Apr 22, 2017

JDatta mentioned this issue Apr 22, 2017

Add unit tests for hierarchical directory support #12

Open

JDatta self-assigned this Apr 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support hierarchical directory structure in TarFileSystem #10

Support hierarchical directory structure in TarFileSystem #10

JDatta commented Apr 22, 2017

Support hierarchical directory structure in TarFileSystem #10

Support hierarchical directory structure in TarFileSystem #10

Comments

JDatta commented Apr 22, 2017

Summary

Sub-tasks

Make the index data-structure hierarchical

Change serialization of index

Re-evaluate IndexEntry data structure

Issue in hadoop dfs -ls -R command

Footnotes

Re-evaluate `IndexEntry` data structure

Issue in hadoop `dfs -ls -R` command