You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, TarFileSystem does not support hierarchical directory structure. It ignores all the directory entries inside the TAR and presents all the files in a flat hierarchy.
This issue tracks the efforts required to support hierarchical directory structure in TAR. This is the umbrella issue; sub-issues may be created later to track specific tasks required.
Sub-tasks
To support hierarchical directory structure or nested directories, we need to make following changes in TarFileSystem code.
Make the index data-structure hierarchical
Currently the index is maintained as a flat hashmap. To capture the nested directories we need some kind of tree like structure. Currently I am thinking to have a simple n-way tree (Each IndexEntry of type directory would contain a list of child index entries) along with a map to fast find a specific node in the tree (Map<String, IndexEntry>)
Change serialization of index
Index is serialized to facilitate later reuse. The current format is Space separated values. In addition to existing issues (does not work if file name contains spaces) this format would not easily work for serializing hierarchical index. Currently I am thinking to use JSON. Need to check if the JSON would work if TAR contains huge number of entries; say, 1 billion.
Re-evaluate IndexEntry data structure
Currently IndexEntry only holds length and offset of the tar header. When the archived file is accessed, we seek to the offset and header is read again. Need to evaluate if we can store more information in index itself so that this disk access is not required. Note that at one extreme there is to store just length and offset. At the other extreme we can keep the complete tar header entry in index. We need to make sure this change does not cause memory pressure.
Note: The actual requirement is to ensure sequential read when we do a ls and ls -R. If all the childs of a directory entry are stored sequentially in tar; i.e. if the entries in tar are sorted by path string then the performance would be no worse than what is current even if we just store length and offset. For recursive walk, we need to ensure the calling code processes the directory hierarchy in the same order as the entries are stored inside tar^. This restriction may turn out to be difficult to impose
Issue in hadoop dfs -ls -R command
To recursively list all paths, -ls -R appends the name component of child path after the parent path. Here the separator is hardcoded / (ouch!)
Here is the corresponding code from FsShell module
// check getPath() so scheme slashes aren't considered part of the path
String basename = childPath.getName();
String separator = uri.getPath().endsWith(Path.SEPARATOR)
? "" : Path.SEPARATOR;
return uriToString(uri, inferredSchemeFromPath) + separator + basename;
Obviously this would not work for the current path format of TarFileSystem. For TarFileSystem, the separator for the archived files are +. Also the getName call would get the complete path of archived file. Resulting in the following bug:
Parent abs path: tar:///file.tar/+dir1+dir2+
Child abs path: tar:///file.tar/+dir1+dir2+f1.txt
FsShell would form path as: tar:///file.tar/+dir1+dir2+/+dir1+dir2+f1.txt
While looking up the file from inside Tar, the TarFileSystem would translate the inArchivePath as /dir1/dir2///dir1/dir2/f1.txt resulting in FileNotFoundException
Summary
Currently, TarFileSystem does not support hierarchical directory structure. It ignores all the directory entries inside the TAR and presents all the files in a flat hierarchy.
This issue tracks the efforts required to support hierarchical directory structure in TAR. This is the umbrella issue; sub-issues may be created later to track specific tasks required.
Sub-tasks
To support hierarchical directory structure or nested directories, we need to make following changes in TarFileSystem code.
Make the index data-structure hierarchical
Currently the index is maintained as a flat hashmap. To capture the nested directories we need some kind of tree like structure. Currently I am thinking to have a simple n-way tree (Each
IndexEntry
of type directory would contain a list of child index entries) along with a map to fast find a specific node in the tree (Map<String, IndexEntry>
)Change serialization of index
Index is serialized to facilitate later reuse. The current format is
Space separated values. In addition to existing issues (does not work if file name contains spaces) this format would not easily work for serializing hierarchical index. Currently I am thinking to use JSON. Need to check if the JSON would work if TAR contains huge number of entries; say, 1 billion.
Re-evaluate
IndexEntry
data structureCurrently
IndexEntry
only holds length and offset of the tar header. When the archived file is accessed, we seek to the offset and header is read again. Need to evaluate if we can store more information in index itself so that this disk access is not required. Note that at one extreme there is to store just length and offset. At the other extreme we can keep the complete tar header entry in index. We need to make sure this change does not cause memory pressure.Note: The actual requirement is to ensure sequential read when we do a
ls
andls -R
. If all the childs of a directory entry are stored sequentially in tar; i.e. if the entries in tar are sorted by path string then the performance would be no worse than what is current even if we just store length and offset. For recursive walk, we need to ensure the calling code processes the directory hierarchy in the same order as the entries are stored inside tar^. This restriction may turn out to be difficult to imposeIssue in hadoop
dfs -ls -R
commandTo recursively list all paths,
-ls -R
appends the name component of child path after the parent path. Here the separator is hardcoded/
(ouch!)Here is the corresponding code from FsShell module
Obviously this would not work for the current path format of TarFileSystem. For TarFileSystem, the separator for the archived files are
+
. Also the getName call would get the complete path of archived file. Resulting in the following bug:tar:///file.tar/+dir1+dir2+
tar:///file.tar/+dir1+dir2+f1.txt
tar:///file.tar/+dir1+dir2+/+dir1+dir2+f1.txt
inArchivePath
as/dir1/dir2///dir1/dir2/f1.txt
resulting inFileNotFoundException
Footnotes
^ For example, say the TAR order is as follows:
Then the calling code that does the recursive walk needs to do a DFS. As we can see a BFS would cause non sequential read.
The text was updated successfully, but these errors were encountered: