You can find the code for the read_table
function mentioned in this note in
subdirectory 20230318-python.
The really handy Pandas function read_fwf
creates a DataFrame
from a
fixed-width text table, for example:
import io
import pandas as pd
table = """
FirstName LastName FirstAppearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
# read_fwf expects a filename, either as a string, a path object, or as the
# next line shows, a stream/file-like object
df = pd.read_fwf(io.StringIO(table))
print(df)
I mostly use it in unit tests, when I need to set up DataFrame
instances.
They’re easier to create using read_fwf
and the resulting code is more
readable, more communicative.
However, read_fwf
doesn’t handle column names that contain spaces:
table = """
FirstName LastName First appearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
df = pd.read_fwf(io.StringIO(table))
print(df)
Column “First appearance” is interpreted as two separate columns, which is not
what you want. Enclosing the column name in single quotes, as a hint to
read_fwf
, doesn’t help.
This lead me to write function read_table
that wraps read_fwf
to allow you
to specify multi-word column names:
table = """
FirstName LastName First appearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
df = read_table(table, columns=["FirstName", "LastName", "First appearance"])
print(df)
read_table
only accepts a string whereas read_fwf
accepts a filename, path
object or stream/file-like object. As mentioned, my primary use case for
read_table
is to setup a DataFrame
from hard-coded data.
read_fwf
doesn’t like it when you have a column whose values have spaces:
table = """
Name FirstAppearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
stream = io.StringIO(table)
df = pd.read_fwf(stream)
print(df)
read_table
has your back here too if you specify the column names:
table = """
Name FirstAppearance
Donald Duck 1934
Mickey Mouse 1928
Goofy 1932
"""
df = read_table(table, columns=["Name", "FirstAppearance"])
print(df)
read_table
wraps read_fwf
and if you pass in a keyword argument that
read_table
doesn’t have in its signature, it passes it on to read_fwf
. This
allows read_table
to support the use of a column to specify an index:
table = """
FirstName LastName FirstAppearance
0 Donald Duck 1934
2 Mickey Mouse 1928
3 Goofy 1932
"""
df = read_table(table, index_col=0)
print(df)
print(df.index)
Here, read_table
passes index_col=0
to read_fwf
. Because of that,
read_fwf
lets the first column provide the index.
Closing thoughts
I realized that having to specify all columns names is not ideal. In one of the
examples above, only column name “First Appearance” has a space but read_table
requires you to specify the other column names too. This is something to address
in a newer version.
Furthermore, in the example with multi-word column values, you had to specify
the column names to import the table correctly. There’s no real need for that as
read_table
can determine the column names automatically.