Skip to content

DataFrames should have a name attribute. #447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Dec 5, 2011 · 39 comments
Closed

DataFrames should have a name attribute. #447

wesm opened this issue Dec 5, 2011 · 39 comments
Labels
API Design Enhancement Ideas Long-Term Enhancement Discussions

Comments

@wesm
Copy link
Member

wesm commented Dec 5, 2011

was: should DataFrames have a name attribute?

@y-p

@lbeltrame
Copy link
Contributor

IMO it would make sense only if one would export to Excel worksheets, in such a case it would be nice to have.

@lodagro
Copy link
Contributor

lodagro commented Dec 6, 2011

It can also be used to set default path in DataFrame.save(), e.g path = DataFrame.name

@wesm
Copy link
Member Author

wesm commented Dec 6, 2011

Could also be integrated into DataFrame.to_html and stuff like that also. I don't think it's too hard to add-- just will be a bit of a slog to make sure the name is passed on in the right places (it was quite a bit of hacking to add name to Series). Shoot for January or February sometime

@hughesadam87
Copy link

I'd upvote this one. I'm using it to auto-title plots and think it would certainly be a nice feature.

@pmacaodh
Copy link

pmacaodh commented Dec 3, 2012

I found uses for it too, however the name (as of v0.90) doesn't survive pickling, which if it did, would be useful to have working (my workaround is a bit of a fudge). To see the problem, try the following:

import pandas as pd
df = pd.DataFrame( data=np.ones([6,6]) )
df.name = 'Ones'
df.save('ones.df')
df2 = pd.load('ones.df')
print df2.name

I'd love to be able to dive in and contribute a fix, but I'm still not so well-versed in the library and many aspects of Python.

@wesm
Copy link
Member Author

wesm commented Dec 3, 2012

It's not a simple addition (you have to worry about preserving metadata through computations), but it would be nice. We'll probably look into it in the somewhat near future

@hughesadam87
Copy link

Hi Paul,

I have written a couple functions that will let you transfer all the custom
attributes from one dataframe to antother. Check out the function
transfer_attributes()

https://github.com/hugadams/pyuvvis/tree/master/pyuvvis/pandas_utils

In particular, if you save dataframeserial.py, it will save and load your
dataframes while preserving any custom attributes. In the file
df_attrhandler, you can use the function called "transfer_attr" to do
something like:

df=DataFrame()
df.name='test'
df2=DataFrame

transfer_attr(df, df2)
print df2.name
'test'

I agree that persistent custom attributes would be a key development in the
future, and there is already a github issue open for it. In fact, a
package that I"ll be posting to the list soon, really does depend on these
custom attributes.

On Mon, Dec 3, 2012 at 1:01 PM, Wes McKinney [email protected]:

It's not a simple addition (you have to worry about preserving metadata
through computations), but it would be nice. We'll probably look into it in
the somewhat near future


Reply to this email directly or view it on GitHubhttps://github.com//issues/447#issuecomment-10963804.

@hughesadam87
Copy link

Now that I have some time, I wanted to followup with this.

DataFrame, IMO, should have a .name attribute, and df.columns also should have a .name attribute and so should df.index, or at least I've found this useful in my work.

In any case, I think persistent attributes, and to a lesser extent, instance methods, would be an extremely important addition to pandas. Here's my reasoning:

Everybody that uses pandas for analysis outside of the scope of timeseries will eventually benefit from customizing/subclassing a DataFrame at some point. Usually, the dataframe is the ideal object for storing the numerical data, but there is also pertinent information that could go along with it to really customize the object. For example, a dataframe becomes the ideal choice for a spectroscopy experiment if one can store an extra array, the spectral baseline, outside of the dataframe. Additionally, experimental metadata ought to be stored. This is so easily done by adding attributes to the dataframe, that it almost begs to be the canonical way to handle spectral data.

The functions I wrote in the above link use a crude method to transfer arbitrary attributes between dataframes. In short, it first examines an empty DataFrame's attributes, and compares these with a list of attributes from the user's dataframe. Any differences are then transferred to the new dataframe. As a hack until a better solution presents itself, dataframe returns could call my transfer_attr() function before returning a new DataFrame. I wouldn't know how to integrate this fully into pandas otherwise.

I know this is low on the priority list, but I really do think that persistent custom attributes would be a big step forward, and not just an appeasement for corner case users.

@ghost
Copy link

ghost commented Dec 8, 2012

Interesting. I'd like to add a couple of notes:

  • You're suggesting there needs to be a mechanism to attach arbitrary metadata to a Dataframe.
    good idea.
    I don't though see why custom attributes must be implemented as "attributes" in the python sense.
    a metadata dict with a simple api which is serialized along with the dataframe would take care
    of most use cases and shouldn't be hard to implement. Is there some requirement you have
    for which this is not adequate?
  • The .name issue is seperate. The .name(s) attributes are not "custom", but are "baked in"
    ,relied on by internal code and can affect other parts of the package in hard to predict ways
    (see the ongoing excel save/read issue).

@hughesadam87
Copy link

y-p,

I would be fine with a metadatadict, or whatever is the most elegant solution to the problem. The reason that I like adding attributes is for access. Something like df.name is easier for people to keep up with than df.metadata['name']; however, if you gave the metadata dict attribute access, then df.metadict.name is also pretty simple. Am I understanding you correctly? Whatever solution ends up being the most simple to implement, would be useful.

I agree that the name issue is separate. If .name is too baked in as you say, then sure, don't include it. But if the pandas Index object also had a way to persist attributes, or a persistent metadata dict, then one could just slap names or whatever attributes they want, onto these object as well.

@ghost
Copy link

ghost commented Dec 8, 2012

I like the idea of providing attribute access under a predefined attribute rather then
directly on the object (i.e. df.tags.measurement_date) as the latter pollutes the namespace
and hurts backwards-compatibility when new methods or instance variables are added in
the future.

@hughesadam87
Copy link

Whatever is best for pandas would be fine with me. If the import gets too
tedious, it is easy enough to make some properties or basic getters/setters
for the convienence of the user. Something like "get_baseline()" may be
easier to present to users than df.tags.baselinedics.baseline1. In any
case, as long as the functionality is there, it will be very useful.

On Fri, Dec 7, 2012 at 8:10 PM, y-p [email protected] wrote:

directly on the object (i.e. df.tags.measurement_date) as the latter
pollutes the namespace

@ghost
Copy link

ghost commented Dec 11, 2012

new custom metadata issue at #2485.
@hugadams - your thoughts (and PR, pending discussion) are welcome.

@sglyon
Copy link
Contributor

sglyon commented Apr 4, 2013

Anyone working on adding the name attribute?

And where to look for more information on a possible tags property?

@hughesadam87
Copy link

You can try using the metadataframe class if you want. Let me kn ow and ill
update my repo. You can monkey patch a name attr in but it will return to a
default value everytime a new dataframe is created.
On Apr 4, 2013 5:53 PM, "Spencer Lyon" [email protected] wrote:

Anyone working on adding the name attribute?

And where to look for more information on a possible tags property?


Reply to this email directly or view it on GitHubhttps://github.com//issues/447#issuecomment-15926456
.

@hughesadam87
Copy link

Alternatively you can create a composite class that stores the name
attribute. Unfortunately you have to specify which dataframe methods can be
called and still return this class instead of returning a dataframe. If
you only need say two methods of df then this is worth doing. Ultimately
metadaraframe is an effort to do this generically for all methods and
dataframe operators so probably easier to start with it. For just a single
attribute addition, this is a heavyhanded solution, but all that I know if.
On Apr 4, 2013 8:14 PM, "Adam Hughes" [email protected] wrote:

You can try using the metadataframe class if you want. Let me kn ow and
ill update my repo. You can monkey patch a name attr in but it will return
to a default value everytime a new dataframe is created.
On Apr 4, 2013 5:53 PM, "Spencer Lyon" [email protected] wrote:

Anyone working on adding the name attribute?

And where to look for more information on a possible tags property?


Reply to this email directly or view it on GitHubhttps://github.com//issues/447#issuecomment-15926456
.

@JamesPHoughton
Copy link

A df name attribute would be useful when slicing panels down to dataframes, parallel to the case where a df column name becomes a series name when sliced. In theory, this should generalize to any number of dimensions.

@nehalecky
Copy link
Contributor

+1 on this. I was wanting it just a few days ago.

@ghost
Copy link

ghost commented Apr 30, 2013

updated title. we'll see when the rest can follow.

@kevindavenport
Copy link

@jtratner
Copy link
Contributor

jtratner commented Sep 4, 2013

@hugadams btw - columns and index now get name attributes (if you have a hierarchical index, it's called names...) not sure if that covers what you were looking for in terms of columns and index.

@jreback
Copy link
Contributor

jreback commented Sep 5, 2013

@jtratner I added support for this by just defining _prop_attributes (see Series)

not complete though as Series mostly uses the original old method

and need a better method to resolve name conflicts and such (eg when u add frames with different names what happens), same issue in Series though

so a bit of a project but all the support is there for this

@ghost
Copy link

ghost commented Jan 24, 2014

@jreback , does doing this fit in naturally into the NDFrame unification deal?

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

technically this is easy (just add to _metadata), but it still needs proper progation...I think its worthwhile, but will take some time to get right

@ghost
Copy link

ghost commented Jan 24, 2014

series names aren't implemented as metadata, and series now derive from NDFrame.
metadata is fine (well.. you know how I feel), but names are a special case.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

@y-p ahh but they are! (well they are in the _metadata list). Actually combined with __finallize__ this makes it possible for sub-classes to implement their own metadata! (e.g. geopandas does this)

@ghost
Copy link

ghost commented Jan 24, 2014

Well... then they didn't used to be. nevermind. let it sit for another 18 months or so.

@jreback jreback added this to the 0.15.0 milestone Feb 15, 2014
@jreback jreback removed this from the 0.14.0 milestone Feb 15, 2014
@JamesRamm
Copy link

Another use of a name attribute would be for GUI's dealing with dataframes. I have one such program which allows a user to load many csv files and plot columns from them. The backend uses dataframes to load and store the CSV data.
The only useful way I can think of to allow the user to select which dataframe to plot from would be having a human-readable attribute (i.e. a name) describing the dataframe to display on the GUI, which can then be used to grab the correct dataframe.
I did a work-around by simply inheriting from dataframe and adding a name property. I then reimplemented class methods (like read_csv) to return an instance of 'NamedDataFrame' rather than DataFrame.

@den-run-ai
Copy link

in ipython notebook and similar REPL, it would make sense to display dataframe name or more generally custom metadata, like in Excel toolbar (count, min, max, average, sum, nans, numerical count).

@summerela
Copy link

When saving results of an analysis, resulting in several different outputs, it would be so nice to automatically save and name your output:

def save_results(df, df_name):
    if len(df) > 0:
        print "Saving {} {} variants to current working directory".format(len(df), df_name)
        df.to_csv('{}.csv'.format(df_name), header=True, encoding='utf-8', index=False)
    else:
        print "No {} variants to save.".format(df_name)

@summerela
Copy link

Ohhh. I see now from StackExchange that I can do something like this:

import pandas as pd
df = pd.DataFrame([])
df.df_name = 'Binky'

@den-run-ai
Copy link

@summer Rae, great finding! Thanks for sharing.

Also computed properties are contained in .describe()

On Tue, Aug 18, 2015, 6:36 PM Summer Rae [email protected] wrote:

Ohhh. I see now from StackExchange
http://stackoverflow.com/questions/14688306/adding-meta-information-metadata-to-pandas-dataframe
that I can do something like this:
import pandas as pd
df = pd.DataFrame([])
df.df_name = 'Binky'


Reply to this email directly or view it on GitHub
#447 (comment).

@leemengtw
Copy link

@summerela wonderful finding!

@wesm
Copy link
Member Author

wesm commented Jan 25, 2017

Adding a name attribute to DataFrame would add a lot of complexity. The benefits (compared with the benefits of named Series) to me are less clear. Closing for now

@wesm wesm closed this as completed Jan 25, 2017
@mortonjt
Copy link

mortonjt commented Jul 18, 2017

I realize that this is a year out of date, but I'd like to pitch in a use case where having a name for a data frame can be really useful.

When performing multi-block analysis (i.e. multi-block partial least squares) in another package (like statsmodels), it would be awesome if we could specify R style formulas via patsy and run this sort of analysis as something like as follows

result = pls_multiblock(formula="Z ~ X + Y + U + V", blocks=(Z, X, Y, U, V) )

where Z, X, Y, U, V are all matrices (represented as pandas data frames).
Representing these objects as pandas data frames is advantageous, since we can keep track of the ordering of the index/column names. But more importantly, if we have information about the naming, and flexibility concerning what sort of models we want to construct on the fly. The implications of having the DataFrame.name being consistent with the Series.name are not clear to me, but having consistent unique identifiers for the DataFrames themselves can easily enable these sorts of analyses.

@shoyer
Copy link
Member

shoyer commented Jul 18, 2017

@mortonjt For this sort of multi-dimensional data analysis, I would consider using xarray, which does already support a name attribute on DataArray objects. Note that we are deprecating pandas.Panel.

@mortonjt
Copy link

I totally forgot about xarray ... Thanks @shoyer!

@Jacob-Stevens-Haas
Copy link
Contributor

Jacob-Stevens-Haas commented Jun 25, 2021

Hey, since the pandas API sometimes provides DataFrames with name attributes and sometimes without, wondering whether there are issues beyond serialization? I have functions that take a dataframe and expect the name attribute, since df.name is assigned by DataFrameGroubpy.apply(). Useful in cases like:

df = pd.DataFrame({'part_no':[1,1,2,2], 'system':['fax machine', None, 'fax machine', 'truck'])
gb = df.groupby('part_no')
gb.apply(validate_individual_parts)
def validate_individual_parts(df: pd.DataFrame) -> None:
    if (df['system'].unique()) > 1:
        logging.warning(f'Unable to determine system for part {df.name})

I'm not sure what the engineering principle at play here is, but it seems reasonable to expect all instances of a DataFrame produced by the pandas API to have the same set of attributes. In the above context, an AttributeError makes sense, but if validate_individual_parts() were called by a function that could be useful for generic DataFrames, a caller might be confused by a nested AttributeError. Maybe there's a hack with NamedDataFrame = typing.NewType('NamedDataFrame', pd.DataFrame) to try to guard against this in SCA, or code that expects a name attribute can catch AttributeError and re-raise with a more descriptive message (or maybe that's excessive and I'm speculating user problems that won't come to be).

@jreback
Copy link
Contributor

jreback commented Jun 25, 2021

these are support via the .attrs accessor: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.attrs.html?highlight=attrs (not a whole lot of docs though :-<)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Ideas Long-Term Enhancement Discussions
Projects
None yet
Development

No branches or pull requests