Skip to content

support for .pipe, how to make this render in the notebook w/o using show(p) #3046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Oct 31, 2015 · 35 comments
Closed

Comments

@jreback
Copy link

jreback commented Oct 31, 2015

Pandas has had support for a .pipe operator for a while, which allows convient pipeing to external functions, see docs here

See my example here:
http://nbviewer.ipython.org/gist/jreback/cd0d8874495c33a91c79

e.g. want [29] to just work

With seaborn/matplotlib this works quite nicely. In fact it works in bokeh as well already, but
doesn't render it (e.g. you have to call show(...), which defeats the purpose of the piping

any way to have this automatically render? (or at least a .show()) method that would render at the end of the cell,

e.g. quite convenient to do things like:

(df
   .set_index('time")
   .resample('D',how='count')
   .foobar
   .pipe(......)
)

and just have it work

@bryevdv
Copy link
Member

bryevdv commented Oct 31, 2015

If y'all have gone with the __pipe__ idea that would probably be a hook we could have used. We don't "auto show" in notebooks though, for a variety of reasons and past experience. We could possibly add a show=True keyword argument, though I don't really like that, it mixes up parts of the API that should not be mixed and even then will only work with certain assumptions met (output_notebook has been called, etc.).

Something like this could probably be made to work:

df['mpg'].pipe(autoshow, Histogram, title='MPG Distribution')

but that kind of sucks too.

ping @bokeh/dev in case anyone has better ideas than me.

@jreback
Copy link
Author

jreback commented Oct 31, 2015

what does the .show() method do then (which would suffice as well)

e.g.

df['msg'].pipe(Histogram, title=....).show()

@bryevdv
Copy link
Member

bryevdv commented Oct 31, 2015

I thought that had been gotten rid of in the new charts work, but maybe not.

@jreback
Copy link
Author

jreback commented Oct 31, 2015

so is their a method to call to show things in the notebook? (obviously show(p) is calling something under the hood).

Its quite convenient to do this type of thing ad-hoc by using chained/pipe computations that end in a plot.

@bryevdv
Copy link
Member

bryevdv commented Oct 31, 2015

It depends on what kind of output has been initialized (e.g., file, notebook, server) and then it eventually will call the Jupyter HTML publishing API at the lowest level render some HTML templates that display the plots (or layouts or widgets). I guess the issues is this: the "pipe-able" functions that accept a data frame as the first are are only a very tiny fraction of all bokeh functions. I don't want to add something that is too easy to use incorrectly or leads to more confusion than utility. This is why __pipe__ would have been nice, we could have marked appropriate functions explicitly, and made and custom code required self-contained to where is was appropriate.

@jreback
Copy link
Author

jreback commented Oct 31, 2015

@bryevdv that seems kind of odd for bokeh to have to 'depend on the type of output', when bokeh is already 'figuring it out' in the show(p) call anyhow.

I think providing syntactic compatibility with what other libraries (mpl/seaborn/pandas) in the ad-hoc notebook analysis would be first class. This makes bokeh feel like the old matplotlib way of doing things (which to be honest was not fun).

Maybe I don't understand the design / impl constraints. Generally these libraries support a .plot(), maybe you can explain why bokeh is different? Why would you even consider adding custom code for such a generic feature?

@bryevdv
Copy link
Member

bryevdv commented Nov 1, 2015

show doesn't figure anything out, it delegates everything to whatever output modes have happened to have been configured (whatever they might be). It won't do anything if no output modes have been configured. Just because someone is in the notebook does imply anything, Maybe they want to save to a file, maybe they want to push to a server, maybe they want to display inline. Maybe some combination of all of these things.

If .show (the method) is still around, maybe that's the solution. Just be aware that df['msg'].pipe(Histogram, title=....).show() might do things... save to file (and open a new tab with that file displayed), or push to the server (and open a tab with the server document URL), in addition to displaying inline in the notebook. And of course it won't display anything in the notebook unless output_notebook has been called.

As to what is different, lots of things. Those other libraries don't have to load a separate javascript library to actually do the rendering, for instance.

@datnamer
Copy link

datnamer commented Nov 1, 2015

+1 to some kind of pipeable interactive analysis mode with economy of typing.

@havocp
Copy link
Contributor

havocp commented Nov 1, 2015

Some naive questions, I don't know the history or all the code here, but have been hacking on output_notebook and show() a little.

  • is there any way to know we are being called by pipe()?
  • what was the __pipe__ idea?

Is the seaborn/matplotlib behavior based on whether they are running in a notebook, something like "if in_notebook: send plot to notebook" at the end of every plotting function, or is it more complicated?

Bokeh does let you assemble plots into layouts of plots, so if making a plot sent it to the notebook, one question I have is how you would be able to make four plots, put them in a layout, and only then send the whole layout to the notebook.
It seems like we need to somehow know we are at the end of the notebook cell and there's nothing else happening with the plot ...?

@jreback
Copy link
Author

jreback commented Nov 1, 2015

not really sure how mpl does this, my point of all of this, is the upstream code already supports pipe chaining.

__pipe__ would just be a more formalized convention. But even now, w/o ANY downstream changes, lots of libraries can accept this.

bokeh should simply be able to act as a drop-in replacement, original impl is here

@havocp
Copy link
Contributor

havocp commented Nov 1, 2015

I don't understand what you mean by "drop in replacement" - for pipe? for mpl you mean?

How would the 4-plots-in-a-layout case work? does mpl have an equivalent and what does it do there?

the rest of you have more background knowledge than I do.

@jreback
Copy link
Author

jreback commented Nov 1, 2015

@havocp 4-plots-in-a-layout case is not part of this.

It is simply taking a Series/DataFrame and plotting it. (a single plot).

@havocp
Copy link
Contributor

havocp commented Nov 1, 2015

but my question is how does the function you pass to pipe know whether it should push the plot over to the notebook, vs the plot is an intermediate result that is going into a layout and should not be pushed. Are those functions passed to pipe special functions only used for pipe ?

@havocp
Copy link
Contributor

havocp commented Nov 1, 2015

i.e. how do we implement Histogram so it works with pipe and also not with pipe. or is the idea I have a special pipe-only Histogram

@pzwang
Copy link

pzwang commented Nov 1, 2015

Those other tools are very "immediate mode", and I think pandas is using
them in that fashion. I think we may need to wrap Charts functions with a
high level wrapper that exposes an immediate-mode thing like this, which
bakes in all the stuff in show() beforehand, so that the actual plotting
function can then generate output.

On Sunday, November 1, 2015, Havoc Pennington [email protected]
wrote:

but my question is how does the function you pass to pipe know whether it
should push the plot over to the notebook, vs the plot is an intermediate
result that is going into a layout and should not be pushed. Are those
functions passed to pipe special functions only used for pipe ?


Reply to this email directly or view it on GitHub
#3046 (comment).

Peter Wang
CTO, Co-founder

@jreback
Copy link
Author

jreback commented Nov 1, 2015

@havocp that's my point. you shouldn't have to do anything special.

The key is that an object that is returned in a notebook cell will render (this is how mpl / seaborn / pandas work). I don't know how this happens.

totally not averse to doing doing semething like:

df.pipe(Histogram, values='....').show()

e.g. I simply want a drop-in replacement for this:

df.plot.hist() or df.pipe(sns.distplot) or df.pipe(plt.histogram)

@jreback
Copy link
Author

jreback commented Nov 1, 2015

IMHO this would instantly create a fair amount of usage for bokeh, just like what has happened with seaborn. In fact, the new charts API looks quite fantastic and goes way beyond what (seaborn/mpl/pandas) do (not to mention all of the other stuff that bokeh is capbable of).

In exploring data, one cann build up these little pipelines in a notebook, for testing, viewing, just playing around. Sure they may want to control various aspects and/or make more sophisticated plots, but making it dead-simple to swap in bokeh would mean people would gravitate to these new functions easily.

@havocp
Copy link
Contributor

havocp commented Nov 1, 2015

@jreback the reason I ask is that right now the notebook output mode sends some html over zeromq to the notebook. So if Histogram always sends the histogram over zeromq, in the layout case we would send 4 histograms, then send the layout with 4 histograms in it. Since sending to the notebook is a side effect. Also bokeh currently lets you create the histogram or layout and then further modify it. and that would also break if creating the histogram auto-sends.

It looks to the notebook user a bit like the notebook is rendering the return value of the cell, but for Bokeh as far as I know that isn't true. The rendered plot is a message that bokeh sends when you show(), so we need to have the "final" output somehow indicated (I think anyway).

So I'm trying to figure out if we somehow know we are the final step in a pipe and it sounds like no. That leaves us with either special versions of Histogram that are only used as the last step, or an explicit show method of some kind I guess ?

the .show() method at the end I expect could work.

@datnamer
Copy link

datnamer commented Nov 1, 2015

Check these out:
https://cran.r-project.org/web/packages/ggvis/vignettes/data-hierarchy.html
http://ggvis.rstudio.com/interactivity.html
http://ggvis.rstudio.com/0.1/data-hierarchy.html

Agree with @jreback - I think this ggvis style of easy piping/syntax but powerful and flexible interactivity(accessed as part of initial plotting), layering and api at the right level of abstraction is what we should shoot for. Bokeh could be amazing for quickly iterating EDA.

@jreback
Copy link
Author

jreback commented Nov 1, 2015

@havocp

as I said, I am not really sure how the rendering actually occurs in the notebook. Though I suspect something like

plt.show() is called implicity ?

I don't really have a problem with explicity calling .show()

e.g.

df.pipe(Histogram.....).show() to mean 'render' the output.

see [12] here; this is part of my PyDataNYC tutorial. I use a .pipe then call some additional methods on the result which are applicable to that object (in this case a seaborn object).

@jreback
Copy link
Author

jreback commented Nov 1, 2015

Just to show something else we are doing here and the notebook here

we do rendering in a chain where its easy to building; the return value ultimately is HTML rendered (this of course is a table).....in theory we could use bokeh as a renderer as well (where you could then take the 'hints' that we have constructed and make handle it directly rather than our in-built templating soln)

@datnamer
Copy link

datnamer commented Nov 1, 2015

Looks really cool...maybe it can interface with bokeh and or phosphor datatable. I could see that having some great interactive filtering/conditional formatting functionality.

@havocp
Copy link
Contributor

havocp commented Nov 1, 2015

to be clear, there's no need to convince people some solution is useful. for me I'm purely in the mode of figuring out how it could best work without breaking some other thing.

I guess next step for me would be to go see how mpl and seaborn etc do it (pointers welcome). If they unconditionally send html to the notebook then bokeh would have to somehow be different afaik since it has multiple kinds of output.

@almarklein
Copy link
Contributor

Maybe a naive question, but can't we use _repr_html_ for this? It gets called whenever an object is considered a cell output. This is what I'm using as a "trigger" to show Flexx widgets in the notebook.

@bryevdv
Copy link
Member

bryevdv commented Nov 1, 2015

We can, but we're not going to. We've already been down this path. More implicit, "auto showing", we already had these discussions, and made these decision, and went through more than a little pain to rip them out. The potential for out of order execution in the notebook, and the cross-language execution mean implicit state and actions make it hard to reason about what will happen in the notebook, and easy to have things happen that were not intended or wanted. TLDR; it was a mess.


I don't understand what is still at issue? Unless I misread, @jreback said this would be ok:

df.pipe(Histogram, values='....').show()

I think that's ok, so what is left to decide?

To add more color, I think the above is in fact, much better. If df.pipe(Histogram, values='....') returns a plot, then lots of awesome things are simple:

income = inc_df.pipe(...)
revenue = rev_df.pipe(...)
taxes = tax_df.pipe(...)

# super easy to put multiple piped plots in a layout, that's good!
show(hplot(income, revenue, taxes))

Or this:

p = df.pipe(...)

# set up ipython interactors here. pipes and interactors? Yes, please!

show(p)
interact(...)

All of those things are basically precluded if we do some auto show thing. By not doing it, we get to play to the best strengths of both libraries at once.

I'm also completely unconvinced that

p = df.pipe(...)
show(p)

or even less:

df.pipe(...).show()

puts any kind of actual burden on anyone, or prevents anything (in fact it allows more things, more easily).

Finally, this only concerns like six functions out of the entire Bokeh library. My main point of contention with doing anything other more than the minimal .show() method is that it would be invasive, broadly changing established behavior, and way out of proportion to the tiny fraction of things affected.

@jreback
Copy link
Author

jreback commented Nov 1, 2015

@bryevdv all of your examples above are great.

However wouldn't this on the Chart object baseclass

def __repr_html__(self):
     show(self)

makes this render if its the last cell in a notebook, yes?

can I have cake and eat it too?

@bryevdv
Copy link
Member

bryevdv commented Nov 2, 2015

Well, that won't work because it's _repr_html_ (not a dunder) and it also needs to return an HTML string to publish (show ultimately calls publish_display_data which returns None)

@mattpap can you expand on this comment:

https://github.com/bokeh/bokeh/blob/master/bokeh/models/component.py#L17

Maybe things are better now, and we could entertain enabling the html repr.

Still, I am really really trying to understand, what is wrong with

df.pipe(...).show()

In general I am very skeptical of burdening APIs with syntactic sugar for things this trivial.

Beyond that, adding that only to bokeh.charts.Chart is not something I'd entertain, it would need to be consistent across all Bokeh APIs otherwise it is a documentation burden. Also @jreback are you volunteering to update all the bokeh tutorials, example notebooks, and documentation? This is definitely not something anyone on the core team can do at the moment, everyone is completely focused on major 0.11 goals.

@mattpap
Copy link
Contributor

mattpap commented Nov 2, 2015

Maybe things are better now, and we could entertain enabling the html repr.

No. I didn't enable this by default, because it breaks the workflow. The solution is to change the workflow, but that wasn't critical at the time this code was added. If you enable html repr, then, with current notebook examples, you will get double output quite often. Either remove show(p) or put ; to in appropriate locations to prevent output.

@jreback
Copy link
Author

jreback commented Nov 2, 2015

@bryevdv nothing wrong with df.pip(...).show()

except that when someone is using mpl/seaborn/pandas they are acustomed to:

%matplotlib inline

df.plot() or df.pipe(sns.......)

OTOH

from bokeh.io import output_notebook
output_notebook()
p = df.pipe(Histogram......)
show(p)

may look only slightly different, but now bokeh is just not a drop-in replacement, and requires more mental effort.

This was not meant to be a 'I need it now!' feature. On your timeframe (sooner obviously preferred), and clearly you have other usecases in mind so may want to take a different path. I am urging that a nice soln (from my perspective) is that bokeh is drop-in replaceable (maybe a %bokeh inline directive would help?)

@pzwang
Copy link

pzwang commented Nov 2, 2015

Still, I am really really trying to understand, what is wrong with df.pipe(....).show()

The high level goal here is to make Bokeh easy and convenient to use for Pandas users, and not give people any reason not to use it compared to e.g. MPL or Seaborn. Having to add a call to show() means that they have to change every cell in their notebook that generates a plot, just to see what it would be like with Bokeh. I think that's a very good motivation to figure out how to make this work (of course, without breaking all of the rest of Bokeh).

I could be misunderstanding the issue here, but this seems pretty straightforward to me. When %bokeh inline or something has been enabled, then can't we make it so that only the last item or plot in an Input cell generates output? Several possibilities, including one or more of:

  1. Hooking into the notebook infrastructure so we know when we're entering a new cell, or when a cell is just about to execute, and we set some bit on the Notebook output backend
  2. sys._getframe() nonsense
  3. Proxy subpackage which imports all Bokeh.charts but mixes-in or wraps all the classes with something that uses (1) and (2) to avoid dupe plots

Also @jreback are you volunteering to update all the bokeh tutorials, example notebooks, and documentation?

I think Jeff's volunteering to help put Bokeh in front of all the Pandas users. :-)

Speaking of documentation, when are we getting our tech documentation team back from the marketing/web-site overhaul? They should be able to help with this.

@philippjfr
Copy link
Contributor

I could be misunderstanding the issue here, but this seems pretty straightforward to me. When %bokeh inline or something has been enabled, then can't we make it so that only the last item or plot in an Input cell generates output? Several possibilities, including one or more of:

Just wanted to chime in here since this is something I know a little bit about from working on HoloViews. What we do is define so called display formatters with IPython, which are basically equivalent to the _repr_* methods, only they can be defined dynamically. So what you could do is when something like %bokeh inline is called it defines a display formatter for Chart types (or whatever baseclass makes sense). When the registered object is in the last line of the input cell the display formatter will then automatically get called with the object as an argument, returning some HTML, which IPython will then display. Not sure if that's what you're looking for but I thought I might as well suggest it.

Edit:

Here's a simple self-contained example (note it only works in the notebook):

from bokeh.charts import Bar, Chart
from bokeh.io import notebook_div, load_notebook
from bokeh.sampledata.autompg import autompg as df

load_notebook()

def display(chart):
    return notebook_div(chart)

ip = get_ipython()
html_formatter = ip.display_formatter.formatters['text/html']
html_formatter.for_type(Chart, display)

Bar(df, 'cyl', values='mpg', title="Total MPG by CYL")

@damianavila
Copy link
Contributor

OK, @bryevdv we should probably take another stab on this as time permits... I think the display formatter idea exposed above it worth exploring (btw, I agree with you that, if we do it, we should do it consistently across all the API levels, not only charts).

@rothnic
Copy link
Contributor

rothnic commented Nov 4, 2015

I just wanted to mention that I discussed with @fpliger last week that I left much of the Chart class as is, and am not married in any way to it. After all the legacy charts are retired, I do want to go back and see what we actually want to keep, since I think the .plot functionality being discussed isn't advertised or documented anymore, and is hanging around from previous implementations.

I'm definitely on board with providing the notebook user a less verbose experience if at all possible, which I do think is pretty important to broad adoption. Every time someone has to reach back to documentation due to an unexpected outcome is a chance they will give up and go back to what they are used to.

@bryevdv
Copy link
Member

bryevdv commented Apr 5, 2017

Given that charts are being moved out to their own repo and HV is going to be promoted more heavily as a high level interface, I am closing this (one of those two places will be a better place to discuss this idea)

@bryevdv bryevdv closed this as completed Apr 5, 2017
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants