Skip to content

Scaled densities/counts in 2d density/bins plots (#2679) #2680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 11, 2018

Conversation

bjreisman
Copy link
Contributor

I added 'scaled' statistics to the stat_bin2d, stat_binhex, stat_density_2d, and stat_contour functions for plotting 2d distributions normalized to a common height. This would be very useful for faceted 2d plots, where the maximum density/count can vary greatly between panels.

(This is my attempt to add the feature I had requested in issue #2679.)


I tried my best to mirror the corresponding statistics from the 1d functions:

stat_bin stat_bin2d stat_binhex stat_bin2d (updated) stat_binhex (updated)
count count count count count
ncount ncount ncount
density density density density density
ndensity ndensity ndensity

It was a little bit harder to get things to match-up one-to-one for the density based function, so some adjustment to the syntax may be needed.

stat_density stat__contour stat_density2d stat__contour (updated) stat_density2d (updated)
count - - -
density - density - density
scaled - - - scaled
- pieces - pieces -
- level - level -
- - - nlevel -

Here is an example of the revised functions in action:

library(ggplot2)
library(dplyr)
library(viridis)

ggplot(diamonds, aes(x=x, y= depth)) +
  stat_density_2d(aes(fill = stat(level)),
                  geom = "polygon",
                  n = 100,
                  bins = 10,
    contour = T) +
  facet_wrap(clarity~.) +
  scale_fill_viridis(option = "A")

plot6

ggplot(diamonds, aes(x=x, y= depth)) +
  stat_density_2d(aes(fill = stat(nlevel)),
                  geom = "polygon",
                  n = 100,
                  bins = 10,
    contour = T) +
  facet_wrap(clarity~.) +
  scale_fill_viridis(option = "A")

plot7

@clauswilke
Copy link
Member

This seems like a good idea to me. Unfortunately it's too late for 2.3.0, we're in a soft freeze.

A couple of comments:

  • Please make sure your coding style matches the ggplot2 style (e.g., spaces around /).
  • The documentation should be updated. Already, the "computed variables" sections of geom_density_2d() and geom_contour() seem incomplete, and they will be even more incomplete after this change.

@@ -60,12 +60,13 @@ StatDensity2d <- ggproto("StatDensity2d", Stat,
lims = c(scales$x$dimension(), scales$y$dimension())
)
df <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
df$nz <- df$z/max(df$z)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nz? This naming seems strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure what to call this one. I went with nz for "normalized z" as in ndensity and ncount. In my own code, I normally name variables which are derived from other variables using "x.difference" so z would become, z.scaled or similar, but I wasn't sure what the convention was for ggplot2.

Alternatively, we could just call it scaled, but then the renaming that happens a few lines down would be somewhat redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it strange to introduce a variable name and then change it 5 lines down. Also, what happens if the if (contour) { branch is executed? What does StatContour do to the nz column?

Copy link
Contributor Author

@bjreisman bjreisman Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to agree; however, I tried to preserve as much of the approach of the original stat_density function as I could and that was the approach taken there:

#exerpt from stat_denisty, line 62
    df <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
    df$group <- data$group[1]

    if (contour) {
      StatContour$compute_panel(df, scales, bins, binwidth)
    } else {
      names(df) <- c("x", "y", "density", "group")
      df$level <- 1
      df$piece <- 1
      df
    }

StatContour doesn't do anything with the nz column, so to avoid having to rename things we could just name it when we create it.

    df <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
    df$scaled <- df$z / max(df$z) ## <- here
    df$group <- data$group[1]

    if (contour) {
      StatContour$compute_panel(df, scales, bins, binwidth)
    } else {
      names(df) <- c("x", "y", "density", "scaled", "group")
      df$level <- 1
      df$piece <- 1
      df
    }

Or we could just create it later on, which avoids passing a column that won't be used to StatContour

    df <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
    df$group <- data$group[1]

    if (contour) {
      StatContour$compute_panel(df, scales, bins, binwidth)
    } else {
      names(df) <- c("x", "y", "density", "scaled", "group")
      df$scaled <- df$density / max(df$density) ## <- here
      df$level <- 1
      df$piece <- 1
      df
    }

Or perhaps there's a third even cleaner method to achieve the same goal. Which approach do you think would be best?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the later creation is better, though you shouldn't name it before you add it to the data frame. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! ✨


This may be a separate branch/pull request, but it might be worthwhile to change the behavior of stat_density2d when aes(fill = stat(level)) is called when contour = FALSE and the behavior of aes(fill = stat(density)) when contour = TRUE to return an error or a warning. Neither of those statistics are meaningful when called in the wrong context and it can return some cryptic errors as seen below. 😬

library(ggplot2)
library(dplyr)
library(viridis)
#2d densiy plot colored by level
diamonds %>%
ggplot(aes(x=x, y= depth)) +
  stat_density_2d(aes(fill = stat(density)),
                  geom = "raster",
                  n = 100,
                  bins = 10,
    contour = T) +
  facet_wrap(clarity~.) +
  scale_fill_viridis(option = "A")

#>  Error in is.finite(x) : default method not implemented for type 'closure' 
diamonds %>%
ggplot(aes(x=x, y= depth)) +
  stat_density_2d(aes(fill = stat(level)),
                  geom = "polygon",
                  n = 100,
                  bins = 10,
    contour = F) +
  facet_wrap(clarity~.) +
  scale_fill_viridis(option = "A")

plot7

@bjreisman
Copy link
Contributor Author

Thanks for the tips! I've updated the styling and documentation as suggested. I wasn't sure how to best describe the pieces computed variable in stat_contour, so that may need more revision than the others.

@hadley
Copy link
Member

hadley commented Jul 8, 2018

@clauswilke can you please finish this off by approving the PR (if you think it's ready to be merged)

@clauswilke
Copy link
Member

@hadley Yes. One thing that I think needs to be fixed in the ggplot2 codebase first is a new section in NEWS.md for post 3.0.0 development.

@bjreisman At a minimum, I think roxygen needs to be rerun so all the documentation changes are processed correctly. I would also suggest adding a line to NEWS.md (not NEWS!). It would have to go above the current ggplot2 3.0.0 section. I'll read the code over carefully tomorrow and let you know if I see anything else.

@hadley
Copy link
Member

hadley commented Jul 8, 2018

Oops yes, done.

@@ -66,6 +71,7 @@ StatDensity2d <- ggproto("StatDensity2d", Stat,
StatContour$compute_panel(df, scales, bins, binwidth)
} else {
names(df) <- c("x", "y", "density", "group")
df$scaled <- df$density / max(df$density)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is called scaled and not ndensity as in the other stats? Also, does the max function need an argument na.rm = TRUE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to consider this addition!

I wasn't entirely happy with the switch between ndensity to scaled either and would actually prefer to use ndensity here. However, in stat_density [the 1d version], the rescaled statistic is called scaled while in stat_bin it's called ndensity or ncount.

Since stat_density_2d is the 2d version of stat_density I tried to mirror the syntax by naming it scaled. I think it might make the most sense to use one syntax throughout all the stat functions (ndensity), but I will leave that to your discretion. To summarize, I think there are a three options.

  1. Mirror the syntax from the 1d versions of the stat layers such that stat_density_2d and stat_density use scaled while stat_bin2d, stat_hexbin, and stat_bin use ndensity and ncount. (implemented this PR)

  2. (2a) Switch over to a single syntax for the 2d stats with this version - ndensity replaces scaled in stat_density_2d.

  3. (2b) Switch over to a single syntax for all density based stats - ndensity replaces scaled in both stat_density and stat_denisty_2d. I suppose you would want to just duplicate scaled and name the new variable ndensity to avoid breaking existing code.

--
RE: na.rm = TRUE; you're correct, I've added that to my local version for my next PR. I can submit that change once scaled vs. ndensity is settled and I'll update the documentation and news as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think option 2b (uniform naming with duplicated scaled column for stat_density) is best. But I'd like to hear @hadley's input as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 2b sounds good (assume it doesn't affect existing code, in which case scaled should be added, instead of replacing the existing variable name)

@bjreisman
Copy link
Contributor Author

bjreisman commented Jul 10, 2018

Okedoke, I've implemented option 2b with the latest commit, as updating NEWS.md and adding an example to geom_density2d() for the new functionality.

I tried running roxygen to update the documentation, but for some reason it didn't recognize the \section{Aesthetics}{ tag, so those were dropped from every geom on my local version. It may have something to do with the following warning: "Version of roxygen2 last used with this package is 6.0.1.9000. You only have version 6.0.1." When I tried to install the dev version from github it wouldn't run. I think I'm running into this error: Having problems with slash direction in Roxygen2 package. Any tips on how to overcome this?

NEWS.md Outdated

* `stat_density()` now includes the calculated statistic `nlevel`, an alias
for `scaled`, to better match the syntax of `stat_bin` (@bjreisman, #2680)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this whole part seems to have ended in the wrong section of the file. It should be at the very top, under ggplot2 3.0.0.9000. You may have to rebase. Also, the whole section should only be 1-2 sentences, such as "stat_contour(), stat_density2d(), stat_bin2d(), stat_binhex() all now also calculate normalized densities, levels, and counts (@bjreisman, #2680)".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, one more comment: The issue this is closing is #2679, not #2680. #2680 is the pull request. And one of the git commit messages should contain the phrase "Closes #2679" or "Fixes #2679" so that github automatically closes the issue once the pull request is integrated. You can add this to the next commit you make.

R/stat-density.r Outdated
@@ -21,6 +21,8 @@
#' \item{count}{density * number of points - useful for stacked density
#' plots}
#' \item{scaled}{density estimate, scaled to maximum of 1}
#' \item{ndensity}{alias for stat(scaled), to mirror the syntax of
#' geom_bin}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put stat(scaled) and geom_bin in backticks so they show up as code? And also please write geom_bin() (with parentheses).

@clauswilke
Copy link
Member

I suspect I can fix the roxygen issue if you give me access to your repo fork. Won't get to it until later in the week though.

@bjreisman
Copy link
Contributor Author

Done! Thanks for taking the time to review these changes.

bjreisman and others added 8 commits July 11, 2018 14:04
+ updated documentation
(revisions suggested by clauswilke)

Notes:
Wasn't sure how to best describe the 'pieces' computed variable generated by stat_contour;

'nz' variable remains for now in stat_density_2d, open to suggestions for a better name.
…to minimize renaming of existing variables. 🎨
- swapped `scaled` for `ndensity` in `stat_density2d()`;

- added `ndensity` to `stat_density` as an alias for `scaled`;

- added an example to `geom_densityd2()`;

- updated NEWS.md to reflect these changes.
@clauswilke
Copy link
Member

@bjreisman I have rebased your PR and made some documentation cleanups. Let's see if all the checks pass.

Most importantly, you were using scale_fill_viridis() in two examples, but that scale does not exist in ggplot2 out of the box. I changed those to use scale_fill_viridis_c(), which does exist. It's important that all examples can be run as written without any unspecified dependencies. Otherwise CRAN check will fail.

@clauswilke clauswilke merged commit 08e5f6c into tidyverse:master Jul 11, 2018
@clauswilke
Copy link
Member

I just merged this, thanks!

@lock
Copy link

lock bot commented Jan 7, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants