Skip to content

Histogram #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 72 commits into from
Closed

Conversation

LukeMathWalker
Copy link
Member

It took me a while, but I managed to pull together a good candidate for histogram-related functionalities (histogram, histogram2d and histogramdd in Numpy).

Bins are formalized in their own submodule, bins:

  • Bin1d, basic building block, reuses Range* structs from the standard library;
  • Bins1d, collection of Bin1d;
  • BinNd, n-dimensional bin. It supports any region in an n-dimensional space that can be expressed as a product of intervals;
  • BinsNd, collection of n-dimensional bins.

With BinsNd we can express collections of bins that are more general than the ones supported by NumPy.

Two top-level traits are provided to implement an histogram method, for ArrayBase and one-dimensional ArrayBase. Right now the computational complexity is the same, apart from a slightly different API (point being ArrayBase<Data<Elem=T>> or a &T, but I plan to enhance Bin1d::contains to enable an implementation of Bins1d::find that uses binary search. This will probably require to enforce non-overlapping bins at creation time for Bins1d.

I am now going to implement all the different string options for bins in np.histogram_bin_edges as builders, but I wanted to get the PR out to get your feedback on the direction @jturner314

@jturner314
Copy link
Member

Thanks for working on this. Unfortunately, I'm not really convinced with the approach.

With BinsNd we can express collections of bins that are more general than the ones supported by NumPy.

What's the use-case for this? Maybe I'm biased by how NumPy does things, but when I think of a histogram, I generally think of the bins as subdividing a space by splitting each axis at specific edges. So, in the 2-D case, I would expect to divide the space like this:

+---+-------+-+
|   |       | |
+---+-------+-+
|   |       | |
|   |       | |
|   |       | |
|   |       | |
+---+-------+-+

but generally not this:

+---+-------+-+
|   |       | |
|   +-------+-+
|   |         |
|   |         |
|   |         |
|   |         |
+---+-------+-+

or overlapping bins. I suppose I can think of some plausible uses for weird bins, but I think it makes more sense to make the common case (rectangular bins defined by splitting the axes at edges) fast and simple to use.

In other words, I like NumPy's edge-based representation. So, for example, for the 3-D case, we'd need edges for three axes, so we could represent this as edges = Vec<Array1<A>> where edges[0] is the array of edges for the first axis, edges[1] is the array of edges for the second axis, and edges[2] is the array of edges for the third axis. The total number of n-D bins represented this way is (edges[0].len() - 1) * (edges[1].len() - 1) * (edges[2].len() - 1).

I like this NumPy-style representation better than BinsNd for a few reasons:

  1. The histogram counts can be represented as an n-D array. The BinsNd approach works best with a 1-D representation like HashMap (the current PR), BTreeMap or Vec, but this loses the shape information. For example, iterating along a particular axis is difficult. Even iterating over the all bins in a sensible order is difficult with HashMap. (I think you could handle the iteration over all bins in a good order by switching to a BTreeMap and carefully defining the ordering of BinsNd or switching to a Vec the same length as the Vec of bins, but it still would not be as convenient or flexible as an n-D array.)

  2. It's a lot easier to plot a histogram given the edges along each axis instead of BinsNd. For example, consider the problem of determining the locations of the ticks along axis 1. With the NumPy-style representation, this is simply edges[1]. With the BinsNd representation it's not obvious. While we don't have any good plotting libraries in Rust at the moment, I think the tasks involved in plotting a histogram are fairly representative of the types of things you might want to do with a histogram.

  3. The NumPy-style representation only allows axes to be split. It does not allow for "merged" bins (the second picture I provided above) or overlapping bins. This is more restrictive, but it also makes working with the NumPy-style representation simpler because you don't have to worry about those special cases.

  4. The size in memory of BinsNd is O(n^ndim * ndim) where n is the number of bins along each axis (assuming the number of bins along each axis is the same). (There are n^ndim total bins, and each one has size ndim.) In contrast, the size in memory of the NumPy-style representation is O(n*ndim). (There are ndim axes, and each one has n+1 edges.) This is a substantial difference, even for a moderately-sized case like n = 30, ndim = 3. I suppose you could argue that the size in memory of the counts is O(n^ndim), so why worry about the bins having a similar big-O, but I think it still makes sense to keep the memory required for the bins small.

  5. Possible performance differences (keep in mind these are just educated guesses without benchmarks):

    • The number of Vecs in the BinsNd representation is 1 + n^ndim. (1 for the outer Vec and n^ndim Vecs for all of the bins.) The large number of Vecs is problematic if the memory allocator doesn't do a good job keeping them close together in memory. (If the Vecs are scattered across memory, caching doesn't work very well. Fortunately, IIUC jemalloc is pretty good at keeping allocations close together, so this may not actually be an issue in practice.) It would be possible to avoid this issue for fixed-dimension BinNd by using an alternative representation for BinNd (e.g. { lower_bounds: D, upper_bounds: D }). The NumPy-style representation doesn't have this issue at all, though; the number of Vecs is 1 + ndim. (1 for the outer Vec, and one for each axis.)

    • Interestingly enough, if the bins are sorted, both representations have the same big-O cost of finding the correct bin with binary search (O(log(n^ndim)) = O(ndim * log(n))), but I would guess that the NumPy-style representation would be faster because it's more cache-friendly.

So, what I'd suggest is something like this:

/// Wrapper around `Array1` that makes sure the elements are in ascending order.
struct Edges<A: Ord> {
    edges: Array1<A>,
}

impl<A: Ord> From<Array1<A>> for Edges<A> {
    fn from(mut edges: Array1<A>) -> Self {
        // sort the array in-place
        Edges { edges }
    }
}

impl<A: Ord> Edges<A> {
    fn view(&self) -> ArrayView1<A> {
        self.edges.view()
    }

    /// Returns the index of the bin containing the given value,
    /// or `None` if none of the bins contain the value.
    fn bin_index(&self, value: &A) -> Option<usize> {
        // binary search for the correct bin
    }

    /// Returns the range of the bin containing the given value.
    fn bin_range(&self, value: &A) -> Option<Range<A>>
    where
        A: Clone,
    {
        let i = self.bin_index(value);
        Range { start: self.edges[i].clone(), end: self.edges[i + 1].clone() }
    }
}

struct HistogramCounts<A: Ord> {
    counts: ArrayD<usize>,
    edges: Vec<Edges<A>>,
}

struct HistogramDensity<A: Ord> {
    density: ArrayD<A>,
    edges: Vec<Edges<A>>,
}

impl<A: Ord> HistogramCounts<A> {
    pub fn new(edges: Vec<Edges<A>>) -> Self {
        let counts = ArrayD::zeros(edges.iter().map(|e| e.len() - 1).collect::<Vec<_>>());
        HistogramCounts { counts, edges }
    }

    pub fn add_observation(observation: ArrayView1<A>) -> Result<(), NotFound> {
        let bin = observation
            .iter()
            .zip(&self.edges)
            .map(|(v, e)| e.bin_index(v).ok_or(NotFound))
            .collect::<Result<Vec<_>, _>>()?;
        self.counts[bin] += 1;
    }
}

A few possible modifications:

  • An EdgesNd wrapper type for Vec<Edges<A>> might be nice to provide methods for finding the n-D index of the bin corresponding to a value and other similar operations. Alternatively, we could provide this functionality on the Histogram* types.

  • We could combine HistogramCounts and HistogramDensity into a single type with a parameter for the element type of the counts/density array. I'm pretty ambivalent about this, because I think a Histogram trait implemented for HistogramCounts and HistogramDensity would work just as well as a common Histogram struct.

  • We could add the dimensionality to the type of Edges/Histogram*. I think this is probably worth doing.

  • Instead of Edges containing Array1<A>, it could contain Vec<A> so that we could take advantage of the standard library's sorting and searching methods. (ndarray doesn't have an equivalent to the standard library's .unstable_sort() or .binary_search().)

  • I've used a requirement of A: Ord for simplicity, which would require using N64 (noisy-float's "not NaN" wrapper type) instead of f64, for example. We might come up with some other way to allow PartialOrd. This is similar to the problem we had with the quantile methods.

What are your thoughts?

@LukeMathWalker
Copy link
Member Author

Well, I think is an understatement to say that I didn't think it through and that your comment makes total sense 😅

I wanted to allow arbitrary meshes, but we should definitely prioritize memory-consumption and performance with respect to the most common use cases. I'll start over from your skeleton!

@LukeMathWalker LukeMathWalker deleted the histogram branch November 18, 2018 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants