-
Notifications
You must be signed in to change notification settings - Fork 28
Histogram #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram #8
Conversation
Thanks for working on this. Unfortunately, I'm not really convinced with the approach.
What's the use-case for this? Maybe I'm biased by how NumPy does things, but when I think of a histogram, I generally think of the bins as subdividing a space by splitting each axis at specific edges. So, in the 2-D case, I would expect to divide the space like this:
but generally not this:
or overlapping bins. I suppose I can think of some plausible uses for weird bins, but I think it makes more sense to make the common case (rectangular bins defined by splitting the axes at edges) fast and simple to use. In other words, I like NumPy's edge-based representation. So, for example, for the 3-D case, we'd need edges for three axes, so we could represent this as I like this NumPy-style representation better than
So, what I'd suggest is something like this: /// Wrapper around `Array1` that makes sure the elements are in ascending order.
struct Edges<A: Ord> {
edges: Array1<A>,
}
impl<A: Ord> From<Array1<A>> for Edges<A> {
fn from(mut edges: Array1<A>) -> Self {
// sort the array in-place
Edges { edges }
}
}
impl<A: Ord> Edges<A> {
fn view(&self) -> ArrayView1<A> {
self.edges.view()
}
/// Returns the index of the bin containing the given value,
/// or `None` if none of the bins contain the value.
fn bin_index(&self, value: &A) -> Option<usize> {
// binary search for the correct bin
}
/// Returns the range of the bin containing the given value.
fn bin_range(&self, value: &A) -> Option<Range<A>>
where
A: Clone,
{
let i = self.bin_index(value);
Range { start: self.edges[i].clone(), end: self.edges[i + 1].clone() }
}
}
struct HistogramCounts<A: Ord> {
counts: ArrayD<usize>,
edges: Vec<Edges<A>>,
}
struct HistogramDensity<A: Ord> {
density: ArrayD<A>,
edges: Vec<Edges<A>>,
}
impl<A: Ord> HistogramCounts<A> {
pub fn new(edges: Vec<Edges<A>>) -> Self {
let counts = ArrayD::zeros(edges.iter().map(|e| e.len() - 1).collect::<Vec<_>>());
HistogramCounts { counts, edges }
}
pub fn add_observation(observation: ArrayView1<A>) -> Result<(), NotFound> {
let bin = observation
.iter()
.zip(&self.edges)
.map(|(v, e)| e.bin_index(v).ok_or(NotFound))
.collect::<Result<Vec<_>, _>>()?;
self.counts[bin] += 1;
}
} A few possible modifications:
What are your thoughts? |
Well, I think is an understatement to say that I didn't think it through and that your comment makes total sense 😅 I wanted to allow arbitrary meshes, but we should definitely prioritize memory-consumption and performance with respect to the most common use cases. I'll start over from your skeleton! |
It took me a while, but I managed to pull together a good candidate for histogram-related functionalities (
histogram
,histogram2d
andhistogramdd
in Numpy).Bins are formalized in their own submodule,
bins
:Bin1d
, basic building block, reusesRange*
structs from the standard library;Bins1d
, collection ofBin1d
;BinNd
, n-dimensional bin. It supports any region in an n-dimensional space that can be expressed as a product of intervals;BinsNd
, collection of n-dimensional bins.With
BinsNd
we can express collections of bins that are more general than the ones supported by NumPy.Two top-level traits are provided to implement an
histogram
method, forArrayBase
and one-dimensionalArrayBase
. Right now the computational complexity is the same, apart from a slightly different API (point
beingArrayBase<Data<Elem=T>>
or a&T
, but I plan to enhanceBin1d::contains
to enable an implementation ofBins1d::find
that uses binary search. This will probably require to enforce non-overlapping bins at creation time forBins1d
.I am now going to implement all the different string options for
bins
in np.histogram_bin_edges as builders, but I wanted to get the PR out to get your feedback on the direction @jturner314