Skip to content

keys(g::Generator) = keys(g.iter) is not generically correct and should not be defined #58341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adienes opened this issue May 7, 2025 · 14 comments
Labels
breaking This change will break code collections Data structures holding multiple items, e.g. sets correctness bug ⚠ Bugs that are likely to lead to incorrect results in user code without throwing design Design of APIs or of the language itself

Comments

@adienes
Copy link
Member

adienes commented May 7, 2025

from the keys docstring:

keys(iterator)
For an iterator or collection that has keys and values (e.g. arrays and dictionaries), return an iterator over the keys.

but a Generator does not have keys and values. it only has values. there is no get(::Generator, ...) nor getindex(::Generator, ...)

I believe a similar problem holds for axes(g::Generator) and ndims(g::Generator), but I'm focusing on keys since it seems the most flagrant.

this leads to awful behavior with things like

julia> d
Dict{Symbol, Int64} with 2 entries:
  :a => 1
  :b => -1

julia> findmax(identity, d)
(1, :a)

julia> findmax(Iterators.map(identity, d))
(:b => -1, :b)

where the second call should either error, or maybe a Generator could 1-index as if it were an enumerate so that would become

julia> findmax(Iterators.map(identity, d))
(:b => -1, 2)

note that

julia> iterate(Iterators.map(identity, d))
(:a => 1, 2)

so even the state in iterate is 1-indexed and not taking from keys(g.iter)

I think technically this is a duplicate of #48379, but I don't think the fundamental problem there has anything to do with skipmissing so I just made a new issue.

@adienes adienes added breaking This change will break code design Design of APIs or of the language itself collections Data structures holding multiple items, e.g. sets correctness bug ⚠ Bugs that are likely to lead to incorrect results in user code without throwing labels May 7, 2025
@nsajko
Copy link
Contributor

nsajko commented May 7, 2025

The method was added here, and approved by triage multiple times it seems: #34678

@adienes
Copy link
Member Author

adienes commented May 7, 2025

how unfortunate. well, it's not documented behavior so it's not too late to revert. note that the triage approval summary contained the language doesn't seem harmful.

but it is harmful, as it makes some basic functions in Base violate their docstring.

@jariji
Copy link
Contributor

jariji commented May 7, 2025

Since Generator doesn't implement getindex I don't see a need for it to implement keys. As Takafumi said in the original PR, we should expect

isequal(collect(values(A)), [A[k] for k in keys(A)])

and Generator isn't intended to support random access like that.

So I'd favor an error over the enumerate- or countfrom-style behavior.

@jariji
Copy link
Contributor

jariji commented May 21, 2025

I think it would be good to get this done because it's a correctness hazard. Is this ready for triage?

@adienes
Copy link
Member Author

adienes commented May 21, 2025

as @nsajko noted, this behavior was already considered and approved by triage in #34678. I don't have a productive solution that wouldn't be very breaking, so I'm not sure what there is for triage to discuss. I should probably just close this as won't fix

although, I will also note that the findmax(itr) docstring doesn't make much sense

Indices are of the same type as those returned by keys(itr) and pairs(itr).

... so which is it? keys or pairs ? those are different sometimes

@mbauman
Copy link
Member

mbauman commented May 21, 2025

I'm still trying to figure out what exactly makes this wrong. Because it's definitely right in other cases. Most notably, we need generators to expose their iteration space's axes in order to collect them to the right shape. That is:

julia> g = (i*j for i in 1:3, j in 1:4)
Base.Generator{Base.Iterators.ProductIterator{Tuple{UnitRange{Int64}, UnitRange{Int64}}}, var"#5#6"}(var"#5#6"(), Base.Iterators.ProductIterator{Tuple{UnitRange{Int64}, UnitRange{Int64}}}((1:3, 1:4)))

julia> Base.IteratorSize(g)
Base.HasShape{2}()

julia> axes(g)
(Base.OneTo(3), Base.OneTo(4))

That's not only correct, but it's precisely what's needed in order to collect(g) (that is: comprehensions!) into the correct shape. Now yes, I know that's not keys (and in fact keys(g) in this case errors because the ProductIterator is a non-indexable iterable). But it's a starting point for me talking out loud here.

I think this is the same fundamental problem as map (#57075) — generators support both iterables and indexables, and the mismatch happens when there are datastructures for whom indexing and iteration disagree. That's what's happening with findmax:

julia> d = Dict(:a=>1, :b=>-1);

julia> findmax(d) # indexing-based
(1, :a)

julia> findmax(kv for kv in d) # iteration based
(:b => -1, :b)

julia> findmax(v for (k,v) in d) # FTFY!
(1, :a)

Is the second one there actually wrong? The key-value pair (:b => -1) is indeed "greater" than (:a => -1)... and yeah, it occurs at the "key" :b. Yes, it's surprising and yes, it's a footgun, but why is it wrong?

@adienes
Copy link
Member Author

adienes commented May 21, 2025

I actually mostly agree with you that this is p r o b a b l y closer to "footgun caused by the intersection of vague interfaces" than it is to "correctness bug"

but consider

julia> d = Dict([:a => 1, :b => -1]);

julia> it = Iterators.map(identity, d);

julia> pairs(it)
Base.Generator{Base.Iterators.Zip{Tuple{Base.KeySet{Symbol, Dict{Symbol, Int64}}, Base.Generator{Dict{Symbol, Int64}, typeof(identity)}}}, Base.var"#Generator##2#Generator##3"{Pair}}(Base.var"#Generator##2#Generator##3"{Pair}(), zip([:a, :b], Base.Generator{Dict{Symbol, Int64}, typeof(identity)}(identity, Dict(:a => 1, :b => -1))))

why is this allowed? namely: that keys forwards to the dictionary but values does not, and we just get a Generator{Dict...}

pairs demands that

keys(a), values(a) and pairs(a) all iterate a and return the elements in the same order.

but this is not guaranteed generically here! imagine I make a dictionary type FunDict where

iterate(::FunDict) = # iterates in whatever order the hash key rng decides
keys(::FunDict) = # sort the keys alphabetically
values(::FunDict) = # sort the keys alphabetically, get each associated value

then pairs(Iterators.map(identity, ::FunDict)) will be a Zip of keys and values that are completely misaligned

(never mind the fact that pairs says it returns an iterator over key => value pairs for any collection that maps a set of keys to a set of values, and a Generator is not such a collection)

@mbauman
Copy link
Member

mbauman commented May 21, 2025

why is this allowed? namely: that keys forwards to the dictionary but values does not, and we just get a Generator{Dict...}

But isn't this all just doing the right thing? The generator is exposing the iteration space of a collection over a set of its keys — we can literally write for kv in d! The values of the generator will be the result of iterating the collection. Which, yeah, is exactly the collection itself!

pairs demands that

keys(a), values(a) and pairs(a) all iterate a and return the elements in the same order.

You lost the styling from the docstring there, but perhaps that's even better because it makes it clear where the ambiguity lies. Is it the English "iterate" or the Julia function iterate? I think it needs to be the latter — iterate itself is a part of that ordering requirement!

Make a data structure that breaks that requirement, and generators will happily forward along that brokenness.


Perhaps the hardest part here is knowing when some f in findmax(f(d)) is using iterate over the keys vs. when it's generating values over the keys vs. when it's discarding the original keys in favor of a new set. Digging into the Iterators module is a good signal, but I can see how skipmissing would be quite ambiguous.

@adienes
Copy link
Member Author

adienes commented May 21, 2025

I'd like to repeat my previous example with some code --- not because I don't think you understood it, but just so I can see very specifically where our disagreement begins.

using Random

struct FunDict{K, V}
    d::Dict{K, V}
end

import Base: keys, values, length, iterate
keys(fd::FunDict) = sort(keys(fd.d))
values(fd::FunDict) = [fd.d[k] for k in keys(fd)]

length(fd::FunDict) = length(fd.d)
iterate(fd::FunDict) = iterate(fd.d)
iterate(fd::FunDict, i) = iterate(fd.d, i)

Random.seed!(1)
d = Dict(zip(randstring.(5 .* ones(Int, 5)), rand(5)))
fd = FunDict(d)
it = Iterators.map(identity, fd)

this produces

julia> collect(pairs(it))
5-element Vector{Pair{String, Pair{String, Float64}}}:
 "4Lhcu" => ("XcGqU" => 0.7305508461555176)
 "BlmfA" => ("4Lhcu" => 0.9003405842788204)
 "XcGqU" => ("ZSI0Z" => 0.8686942572391998)
 "ZSI0Z" => ("BlmfA" => 0.37215957409032674)
 "cC0J7" => ("cC0J7" => 0.8667711192602672)

I am currently inferring from your response that you would describe this as a broken data structure, and not meeting the contract of one or both of pairs and iterate, is that correct? if so it's an interesting distinction as I had been reading the docstring to mean something more like enforcing pairs(x) == zip(keys(x), values(x)) without requiring anything of iterate

also note of course that it is not the case that values(x) yields the same elements as iterating over x

@jariji
Copy link
Contributor

jariji commented May 21, 2025

I might be way behind here, but I'd like to try articulating my own view.

Imho the gospel should be that keys returns indices that are valid under getindex.

Generator could define getindex as either Iterators.nth, or as g.iter[i]

To demonstrate, take the case of

julia> g = (x for x in Dict(:a => 10))
Base.Generator{Dict{Symbol, Int64}, typeof(identity)}(identity, Dict(:a => 10))

If g[i] is nth(g, i), then keys(g) needs to be [1] because 1 is the only value for which nth(g, i) returns a value. That's not currently what it does.

julia> keys(g)
KeySet for a Dict{Symbol, Int64} with 1 entry. Keys:
  :a

If g[i] is g.iter[i], then keys(g) needs to return keys(g.iter), whose elements are [:a]. Then consulting findmax we have

findmax(f, domain) -> (f(x), index)

Return a pair of a value in the codomain (outputs of f) and the index or key of the corresponding value in the domain (inputs to f) such that f(x) is maximised. If there are multiple maximal points, then the first one will be returned.

domain must be a non-empty iterable supporting keys. Indices are of the same type as those returned by keys(domain).

and

julia> findmax(identity, g)
(:a => 10, :a)

where :a is the key i where g[i] attains its maximum, which is right, and :a => 10 is purported to be the value g[i] at that i, but that is wrong: the value of g[i] is 10, not :a => 10.

I think therefore that defining keys(::Generator) necessitates violating my gospel assumption that keys(g) returns indices i for which getindex(g, i) returns values. Do we think that's right? Are we ok with it?

@mbauman
Copy link
Member

mbauman commented May 21, 2025

I am currently inferring from your response that you would describe this as a broken data structure, and not meeting the contract of one or both of pairs and iterate, is that correct? if so it's an interesting distinction as I had been reading the docstring to mean something more like enforcing pairs(x) == zip(keys(x), values(x)) without requiring anything of iterate

Yeah, that's precisely it. I also wasn't as certain — my response wasn't so much "you broke it, RTFM", but rather, realizing that the end effect of Generator defining its keys in this manner is implicitly depending upon (and thus requiring!) that these orderings match.

It's a rather interesting end effect. I don't think there are other places where we assume/require this — at least not so strongly.


I'm not nearly as concerned that generators don't immediately expose an indexable interface but still have keys... precisely because they can preserve those keys (ok, not the keys directly, but the axes) upon collect when the iteration space HasShape. It goes right back to what map means, in fact.

@mbauman
Copy link
Member

mbauman commented May 21, 2025

So there are two problems here. One is the fact that dicts generate different elements upon iteration than indexing. That's a problem, but that ain't going away til 2.0, surely (cf. #24019). So given that we live in a world where iterating and indexing can give you different things, can you know what a function will do? If a docstring says it does something to the "elements" of a collection, is that iteratively or indexingly?

Concretely, if I have a d = Dict(:a => 1, :b => missing, :c => 3), what are the answers here?

  • count(Returns(true), filter(!ismissing, d))
  • count(Returns(true), filter!(!ismissing, d))
  • count(Returns(true), Iterators.filter(!ismissing, d))
  • count(Returns(true), skipmissing(d))
  • count(Returns(true), keys(skipmissing(d))
  • skipmissing(d)[:b]

I cannot imagine anyone getting 6/6 correct there (but there's definitely a bug, so perhaps that's unfair).

@jariji
Copy link
Contributor

jariji commented May 22, 2025

I'm not nearly as concerned that generators don't immediately expose an indexable interface but still have keys... precisely because they can preserve those keys (ok, not the keys directly, but the axes) upon collect when the iteration space HasShape. It goes right back to what map means, in fact.

Obviously no obligation to answer these but I'm left wondering

  • What does keys mean if not "indexable with those keys"?
  • Is there a function that means "indexable with these keys"?
  • Would it be okay to add a function whose keys are different from its valid getindex arguments?
  • Does generic keytype refer to keys or to getindex arguments?
  • Are we really enough useful out of violating the straightforward keys-getindex association to make it worthwhile? Afaik this is the only violation; the rule even holds for String.

@adienes
Copy link
Member Author

adienes commented May 22, 2025

what are the answers here?

sounds fun. I'll try to answer these based on my understanding of what Coherent Julia As Intended™️ is currently (rather than whatever I might think it "should" be)

is that iteratively or indexingly?

Based only on the most common empirical design pattern I tend to see in Julia, "indexingly" seems to most commonly take precedence over "iteratively," where shape or structure or keys etc. etc. are used when able and treating something as a generic iterator occurs only as a fallback. To separate the terms a bit, I'll use elements to describe what you get when you iterate something as opposed to values(collection)

  • count(Returns(true), filter(!ismissing, d))
    • here is my reasoning:
    • filter(f, collection) is likely going to try to retain the structure of collection
    • filter(f, collection) is likely going to try to equal collect(Iterators.filter(f, collection)), which surely receives elements
    • since the elements of d are pairs and never ismissing(::Pair), therefore filter(!ismissing, d)) == d
    • now count will be 3 no matter whether it's iterating elements, values, keys, or pairs
  • count(Returns(true), filter!(!ismissing, d))
    • 3, same as above
  • count(Returns(true), Iterators.filter(!ismissing, d))
    • 3, same as above
  • count(Returns(true), skipmissing(d))
    • based only on my guess that skipmissing(collection) was probably designed to be "useful" rather than "mental gymnastics in the name of consistency" (/snark) it probably filters values?
    • returns 2
  • count(Returns(true), keys(skipmissing(d)))
    • based on the same guess above, 2 again
  • skipmissing(d)[:b]
    • KeyError

let's see the results!

.... oh come on this is definitely a bug 😭

julia> Dict(skipmissing(d))
Dict{Symbol, Union{Missing, Int64}} with 3 entries:
  :a => 1
  :b => missing
  :c => 3

julia> collect(keys(skipmissing(d)))
2-element Vector{Symbol}:
 :a
 :c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code collections Data structures holding multiple items, e.g. sets correctness bug ⚠ Bugs that are likely to lead to incorrect results in user code without throwing design Design of APIs or of the language itself
Projects
None yet
Development

No branches or pull requests

4 participants