Skip to content

Make metadata lazy for table row classes #1261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 24, 2021

Conversation

benjeffery
Copy link
Member

We were decoding metadata on every table row access previously, this makes that lazy. This same method will be applied to the container classes that get returned from e.g. ts.node(u) when they are dataclasses.

Note this also gives table row objects a .table attribute (@hyanwong might be interested) This attribute does not get put into the result of dataclasses.asdict or used in the auto dataclasses methods such as __eq__ or __repr__. These auto methods also only see decoded metadata. The dataclasses are still "frozen".

Needs a couple of extra tests and and one optimisation for the null schema case before merging. Also thinking the schema reference should be stored, as otherwise it could be modified before use by the row.

@codecov
Copy link

codecov bot commented Mar 20, 2021

Codecov Report

Merging #1261 (020f2d5) into main (1510d2a) will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1261      +/-   ##
==========================================
+ Coverage   93.78%   93.82%   +0.03%     
==========================================
  Files          26       26              
  Lines       21991    21995       +4     
  Branches      987      991       +4     
==========================================
+ Hits        20625    20637      +12     
+ Misses       1325     1324       -1     
+ Partials       41       34       -7     
Flag Coverage Δ
c-tests 92.49% <ø> (ø)
lwt-tests 92.97% <ø> (ø)
python-c-tests 95.10% <100.00%> (+0.07%) ⬆️
python-tests 98.83% <100.00%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/metadata.py 98.78% <100.00%> (+0.12%) ⬆️
python/tskit/tables.py 99.74% <100.00%> (+0.12%) ⬆️
python/tskit/trees.py 97.86% <100.00%> (+0.42%) ⬆️
python/tskit/vcf.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1510d2a...020f2d5. Read the comment docs.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, see some comments.

return cls


@lazy_metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not obvious here why we need both a decorator and to inherit from LazyMetadata - one or other is easy to follow for a reader, but both is confusing. What's the rationale here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should spell it out up here where these classes are defined anyway, need to encode this knowledge somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to make an extra slot with a class decorator, annoying!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we bother with the decorator then? We should be able to implement this in a straighforward inheritancy type way, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manged to do it all in a decorator in the end, can't do inheritance due to intercepting the constructor.

@benjeffery
Copy link
Member Author

@jeromekelleher I think this could have a review now. Docs are not building due to a warning I haven't figured out yet, but look good ignoring the warning.

Note that I've removed the deprecated attrs on mutation - these could be added back if you think it is too early.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I'm seriously questioning whether making things frozen is worth it. What problem are we trying to solve here?

Having to resort to this __setattr__ stuff to do perfectly reasonable things seems much more likely to result in errors than anything the frozen is trying to prevent. Plus, I think we are likely to break code if we change the semantics to frozen now.

If we're trying to allow for things like:

tables.nodes[0].flags = 0

then we perhaps we could have table rows frozen for now (to allow this in future) but keep the stuff returned by the TreeSequence unfrozen?

@@ -218,7 +219,7 @@ def record_edge(self, left, right, parent, child):
else:
last = self.edge_buffer[child][-1]
if last.right == left:
last.right = right
builtins.object.__setattr__(last, "right", right)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a pattern I want to encourage people to use. This seems worse than not having Frozen instances.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed - I could rejig these instances to not use the ugly way. It only really needs to be used in the custom __init__ and the metadata decorator.

and np.array_equal(self.location, other.location)
and np.array_equal(self.parents, other.parents)
and self.metadata == other.metadata
)

def __neq__(self, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

  1. I think we can get rid of __neq__ now, this is handled automatically in more recent versions of Python
  2. Can we use the dataclass __eq__? If not, let's add a comment saying why, as this will be a recurring question.

(Ah, answering my own question here: we have numpy arrays. Good to add a quick comment here to that effect)

"""
metadata: Any
"""
The decoded :ref:`metadata <sec_metadata_definition>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, only decoded if a schema is present, right?

and np.array_equal(self.mutations, other.mutations)
)

def __neq__(self, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need neq I think

@@ -560,71 +488,95 @@ class Variant(SimpleContainer):
Modifying the attributes in this class will have **no effect** on the
underlying tree sequence data.

:ivar site: The site object for this variant.
:ivar site:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of the ivars entirely I think (hope?)

@@ -39,6 +41,8 @@
import tskit
import tskit.exceptions as exceptions

__setattr__ = builtins.object.__setattr__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about the wisdom of this - don't we then export this into our namespace? If we're going to do something this ugly, the I'd prefer to feel the full brunt of the ugliness each time we use it, just to be reminded of how ugly it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imported it to avoid the double lookup on each row access - I thinking importing it to something called __builtin__object__setattr__ would be ugly enough?

@jeromekelleher
Copy link
Member

Note that I've removed the deprecated attrs on mutation - these could be added back if you think it is too early.

When were they marked as deprecated? We should update the changelog with details.

@benjeffery
Copy link
Member Author

Looks good, but I'm seriously questioning whether making things frozen is worth it. What problem are we trying to solve here?

The idea was to prevent users thinking they could mutate the classes - as it warned about in the old docs.

Having to resort to this __setattr__ stuff to do perfectly reasonable things seems much more likely to result in errors than anything the frozen is trying to prevent. Plus, I think we are likely to break code if we change the semantics to frozen now.

If we're trying to allow for things like:

tables.nodes[0].flags = 0

then we perhaps we could have table rows frozen for now (to allow this in future) but keep the stuff returned by the TreeSequence unfrozen?

Do you mean the other way round? As in TableRow classes are unfrozen to allow for future mutation, and Node et al. are frozen? This seems to make sense to me.

@jeromekelleher
Copy link
Member

Do you mean the other way round? As in TableRow classes are unfrozen to allow for future mutation, and Node et al. are frozen? This seems to make sense to me.

No - to me the only point in freezing the classes is so that we don't break people's code later when we change the semantics to "push the changes back into the tables". So, let people mutate away on the Node, Edge etc classes - we document that these are copies of the original data, and they can do whatever they like with them. Freezing the classes just for the sake of it, or to somehow prevent people from writing bad code seems ... unpythonic to me.

@jeromekelleher
Copy link
Member

How about we make the minimal changes to dataclasses first as cleanly as we can with the same semantics as we currently have (ish), and then look at whether we should freeze or not as a second pass?

@benjeffery benjeffery force-pushed the table-row-metadata branch 3 times, most recently from 3a93746 to afc0ec7 Compare March 24, 2021 10:10
@benjeffery
Copy link
Member Author

@jeromekelleher OK, semantics are now exactly as they were before. Think I have addressed the comments too. Just the docs
to fix, for some reason it isn't finding the Site class.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, seems like a major improvement. One comment on the type annotations.

@@ -2991,17 +2971,18 @@ def test_metadata(self):
(inst,) = self.get_instances(1)
(inst2,) = self.get_instances(1)
assert inst == inst2
inst._metadata_decoder = lambda m: "different decoder"
inst.metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this statement do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was forcing decode of the metadata, but that is done by the equality operator above anyway, removed.

class IndividualTableRow:
__slots__ = ["flags", "location", "parents", "metadata"]
flags: int
location: np.ndarray
parents: np.ndarray
metadata: bytes
metadata: Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could say metadata : Union[bytes, dict] couldn't we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Optional[Union[bytes, dict]]

@benjeffery benjeffery added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Mar 24, 2021
@mergify mergify bot merged commit ecf9bcb into tskit-dev:main Mar 24, 2021
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants