Skip to content

Conversation

petrelharp
Copy link
Contributor

@petrelharp petrelharp commented Sep 10, 2021

Here is a start at adding metadata_vector. I haven't figured out the right way to get keys that are a few nested levels down, though - do you know an elegant way to do this, @benjeffery?

Fixes #1676

@benjeffery
Copy link
Member

For getting the nested keys how about something like:
reduce(lambda d, k: d.get(k, default) if isinstance(d, Mapping) else default, key, row.metadata)

You'd need to check if key was a list first.

@petrelharp
Copy link
Contributor Author

Wow that is much more elegant than I would have done.

@petrelharp
Copy link
Contributor Author

Needs some more tests for weird metadata, but this seems to work.

@petrelharp
Copy link
Contributor Author

Ok, I think this is good to go?

@codecov
Copy link

codecov bot commented Sep 10, 2021

Codecov Report

Merging #1690 (3118b53) into main (936bf86) will increase coverage by 0.00%.
The diff coverage is 94.11%.

❗ Current head 3118b53 differs from pull request most recent head 35716ff. Consider uploading reports for the commit 35716ff to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1690   +/-   ##
=======================================
  Coverage   93.79%   93.79%           
=======================================
  Files          27       27           
  Lines       23578    23595   +17     
  Branches     1085     1089    +4     
=======================================
+ Hits        22114    22130   +16     
- Misses       1429     1430    +1     
  Partials       35       35           
Flag Coverage Δ
c-tests 92.04% <ø> (ø)
lwt-tests 93.49% <ø> (ø)
python-c-tests 95.46% <94.11%> (-0.01%) ⬇️
python-tests 98.75% <94.11%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/tables.py 98.82% <94.11%> (-0.10%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 936bf86...35716ff. Read the comment docs.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I think we need to handle the default differently though.

def metadata_vector(self, key, *, dtype=None, default_value=None):
"""
Returns a numpy array of metadata values obtained by extracting ``key``
from each metadata entry, and inserting ``default_value`` if the key is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"inserting" suggests that we might be modifying the metadata

Suggested change
from each metadata entry, and inserting ``default_value`` if the key is
from each metadata entry, and using ``default_value`` if the key is

if isinstance(key, list):
out = np.array(
[
functools.reduce(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use from functools import reduce you save an attribute look up in the tight loop. Same with Mapping below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah! ok!

out = np.array(
[
functools.reduce(
lambda d, k: d.get(k, default_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to use get if a default value is supplied so that KeyError is raised for a missing entry. As None could be a valid desired default, you'll need a singleton class NO_DEFAULT to check for and use as the default kwarg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh dear

self.ll_table.metadata_schema
)

def metadata_vector(self, key, *, dtype=None, default_value=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea but you could make key optional and return the whole metadata object if it is not specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - nice idea, but that might encourage bad habits? Currently, metadata has to be a dictionary (not eg a single number), so I'm not sure about this. We could always extend it later to cover this case without breaking this API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Peter here

@petrelharp
Copy link
Contributor Author

Thanks! I think I took care of those things. Too bad python doesn't have the missing() function.

@molpopgen
Copy link
Member

I haven't scanned the diffs, but curious if this works/is tested on both codec flavors? (JSON and struct)

@petrelharp
Copy link
Contributor Author

Whoops - I was too clever with the tests, and hadn't caught that it was doing totally the wrong thing. Since strings are a Sequence, when I asked for key="abc" it was treating that the same as key=["a", "b", "c"]. Thanks for catching that, codecov!

@petrelharp
Copy link
Contributor Author

I haven't scanned the diffs, but curious if this works/is tested on both codec flavors? (JSON and struct)

It certainly should work - this is only using things in the layer that the codec is invisible!

@benjeffery
Copy link
Member

benjeffery commented Sep 14, 2021

@petrelharp I've added a commit here with an extra test for deep KeyError and a tweak to the default value so that it looks better in the docs. Let me know what you think.

# we override the meta so that it looks good in the docs.
class NotSetMeta(type):
def __repr__(cls):
return "Not set"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah-ha! nice magic!

@petrelharp
Copy link
Contributor Author

Rebased; should be ready to merge.

@benjeffery
Copy link
Member

Added changelog - merging.

@mergify mergify bot merged commit 7fff7bc into tskit-dev:main Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vectorised metadata extraction
4 participants