Skip to content

Fix Simulation.filter_indices bug #816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from

Conversation

chris-langfield
Copy link
Collaborator

@chris-langfield chris-langfield commented Dec 21, 2022

This PR has been rewritten to fix a bug in Simulation where filter_indices was initialized as type float, causing some of the confusion that led to the original attempt at patching the metadata.

As discussed with @garrettwrong, casting to object actually does serve to protect the original dtype of the column, mostly as a workaround to pandas-dev/pandas#4094

I actually think the line of attack that this PR originally introduced could make the problem worse, so am tabling it for now.

Original content of the PR:


If you try to update a metadata column at only some indices, the entire column's dtype is set to object, erasing whatever dtype the column previously had.

This patch checks the existing dtype of the column and updates the entire series accordingly. Due to a known issue with pandas.DataFrame.update() this will not work (and will be difficult to make work) for int or bool types. See discussion in this thread.

This PR therefore clarifies the following expected behavior for setting metadata dtypes:

  • Setting, getting, and updating float32 columns is totally consistent. If the column was added as a float32, it can be updated even with other numeric types, and will still return as float32 values.
  • Same for float64
  • Int columns will be upcast to float64 if you try to partially update them. The workaround is to not do a partial update, but reset the entire column with the values you want, including ones that won't change.
  • Same for bool

These behaviors are tested in test_simulation_metadata and it also tests and has examples of how to do the workaround.

@chris-langfield chris-langfield self-assigned this Dec 21, 2022
@chris-langfield chris-langfield added bug Something isn't working invalid This doesn't seem right cleanup labels Dec 21, 2022
@codecov
Copy link

codecov bot commented Dec 21, 2022

Codecov Report

Merging #816 (f4a2de1) into develop (1187c1b) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           develop     #816   +/-   ##
========================================
  Coverage    88.25%   88.25%           
========================================
  Files          113      113           
  Lines         8891     8891           
========================================
  Hits          7847     7847           
  Misses        1044     1044           
Impacted Files Coverage Δ
src/aspire/source/image.py 97.47% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@chris-langfield
Copy link
Collaborator Author

This may not be possible to implement for integer or bool types.

https://stackoverflow.com/questions/36743563/preserve-dataframe-column-data-type-after-outer-merge

Doing so would involve doing an internal conversion to a Pandas ExtensionDtype

Although this type of int plays nice with DataFrame.update(), this will cause problems down the line when we try to get the metadata by using DataFrame.to_numpy(), which does not know what to do with ExtensionDtypes.

The upshot is there are two options:

  • Merge this patch, which preserves dtypes for float32 and float64 columns (maybe important for storing CTF parameters etc based on src.dtype and not accidentally changing precision level). However, in this case Int64 values are upcast to Float64 when you try to do an update(). We would have to remember throughout the codebase to set integer-valued columns all at once (i.e. with values of length n), even if we are just updating some indices.
  • Keep the old behavior, where partial updates always cast to object.

@chris-langfield chris-langfield changed the title Metadata dtype patch Fix Simulation.filter_indices bug Jan 3, 2023
@chris-langfield chris-langfield deleted the metadata_dtype_patch branch January 3, 2023 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cleanup invalid This doesn't seem right
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant