-
-
Notifications
You must be signed in to change notification settings - Fork 954
Implement ExtensionArray interface #701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ExtensionArray interface #701
Conversation
9214cff
to
9af7f8d
Compare
9af7f8d
to
56e6019
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to see this PR. I took a brief look through and raised a few small comments. Feel free to ignore.
geopandas/array.py
Outdated
return self | ||
from pandas import factorize | ||
_, uniques = factorize(self) | ||
return uniques |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that two geometries can have the same value but different pointers. Do we have have a policy on what to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the discussion for factorize
geopandas/array.py
Outdated
"""Return my 'dense' representation | ||
@property | ||
def nbytes(self): | ||
return self.data.nbytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably an under-estimate. Geometries are generally far larger than the pointers. It's possible that GEOS has something for this, although I doubt it. If it's cheap we might consider converting to WKB and taking the length of that as an approximation.
pandas.factorize | ||
ExtensionArray.factorize | ||
""" | ||
return from_wkb(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation seems surprising given the method title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, we represent geometries with the appropriate bytestring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for now I went for using WKB as an intermediate representation for unique / factorize (and possibly also for sorting). With the idea that WKB is the cheapest representation to generate that can be used for algorithms like unique and factorize, without implementing them ourselves (in principle we could also code them in cython using the geos interface for equality and populating a table)
10c58b7
to
05162c7
Compare
Remaining error for pandas master is a bug in extension arrays in pandas: pandas-dev/pandas#20743 |
Codecov Report
@@ Coverage Diff @@
## geopandas-cython #701 +/- ##
====================================================
- Coverage 91.32% 88.55% -2.77%
====================================================
Files 16 16
Lines 1694 1783 +89
====================================================
+ Hits 1547 1579 +32
- Misses 147 204 +57
Continue to review full report at Codecov.
|
The ExtensionArray interface how been implemented in the mean time. |
First pass to implement the extension array interface. All interface tests are passing except the ones that rely on sorting.
Still need to better integrate with block code to support released version of pandas as well.
Some other tests are still failing: pandas-dev/pandas#20576, pandas-dev/pandas#20578