Skip to content

Implement ExtensionArray interface #701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 2, 2018

First pass to implement the extension array interface. All interface tests are passing except the ones that rely on sorting.
Still need to better integrate with block code to support released version of pandas as well.

Some other tests are still failing: pandas-dev/pandas#20576, pandas-dev/pandas#20578

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to see this PR. I took a brief look through and raised a few small comments. Feel free to ignore.

return self
from pandas import factorize
_, uniques = factorize(self)
return uniques
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that two geometries can have the same value but different pointers. Do we have have a policy on what to do here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion for factorize

"""Return my 'dense' representation
@property
def nbytes(self):
return self.data.nbytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably an under-estimate. Geometries are generally far larger than the pointers. It's possible that GEOS has something for this, although I doubt it. If it's cheap we might consider converting to WKB and taking the length of that as an approximation.

pandas.factorize
ExtensionArray.factorize
"""
return from_wkb(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation seems surprising given the method title.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, we represent geometries with the appropriate bytestring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for now I went for using WKB as an intermediate representation for unique / factorize (and possibly also for sorting). With the idea that WKB is the cheapest representation to generate that can be used for algorithms like unique and factorize, without implementing them ourselves (in principle we could also code them in cython using the geos interface for equality and populating a table)

@jorisvandenbossche
Copy link
Member Author

Remaining error for pandas master is a bug in extension arrays in pandas: pandas-dev/pandas#20743

@codecov
Copy link

codecov bot commented Apr 19, 2018

Codecov Report

Merging #701 into geopandas-cython will decrease coverage by 2.76%.
The diff coverage is 67.76%.

Impacted file tree graph

@@                 Coverage Diff                  @@
##           geopandas-cython     #701      +/-   ##
====================================================
- Coverage             91.32%   88.55%   -2.77%     
====================================================
  Files                    16       16              
  Lines                  1694     1783      +89     
====================================================
+ Hits                   1547     1579      +32     
- Misses                  147      204      +57
Impacted Files Coverage Δ
geopandas/array.py 86.3% <56.6%> (-9.08%) ⬇️
geopandas/base.py 88.97% <61.53%> (-1.51%) ⬇️
geopandas/geoseries.py 86.16% <67.92%> (-7.02%) ⬇️
geopandas/geodataframe.py 94.78% <87.87%> (-1.81%) ⬇️
geopandas/io/file.py 89.61% <0%> (-5.2%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9cb9394...d7b0af0. Read the comment docs.

@jorisvandenbossche
Copy link
Member Author

The ExtensionArray interface how been implemented in the mean time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants