Skip to content

Refactor csv_importer.py with Pandas #1116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 11, 2023
Merged

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Mar 31, 2023

Updates #1103

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

Summary

I meant to make only a few specific changes, but I ended up making a lot of changes.

  • pandas.DataFrame.apply is just a Python for-loop under the hood, which is slow, so I rewrote these methods to use the built-ins, which rely on fast numpy functions
  • we don't need to fall back to str dtypes in the Pandas dataframe: it complicates code downstream, makes the code slower (because of redundant dtype casting), and, based on my log-digging in this thread, the code path hasn't occurred in over a year (and only then occurred once due to a bug which was fixed with a data type change)
  • fixed/updated a few exception handlings:
    • ValueError in the first exception in load_csv made the second exception statement unreachable, so removed it
    • added GeoTypeSanityCheckException, ValueSanityCheckException to handle a couple other exceptions
  • removed the old staticmethods that are unused after these changes
  • changed tests to conform to new data type assumptions and overall fixed tests to pass

* rely on Pandas built-ins more for speed
* assume Pandas data types are not strings
* update tests
@dshemetov dshemetov requested a review from BrainIsDead March 31, 2023 02:05
@github-advanced-security
Copy link

You have successfully added a new SonarCloud configuration ``. As part of the setup process, we have scanned this repository and found no existing alerts. In the future, you will see all code scanning alerts on the repository Security tab.

@dshemetov dshemetov requested a review from melange396 March 31, 2023 02:08
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 6 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@BrainIsDead
Copy link
Contributor

Wow!!!!!! Looks perfect you are definitely better in pandas than me
image

@dshemetov
Copy link
Contributor Author

dshemetov commented Mar 31, 2023

Haha! I just spent the past 6 months trying to make JIT fast and that involved figuring out what's fast and what's not in Pandas. Most of that involved speed profiling, but here are a few good articles:

@BrainIsDead
Copy link
Contributor

Haha! I just spent the past 6 months trying to make JIT fast and that involved figuring out what's fast and what's not in Pandas. Most of that involved speed profiling, but here are a few good articles:

Thanks

@BrainIsDead BrainIsDead merged commit 5bd8408 into issue_1078 Apr 11, 2023
@BrainIsDead BrainIsDead deleted the ds/csv_importer_pandas branch April 11, 2023 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants