Refactor csv_importer.py with Pandas #1116

dshemetov · 2023-03-31T02:05:45Z

Updates #1103

Prerequisites:

Unless it is a documentation hotfix it should be merged against the dev branch
Branch is up-to-date with the branch to be merged with, i.e. dev
Build is successful
Code is cleaned up and formatted

Summary

I meant to make only a few specific changes, but I ended up making a lot of changes.

pandas.DataFrame.apply is just a Python for-loop under the hood, which is slow, so I rewrote these methods to use the built-ins, which rely on fast numpy functions
we don't need to fall back to str dtypes in the Pandas dataframe: it complicates code downstream, makes the code slower (because of redundant dtype casting), and, based on my log-digging in this thread, the code path hasn't occurred in over a year (and only then occurred once due to a bug which was fixed with a data type change)
fixed/updated a few exception handlings:
- ValueError in the first exception in load_csv made the second exception statement unreachable, so removed it
- added GeoTypeSanityCheckException, ValueSanityCheckException to handle a couple other exceptions
removed the old staticmethods that are unused after these changes
changed tests to conform to new data type assumptions and overall fixed tests to pass

* rely on Pandas built-ins more for speed * assume Pandas data types are not strings * update tests

github-advanced-security · 2023-03-31T02:06:53Z

You have successfully added a new SonarCloud configuration ``. As part of the setup process, we have scanned this repository and found no existing alerts. In the future, you will see all code scanning alerts on the repository Security tab.

sonarqubecloud · 2023-03-31T02:51:57Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
6 Code Smells

No Coverage information
0.0% Duplication

BrainIsDead · 2023-03-31T11:34:47Z

Wow!!!!!! Looks perfect you are definitely better in pandas than me

dshemetov · 2023-03-31T20:53:04Z

Haha! I just spent the past 6 months trying to make JIT fast and that involved figuring out what's fast and what's not in Pandas. Most of that involved speed profiling, but here are a few good articles:

It helps to understand that Pandas stores its data as Numpy arrays and the built-ins use fast vectorized Numpy methods that use C
apply on the other hand, uses Python for-loops and function calls

BrainIsDead · 2023-04-03T09:37:58Z

Haha! I just spent the past 6 months trying to make JIT fast and that involved figuring out what's fast and what's not in Pandas. Most of that involved speed profiling, but here are a few good articles:

It helps to understand that Pandas stores its data as Numpy arrays and the built-ins use fast vectorized Numpy methods that use C

apply on the other hand, uses Python for-loops and function calls

Thanks

refactor: csv_importer

5de1dae

* rely on Pandas built-ins more for speed * assume Pandas data types are not strings * update tests

dshemetov requested a review from BrainIsDead March 31, 2023 02:05

dshemetov requested a review from melange396 March 31, 2023 02:08

refactor: improve a few error messages

175c900

BrainIsDead approved these changes Apr 11, 2023

View reviewed changes

BrainIsDead merged commit 5bd8408 into issue_1078 Apr 11, 2023

BrainIsDead deleted the ds/csv_importer_pandas branch April 11, 2023 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor csv_importer.py with Pandas #1116

Refactor csv_importer.py with Pandas #1116

Uh oh!

dshemetov commented Mar 31, 2023 •

edited

Loading

Uh oh!

github-advanced-security bot commented Mar 31, 2023

Uh oh!

sonarqubecloud bot commented Mar 31, 2023

Uh oh!

BrainIsDead commented Mar 31, 2023

Uh oh!

dshemetov commented Mar 31, 2023 •

edited

Loading

Uh oh!

BrainIsDead commented Apr 3, 2023

Uh oh!

Uh oh!

Refactor csv_importer.py with Pandas #1116

Refactor csv_importer.py with Pandas #1116

Uh oh!

Conversation

dshemetov commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-advanced-security bot commented Mar 31, 2023

Uh oh!

sonarqubecloud bot commented Mar 31, 2023

Uh oh!

BrainIsDead commented Mar 31, 2023

Uh oh!

dshemetov commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BrainIsDead commented Apr 3, 2023

Uh oh!

Uh oh!

dshemetov commented Mar 31, 2023 •

edited

Loading

dshemetov commented Mar 31, 2023 •

edited

Loading