Skip to content

Fix(parse cs): Remove '>' char from path fields #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 25, 2023

Conversation

arulths
Copy link
Contributor

@arulths arulths commented Apr 24, 2023

  • added a new list that includes any fields that may have paths in them
    • note that this list is not exhaustive, it only includes the path fields already defined in cryosparc2.py
  • strategy: remove the first occurrence of the ">" char in each path using str.replace()

fixes #102

@asarnow
Copy link
Owner

asarnow commented Apr 24, 2023

Cool, thanks! This is almost ready to be merged, can you instead add a PATH_FIELDS list to star.Relion containing the field constants (like e.g. COORDS3D or any of those lists)? The idea is to reference the field conversion dictionary as few times as possible for ease of maintenance. Also, instead of Series.apply() with a lambda it's probably faster to use Series.str.replace (I think df[field].str.replace(">", "", 1)).

Also - do you know if it's always exhaustively adding the > to the paths? Is there any way to end up with mixed > / no > in the same file? If not it would be nice to also test for the first path having a > so we can skip the code block if it's not needed.

@arulths
Copy link
Contributor Author

arulths commented Apr 25, 2023

Thanks @asarnow, I've made the modifications. I didn't end up removing the lambda as I tested this section with 1.2M rows and two path fields, here were my findings:

with lambda:

test 1

Replaced > with empty string in rlnMicrographName
Replaced > with empty string in ucsfImagePath
Time taken to replace > with empty string: 0.3381047248840332

test 2

Replaced > with empty string in rlnMicrographName
Replaced > with empty string in ucsfImagePath
Time taken to replace > with empty string: 0.3361241817474365

with Series.str.replace:

test 1

Replaced > with empty string in rlnMicrographName
Replaced > with empty string in ucsfImagePath
Time taken to replace > with empty string: 0.37374305725097656

test 2:

Replaced > with empty string in rlnMicrographName
Replaced > with empty string in ucsfImagePath
Time taken to replace > with empty string: 0.39196300506591797

Its not a big amount but it is slower, but let me know if you'd still like me to use Series.str.replace anyway

@asarnow
Copy link
Owner

asarnow commented Apr 25, 2023

Nice, thank you! I'm going to merge the PR now.

Keeping .apply() is fine, I guess Series.str.replace is doing some extra stuff that's a bit slower. There's a pandas issue about it here.

@asarnow asarnow merged commit 9c16200 into asarnow:master Apr 25, 2023
@arulths arulths deleted the issues/102/remove-cs-export-char branch April 25, 2023 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Converting exported .cs file includes> char in file paths
2 participants