Skip to content

python string wrapper? #218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ExpandingMan opened this issue Sep 7, 2022 · 5 comments
Closed

python string wrapper? #218

ExpandingMan opened this issue Sep 7, 2022 · 5 comments

Comments

@ExpandingMan
Copy link
Contributor

Converting strings from Python is of course really expensive because it involves a lot of copying and not even of contiguous blocks of data. Once you start getting above a few MB of data converting strings starts to look like a really bad option. The Py objects can do a lot, but they are not AbstractString so they don't really look like strings on the Julia end until you convert them.

Any interest in creating some kind of PyStr wrapper that provides an AbstractString interface for Py's?

@cjdoris
Copy link
Collaborator

cjdoris commented Sep 7, 2022

I've just done a quick benchmark with x = "a"^100_000_000 and y = Py(x):

  • x * "b" takes 17ms
  • Py(x) takes 33ms
  • pyconvert(String, y) takes 41ms

So conversion to/from Python strings appears to be within a small factor of optimal.

Is it that you don't want to pay this conversion cost at all, or have limited memory and therefore want a lazy wrapper?

I'm not against adding PyString, but AFAIU Python strings have no defined storage representation so there is no good way to access its codepoints or do fast random access to its characters. I think the best you can do is explicitly encode the string as UTF8 first, which is exactly what conversion to String does.

We could have a PyString wrapper type which simply uses ordinary Python indexing to access substrings and characters, but this will be sloooow.

@ExpandingMan
Copy link
Contributor Author

I think the problem is that it doesn't scale well. I don't have a MWE but I have observed converting large sets of strings to be much more expensive. I think the reasons for this are the burden on the garbage collector and that the memory holding the strings is not in general contiguous as in your example with one large string.

I'm aware that PyString would still have enormous disadvantages, but it might be "good enough" in some cases. For example, I recently had to get $\sim 10^7$ strings from Python but there were only $\sim 100$ distinct strings, and my task was much faster by encoding them with a hash or whatever rather than trying to copy them. In my case not having a PyString didn't matter much, but sometimes it may... on the other hand maybe in cases where you would really need a PyString it's just not worth it and you should just convert 🤷

I'm not sure it's the right approach, it was just a thought.

@cjdoris
Copy link
Collaborator

cjdoris commented Sep 7, 2022

The main use for a wrapper is to provide a zero-copy interface to a mutable object. A secondary use is to access only a small portion of a large container. I can't think of more uses than these two. If you don't have either of these uses (i.e. you are only reading, and will read most of the container) then usually you're better off eagerly converting the container instead.

Strings are immutable, which leaves only the second use, i.e. reading a small portion of a large string. But then you may as well just take the relevant substrings on the python side before converting. Maybe that's not always possible (e.g. this is happening inside a function which is acting generically on strings).

@PallHaraldsson
Copy link
Contributor

https://discuss.python.org/t/pep-686-make-utf-8-mode-default/14435/43
"Python 3.15 instead of 3.13" will default to UTF-8 mode, some wanted it for Python 3.12. It seems it was postponed (I mean as default, the mode is already a non-default option in all currently supported Python versions).

I don't know when strings will internally be UTF-8 (as opposed to just for I/O), but I think they want to change that, in similar time-frame.

@cjdoris
Copy link
Collaborator

cjdoris commented Oct 10, 2022

That's interesting, if strings ever use UTF-8 internally then we could add a PyString wrapper. Until then I don't think a wrapper would gain much so gonna close this issue for now. Feel free to reopen whenever.

@cjdoris cjdoris closed this as completed Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants