Skip to content

systematic, efficient approach to string construction #3

@StefanKarpinski

Description

@StefanKarpinski
Member

The current approach uses polymorphism to make RopeString objects. This is pretty inefficient for the typical small string use-case. To efficiently construct a C-style string in the current framework, one makes the current output stream a memio object and then prints to it. The general pattern I've used is to write a print_whatever function and then wrap it in a whatever function that returns a string using print_to_string. Should we stick with this pattern? It has the advantage of allowing the printing version to be very efficient, but it's kind of awkward to write. Should we figure out a different pattern? Something like C#'s StringBuilder pattern?

Perhaps it suffices to make strcat check the size and encodings of its arguments and use print_to_string approach to concatenate them into a copied string where appropriate — namely when the arguments are of compatible encodings (e.g. any mixture of ASCIIString and UTF8String), and if concatenated they would be below some size threshold. For larger strings, we should continue to use the RopeString approach. Also, string slices should copy their contents as well unless the resulting string is above the "large string" threshold, in which case, they can continue to use the current SubString with the known issue that this pins the superstring in memory.

Activity

JeffBezanson

JeffBezanson commented on May 3, 2011

@JeffBezanson
SponsorMember

Changing types based on string lengths makes it too hard to infer the types of these rather common operations. Instead, we should have the option to wrap a string as BigString(s) if s might be large, and BigString can use the memory-saving versions of these operations.
There's not much difference between print_to_string and StringBuilder. with_output_stream can be skipped in some cases by using write() with an explicit destination argument. It's also nice to be able to write output directly to an I/O endpoint without building a temporary string first.
Simple string building cases can also be handled by pushing characters into an array.

StefanKarpinski

StefanKarpinski commented on May 3, 2011

@StefanKarpinski
SponsorMemberAuthor

Makes sense. I can make the BigString change easily.

Is this an argument for continuing to implement core string building functionality by writing the printing version first and then defining the string creating version by applying print_to_string to the printing version?

JeffBezanson

JeffBezanson commented on May 3, 2011

@JeffBezanson
SponsorMember

Somewhat, but multiple approaches can be used. For example, if you're just
combining strings and characters you can use write() instead of going
through print_to_string. We might want to provide some nicer names for
memio, takebuf_string, and write, and make it look more like StringBuilder.
Or for something like strcat I would determine the size of the result,
allocate it once, and use memcpy.

The trouble is that if I do something like

write(io, strcat(a,b,c))

what you ideally want is to write each string without forming the temporary.
Even if strcat is written using an i/o buffer you don't get that
automatically here. I might have to say

strcat_to(io, a, b, c)

but that's not a very nice interface. If a, b, or c is a BigString though,
the strcat is done lazily and you get the desired behavior of writing all
the pieces with no copying. This seems to convince me that there's no
advantage to writing all the string functions in terms of printing. So do
whatever's simplest/fastest/convenient, and let BigString handle other
concerns. How's that sound?

print_escaped is a bit different since we know that a main use of it is
doing output. So strcat etc. doesn't necessarily need to imitate it.

On Tue, May 3, 2011 at 12:38 PM, StefanKarpinski <
reply@reply.github.com>wrote:

Makes sense. I can make the BigString change easily.

Is this an argument for continuing to implement core string building
functionality by writing the printing version first and then defining the
string creating version by applying print_to_string to the printing version?

Reply to this email directly or view it on GitHub:
#3 (comment)

ViralBShah

ViralBShah commented on Jul 9, 2011

@ViralBShah
Member

This seems like a 2.0 thing.

StefanKarpinski

StefanKarpinski commented on Jul 9, 2011

@StefanKarpinski
SponsorMemberAuthor

We're actually pretty good on this at this point. All strcat and string ref (substring) operations on ASCIIString and UTF8String objects use memcpy now, so they're fast and they don't create exotic string objects (RopeString, SubString, etc.). Repeating a string does create aRepString` object, but I think that's probably acceptable. I could make a copying implementation of that rather easily.

If someone wants to use a StringBuilder pattern, they can write the printing version and then use print_to_string on it. I feel like that's a reasonable approach if one is worried about strcat efficiency, with the added bonus of providing a version of the same functionality that can print without having to build a string at all.

I think this issue is not fully addressed, but well enough for v1.0 for now. Will reassign to v2.0.

JeffBezanson

JeffBezanson commented on Jul 9, 2011

@JeffBezanson
SponsorMember

Can I replace memcpy(a) with copy(a)?

StefanKarpinski

StefanKarpinski commented on Jul 9, 2011

@StefanKarpinski
SponsorMemberAuthor

Is copy(a::Array{Uint8,1}) as efficient as memcpy is?

JeffBezanson

JeffBezanson commented on Jul 9, 2011

@JeffBezanson
SponsorMember

It should be now that we changed copy_to to use memcpy for arrays where possible.

StefanKarpinski

StefanKarpinski commented on Jul 9, 2011

@StefanKarpinski
SponsorMemberAuthor

We can get rid of memcpy entirely then. I'll do it.

ViralBShah

ViralBShah commented on Jul 9, 2011

@ViralBShah
Member

We also need to experiment with some sizes at which memcpy is faster. It is actually slower for small arrays. Copy_to should have these smarts.

On 10-Jul-2011, at 12:43 AM, JeffBezansonreply@reply.github.com wrote:

It should be now that we changed copy_to to use memcpy for arrays where possible.

Reply to this email directly or view it on GitHub:
#3 (comment)

289 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @StefanKarpinski@ViralBShah@JeffBezanson

      Issue actions

        systematic, efficient approach to string construction · Issue #3 · JuliaLang/julia