Skip to content

proposal: allow append to update slice capacity #16976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
carl-mastrangelo opened this issue Sep 3, 2016 · 18 comments
Closed

proposal: allow append to update slice capacity #16976

carl-mastrangelo opened this issue Sep 3, 2016 · 18 comments

Comments

@carl-mastrangelo
Copy link
Contributor

carl-mastrangelo commented Sep 3, 2016

Background

When allocating byte slices, the len and cap values of the slice are set based on the arguments passed in to make:

foo = make([]byte, size, size) // len=size cap=size
foo = make([]byte, size) // len=size cap=size

From my understanding of the memory allocator, it appears that there is actually a small amount of wasted space in the underlying span if the size doesn't match up properly. It seems that the only way currently to expand the capacity to match the available underlying is by calling append, which allocates a slice that does fill the span and sets the cap appropriately. This causes an unnecessary copy and reallocation

Proposal

I would like to propose a minor, language level change that alters the append function. If the passed in slice has len == cap, and there are no arguments given to append, then the returned slice should have the capacity set to the maximum amount available. For example:

foo := make([]byte, 1500)
foo = append(foo)
// len=1500 cap=1536 

Note that this overload would be distinct from the vararg version that may have an empty slice provided. The user intended behavior is less clear in that scenario, and it may be surprising to have the capacity change when the argument is empty. The intent is much more clear when there are no extra arguments to the append call at all.

As before, bar = append(foo) will not alter foo. Though I am not strongly tied to this exact solution, there needs to be some good answer for optimizing Go program memory that doesn't involve knowing the exact underlying span sizes.

A secondary use case for this would be for implementing a "prepend" operation on slices. When a slice needs to be expanded for a prepend operation, the newly resized slice needs to be as big as possible in order to leave as much room as possible at the beginning. Reallocating the slice has to be done manually, which means using make, which forces the capacity to less than the max that the analogous append operation would use.

A third place where this would be useful is in bytes.Buffer. The function Buffer.grow implements slice resizing manually, meaning that it also has to use make. It could expand its reallocated slices more effectively if the capacity was set to the true maximum.

@minux
Copy link
Member

minux commented Sep 3, 2016 via email

@randall77
Copy link
Contributor

This seems too minor an optimization to introduce a new language feature to support it.

You could have a big template slice s and allocate a new slice t using t = append([]int(nil), s[:n]...).

I think it would be nice to have make([]int, n) be allowed to allocate a cap>n if it wants (whereas make([]int, n, c) would need to allocate a cap of exactly c).
Has to wait for Go 2.

@carl-mastrangelo
Copy link
Contributor Author

@minux Perhaps it should be an implementation detail? I looked through the spec and didn't see anything that appending had to not return a modified capacity slice, only that it could. Other implementations could still behave as they do today and leave it as a noop.

@randall77 I am not tied to the particular implementation, I would just like a way to use all the memory that has been allocated. The template slice is a nice idea, but won't that pay the penalty of also having to read from template? I was aiming at a zero copy implementation.

Definitely agree about the two arg make being allowed to pick a smarter capacity, but as you said it would have to wait until Go 2. It seems like my proposal would be a backwards compatible workaround in the mean time.

@randall77
Copy link
Contributor

Yes, the template slice would mean an extra copy you wouldn't need otherwise. Should be easily cacheable though.

@carl-mastrangelo
Copy link
Contributor Author

carl-mastrangelo commented Sep 3, 2016

@randall77 I believe the performance difference is non trivial. I made a change that makes runtime.roundupsize linked through to bytes.roundupsize. Correct me if I am wrong, but when increasing the size of a slice, it is this function that will choose the actual amount of space granted for a given memory request.

Using that, I changed bytes.makeSlice (which is used by Buffer.grow) to force the capacity to the maximum. I believe this is equivalent to the effect that append would have in my proposal. I then wrote a benchmark that writes ~1KB slices to a bytes.Buffer a fixed number of times. Running this benchmark before and after shows fairly large gains when doing lots of writes:

Before:

//return make([]byte, n)
BenchmarkBufferMultipleWritesExact/1025x1-4              2000000           611 ns/op
BenchmarkBufferMultipleWritesExact/1025x2-4              1000000          1770 ns/op
BenchmarkBufferMultipleWritesExact/1025x4-4               300000          4506 ns/op
BenchmarkBufferMultipleWritesExact/1025x8-4               200000          9684 ns/op
BenchmarkBufferMultipleWritesExact/1025x16-4              100000         20517 ns/op
BenchmarkBufferMultipleWritesExact/1025x32-4               30000         41415 ns/op
BenchmarkBufferMultipleWritesExact/1025x64-4               20000         79752 ns/op
BenchmarkBufferMultipleWritesExact/1025x128-4              10000        158385 ns/op
BenchmarkBufferMultipleWritesExact/1025x256-4               5000        323424 ns/op
BenchmarkBufferMultipleWritesExact/1025x512-4               2000        753127 ns/op
BenchmarkBufferMultipleWritesExact/1025x1024-4              1000       1429573 ns/op
BenchmarkBufferMultipleWritesExact/1024x1-4              3000000           577 ns/op
BenchmarkBufferMultipleWritesExact/1024x2-4              1000000          2146 ns/op
BenchmarkBufferMultipleWritesExact/1024x4-4               300000          4701 ns/op
BenchmarkBufferMultipleWritesExact/1024x8-4               200000         10250 ns/op
BenchmarkBufferMultipleWritesExact/1024x16-4              100000         20055 ns/op
BenchmarkBufferMultipleWritesExact/1024x32-4               30000         40386 ns/op
BenchmarkBufferMultipleWritesExact/1024x64-4               20000         81937 ns/op
BenchmarkBufferMultipleWritesExact/1024x128-4              10000        151521 ns/op
BenchmarkBufferMultipleWritesExact/1024x256-4               5000        317768 ns/op
BenchmarkBufferMultipleWritesExact/1024x512-4               2000        724780 ns/op
BenchmarkBufferMultipleWritesExact/1024x1024-4              1000       1417779 ns/op
BenchmarkBufferMultipleWritesExact/1023x1-4              3000000           567 ns/op
BenchmarkBufferMultipleWritesExact/1023x2-4              1000000          1655 ns/op
BenchmarkBufferMultipleWritesExact/1023x4-4               300000          4384 ns/op
BenchmarkBufferMultipleWritesExact/1023x8-4               200000          9584 ns/op
BenchmarkBufferMultipleWritesExact/1023x16-4              100000         19736 ns/op
BenchmarkBufferMultipleWritesExact/1023x32-4               30000         40424 ns/op
BenchmarkBufferMultipleWritesExact/1023x64-4               20000         80950 ns/op
BenchmarkBufferMultipleWritesExact/1023x128-4              10000        155928 ns/op
BenchmarkBufferMultipleWritesExact/1023x256-4               5000        315266 ns/op
BenchmarkBufferMultipleWritesExact/1023x512-4               2000        733745 ns/op
BenchmarkBufferMultipleWritesExact/1023x1024-4              1000       1381548 ns/op

After:

//return make([]byte, n, int(roundupsize(uintptr(n))))
BenchmarkBufferMultipleWritesExact/1025x1-4              2000000           631 ns/op
BenchmarkBufferMultipleWritesExact/1025x2-4               500000          2091 ns/op
BenchmarkBufferMultipleWritesExact/1025x4-4               200000          5367 ns/op
BenchmarkBufferMultipleWritesExact/1025x8-4               200000          5578 ns/op
BenchmarkBufferMultipleWritesExact/1025x16-4              100000         12246 ns/op
BenchmarkBufferMultipleWritesExact/1025x32-4               50000         28473 ns/op
BenchmarkBufferMultipleWritesExact/1025x64-4               20000         64402 ns/op
BenchmarkBufferMultipleWritesExact/1025x128-4              10000        124943 ns/op
BenchmarkBufferMultipleWritesExact/1025x256-4               5000        299186 ns/op
BenchmarkBufferMultipleWritesExact/1025x512-4               2000        591824 ns/op
BenchmarkBufferMultipleWritesExact/1025x1024-4              1000       1202655 ns/op
BenchmarkBufferMultipleWritesExact/1024x1-4              2000000           586 ns/op
BenchmarkBufferMultipleWritesExact/1024x2-4              1000000          1748 ns/op
BenchmarkBufferMultipleWritesExact/1024x4-4               300000          4497 ns/op
BenchmarkBufferMultipleWritesExact/1024x8-4               300000          4678 ns/op
BenchmarkBufferMultipleWritesExact/1024x16-4              100000         10812 ns/op
BenchmarkBufferMultipleWritesExact/1024x32-4               50000         24608 ns/op
BenchmarkBufferMultipleWritesExact/1024x64-4               30000         52313 ns/op
BenchmarkBufferMultipleWritesExact/1024x128-4              10000        118654 ns/op
BenchmarkBufferMultipleWritesExact/1024x256-4              10000        231403 ns/op
BenchmarkBufferMultipleWritesExact/1024x512-4               2000        519558 ns/op
BenchmarkBufferMultipleWritesExact/1024x1024-4              1000       1051704 ns/op
BenchmarkBufferMultipleWritesExact/1023x1-4              2000000           588 ns/op
BenchmarkBufferMultipleWritesExact/1023x2-4              1000000          1742 ns/op
BenchmarkBufferMultipleWritesExact/1023x4-4               300000          4520 ns/op
BenchmarkBufferMultipleWritesExact/1023x8-4               300000          4870 ns/op
BenchmarkBufferMultipleWritesExact/1023x16-4              100000         20188 ns/op
BenchmarkBufferMultipleWritesExact/1023x32-4               30000         40593 ns/op
BenchmarkBufferMultipleWritesExact/1023x64-4               10000        104389 ns/op
BenchmarkBufferMultipleWritesExact/1023x128-4              10000        130177 ns/op
BenchmarkBufferMultipleWritesExact/1023x256-4               3000        359676 ns/op
BenchmarkBufferMultipleWritesExact/1023x512-4               2000        518353 ns/op
BenchmarkBufferMultipleWritesExact/1023x1024-4              1000       1054462 ns/op

The first number in the name is the size of the slice, in order to try and cause degenerate allocation patterns. The second number is how many times in the loop said buffer was written to the bytes.Buffer. I can upload the code and benchmark if necessary.

Once there are 8 writes or more, using the entire capacity of the underlying span becomes important. A common use case would be reading data off the network, which typically comes in small chunks, but may been to be compacted.

@robpike
Copy link
Contributor

robpike commented Sep 4, 2016

This is too low-level and tricky. People have trouble understanding append as it is, and the proposal would make its behavior harder to understand.

I admit a bias, though: changes like this to a language to exploit a hidden detail of the implementation bother me.

@carl-mastrangelo
Copy link
Contributor Author

carl-mastrangelo commented Sep 5, 2016

@robpike I admit it is tricky, but before I put this proposal together, I actually tried it expecting the described behavior. It has always been an expectation of mine that append would return a slice with the appropriate capacity. That it did not return one with no extra args was the surprising part.

As for revealing implementation details: I think your argument would be stronger if it was earlier in the history of Go, when the runtime was changing more often. There is still performance on the table, and the alternative I think would be overall worse: copy pasting the current buffer allocation scheme. Once people do this, they will either be locked into it and miss out as the allocator evolves, or have to check the Go minor version to pick which copy of the allocator to use. By not doing inside Go, you don't stop it from happening; you just force it into everyone's application, and badly. The application and the runtime need to align at least somewhat to not waste CPU cycles. The implementation details eventually become the API for better or worse.

@quentinmit quentinmit added this to the Proposal milestone Sep 6, 2016
@carl-mastrangelo
Copy link
Contributor Author

Friendly ping. Are there plans (or even remote possibility) for the memory allocator to change substantially in ways that would make this proposal obsolete? Are there alternatives that would still accomplish the "give me a slice of exactly this length, but I'm flexible on the capacity"?

@minux
Copy link
Member

minux commented Sep 12, 2016 via email

@randall77
Copy link
Contributor

@minux, that would be nice but I don't know if it is possible.
Let's say size classes are 128 bytes. How do you tell the difference between:

    b := make([]byte, 126)
    b = append(b, []byte{...128 bytes...})

and

    var s struct{data [126]byte; dontClobberMe byte }
    b := s.data[:]
    b = append(b, []byte{...128 bytes...})

In the first situation you can reuse the backing array, but in the second case you can't. But you can't tell based on simply the type passed to append and the rounded-up size of the allocation area.

@rsc
Copy link
Contributor

rsc commented Sep 13, 2016

If len==cap, the maximum amount available is cap, not whatever is in the underlying array. Ignoring cap in this case during append would eliminate the semantics of the x[i:j:k] slice operation, which was added for a reason. I agree with Keith that it would be nice if make([]T, n) could return cap>n to begin with, but it cannot today.

If exposing the underlying array size returned by make is important, then instead of overloading append, a better way would be to adjust the calls to make to ask for it explicitly, not try to reconstruct it later. For example, a make-with-maximal-cap syntax might be make([]T, n, _) or make([]T, n, ...).

@carl-mastrangelo
Copy link
Contributor Author

@rsc If the make syntax was overloaded, what would be the appropriate thing to do for make(chan int, 100, ...) or make(map[int]string, 100, ...) ?

Channels may not meaningfully benefit from having a flexible cap (they could, I just didn't think about it before now). They would fill the available size up, but channels don't resize. Maps would also be somewhat odd, since they are not as simple as blunt slices. Is it possible to make better allocation decisions for either of these types if you had flexibility?

@minux
Copy link
Member

minux commented Sep 13, 2016 via email

@robpike
Copy link
Contributor

robpike commented Sep 13, 2016

I remain opposed. We should try not to expose what are in essence implementation details through the language.

@carl-mastrangelo
Copy link
Contributor Author

@minux & @robpike Would you prefer a library function instead? Perhaps something like runtime.ByteSlice(len int) []byte ? It would only work for bytes, but I am not sure there are use cases for other types.

Also, would you mind clarifying your opposition? The change I was originally proposing is an implementation change, not a language change. It would be possible to make append(foo) work for this implementation of Go without harming or breaking other implementations. GCC Go would continue to work. The change as I have described is at best a minor performance perk to those who care, and at worst a no-op in other implementations.

@minux
Copy link
Member

minux commented Sep 13, 2016

I think Russ already explained the reason why the original proposal can't
be accepted (the reason 1).

  1. the cap of a slice is the maximum size, and any increase of that is
    violating
    language spec (consider the case where you build your own memory allocator
    by allocating a large chunk of []byte, and use chunk[pos:pos+len:pos+len] to
    hand out small slices to other code. In that case, we definitely don't want
    the
    other code to extend the cap, as it will step on later allocations.)
  2. if you're talking about only extending into free space not used by any
    other
    objects, my last reply explained why it's not a good idea both
    implementation
    wise (will complicate gc bitmap to track object at byte granularity) and
    efficiency
    wise (if two goroutines try to expend the same slice, contention results).

@carl-mastrangelo
Copy link
Contributor Author

@minux Fair enough. What do you think about the library call?

If the Go team doesn't agree this is a problem, there probably isn't much point in continuing on this proposal.

@bradfitz
Copy link
Contributor

I'm going to decline this proposal because it sounds like it's either not feasible for Go 1 or exposes too many internals.

This could be done in a third-party library, though. Just hard-code the slab sizes for each (Go version, arch) pair, and make your own make wrapper helpers which return nice sizes.

But then Go isn't bound to that detail if the allocator changes later and e.g. stops using fixed slab sizes.

@golang golang locked and limited conversation to collaborators Sep 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants