Skip to content

proposal: x/tools/diff: a package for computing text differences #58893

Closed
@adonovan

Description

@adonovan

Recent work in gopls resulted in the creation of an internal package for computing text differences in the manner of the UNIX diff command, for applying those differences to a file in the manner of the patch command, and for presenting line-oriented diffs using +/- prefix notation aka GNU "unified" diff format (diff -u). Diff functionality is invaluable for developer tools that transform source files, and for tests that compare expected and actual outputs. We propose to publish our diff package with the public API shown below.

// Package diff computes differences between text files or strings.
package diff // import "golang.org/x/tools/diff"

// -- diff --

// Strings computes the differences between two strings.
// The resulting edits respect rune boundaries.
func Strings(before, after string) []Edit

// Bytes computes the differences between two byte slices.
// The resulting edits respect rune boundaries.
func Bytes(before, after []byte) []Edit

// An Edit describes the replacement of a portion of a text file.
type Edit struct {
	Start, End int    // byte offsets of the region to replace
	New        string // the replacement
}

func (e Edit) String() string

// -- apply --

// Apply applies a sequence of edits to the src buffer and returns the
// result. Edits are applied in order of start offset; edits with the
// same start offset are applied in the order they were provided.
//
// Apply returns an error if any edit is out of bounds,
// or if any pair of edits is overlapping.
func Apply(src string, edits []Edit) (string, error)

// ApplyBytes is like Apply, but it accepts a byte slice.
// The result is always a new array.
func ApplyBytes(src []byte, edits []Edit) ([]byte, error)

// SortEdits orders a slice of Edits by (start, end) offset.
// This ordering puts insertions (end = start) before deletions
// (end > start) at the same point, but uses a stable sort to preserve
// the order of multiple insertions at the same point.
// (Apply detects multiple deletions at the same point as an error.)
func SortEdits(edits []Edit)

// -- unified --

// Unified returns a unified diff of the old and new strings.
// If the strings are equal, it returns the empty string.
// The old and new labels are the names of the old and new files.
func Unified(oldLabel, newLabel, old, new string) string

// ToUnified applies the edits to content and returns a unified diff.
// It returns an error if the edits are inconsistent; see [Apply].
// The old and new labels are the names of the content and result files.
func ToUnified(oldLabel, newLabel, content string, edits []Edit) (string, error)

@pjweinb @findleyr

Activity

added this to the Proposal milestone on Mar 6, 2023
pjweinb

pjweinb commented on Mar 6, 2023

@pjweinb

The idea of doing this seems fine, but perhaps the documentation should include a few caveats to cover the following points. (There's lots of ways to write diff, and sadly no best one. This one is for inputs that are fairly similar, or fairly short.)

  1. When the edit distance between the inputs is fairly small, the algorithm might find a minimal edit sequence, but it might not. All that is claimed is that applying the Edits to the 'before' gets the 'after'.
  2. There is an internal (inaccessible) parameter that controls the maximum length of the returned []Edit. If this bound is hit, the returned []Edit will have sensible edits at the beginning and end of the input, but a big replace in the middle.
bcmills

bcmills commented on Mar 6, 2023

@bcmills
Contributor

See previously #23113, #41980.

moved this to Incoming in Proposalson Mar 6, 2023
ianlancetaylor

ianlancetaylor commented on Mar 6, 2023

@ianlancetaylor
Contributor

See also #45200.

DeedleFake

DeedleFake commented on Mar 6, 2023

@DeedleFake

Instead of SortEdits(), maybe func (Edit) Less(Edit) bool would be more general. It's pretty straightforward to plug in via slices.SortFunc(edits, diff.Edit.Less).

adonovan

adonovan commented on Mar 6, 2023

@adonovan
MemberAuthor

I agree that we should guarantee only that the composition of diff.Strings and diff.Apply is the identity, and nothing about the specific edits that it returns. We should probably also define the Unified text form in more detail.

Thanks for the links to related proposals. There's a fair bit of interest in both the narrow concept of text diff as proposed here, and in richer kinds of structured value diff for use in testing. If some form of the latter is accepted into the standard library, then perhaps simple text diff, on which it would depend, would also need to be in the standard library, though not necessarily exposed. I'm going to resist the temptation to argue that this should be a standard package. We can always do that later.

adonovan

adonovan commented on Mar 6, 2023

@adonovan
MemberAuthor

Instead of SortEdits(), maybe func (Edit) Less(Edit) bool would be more general. It's pretty straightforward to plug in via slices.SortFunc(edits, diff.Edit.Less).

I tried that initially, but it turns out to be incorrect: it's imperative that you use sort.Stable for edits since insertions at the same point must preserve their relative order.

DeedleFake

DeedleFake commented on Mar 6, 2023

@DeedleFake

I tried that initially, but it turns out to be incorrect: it's imperative that you use sort.Stable for edits since insertions at the same point must preserve their relative order.

There is also a slices.SortStableFunc().

Alternatively, you could add a mechanism to Edit that could help them maintain their relative ordering outside of external context such as a numbered priority. Otherwise, if you need a function to sort a []Edit, the implicit assumption is that an unsorted slice is likely to be obtained from somewhere, but if the relative ordering of the elements of that slice is important than there's also an assumption that that slice will always already be partially ordered correctly. That seems kind of error-prone to me.

adonovan

adonovan commented on Mar 6, 2023

@adonovan
MemberAuthor

the implicit assumption is that an unsorted slice is likely to be obtained from somewhere, but if the relative ordering of the elements of that slice is important than there's also an assumption that that slice will always already be partially ordered correctly. That seems kind of error-prone to me.

The definition of Apply makes clear that the slice of edits is a list, not a set: the relative ordering of insertions is important. But Apply can call SortEdits internally. Within gopls, we use SortEdits after merging lists of edits to the same file, but simple concatenation should suffice. It's also used to ensure to ensure a deterministic order, which some clients have mistakenly assumed.

Perhaps we should remove SortEdits from the API and let gopls implement its own copy of that function.

earthboundkid

earthboundkid commented on Mar 6, 2023

@earthboundkid
Contributor

Can the Strings and Bytes functions be unified behind [byteseq string|[]byte]? Perhaps diff.Of?

adonovan

adonovan commented on Mar 6, 2023

@adonovan
MemberAuthor

Can the Strings and Bytes functions be unified behind [byteseq string|[]byte]? Perhaps diff.Of?

They could, but it seems like a lot of trouble just to achieve name overloading.

earthboundkid

earthboundkid commented on Mar 6, 2023

@earthboundkid
Contributor

The other way around, I feel like it's a lot of work to have duplicate Strings and Bytes functions that work the same way instead of having callers cast their []byte to string or having a single generic function.

23 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @rsc@dolmen@earthboundkid@DeedleFake@mpx

        Issue actions

          proposal: x/tools/diff: a package for computing text differences · Issue #58893 · golang/go