Skip to content

Better Importer interface #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 28, 2018
Merged

Better Importer interface #192

merged 1 commit into from
Apr 28, 2018

Conversation

sbarzowski
Copy link
Collaborator

As discussed in #190

@sbarzowski
Copy link
Collaborator Author

@petr-k @sparkprime @anguslees

I haven't tested it beyond existing test suite, yet.

@coveralls
Copy link

coveralls commented Feb 18, 2018

Coverage Status

Coverage increased (+0.1%) to 73.418% when pulling bd91264 on sbarzowski:importer into fde815f on google:master.

@anguslees
Copy link
Contributor

anguslees commented Feb 18, 2018

I don't like the separation/race between canonicalise and import. Apart from regular race conditions, in my case I want to fetch from remote servers - and it is much more efficient to fetch content in the same operation as walking the search path (testing if the content actually exists).

Can we roll the two back into one Import operation, like before this PR? I think we can just document that it is important that FoundHere is in an absolute/canonicalised form.
Edit: Hrm, no I see that it isn't that simple (since we want to look for cache hits before even calling into the importer). .. Perhaps we move the cache into the importer? The only other alternative I can see is to expose more of the search logic so the caller can cache check at the right points (build srcfile-relative path; check abspath cache; attempt to read; check search-relative cache; foreach search path attempt to read; store positive/negative result in cache).

@sbarzowski
Copy link
Collaborator Author

sbarzowski commented Feb 19, 2018

@anguslees
If you're using urls to access remote servers, then Canonicalize can be a no-op, the paths are already canonical, i.e. it can be:

func (importer *RemoteImporter) Canonicalize(codePath, importedPath string) (string, error) {
	return importedPath, nil
}

The only requirement for canonical path is that importer knows how to fetch it, there is no need to check that it actually exists.

In case of file importer, I'm aware of the race, but it is completely benign. Jsonnet has every right to show an error if an imported file is deleted - and that's the only thing that can happen. And we need to check for existence in a bunch of locations anyway.

@sparkprime
Copy link
Collaborator

Don't we want the ability for a Jsonnet file hosted at http://foo/bar.jsonnet to do import "baz.jsonnet" and for that to be resolved to http://foo/baz.jsonnet ?

@sparkprime
Copy link
Collaborator

I wonder if we could let the importer do cacheing by passing it the cache singleton to use. I also think maybe we should cache the non-existence of files, to speed up future searches for the same thing.

@sbarzowski
Copy link
Collaborator Author

sbarzowski commented Feb 20, 2018

Don't we want the ability for a Jsonnet file hosted at http://foo/bar.jsonnet to do import "baz.jsonnet" and for that to be resolved to http://foo/baz.jsonnet ?

It doesn't change anything. The redirections can be completely transparent. In your example http://foo/bar.jsonnet doesn't depend on from where it is imported, so it is already a good canonical path.

Perhaps "canonicalize" gives a wrong idea that it needs to be the sole identifier for given content. Maybe getKey or something would be a better name. The only point is that it no longer depends on from where it was imported.

I wonder if we could let the importer do cacheing by passing it the cache singleton to use.

I think it wouldn't help. It would still need to first get some key to the cache, then check it, then fetch the data. It would make it easier to carry some additional data between getting the key and fetching, but I don't know what it could be.

I also think maybe we should cache the non-existence of files, to speed up future searches for the same thing.

It is something the importer could do internally. I see also other, potentially easier options - like for example caching JPATH lookups separately (still within the importer). Anyway it doesn't change the interface.

@anguslees
Copy link
Contributor

anguslees commented Feb 21, 2018

I think I'm missing something in this discussion:

The only requirement for canonical path is that importer knows how to fetch it, there is no need to check that it actually exists.

So let's walk through an example:

Let's say that http://foo/bar.jsonnet contains import "baz.libsonnet". I think this means:

  • Canonicalize("http://foo/bar.jsonnet", "baz.libsonnet") will presumably return "http://foo/baz.libsonnet"
  • check cache, missing
  • Import("http://foo/baz.libsonnet") is expected to return the content.

Now the above works fine for simple direct imports, but what if baz.libsonnet needs to be found along the search path? Now "http://foo/baz.libsonnet" is the wrong answer, and I need to go and probe for existence at Canonicalize-time, which is expensive (the local-file equivalent is in tryPath, also called by Canonicalize)

I can of course avoid the second fetch by pre-stuffing the result in a cache during Canonicalize, and making Import always hit this cache - but this seems like I'm working around (not with) this new proposed API. In particular, with this workaround Canonicalize now has side effects, while Import doesn't, and I have to maintain my own shadow-cache :/
It seems like it would have been better just to embed the cache in my Importer in the first place, and return Import to the dumb/simple Import(from, path) -> ImportedData API.

@sparkprime
Copy link
Collaborator

I don't see how

func (importer *RemoteImporter) Canonicalize(codePath, importedPath string) (string, error) {
	return importedPath, nil
}

could ever work if the importedPath is relative, since it would collide with the result of Canonicalize for different files (ones with different content).

@sbarzowski
Copy link
Collaborator Author

I don't see how [...] could ever work if the importedPath is relative...

Ok, I assumed that if you use URLs then you always use an absolute one, hence some misunderstandings. Of course if you want to have imports relative to the source and search path for remote stuff, then it gets more complicated.

@sbarzowski
Copy link
Collaborator Author

@anguslees
How would the embedded cache work? My guess is that it would keep contents (or nonexistence) of each URL it tries.

@sbarzowski
Copy link
Collaborator Author

sbarzowski commented Feb 21, 2018

Hmmm... I'm getting more and more convinced that we should just leave the caching of file contents to the importer, since we're talking about having some sort of cache in the importer anyway. It's also very flexible.

So now, my plan is to:

  1. Change the Importer interface to Import(from, path) -> (contents, foundAt, err). The first argument (from) would be a path to the file, not its directory like it is now (to satisfy Import path should be opaque #190).
  2. Decouple code cache and data (file contents) cache. The code cache would use foundAt as key, so that each file would be parsed and analyzed exactly once.
  3. The Importer wouldn't care about code cache or any Jsonnet data structures.
  4. It would be recommended that the Importer implements cache, but it will be an implementation detail, the rest of the code won't care about it.

Does it make sense?

@sparkprime
Copy link
Collaborator

sparkprime commented Feb 22, 2018

Yes except we should encourage people to cache their I/O, otherwise it might expose the laziness.
If you can keep a simple key/value store in core that makes it trivial for the importer to do it, that would be awesome.

@anguslees
Copy link
Contributor

Sgtm.

@anguslees
Copy link
Contributor

Oh, I encourage you to break the exisiting function signature in some way to force users to reevaluate their use of this API as part of the upgrade. (ie don't just change the semantics of the first string arg)

@sparkprime
Copy link
Collaborator

I think we're on first name terms with all of the users of this API so this should be OK :)

@Quentin-M
Copy link

I just briefly asked @sparkprime in Slack if there is reason MakeImportCache is exported if it doesn't implement the Importer interface?

It seems that the VM is very stateless at this point, but it would be very beneficial to have the imported libraries cached across multiple evaluate calls. When working with libraries such as ksonnet-lib, it takes ~1.9s on an Intel i7 (Kaby Lake) @ 3.1 GHz, just to evaluate the following snippet. Those quickly add up when generating several files and it would be silly to use create a snippet that aggregate those files to do an evaluate multi for the purpose of caching the libraries, making the operator loose 1/ ability to debug the snippets easily 2/ ability to pass different ExtVar/ExtCode/TLA to those snippets 3/ file-grained control over what gets executed and when 4/ etc.

local k = import "k.libsonnet";
k.core.v1.list.new([])

I feel it would be reasonable to cache those in the VM/Importer.
I do not thing that the libraries would usually change during the lifespan of a VM.
If they do, the operator is most-likely aware about it, and a new VM/Importer instance can be created.

@sbarzowski
Copy link
Collaborator Author

@Quentin-M
Thanks, these are good points.

What takes time when importing k.libsonnet is parsing and analyzing it, so code (ast) cache is the important part. Importer will only have bytes-level cache (doesn't know about the code), so further changes will be needed.

@sbarzowski
Copy link
Collaborator Author

@sparkprime I've updated it. Now Importer is responsible for data-level caching. The file cache now caches existence/contents of every tried absolute path.

imports.go Outdated
importer Importer
foundAtVerification map[string]*string
codeCache map[string]potentialValue
importer Importer
}

// MakeImportCache creates and ImportCache using an importer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creates an ImportCache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also capitalize importer

imports.go Outdated
func (cache *ImportCache) importData(key importCacheKey) *ImportCacheValue {
if cached, ok := cache.cache[key]; ok {
return cached
func (cache *ImportCache) importData(codePath, importedPath string) (contents *string, foundAt string, err error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be consistent with the parameter names, because in Importer they are importedFrom, path

imports.go Outdated
}

type importCacheMap map[importCacheKey]*ImportCacheValue

// ImportCache represents a cache of imported data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add:

While the user-defined Importer implementations are required to cache file content, this cache is an additional layer of optimization that caches values (i.e. the result of executing the file content). It also verifies that the content pointer is the same for two foundAt values.

imports.go Outdated
// returned on subsequent calls.
// b) for given foundAt, the contents are always the same
//
// Note that by "the same contents" we mean the same pointer and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a little bit weird (if you haven't read the implementation) to have to return the same pointer. We could alternatively say that "If you return the same foundAt for multiple imports, we use the content for the first one." That would mean cacheing the LiteralString in the codeCache.

On the other hand, it's not a big deal for implementers to actually meet these requirements, so I'm happy either way.

imports.go Outdated
// Import imports a map entry.
func (importer *MemoryImporter) Import(dir, importedPath string) (*ImportedData, error) {
// Import fetches data from a map entry.
func (importer *MemoryImporter) Import(codePath, importedPath string) (contents *string, foundAt string, err error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about parameter naming

imports.go Outdated
func (importer *FileImporter) Import(dir, importedPath string) (*ImportedData, error) {
found, content, foundHere, err := tryPath(dir, importedPath)
// Import imports file from the filesystem.
func (importer *FileImporter) Import(codePath, importedPath string) (contents *string, foundAt string, err error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about parameter naming

imports.go Outdated
if os.IsNotExist(err) {
entry = &fsCacheEntry{
exists: false,
contents: "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could omit this line

}

// Import imports a map entry.
func (importer *MemoryImporter) Import(dir, importedPath string) (*ImportedData, error) {
// Import fetches data from a map entry.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of this are a bit weird now I think -- all paths are treated as absolute keys into the "Data" map. That probably should be documented, although I understand we only use this for tests so it does not matter?

@sparkprime
Copy link
Collaborator

LGTM, I'd like to hear @anguslees opinions.

@Quentin-M This does not address your request, but that is strictly orthogonal to this (a case of re-using the FileImporter between VM calls). We should do it in a separate PR.

@sbarzowski
Copy link
Collaborator Author

@sparkprime I have just applied the suggestions and I changed the interface slightly (Contents struct) to catch some caching problems. This way it's clear what it means that "the contents" are the same, the user has no way to change the underlying string. The same could be done with builtin string (just a string, not a pointer), but to have guaranteed O(1) comparison here, I would need to use reflection to get StringHeader, so I concluded that a wrapper is less ugly.

@sparkprime
Copy link
Collaborator

LGTM @anguslees anything to say here? I'll merge if not.

@sparkprime sparkprime merged commit f4428e6 into google:master Apr 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants