Skip to content

Consistent RDF permalinks with content negotiation #146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stain opened this issue Aug 14, 2017 · 5 comments
Closed

Consistent RDF permalinks with content negotiation #146

stain opened this issue Aug 14, 2017 · 5 comments

Comments

@stain
Copy link
Member

stain commented Aug 14, 2017

We should use consistent permalinks in URIs across our RDF to identify a workflow or a workflow file.

Currently (v1.1) we have:

  • SPARQL uses Graph URIs like http://sparql:3030/cwlviewer/github.com/genome/cancer-genomics-workflow/blob/be7e682c6a2d0b24b949e022aeae7786bd8434ed/strelka/workflow.cwl that exposes the origin of the git repository, its commit and file path
    • Statements within such graphs contains URIs like file:///data/git/1a2b5d62cde8555e5932907b28189585a2bf99d2/fp_filter/workflow.cwl that exposes the working directory for the git clone.
  • The research object's .ro/annotations/workflow.ttl annotation contain URIs like https://github.com/raw/common-workflow-language/workflows/master/workflows/make-to-cwl/dna.cwl#main

I propose we replace all of those (possibly with search-replace on the cwltool --printrdf output) to use a single location-free URI like: https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl

Permalink URI scheme

The new URI scheme is composed like this:

https://w3id.org/cwl/view/{scm}/{commit}/{path}#{anchor}
  • https://w3id.org/cwl/view/ fixed prefix at permalink service https://w3id.org/ (/cwl is our namespace)
  • {scm} - source code management protocol, currently only git supported
  • {commit} - full git commit sha1 id (no branches or short commits allowed)
  • {path} - relative path to .cwl file within a checkout of that git commit
  • #{anchor} - an optional anchor, e.g. #main as-is from cwltool --print-rdf ; not passed on to server

Anyone can construct a URI according to the above scheme for a given git commit and file - even if the commit only exists on a local disk or in a private git repository that the CWL Viewer does not know about.

These make good Linked Data identifiers for specific CWL workflow definitions because:

  • The cwl file and its neighbors can't change within the git commit
  • The URI is the same wherever the git repository is pushed or hosted

Anyone generating the URIs should be aware of some edge cases:

  • An uncommitted file change
  • CWL file is within a git submodule which could be a movable branch (without any commits appearing on master git repository)
  • CWL file is not tracked in git repository (e.g. ../../outside.cwl)

Resolving

Resolving any URI starting with https://w3id.org/cwl/view/git/{rest} will HTTP 302 redirect to the corresponding resource https://view.commonwl.org/git/{rest} representing that path in that commit

GET https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1

HTTP/1.1 302 Found
Location: https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl

Unknown commit?

If the public CWL viewer have never heard about the commit 933bf2a1a1cce32d88f88f136275535da9df0954 there is not much more to say:

HEAD https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1

HTTP/1.1 404 Not Found

Unknown git commit `933bf2a1a1cce32d88f88f136275535da9df0954`

Content-negotiation

But if it is known, CWL Viewer finds a matching graph for that file in that commit, then the client can content-negotiate to get various RDF serializations like text/turtle or application/ld+json:

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: text/turtle

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/turtle

@prefix cwl: <https://w3id.org/cwl/cwl#>.
<https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl#main> a cwl:Workflow .
#  ....

Notice how the returned RDF uses the location-independent w3id.org namespace, not view.commonwl.org

YAML

If the client asks for the CWL file with type application/x-yaml or application/octet-stream, and the git repository has a public "raw" option, then the server can redirect to that:

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/x-yaml

HTTP/1.1 302 Found
Vary: Accept
Location: https://cdn.rawgit.com/common-workflow-language/workflows/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl

GET https://cdn.rawgit.com/common-workflow-language/workflows/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/x-yaml

HTTP/1.1 200 OK
Content-Type: application/octet-stream

#!/usr/bin/env cwl-runner
cwlVersion: v1.0

class: Workflow
inputs:
    ...

HTML and JSON API

If the user asks for text/html, it is probably a browser. So CWL Viewer will redirect to the normal workflow rendering:

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: text/html

HTTP/1.1 302 Found
Vary: Accept
Location: https://view.commonwl.org/workflows/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl

This works also for application/json which then gives the JSON api output:

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/json

HTTP/1.1 302 Found
Vary: Accept
Location: https://view.commonwl.org/workflows/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl
GET https://view.commonwl.org/workflows/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/json

HTTP/1.1 200 OK
Vary: Accept
Content-Type: application/json
{
    "retrievedFrom": {
        "owner": "common-workflow-language",
        "repoName": "workflows",
        "branch": "master",
        "path": "workflows/lobSTR/lobSTR-workflow.cwl",
        "url": "https://github.com/common-workflow-language/workflows/tree/master/workflows/lobSTR/lobSTR-workflow.cwl"
    },
    "retrievedOn": 1499175275743,
    "lastCommit": "920c6be45f08e979e715a0018f22c532b024074f",
    "label": "lobSTR-workflow.cwl",
   ...
}

Images

OK, let's be cool and do images as well.

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: image/svg+xml

HTTP/1.1 302 Found
Vary: Accept
Location: https://view.commonwl.org/graph/svg/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl

Research Object Bundle

..and of course our Research Object Bundle if client asks for application/ro+zip or application/zip

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/ro+zip

HTTP/1.1 302 Found
Vary: Accept
Location: https://view.commonwl.org/robundle/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl

Packed workflows

If there's a packed CWL file with nested workflows, then a workflow is not matchable by it's filename alone, as you need to know also the #{anchor}. This is not a problem for the RDF output, as it will contain all workflows found in the packed CWL file, and you just match by #anchor.

However it can be a problem for the HTTP and JSON rendering, which with #103 would have alternative URIs depending on the selected nested workflow. So it could be confusing to redirect to the top-level workflow (if that can even be determined) as the user won't find their `#nested1/step/nestedstep2# in there; we don't expand nested workflows in the UI.

So if the user asks for text/html or application/json for a packed workflow (multipe workflows found), then we'll give an error, with links to the candidates using #103 escaped URIs.

GET https://view.commonwl.org/git/adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/packed.cwl HTTP/1.1
Accept: text/html

HTTP/1.1 300 Multiple Choices
Vary: Accept
Content-Type: text/uri-list

https://view.commonwl.org/workflows/example.com/blob/adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/packed.cwl%23main
https://view.commonwl.org/workflows/example.com/blob/adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/packed.cwl%23nested1
https://view.commonwl.org/workflows/example.com/blob/adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/packed.cwl%23nested2
@stain
Copy link
Member Author

stain commented Aug 14, 2017

Q: Should text/html redirect to the normal CWL Viewer, or do we need an RDFa HTML rendering so that we have a less confusing "meta" RDF resource we can link to from the normal HTML? (There should be an inverse prov:specializationOf link to the HTML page from the RDF)

@MarkRobbo
Copy link
Member

It is possible to add the same version of a workflow twice or more to CWLViewer, by having multiple branch names referencing the same commit ID or by adding a version directly via the commit ID.

In the case of the standard accepts: text/html header, which version of the workflow should be linked to?

  • It is slightly confusing to just return the first one in the database as this would redirect to a random branch name eg master - even though this should currently point to the same commit ID anyway

  • If it is always the full form /git/fullcommitID/path.cwl, this might not already exist in the database and so would need to be processed before viewing

@stain
Copy link
Member Author

stain commented Aug 22, 2017

I think it could make sense to always redirect to the full commitId version - I intended the https://w3id.org/cwl/v/ to be a permalink, so I'm not sure what's the extra win for the user of instead going to master branch or so (other than knowing that that commit is on such a branch today).

One thing that could happen laster is that the mastertoday is at commitdeadbeefand so we makehttps://w3id.org/cwl/v/git/deadbeef/wf.cwl` in the RO for wf.cwl - someone downloads it and/or runs a workflow with this and start talking about https://w3id.org/cwl/v/git/deadbeef/wf.cwl#step3. Now someone edits master (say deleting step3) and the new commit is fab31337 - they visit the master branch in the CWL Viewer and the master db entry is updated for the new commit.

Now someone else comes back again at https://w3id.org/cwl/v/git/deadbeef/wf.cwl to find out about that #step3 - they should be redirected to the deadbeef commit, not the progressed master branch.

If we present the permalink in the UI as well as in the RDF - according to Identifiers 21st century lesson 8 - then people might have put that link into their papers and so on, so we should keep supporting it. As we have to do that anyway, why not always go to the commit?

@stain
Copy link
Member Author

stain commented Aug 22, 2017

One challenge is that "Explore" will be more polluted by older commits if we "never forget". It might make sense to optimize this somehow to prune public listing of commits that are no longer on a branch (if it was important it should have been!). We would still only keep around commit ROs etc that have been previously visited (and hence more noteworthy than an in-between commit) so it shouldn't be too explosive.

@stain
Copy link
Member Author

stain commented Aug 24, 2017

I changed prefix from https://w3id.org/cwl/view/git to https://w3id.org/cwl/view/git to avoid conflicts with CWL spec versions like https://w3id.org/cwl/v1.0

See wiki page https://github.com/common-workflow-language/cwlviewer/wiki/Permalinks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants