Skip to content

Consistent RDF permalinks with content negotiation #146

Closed
@stain

Description

@stain

We should use consistent permalinks in URIs across our RDF to identify a workflow or a workflow file.

Currently (v1.1) we have:

  • SPARQL uses Graph URIs like http://sparql:3030/cwlviewer/github.com/genome/cancer-genomics-workflow/blob/be7e682c6a2d0b24b949e022aeae7786bd8434ed/strelka/workflow.cwl that exposes the origin of the git repository, its commit and file path
    • Statements within such graphs contains URIs like file:///data/git/1a2b5d62cde8555e5932907b28189585a2bf99d2/fp_filter/workflow.cwl that exposes the working directory for the git clone.
  • The research object's .ro/annotations/workflow.ttl annotation contain URIs like https://github.com/raw/common-workflow-language/workflows/master/workflows/make-to-cwl/dna.cwl#main

I propose we replace all of those (possibly with search-replace on the cwltool --printrdf output) to use a single location-free URI like: https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl

Permalink URI scheme

The new URI scheme is composed like this:

https://w3id.org/cwl/view/{scm}/{commit}/{path}#{anchor}
  • https://w3id.org/cwl/view/ fixed prefix at permalink service https://w3id.org/ (/cwl is our namespace)
  • {scm} - source code management protocol, currently only git supported
  • {commit} - full git commit sha1 id (no branches or short commits allowed)
  • {path} - relative path to .cwl file within a checkout of that git commit
  • #{anchor} - an optional anchor, e.g. #main as-is from cwltool --print-rdf ; not passed on to server

Anyone can construct a URI according to the above scheme for a given git commit and file - even if the commit only exists on a local disk or in a private git repository that the CWL Viewer does not know about.

These make good Linked Data identifiers for specific CWL workflow definitions because:

  • The cwl file and its neighbors can't change within the git commit
  • The URI is the same wherever the git repository is pushed or hosted

Anyone generating the URIs should be aware of some edge cases:

  • An uncommitted file change
  • CWL file is within a git submodule which could be a movable branch (without any commits appearing on master git repository)
  • CWL file is not tracked in git repository (e.g. ../../outside.cwl)

Resolving

Resolving any URI starting with https://w3id.org/cwl/view/git/{rest} will HTTP 302 redirect to the corresponding resource https://view.commonwl.org/git/{rest} representing that path in that commit

GET https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1

HTTP/1.1 302 Found
Location: https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl

Unknown commit?

If the public CWL viewer have never heard about the commit 933bf2a1a1cce32d88f88f136275535da9df0954 there is not much more to say:

HEAD https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1

HTTP/1.1 404 Not Found

Unknown git commit `933bf2a1a1cce32d88f88f136275535da9df0954`

Content-negotiation

But if it is known, CWL Viewer finds a matching graph for that file in that commit, then the client can content-negotiate to get various RDF serializations like text/turtle or application/ld+json:

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: text/turtle

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/turtle

@prefix cwl: <https://w3id.org/cwl/cwl#>.
<https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl#main> a cwl:Workflow .
#  ....

Notice how the returned RDF uses the location-independent w3id.org namespace, not view.commonwl.org

YAML

If the client asks for the CWL file with type application/x-yaml or application/octet-stream, and the git repository has a public "raw" option, then the server can redirect to that:

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/x-yaml

HTTP/1.1 302 Found
Vary: Accept
Location: https://cdn.rawgit.com/common-workflow-language/workflows/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl

GET https://cdn.rawgit.com/common-workflow-language/workflows/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/x-yaml

HTTP/1.1 200 OK
Content-Type: application/octet-stream

#!/usr/bin/env cwl-runner
cwlVersion: v1.0

class: Workflow
inputs:
    ...

HTML and JSON API

If the user asks for text/html, it is probably a browser. So CWL Viewer will redirect to the normal workflow rendering:

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: text/html

HTTP/1.1 302 Found
Vary: Accept
Location: https://view.commonwl.org/workflows/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl

This works also for application/json which then gives the JSON api output:

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/json

HTTP/1.1 302 Found
Vary: Accept
Location: https://view.commonwl.org/workflows/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl
GET https://view.commonwl.org/workflows/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/json

HTTP/1.1 200 OK
Vary: Accept
Content-Type: application/json
{
    "retrievedFrom": {
        "owner": "common-workflow-language",
        "repoName": "workflows",
        "branch": "master",
        "path": "workflows/lobSTR/lobSTR-workflow.cwl",
        "url": "https://github.com/common-workflow-language/workflows/tree/master/workflows/lobSTR/lobSTR-workflow.cwl"
    },
    "retrievedOn": 1499175275743,
    "lastCommit": "920c6be45f08e979e715a0018f22c532b024074f",
    "label": "lobSTR-workflow.cwl",
   ...
}

Images

OK, let's be cool and do images as well.

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: image/svg+xml

HTTP/1.1 302 Found
Vary: Accept
Location: https://view.commonwl.org/graph/svg/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl

Research Object Bundle

..and of course our Research Object Bundle if client asks for application/ro+zip or application/zip

GET https://view.commonwl.org/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/lobSTR/lobSTR-workflow.cwl HTTP/1.1
Accept: application/ro+zip

HTTP/1.1 302 Found
Vary: Accept
Location: https://view.commonwl.org/robundle/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl

Packed workflows

If there's a packed CWL file with nested workflows, then a workflow is not matchable by it's filename alone, as you need to know also the #{anchor}. This is not a problem for the RDF output, as it will contain all workflows found in the packed CWL file, and you just match by #anchor.

However it can be a problem for the HTTP and JSON rendering, which with #103 would have alternative URIs depending on the selected nested workflow. So it could be confusing to redirect to the top-level workflow (if that can even be determined) as the user won't find their `#nested1/step/nestedstep2# in there; we don't expand nested workflows in the UI.

So if the user asks for text/html or application/json for a packed workflow (multipe workflows found), then we'll give an error, with links to the candidates using #103 escaped URIs.

GET https://view.commonwl.org/git/adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/packed.cwl HTTP/1.1
Accept: text/html

HTTP/1.1 300 Multiple Choices
Vary: Accept
Content-Type: text/uri-list

https://view.commonwl.org/workflows/example.com/blob/adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/packed.cwl%23main
https://view.commonwl.org/workflows/example.com/blob/adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/packed.cwl%23nested1
https://view.commonwl.org/workflows/example.com/blob/adc83b19e793491b1c6ea0fd8b46cd9f32e592fc/packed.cwl%23nested2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions