Skip to content

fragment insertion #2396

@sjanssen2

Description

@sjanssen2

I hope this is the right place to discuss how we want to integrate fragment-insertion AKA SEPP into Qiita.

I think we agree on the fact that we somehow want to cache placements of sequences into one given reference. Current only reference is GreenGenes 13.8 as provided in Siavash's repo
Jose's suggestion was to use one hd5f file to store placements. A placement looks like this:

placement = {
      "p": [[6454, -96408.34, 0.8250844, 0.0000070187216, 0.04252309],
        [6564, -96410.87, 0.066086836, 0.000006061417, 0.04802516],
        [6448, -96411.98, 0.02176576, 5.000002E-7, 0.048468385],
        [6449, -96411.98, 0.02176576, 5.000002E-7, 0.048468385],
        [6450, -96411.98, 0.02176576, 5.000002E-7, 0.048468385],
        [6447, -96411.98, 0.02176576, 5.000002E-7, 0.0484684],
        [6440, -96411.98, 0.021765757, 5.000002E-7, 0.048468385]],
      "nm": [["CGACGTGTGTACGTGTAGTGTCGGGATCGTAGTCGTAGTCGTAGTCGTAGTCGTGTGTACGTGTAGTCGTAGTC",
          1]]
    }

where placement["nm"][0][0] is the fragment sequence, and placement["p"] holds alternative placements. One of those placements is chosen by SEPP to be inserted into the reference tree later on. The first component, e.g. 6454 points to the node of the reference tree, i.e. the Greengenes tree.

Thus, I'd like to suggest the following data organization:

h5py.File('test1.hdf5', 'w').create_dataset(
     'gg13.8/%s' % placement["nm"][0][0], data=placement["p"])

I am a hd5f newbie, so please double check if this is a good design!

My suggestion would be to make SEPP a part of the deblur-qiita plugin. We could first check which sequences are already placed (a lookup to the above storage) and then run SEPP on the reduced set per biom table. Newly found placement must be reported back to the central storage, of course.

Let us defer the details of how to expose the actual insertion tree (i.e. the result of adding fragments according to placements into the reference, via program guppy tog) for now. There are alternative routes, @tanaes might want to comment on that.
For now, I think we are good to create one tree per deblur table and/or one tree per meta-analyses out of the placements from central storage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions