fragment insertion

I hope this is the right place to discuss how we want to integrate fragment-insertion AKA SEPP into Qiita.

I think we agree on the fact that we somehow want to cache placements of sequences into one given reference. Current only reference is GreenGenes 13.8 as provided in [Siavash's repo](https://github.com/smirarab/sepp-refs/tree/4.3.4b)
Jose's suggestion was to use one hd5f file to store placements. A placement looks like this:

```
placement = {
      "p": [[6454, -96408.34, 0.8250844, 0.0000070187216, 0.04252309],
        [6564, -96410.87, 0.066086836, 0.000006061417, 0.04802516],
        [6448, -96411.98, 0.02176576, 5.000002E-7, 0.048468385],
        [6449, -96411.98, 0.02176576, 5.000002E-7, 0.048468385],
        [6450, -96411.98, 0.02176576, 5.000002E-7, 0.048468385],
        [6447, -96411.98, 0.02176576, 5.000002E-7, 0.0484684],
        [6440, -96411.98, 0.021765757, 5.000002E-7, 0.048468385]],
      "nm": [["CGACGTGTGTACGTGTAGTGTCGGGATCGTAGTCGTAGTCGTAGTCGTAGTCGTGTGTACGTGTAGTCGTAGTC",
          1]]
    }
```

where `placement["nm"][0][0]` is the fragment sequence, and `placement["p"]` holds alternative placements. One of those placements is chosen by SEPP to be inserted into the reference tree later on. The first component, e.g. `6454` points to the node of the reference tree, i.e. the Greengenes tree.

Thus, I'd like to suggest the following data organization:

```
h5py.File('test1.hdf5', 'w').create_dataset(
     'gg13.8/%s' % placement["nm"][0][0], data=placement["p"])
```

I am a hd5f newbie, so please double check if this is a good design!

My suggestion would be to make SEPP a part of the deblur-qiita plugin. We could first check which sequences are already placed (a lookup to the above storage) and then run SEPP on the reduced set per biom table. Newly found placement must be reported back to the central storage, of course.

Let us defer the details of how to expose the actual insertion tree (i.e. the result of adding fragments according to placements into the reference, via program `guppy tog`) for now. There are alternative routes, @tanaes might want to comment on that. 
For now, I think we are good to create one  tree per deblur table and/or one tree per meta-analyses out of the placements from central storage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fragment insertion #2396

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fragment insertion #2396

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions