Skip to content

VARGRAM submission #243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
15 of 32 tasks
cjpalpallatoc opened this issue May 9, 2025 · 2 comments
Open
15 of 32 tasks

VARGRAM submission #243

cjpalpallatoc opened this issue May 9, 2025 · 2 comments

Comments

@cjpalpallatoc
Copy link

cjpalpallatoc commented May 9, 2025

Submitting Author: (@cjpalpallatoc)
All current maintainers: (@cjpalpallatoc)
Package Name: VARGRAM
One-Line Description of Package: A Python visualization tool for genomic surveillance
Repository Link: https://github.com/pgcbioinfo/vargram
Version submitted: 0.3.0
EiC: @coatless
Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD


Code of Conduct & Commitment to Maintain Package

Description

  • Include a brief paragraph describing what your package does:

During a viral outbreak, the diversity of sampled sequences often needs to be quickly determined to understand the evolution of a pathogen. VARGRAM (Visual ARrays for GRaphical Analysis of Mutations) empowers researchers to quickly generate a mutation profile to compare batches of sequences against each other and against a reference set of mutations. A publication-ready profile can be generated in a couple lines of code by providing sequence files (FASTA, GFF3) or tabular data (CSV, TSV, Pandas DataFrame). When sequence files are provided, VARGRAM leverages Nextclade CLI to perform mutation calling. We have user-friendly installation instructions and tutorials on our documentation website.

Scope

  • Please indicate which category or categories.
    Check out our package scope page to learn more about our
    scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):

    • Data retrieval
    • Data extraction
    • Data processing/munging
    • Data deposition
    • Data validation and testing
    • Data visualization1
    • Workflow automation
    • Citation management and bibliometrics
    • Scientific software wrappers
    • Database interoperability

Domain Specific

  • Geospatial
  • Education

Community Partnerships

If your package is associated with an
existing community please check below:

  • For all submissions, explain how and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):

    • Who is the target audience and what are scientific applications of this package?
      We hope that VARGRAM would be useful for researchers, analysts, and students in the field of molecular epidemiology/genomic surveillance. During the pandemic, we've used an early mutation profile script to characterize emergent variants and potential recombinants.

    • Are there other Python packages that accomplish the same thing? If so, how does yours differ?
      The closest we're aware of is snipit. The main difference is that VARGRAM provides a visual comparison of mutation profiles between groups or within a population of samples. There are also additional features such as grouping mutations per gene, adding multiple sets of reference mutations, and other customizations. We also plan to expand the package to provide other types of visualization relevant to genomic surveillance.
      We're also aware of packages like Marsilea that can in principle be used to make a profile, but these are more general in scope and would require more work for the user than if they used VARGRAM. Outside Python, we've seen researchers create mutation profiles with custom scripts (in R) and there are also web tools available like Nextclade. VARGRAM differs by making the process substantially convenient in terms of generation and customization of the figure.

    • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:
      VARGRAM #225

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • uses an OSI approved license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a tutorial with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

JOSS Checks
  • The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI:

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

  • I have read the author guide.
  • I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Footnotes

  1. Please fill out a pre-submission inquiry before submitting a data visualization package.

@coatless
Copy link

coatless commented May 14, 2025

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci
review. Below are the basic checks that your package needs to pass
to begin our review. If some of these are missing, we will ask you
to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements
below.

  • Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
    • The package imports properly into a standard Python environment import package.
  • Fit The package meets criteria for fit and overlap.
  • Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
    • User-facing documentation that overviews how to install and start using the package.
    • Short tutorials that help a user understand how to use the package and what it can do for them.
    • API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
  • Core GitHub repository Files
    • README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
    • Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
    • Code of Conduct The package has a CODE_OF_CONDUCT.md file.
    • License The package has an OSI approved license.
      NOTE: We prefer that you have development instructions in your documentation too.
  • Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
  • Automated tests Package has a testing suite and is tested via a Continuous Integration service.
  • Repository The repository link resolves correctly.
  • Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

  • Initial onboarding survey was filled out
    We appreciate each maintainer of the package filling out this survey individually. 🙌
    Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. 🙌


Editor comments

Hi, thanks for submittiing the package.

Regarding the examples and quick start, please include a quick command, e.g. setup_test_data(), that can be used to obtain the required data from the repository so that the package's example can be run.

The following addresses this first step discussed under Example.

Sample implementation for `setup_test_data()`
import requests
import zipfile
import os
import tempfile
import shutil
from pathlib import Path
from typing import Optional, Union, Tuple

def setup_test_data(
    repo: str = "pgcbioinfo/vargram",
    source_path: str = "tests/test_data",
    target_dir: Optional[Union[str, Path]] = None
) -> Path:
    """
    Download the latest release of a GitHub repository and extract a specific directory's
    contents to a target directory.
    
    This function is useful for setting up test data for a package. It fetches the latest
    release from GitHub, extracts the specified directory's contents, and places them
    directly in the target directory.
    
    Parameters
    ----------
    repo : str, default "pgcbioinfo/vargram"
        The GitHub repository in format "owner/repo".
    source_path : str, default "tests/test_data"
        Path within the repository to extract. This directory's contents will be
        placed in the target directory.
    target_dir : str or Path, optional
        Directory where the contents should be extracted. If None, the current
        working directory is used.
        
    Returns
    -------
    Path
        A Path object pointing to the target directory where files were extracted.
        
    Raises
    ------
    requests.HTTPError
        If the API request to GitHub fails.
    FileNotFoundError
        If the specified source_path doesn't exist in the repository.
    ValueError
        If the repository format is invalid.
        
    Examples
    --------
    >>> # Extract to current directory
    >>> data_path = setup_test_data()
    >>> 
    >>> # Extract to a specific directory
    >>> data_path = setup_test_data(target_dir="./my_test_data")
    >>> 
    >>> # Extract a different path from a different repo
    >>> data_path = setup_test_data(
    ...     repo="username/repo",
    ...     source_path="data/examples",
    ...     target_dir="./examples"
    ... )
    """
    # Validate repo format
    if not repo or "/" not in repo:
        raise ValueError(f"Invalid repository format: {repo}. Expected format: 'owner/repo'")
    
    # Convert target_dir to Path if specified, otherwise use current directory
    if target_dir is None:
        target_dir = Path.cwd()
    else:
        target_dir = Path(target_dir)
        target_dir.mkdir(parents=True, exist_ok=True)
    
    # Step 1: Get the latest release information
    print(f"Getting latest release for {repo}...")
    response = requests.get(f"https://github.com/api/repos/{repo}/releases/latest")
    response.raise_for_status()  # Raise an exception for HTTP errors
    release_data = response.json()
    zipball_url = release_data["zipball_url"]
    release_tag = release_data["tag_name"]
    print(f"Found release: {release_tag}")
    
    # Step 2: Download the release zip file
    print(f"Downloading release from {zipball_url}...")
    zip_response = requests.get(zipball_url, stream=True)
    zip_response.raise_for_status()
    
    # Create temporary files
    with tempfile.NamedTemporaryFile(delete=False, suffix='.zip') as temp_zip:
        # Write the downloaded zip to the temporary file
        for chunk in zip_response.iter_content(chunk_size=8192):
            temp_zip.write(chunk)
        temp_zip_path = temp_zip.name
    
    temp_dir = tempfile.mkdtemp()
    
    try:
        # Step 3: Extract the archive
        print("Extracting the release...")
        with zipfile.ZipFile(temp_zip_path, 'r') as zip_ref:
            zip_ref.extractall(temp_dir)
        
        # Step 4: Find the extracted directory
        extracted_dir = next(Path(temp_dir).iterdir())  # Get the first (and only) directory
        
        # Step 5: Check if the source path directory exists
        source_data_path = extracted_dir / source_path
        if not source_data_path.exists():
            raise FileNotFoundError(f"{source_path} directory not found in the release")
        
        # Step 6: Copy each item from source directory directly to the target directory
        print(f"Copying contents of {source_path} to {target_dir}...")
        for item in source_data_path.iterdir():
            dest_path = target_dir / item.name
            
            # If it's a directory, copy the entire directory tree
            if item.is_dir():
                if dest_path.exists():
                    shutil.rmtree(dest_path)
                shutil.copytree(item, dest_path)
            # If it's a file, just copy the file
            else:
                if dest_path.exists():
                    os.remove(dest_path)
                shutil.copy2(item, dest_path)
                
        print(f"Successfully extracted {source_path} contents to {target_dir}")
        
        return target_dir
    
    finally:
        # Step 7: Clean up temporary files
        os.unlink(temp_zip_path)
        shutil.rmtree(temp_dir)

setup_test_data()

An alternative approach would be to create a tar of all the data files and, then, provide a series of shell commands that download and extract the data into the working directory.

With this in mind, there are two different data locations with different data files. Some data is used, some is not, and some is missing:

  1. test/test_data
  2. docs/assets/data

For the later example with mutation profiles, there lacks a discussion about where nextclade_analysis.csv or covid_samples/* data can be obtained from once Nextclade CLI is installed.

I'll kick start the review process on the package as the reviews can use the above script in the interim to explore the initial example.

@lwasser lwasser moved this from pre-review-checks to seeking-editor in peer-review-status May 14, 2025
@cjpalpallatoc
Copy link
Author

Thanks for the initial comments @coatless . I just have an important deadline to meet this coming week, but I'll work on your suggestions as soon as I can. Looking forward to the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: seeking-editor
Development

No branches or pull requests

2 participants