Datasets
Taxprofiler Pipeline with Dataset Integration Using Cloudgene¶
In this tutorial, we will adapt the taxprofiler
pipeline for taxonomic classification and profiling of metagenomic data using Cloudgene. The pipeline will be configured to integrate with datasets, allowing users to select specific datasets for their analysis while reducing input complexity. The Cloudgene application will be tailored for this use case, simplifying the pipeline's input parameters and linking datasets dynamically.
Taxprofiler Application¶
We start with the taxprofiler
pipeline, which fetches data based on user-provided input IDs and performs taxonomic profiling. Here is the cloudgene.yaml
file for the taxprofiler
pipeline:
id: taxprofiler
name: Taxprofiler
description: Taxonomic classification and profiling of shotgun short- and long-read metagenomic data
version: 1.1.8
website: https://github.com/nf-core/taxprofiler
author: James A. Fellows Yates, Sofia Stamouli, Moritz E. Beber, and the nf-core/taxprofiler team
logo: https://raw.githubusercontent.com/nf-core/fetchngs/master/docs/images/nf-core-fetchngs_logo_light.png
workflow:
steps:
- name: Fetch Data
script: nf-core/fetchngs
revision: 1.12.0
stdout: true
params:
input: "${input_ids}"
- name: Run taxprofiler
script: nf-core/taxprofiler
revision: 1.1.8
stdout: true
params:
input: "${outdir}/samplesheet/samplesheet.csv"
databases: "https://raw.githubusercontent.com/nf-core/test-datasets/taxprofiler/database_full_v1.2.csv"
multiqc_title: "${CLOUDGENE_JOB_NAME}"
inputs:
- id: input_ids
description: IDs
type: textarea
value: "SRR12696236"
writeFile: "ids.csv"
serialize: false
outputs:
- id: outdir
description: Output
type: folder
This setup fetches sequence data and runs the taxprofiler pipeline using an external database. However, we want to make the database selection dynamic and allow users to choose a specific dataset for taxonomic profiling.
Datasets for Taxprofiler¶
Datasets allow us to link the pipeline to different database options dynamically, enabling users to select the most relevant database for their analysis. We will create individual Cloudgene applications for each database dataset and link them to the taxprofiler
pipeline.
Dataset Structure¶
We will create a Cloudgene application for each database with the following structure:
cloudgene.yaml
for Dataset 1¶
The cloudgene.yaml
file for each dataset application will define properties to share the dataset location with the main pipeline:
name: Database 1
version: 1.0
category: taxprofiler_database
properties:
database_url: "${CLOUDGENE_APP_LOCATION}/database_full_v1.2.csv"
Here, the database_url
property points to the dataset file, allowing the main pipeline to retrieve this information when the user selects the dataset. The category
field is used to group all dataset applications, making it easier to filter and display them within the taxprofiler
pipeline interface.
Updating the Taxprofiler Application to Use Datasets¶
Now, we update the cloudgene.yaml
file for the taxprofiler
pipeline to use datasets. Instead of hardcoding the database URL, we will allow users to select from available datasets.
Updated cloudgene.yaml
for Taxprofiler¶
id: taxprofiler
name: Taxprofiler
description: Taxonomic classification and profiling of shotgun short- and long-read metagenomic data
version: 1.1.8
workflow:
steps:
- name: Fetch Data
script: nf-core/fetchngs
revision: 1.12.0
stdout: true
params:
input: "${input_ids}"
- name: Run taxprofiler
script: nf-core/taxprofiler
revision: 1.1.8
stdout: true
params:
input: "${outdir}/samplesheet/samplesheet.csv"
databases: "${database.database_url}"
multiqc_title: "${CLOUDGENE_JOB_NAME}"
inputs:
- id: input_ids
description: IDs
type: textarea
value: "SRR12696236"
writeFile: "ids.csv"
serialize: false
- id: database
description: Database
type: dataset
category: taxprofiler_database
serialize: false
outputs:
- id: outdir
description: Output
type: folder
Explanation¶
dataset
input: The inputdatabase
uses thedataset
type, allowing users to select a dataset from all installed database applications that fall under thetaxprofiler_database
category.- Database link: The
database.database_url
references thedatabase_url
property defined in the dataset application’scloudgene.yaml
, dynamically linking the selected database to thetaxprofiler
pipeline.
With this setup, users can select a dataset from the available options, and the taxprofiler
pipeline will use the corresponding database for taxonomic profiling.
Advantages of This Approach¶
- Dynamic Dataset Selection: Users can easily select a dataset from a predefined list without modifying the pipeline configuration.
- Separation of Concerns: The logic of dataset management is separated from the pipeline itself, making it easier to add or update datasets without changing the pipeline code.
- Scalability: As new datasets are added, they are automatically available for selection without further changes to the pipeline.
- Permission Management: Admins can control access to specific datasets through user groups, providing flexibility in shared environments.
By using this approach, the taxprofiler
pipeline becomes more user-friendly and adaptable, enabling seamless integration of datasets for specialized use cases.