isoformant module

isoformant.balanced_chisq_test(adata, cond_list=[], cond_label='sample_id', cluster_label='leiden')[source]

Chi-squared goodness-of-fit test. Assumes class balance.

Parameters
  • input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)

  • cond_list (list, optional) – List of condition labels to be considered, defaults to []

  • cond_label (str, optional) – Condition label, defaults to ‘sample_id’

  • cluster_label (str, optional) – Cluster label, defaults to ‘leiden’

Returns

Test results

Return type

pandas dataframe

isoformant.cns(minimap2_path, seq_dict, dest_dir, ref_fa, cluster_label='leiden', ncore=1, sirv=False, messages=False)[source]

Consensus calling module.

Parameters
  • minimap2_path (str) – Path to minimap2 executable

  • seq_dict (dict) – Dictionary mapping cluster to read sequences

  • dest_dir (str) – Path to directory for output files

  • ref_fa (str) – Path to reference FASTA file

  • cluster_label (str, optional) – Cluster label, defaults to ‘leiden’

  • ncore (int, optional) – Number of cores, defaults to 1

  • sirv (bool, optional) – True to benchmark SIRVs, defaults to False

  • messages (bool, optional) – True to print STDOUT/STDERR messages, defaults to False

Returns

Path to consensus sequence set BAM file

Return type

str

Returns

Dictionary mapping cluster-specific consensus sequence to BAM path

Return type

dict

isoformant.create_viz_bams(input_adata, combined_bam, dest_dir, cluster_label='leiden', bam_n=10, ncore=1)[source]

Create BAM files for visualizations. Returns a dictionary mapping cluster ID to BAM file.

Parameters
  • input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)

  • combined_bam (str) – Path to BAM file associated with processed reads

  • dest_dir (str) – Path to directory for output files

  • cluster_label (str, optional) – Cluster label, defaults to ‘leiden’

  • bam_n (int, optional) – Maximum number of records per BAM file, defaults to 10

  • ncore (int, optional) – Number of cores, defulats to 1

Returns

Dictionary mapping cluster to BAM path

Return type

dict

Returns

Dictionary mapping cluster to read sequences

Return type

dict

isoformant.kmerize(combined_csv, ksize=5)[source]

Ingests CSV file and produces k-mer frequency anndata object.

Parameters
  • combined_csv (str) – Path to CSV file

  • ksize (int, optional) – k-mer size, defaults to 5

Returns

k-mer frequencies as anndata object

Return type

anndata object

isoformant.minimap2_launcher(minimap2_path, ref_fa, cnsfa, cnsbam, sirv=False, messages=False)[source]

Launch minimap2 alignment.

Parameters
  • minimap2_path (str) – Path to minimap2 executable

  • ref_fa (str) – Path to reference FASTA file

  • cnsfa (str) – Path to input query FASTA file

  • cnsbam (str) – Path to output query BAM file

  • sirv (bool, optional) – True to benchmark SIRVs, defaults to False

  • messages (bool, optional) – True to print STDOUT/STDERR messages, defaults to False

isoformant.pca_pipeline(adata_reads, groupby_list, components=['1,2'], n_comps=100, plot=False)[source]

PCA pipeline performed in-place. Scale data and perform PCA. Option to print PC scatter plot.

Parameters
  • adata_reads (Anndata object) – Read x k-mer data

  • groupby_list (list) – List of feature names to color-code in plot

  • components (list, optional) – List of PC pairs to plot (e.g. [‘1,2’, …]), defaults to [‘1,2’]

  • n_comps (int, optional) – Number of PCs to compute, defaults to 100

  • plot (bool, optional) – ‘True’ to plot, ‘False’ to pass, defaults to ‘False’

isoformant.plot_cluster_occupancy(adata, cond_list=[], cond_label='sample_id', cluster_label='leiden', fig_size=(6, 6), subplot_hjust=0.3, save=None)[source]

Plots cluster occupancy by condition label.

Parameters
  • input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)

  • cond_list (list, optional) – List of condition labels to be considered, defaults to []

  • cond_label (str, optional) – Condition label, defaults to ‘sample_id’

  • cluster_label (str, optional) – Cluster label, defaults to ‘leiden’

  • fig_size (float, optional) – Figure size, defaults to (10,6)

  • subplot_hjust – Subplot horizontal spacing, defaults to 0.3

  • save (None, optional) – Path to save plot, defaults to None

Returns

Frequency table

Return type

pandas dataframe

isoformant.plot_highlight_COI(input_adata, COI, cluster_label='leiden')[source]

Highlight cluster of interest in UMAP plot.

Parameters
  • input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)

  • COI (str) – Cluster label

  • cluster_label (str, optional) – Cluster label, defaults to ‘leiden’

isoformant.plot_tracks(track_dict, chrom, start, end, ref_fa, left_margin=0, save=None)[source]

Given dictionary, plot reads tracks.

Parameters
  • track_dict – Dictionary of track labels mapped to BAM files

  • chrom (str) – Reference sequence name of ROI

  • start (int) – 1-based start coordinate of ROI

  • end (int) – 1-based end coordinate of ROI

  • ref_fa (str) – Path to reference FASTA file

  • left_margin (int, optional) – Left margin in genomic coodinates, defaults to 0

  • save (None, optional) – Path to save plot, defaults to None

isoformant.preprocess_reads(bams_list, chrom, start, end, dest_dir, max_reads=1000, ncore=1, qual_cutoff=10, len_cutoff=300)[source]

Module to process a list of BAM files. Coordinates ‘process_bam’ task: extract passing reads. Creates merged BAM file and writes to disk. Creates merged min-depth balanced CSV file and writes to disk.

Parameters
  • bams_list (list) – List of sorted BAM file paths (.bai required in same directory)

  • chrom (str) – Reference sequence name of ROI

  • start (int) – 1-based start coordinate of ROI

  • end (int) – 1-based end coordinate of ROI

  • dest_dir (str) – Path to directory for output files

  • max_reads (int, optional) – Maximum possible reads considered, defaults to 1000

  • ncore (int, optional) – Number of cores, defaults to 1

  • qual_cutoff (float, optional) – Minimum mean base quality, defaults to 10

  • len_cutoff (int, optional) – Minimum read length, defaults to 300

Returns

Path to merged CSV file

Return type

str

Returns

Path to sorted merged BAM file

Return type

str

isoformant.process_bam(bamfile, chrom, start, end, suffix, outcsv, outbam, qual_cutoff=10, len_cutoff=300)[source]

Performs a series of tasks on BAM file. Filter BAM reads: (1) not secondary/supplemental read, (2) min mean base quality, (3) min read length. Write to disk in BAM format. Write to disk in CSV format. Randomly shuffled.

Parameters
  • bamfile (str) – Path to sorted BAM file (.bai index file required in same directory)

  • chrom (str) – Reference sequence name of ROI

  • start (int) – 1-based start coordinate of ROI

  • end (int) – 1-based end coordinate of ROI

  • suffix (str) – Identifier to be appended to read names

  • outcsv (str) – Path to CSV output file

  • outbam (str) – Path to BAM output file

  • qual_cutoff (float, optional) – Minimum mean base quality, defaults to 10

  • len_cutoff (int, optional) – Minimum read length, defaults to 300

Returns

Number of records passing filters

Return type

int

isoformant.roi_fa(ref_fa, chrom, out_fa)[source]

Trim FASTA to reference sequence of interest.

Parameters
  • cnsfa – Path to input query FASTA file

  • chrom (str) – Reference sequence name of ROI

  • out_fa (str) – Path to output file

isoformant.shuffle_csv(incsv)[source]

Shuffle CSV records. Performs task in-place.

Parameters

incsv (str) – CSV file path

isoformant.umap_pipeline(input_adata, groupby_list, n_pcs=3, n_neighbors=15, min_dist=0.5, resolution=1, components=['1,2'], plot=False)[source]

UMAP pipeline performed in-place. Build KNN graph, compute UMAP, and cluster reads using Leiden algorithm. Option to print UMAP scatter plot.

Parameters
  • input_adata (Anndata object) – Read x k-mer data (processed by pca_pipeline)

  • groupby_list (list) – List of feature names to color-code in plot

  • n_pcs (int, optional) – Number of PCs for KNN graph, defaults to 3

  • n_neighbors (int, optional) – Number of neighbors per neighborhood, defaults to 15

  • min_dist (float, optional) – Minimum distance to neighbor, defaults to 0.5

  • resolution (float, optional) – Leiden clustering resolution, defulats to 1

  • components (list, optional) – List of UMAP component pairs to plot (e.g. [‘1,2’, …]), defaults to [‘1,2’]

  • plot (bool, optional) – ‘True’ to plot, ‘False’ to pass, defaults to ‘False’