isoformant module

isoformant.balanced_chisq_test(adata, cond_list=[], cond_label='sample_id', cluster_label='leiden')[source]

Chi-squared goodness-of-fit test. Assumes class balance.

Parameters

input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)
cond_list (list, optional) – List of condition labels to be considered, defaults to []
cond_label (str, optional) – Condition label, defaults to ‘sample_id’
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’

Returns

Test results

Return type

pandas dataframe

isoformant.cns(minimap2_path, seq_dict, dest_dir, ref_fa, cluster_label='leiden', ncore=1, sirv=False, messages=False)[source]

Consensus calling module.

Parameters

minimap2_path (str) – Path to minimap2 executable
seq_dict (dict) – Dictionary mapping cluster to read sequences
dest_dir (str) – Path to directory for output files
ref_fa (str) – Path to reference FASTA file
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’
ncore (int, optional) – Number of cores, defaults to 1
sirv (bool, optional) – True to benchmark SIRVs, defaults to False
messages (bool, optional) – True to print STDOUT/STDERR messages, defaults to False

Returns

Path to consensus sequence set BAM file

Return type

str

Returns

Dictionary mapping cluster-specific consensus sequence to BAM path

Return type

dict

isoformant.create_viz_bams(input_adata, combined_bam, dest_dir, cluster_label='leiden', bam_n=10, ncore=1)[source]

Create BAM files for visualizations. Returns a dictionary mapping cluster ID to BAM file.

Parameters

input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)
combined_bam (str) – Path to BAM file associated with processed reads
dest_dir (str) – Path to directory for output files
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’
bam_n (int, optional) – Maximum number of records per BAM file, defaults to 10
ncore (int, optional) – Number of cores, defulats to 1

Returns

Dictionary mapping cluster to BAM path

Return type

dict

Returns

Dictionary mapping cluster to read sequences

Return type

dict

isoformant.kmerize(combined_csv, ksize=5)[source]

Ingests CSV file and produces k-mer frequency anndata object.

Parameters

combined_csv (str) – Path to CSV file
ksize (int, optional) – k-mer size, defaults to 5

Returns

k-mer frequencies as anndata object

Return type

anndata object

isoformant.minimap2_launcher(minimap2_path, ref_fa, cnsfa, cnsbam, sirv=False, messages=False)[source]

Launch minimap2 alignment.

Parameters

minimap2_path (str) – Path to minimap2 executable
ref_fa (str) – Path to reference FASTA file
cnsfa (str) – Path to input query FASTA file
cnsbam (str) – Path to output query BAM file
sirv (bool, optional) – True to benchmark SIRVs, defaults to False
messages (bool, optional) – True to print STDOUT/STDERR messages, defaults to False

isoformant.pca_pipeline(adata_reads, groupby_list, components=['1,2'], n_comps=100, plot=False)[source]

PCA pipeline performed in-place. Scale data and perform PCA. Option to print PC scatter plot.

Parameters

adata_reads (Anndata object) – Read x k-mer data
groupby_list (list) – List of feature names to color-code in plot
components (list, optional) – List of PC pairs to plot (e.g. [‘1,2’, …]), defaults to [‘1,2’]
n_comps (int, optional) – Number of PCs to compute, defaults to 100
plot (bool, optional) – ‘True’ to plot, ‘False’ to pass, defaults to ‘False’

isoformant.plot_cluster_occupancy(adata, cond_list=[], cond_label='sample_id', cluster_label='leiden', fig_size=(6, 6), subplot_hjust=0.3, save=None)[source]

Plots cluster occupancy by condition label.

Parameters

input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)
cond_list (list, optional) – List of condition labels to be considered, defaults to []
cond_label (str, optional) – Condition label, defaults to ‘sample_id’
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’
fig_size (float, optional) – Figure size, defaults to (10,6)
subplot_hjust – Subplot horizontal spacing, defaults to 0.3
save (None, optional) – Path to save plot, defaults to None

Returns

Frequency table

Return type

pandas dataframe

isoformant.plot_highlight_COI(input_adata, COI, cluster_label='leiden')[source]

Highlight cluster of interest in UMAP plot.

Parameters

input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)
COI (str) – Cluster label
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’

isoformant.plot_tracks(track_dict, chrom, start, end, ref_fa, left_margin=0, save=None)[source]

Given dictionary, plot reads tracks.

Parameters

track_dict – Dictionary of track labels mapped to BAM files
chrom (str) – Reference sequence name of ROI
start (int) – 1-based start coordinate of ROI
end (int) – 1-based end coordinate of ROI
ref_fa (str) – Path to reference FASTA file
left_margin (int, optional) – Left margin in genomic coodinates, defaults to 0
save (None, optional) – Path to save plot, defaults to None

isoformant.preprocess_reads(bams_list, chrom, start, end, dest_dir, max_reads=1000, ncore=1, qual_cutoff=10, len_cutoff=300)[source]

Module to process a list of BAM files. Coordinates ‘process_bam’ task: extract passing reads. Creates merged BAM file and writes to disk. Creates merged min-depth balanced CSV file and writes to disk.

Parameters

bams_list (list) – List of sorted BAM file paths (.bai required in same directory)
chrom (str) – Reference sequence name of ROI
start (int) – 1-based start coordinate of ROI
end (int) – 1-based end coordinate of ROI
dest_dir (str) – Path to directory for output files
max_reads (int, optional) – Maximum possible reads considered, defaults to 1000
ncore (int, optional) – Number of cores, defaults to 1
qual_cutoff (float, optional) – Minimum mean base quality, defaults to 10
len_cutoff (int, optional) – Minimum read length, defaults to 300

Returns

Path to merged CSV file

Return type

str

Returns

Path to sorted merged BAM file

Return type

str

isoformant.process_bam(bamfile, chrom, start, end, suffix, outcsv, outbam, qual_cutoff=10, len_cutoff=300)[source]

Performs a series of tasks on BAM file. Filter BAM reads: (1) not secondary/supplemental read, (2) min mean base quality, (3) min read length. Write to disk in BAM format. Write to disk in CSV format. Randomly shuffled.

Parameters

bamfile (str) – Path to sorted BAM file (.bai index file required in same directory)
chrom (str) – Reference sequence name of ROI
start (int) – 1-based start coordinate of ROI
end (int) – 1-based end coordinate of ROI
suffix (str) – Identifier to be appended to read names
outcsv (str) – Path to CSV output file
outbam (str) – Path to BAM output file
qual_cutoff (float, optional) – Minimum mean base quality, defaults to 10
len_cutoff (int, optional) – Minimum read length, defaults to 300

Returns

Number of records passing filters

Return type

int

isoformant.roi_fa(ref_fa, chrom, out_fa)[source]

Trim FASTA to reference sequence of interest.

Parameters

cnsfa – Path to input query FASTA file
chrom (str) – Reference sequence name of ROI
out_fa (str) – Path to output file

isoformant.shuffle_csv(incsv)[source]

Shuffle CSV records. Performs task in-place.

Parameters: incsv (str) – CSV file path

isoformant.umap_pipeline(input_adata, groupby_list, n_pcs=3, n_neighbors=15, min_dist=0.5, resolution=1, components=['1,2'], plot=False)[source]

UMAP pipeline performed in-place. Build KNN graph, compute UMAP, and cluster reads using Leiden algorithm. Option to print UMAP scatter plot.

Parameters

input_adata (Anndata object) – Read x k-mer data (processed by pca_pipeline)
groupby_list (list) – List of feature names to color-code in plot
n_pcs (int, optional) – Number of PCs for KNN graph, defaults to 3
n_neighbors (int, optional) – Number of neighbors per neighborhood, defaults to 15
min_dist (float, optional) – Minimum distance to neighbor, defaults to 0.5
resolution (float, optional) – Leiden clustering resolution, defulats to 1
components (list, optional) – List of UMAP component pairs to plot (e.g. [‘1,2’, …]), defaults to [‘1,2’]
plot (bool, optional) – ‘True’ to plot, ‘False’ to pass, defaults to ‘False’