isoformant module
- isoformant.balanced_chisq_test(adata, cond_list=[], cond_label='sample_id', cluster_label='leiden')[source]
Chi-squared goodness-of-fit test. Assumes class balance.
- Parameters
input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)
cond_list (list, optional) – List of condition labels to be considered, defaults to []
cond_label (str, optional) – Condition label, defaults to ‘sample_id’
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’
- Returns
Test results
- Return type
pandas dataframe
- isoformant.cns(minimap2_path, seq_dict, dest_dir, ref_fa, cluster_label='leiden', ncore=1, sirv=False, messages=False)[source]
Consensus calling module.
- Parameters
minimap2_path (str) – Path to minimap2 executable
seq_dict (dict) – Dictionary mapping cluster to read sequences
dest_dir (str) – Path to directory for output files
ref_fa (str) – Path to reference FASTA file
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’
ncore (int, optional) – Number of cores, defaults to 1
sirv (bool, optional) – True to benchmark SIRVs, defaults to False
messages (bool, optional) – True to print STDOUT/STDERR messages, defaults to False
- Returns
Path to consensus sequence set BAM file
- Return type
str
- Returns
Dictionary mapping cluster-specific consensus sequence to BAM path
- Return type
dict
- isoformant.create_viz_bams(input_adata, combined_bam, dest_dir, cluster_label='leiden', bam_n=10, ncore=1)[source]
Create BAM files for visualizations. Returns a dictionary mapping cluster ID to BAM file.
- Parameters
input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)
combined_bam (str) – Path to BAM file associated with processed reads
dest_dir (str) – Path to directory for output files
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’
bam_n (int, optional) – Maximum number of records per BAM file, defaults to 10
ncore (int, optional) – Number of cores, defulats to 1
- Returns
Dictionary mapping cluster to BAM path
- Return type
dict
- Returns
Dictionary mapping cluster to read sequences
- Return type
dict
- isoformant.kmerize(combined_csv, ksize=5)[source]
Ingests CSV file and produces k-mer frequency anndata object.
- Parameters
combined_csv (str) – Path to CSV file
ksize (int, optional) – k-mer size, defaults to 5
- Returns
k-mer frequencies as anndata object
- Return type
anndata object
- isoformant.minimap2_launcher(minimap2_path, ref_fa, cnsfa, cnsbam, sirv=False, messages=False)[source]
Launch minimap2 alignment.
- Parameters
minimap2_path (str) – Path to minimap2 executable
ref_fa (str) – Path to reference FASTA file
cnsfa (str) – Path to input query FASTA file
cnsbam (str) – Path to output query BAM file
sirv (bool, optional) – True to benchmark SIRVs, defaults to False
messages (bool, optional) – True to print STDOUT/STDERR messages, defaults to False
- isoformant.pca_pipeline(adata_reads, groupby_list, components=['1,2'], n_comps=100, plot=False)[source]
PCA pipeline performed in-place. Scale data and perform PCA. Option to print PC scatter plot.
- Parameters
adata_reads (Anndata object) – Read x k-mer data
groupby_list (list) – List of feature names to color-code in plot
components (list, optional) – List of PC pairs to plot (e.g. [‘1,2’, …]), defaults to [‘1,2’]
n_comps (int, optional) – Number of PCs to compute, defaults to 100
plot (bool, optional) – ‘True’ to plot, ‘False’ to pass, defaults to ‘False’
- isoformant.plot_cluster_occupancy(adata, cond_list=[], cond_label='sample_id', cluster_label='leiden', fig_size=(6, 6), subplot_hjust=0.3, save=None)[source]
Plots cluster occupancy by condition label.
- Parameters
input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)
cond_list (list, optional) – List of condition labels to be considered, defaults to []
cond_label (str, optional) – Condition label, defaults to ‘sample_id’
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’
fig_size (float, optional) – Figure size, defaults to (10,6)
subplot_hjust – Subplot horizontal spacing, defaults to 0.3
save (None, optional) – Path to save plot, defaults to None
- Returns
Frequency table
- Return type
pandas dataframe
- isoformant.plot_highlight_COI(input_adata, COI, cluster_label='leiden')[source]
Highlight cluster of interest in UMAP plot.
- Parameters
input_adata (Anndata object) – Read x k-mer data (processed by umap_pipeline)
COI (str) – Cluster label
cluster_label (str, optional) – Cluster label, defaults to ‘leiden’
- isoformant.plot_tracks(track_dict, chrom, start, end, ref_fa, left_margin=0, save=None)[source]
Given dictionary, plot reads tracks.
- Parameters
track_dict – Dictionary of track labels mapped to BAM files
chrom (str) – Reference sequence name of ROI
start (int) – 1-based start coordinate of ROI
end (int) – 1-based end coordinate of ROI
ref_fa (str) – Path to reference FASTA file
left_margin (int, optional) – Left margin in genomic coodinates, defaults to 0
save (None, optional) – Path to save plot, defaults to None
- isoformant.preprocess_reads(bams_list, chrom, start, end, dest_dir, max_reads=1000, ncore=1, qual_cutoff=10, len_cutoff=300)[source]
Module to process a list of BAM files. Coordinates ‘process_bam’ task: extract passing reads. Creates merged BAM file and writes to disk. Creates merged min-depth balanced CSV file and writes to disk.
- Parameters
bams_list (list) – List of sorted BAM file paths (.bai required in same directory)
chrom (str) – Reference sequence name of ROI
start (int) – 1-based start coordinate of ROI
end (int) – 1-based end coordinate of ROI
dest_dir (str) – Path to directory for output files
max_reads (int, optional) – Maximum possible reads considered, defaults to 1000
ncore (int, optional) – Number of cores, defaults to 1
qual_cutoff (float, optional) – Minimum mean base quality, defaults to 10
len_cutoff (int, optional) – Minimum read length, defaults to 300
- Returns
Path to merged CSV file
- Return type
str
- Returns
Path to sorted merged BAM file
- Return type
str
- isoformant.process_bam(bamfile, chrom, start, end, suffix, outcsv, outbam, qual_cutoff=10, len_cutoff=300)[source]
Performs a series of tasks on BAM file. Filter BAM reads: (1) not secondary/supplemental read, (2) min mean base quality, (3) min read length. Write to disk in BAM format. Write to disk in CSV format. Randomly shuffled.
- Parameters
bamfile (str) – Path to sorted BAM file (.bai index file required in same directory)
chrom (str) – Reference sequence name of ROI
start (int) – 1-based start coordinate of ROI
end (int) – 1-based end coordinate of ROI
suffix (str) – Identifier to be appended to read names
outcsv (str) – Path to CSV output file
outbam (str) – Path to BAM output file
qual_cutoff (float, optional) – Minimum mean base quality, defaults to 10
len_cutoff (int, optional) – Minimum read length, defaults to 300
- Returns
Number of records passing filters
- Return type
int
- isoformant.roi_fa(ref_fa, chrom, out_fa)[source]
Trim FASTA to reference sequence of interest.
- Parameters
cnsfa – Path to input query FASTA file
chrom (str) – Reference sequence name of ROI
out_fa (str) – Path to output file
- isoformant.shuffle_csv(incsv)[source]
Shuffle CSV records. Performs task in-place.
- Parameters
incsv (str) – CSV file path
- isoformant.umap_pipeline(input_adata, groupby_list, n_pcs=3, n_neighbors=15, min_dist=0.5, resolution=1, components=['1,2'], plot=False)[source]
UMAP pipeline performed in-place. Build KNN graph, compute UMAP, and cluster reads using Leiden algorithm. Option to print UMAP scatter plot.
- Parameters
input_adata (Anndata object) – Read x k-mer data (processed by pca_pipeline)
groupby_list (list) – List of feature names to color-code in plot
n_pcs (int, optional) – Number of PCs for KNN graph, defaults to 3
n_neighbors (int, optional) – Number of neighbors per neighborhood, defaults to 15
min_dist (float, optional) – Minimum distance to neighbor, defaults to 0.5
resolution (float, optional) – Leiden clustering resolution, defulats to 1
components (list, optional) – List of UMAP component pairs to plot (e.g. [‘1,2’, …]), defaults to [‘1,2’]
plot (bool, optional) – ‘True’ to plot, ‘False’ to pass, defaults to ‘False’