Prepare STAR parameters
About STAR options for TEs
We’ll pass them from CLI via --extra_star_align_args (this parameter is officially supported in the STAR-Salmon route). Use:
--extra_star_align_args '--twopassMode Basic \
--alignEndsType EndToEnd \
--outSAMtype BAM Unsorted \
--outSAMunmapped None \
--alignSJoverhangMin 8 --alignSJDBoverhangMin 1 \
--outFilterMismatchNoverLmax 0.06 \
--outFilterMismatchNmax 999 \
--alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 \
--outFilterScoreMinOverLread 0.4 \
--outFilterMatchNminOverLread 0.4 \
--outFilterType BySJout \
--outFilterMultimapNmax 100 \
--winAnchorMultimapNmax 200 \
--outMultimapperOrder Random \
--outSAMmultNmax 1 \
--outSAMprimaryFlag OneBestScore \
--runRNGseed 777'Brief interpretation
--twopassMode Basic # discover novel SJs in pass1; improves splice accuracy
--alignEndsType EndToEnd # no soft-clipping; reduces opportunistic partial hits in repeats
--outSAMunmapped None # don’t emit unmapped reads (smaller BAM; QC modules won’t use them)
--alignSJoverhangMin 8 --alignSJDBoverhangMin 1 # reasonable SJ support (discovery + annotated)
--outFilterMismatchNoverLmax 0.06 # ≤6% mismatches per read; keeps alignments clean
--outFilterMismatchNmax 999 # (paired with the relative cap above)
--alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 # mammalian intron/gap ranges
--outFilterScoreMinOverLread 0.4 # ≥40% of read must be mappable by score
--outFilterMatchNminOverLread 0.4 # ≥40% of read must match; avoids tiny anchors in repeats
--outFilterType BySJout # prefer alignments consistent with known/learned SJs; fewer spurious SJs
--outFilterMultimapNmax 100 # retain multi-mappers (needed for TE signal)
--winAnchorMultimapNmax 200 # allow many candidate sites in repetitive sequence
--outMultimapperOrder Random # randomize among equally good hits
--outSAMmultNmax 1 # **emit ONE alignment per read** (random-one strategy)
--outSAMprimaryFlag OneBestScore # never write multiple primaries on ties
--runRNGseed 777 # reproducible random selection across runsTell STAR to emit a genome BAM
Add back: --outSAMtype BAM Unsorted
- Good:
--twopassMode Basic+--outFilterType BySJoutimproves splice junction calling and reduces mis-spliced alignments. - EndToEnd ensures reads won’t be soft-clipped to fit repeats; if adapters/low-quality tails remain, they simply fail (nf-core trims by default, so you’re fine).
- Multi-mappers: harmless for exons because you emit one best/random alignment per read. You’ll get consistent, integer counts with featureCounts (no
-M, no--fraction).
- If you derive introns as genebody − exons (recommended) or from an intron SAF, this mapping favors clean separation of spliced (exonic) vs unspliced (intronic) signal.
- Two-pass + SJ filtering reduces the chance that intronic reads get “forced” into dubious splice alignments.
- Broad first-to-last exon spans benefit from permissive intron windows, so reads across large introns remain alignable. Your
alignIntronMax 1e6/MatesGapMax 1e6cover that.
- This is exactly the random-one policy: keep many candidate loci (
Nmax 100) but write one per read at random (outSAMmultNmax 1+OneBestScore+Random+ fixed seed). - That harmonizes with featureCounts without
--fractionto produce integer family/subfamily counts, avoiding multi-mapping inflation. - Caveat: don’t use this BAM for locus-specific TE inference (assignments are random). If you later need locus resolution, do a second, unique-only STAR run.
- Two-pass + BySJout: higher confidence in splice junctions improves all gene-centric analyses (exon/intron/genebody) and prevents spurious intron signal arising from mis-spliced alignments.
- EndToEnd: avoids soft-clipping that can create misleading partial matches in repetitive DNA—useful when you also care about TE signal and clean intron/exon boundaries.
- Relative mismatch cap (6%): keeps alignments specific without being so strict that you lose real signal. If your RNA is lower quality/longer reads, you can relax to 0.08–0.10.
- High multimap retention + single emitted alignment: preserves TE information during mapping and implements the random-one TE strategy at the alignment stage, so downstream counting is simple and reproducible.
- Reproducibility: fixed seed guarantees identical random assignments across re-runs.
Relax mismatch proportion if needed
If you see lower than expected alignment rates (e.g., older libraries, long reads), try:
--outFilterMismatchNoverLmax 0.08
Junction permissiveness
Your BySJout is conservative (good default). If you work with unusual splicing (e.g., immune hyper-editing, many non-canonical SJs), you might drop it—but only if you have a reason.
- ✅ Green (keep): two-pass + BySJout; EndToEnd; intron windows; multimap 100; random-one trio; RNG seed.
- 🟨 Yellow (situational):
outFilterMismatchNoverLmax 0.06(relax to 0.08–0.10 if needed);outSAMunmapped None(flip toWithinfor QC). - 🟥 Red (don’t mix): Don’t use this BAM for TE locus-specific counting or for fractional TE counting—its policy is random-one. If you want fractional, re-run STAR allowing multiple alignments (
--outSAMmultNmax >1) and then count TEs withfeatureCounts -M --fraction.
With this configuration, you can align once and then:
- count exons/introns/genebodies with featureCounts on disjoint gene tracks, and
- count TE families/subfamilies (integer) with your grouped SAF (no
--fraction).
This guide shows how we:
- convert the TE GTF into a grouped SAF so featureCounts outputs rows like
HAL1:L1:LINE, - (optionally) remove TE loci that overlap annotated exons with bedtools,
- run TE counts (fractional multi-mappers; fragments; unstranded),
- run gene counts per library strandedness,
- do quick QA checks so we know we did the right thing.
Everything below uses your actual paths.
- Gene GTF
GENE_GTF=/data1/users/antonz/data/DM_summer_2025/13441-DM/results_mm39_TE/genome/gencode.vM37.primary_assembly.annotation.filtered.gtf- TE GTF (from TEtranscripts website, mm39 RepeatMasker in GTF form)
TE_GTF=/data1/shared/ref/mouse/Ensembl/mm39/GRCm39_Ensembl_rmsk_TE.gtf.gz- BAMs (per project):
- 13441-DM (reverse-stranded genes):
/data1/users/antonz/data/DM_summer_2025/13441-DM/results_mm39_TE/star_salmon/bam_fin/*.bam - 13444-DM (unstranded genes):
/data1/users/antonz/data/DM_summer_2025/13444-DM/results_mm39_TE/star_salmon/bam_fin/*.bam
- 13441-DM (reverse-stranded genes):
This lets featureCounts sum all loci for the same TE group into one meta-feature row in the output.
We keep class info so labels look like
HAL1:L1:LINE.
TE_GTF=/data1/shared/ref/mouse/Ensembl/mm39/GRCm39_Ensembl_rmsk_TE.gtf.gz
SAF_ALL=/data1/shared/ref/mouse/Ensembl/mm39/GRCm39_rmsk_TE_GROUPED_all.saf
zcat "$TE_GTF" |
awk 'BEGIN{OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}
$0 !~ /^#/ {
match($0,/gene_id "([^"]+)"/,g); # subfamily / repName
match($0,/family_id "([^"]+)"/,f); # family
match($0,/class_id "([^"]+)"/,c); # class (LINE/LTR/SINE/DNA/RC/...)
gid = g[1] ":" f[1] ":" c[1]; # TEtranscripts-like label
print gid, $1, $4, $5, $7
}' > "$SAF_ALL"Checks:
head -3 "$SAF_ALL"
# GeneID Chr Start End Strand
# Lx2B2:L1:LINE 1 8387807 8388657 +
# B3:B2:SINE 1 41942995 41943142 +
wc -l "$SAF_ALL" # rows = loci + 1 header
cut -f1 "$SAF_ALL" | tail -n +2 | sort -u | wc -l # unique TE groups (≈ 1–2k; you saw 1243)SAF_RETRO=/data1/shared/ref/mouse/Ensembl/mm39/GRCm39_rmsk_TE_GROUPED_retro.saf
zcat "$TE_GTF" |
awk 'BEGIN{OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}
$0 !~ /^#/ {
match($0,/class_id "([^"]+)"/,cl); cls = cl[1]
if (cls=="LINE" || cls=="SINE" || cls=="LTR" || cls=="RC") {
match($0,/gene_id "([^"]+)"/,g);
match($0,/family_id "([^"]+)"/,f);
gid = g[1] ":" f[1] ":" cls;
print gid, $1, $4, $5, $7
}
}' > "$SAF_RETRO"Why: This mirrors common TE-expression analyses and avoids low-complexity/simple repeats and RNA genes that can dominate counts.
We use bedtools to subtract exon intervals from TE intervals, then rebuild the SAF.
Make exon BED (0-based) from the gene GTF:
GENE_GTF=/data1/users/antonz/data/DM_summer_2025/13441-DM/results_mm39_TE/genome/gencode.vM37.primary_assembly.annotation.filtered.gtf
awk 'BEGIN{OFS="\t"} $3=="exon"{print $1,$4-1,$5,".",".",$7}' "$GENE_GTF" > exons.bedTE_GTF=/data1/shared/ref/mouse/Ensembl/mm39/GRCm39_Ensembl_rmsk_TE.gtf.gz
# Unique classes, descending by locus count
zcat "$TE_GTF" |
awk '$0!~/^#/{ if (match($0,/class_id "([^"]+)"/,m)) print m[1]; else print "NA" }' |
sort | uniq -c | sort -nrThat will show things like LINE, SINE, LTR, DNA, RC, Simple_repeat, Low_complexity, Satellite, rRNA, tRNA, snRNA, scRNA, srpRNA, Unknown, ... (exact list depends on the file).
# families per class, with counts
zcat "$TE_GTF" |
awk '$0!~/^#/{ match($0,/family_id "([^"]+)"/,f); match($0,/class_id "([^"]+)"/,c);
if(f[1]!="" && c[1]!="") print c[1]"\t"f[1]; }' |
sort | uniq -c | sort -nr | headOnce you’ve seen what’s present, set a regex for the classes to keep:
KEEP_CLASSES='^(LINE|SINE|LTR|RC)$' # edit if you want DNA or others
SAF_RETRO=/data1/shared/ref/mouse/Ensembl/mm39/GRCm39_rmsk_TE_GROUPED_retro.saf
zcat "$TE_GTF" |
awk -v pat="$KEEP_CLASSES" 'BEGIN{OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}
$0!~/^#/{
match($0,/class_id "([^"]+)"/,cl); cls=cl[1];
if (cls ~ pat){
match($0,/gene_id "([^"]+)"/,g);
match($0,/family_id "([^"]+)"/,f);
print g[1] ":" f[1] ":" cls, $1, $4, $5, $7
}
}' > "$SAF_RETRO"# classes present in the SAF (3rd token in GeneID)
cut -f1 "$SAF_RETRO" | tail -n +2 | awk -F: '{print $3}' | sort -uIf that prints only LINE, LTR, RC, SINE (or whatever you chose), you’re good.
zcat "$TE_GTF" | grep 'class_id "LTR"' | headJust extend the regex, e.g.:
KEEP_CLASSES='^(LINE|SINE|LTR)$' # drop RC (Helitrons)That’s all you need to inspect and control class filtering with confidence.
IN_SAF="$SAF_ALL" # or $SAF_RETRO
awk 'BEGIN{OFS="\t"} NR>1{print $2,$3-1,$4,$1,".",$5}' "$IN_SAF" > te_grouped.bed
# Make sure we normalized contig names the same way in both inputs before running bedtools
# (either both with 'chr' or both without)
grep -m1 '^chr' exons.bed || echo "exons: no chr prefix"
grep -m1 '^chr' te_grouped.bed || echo "te: no chr prefix"
# If exons.bed uses chr* but TE does not (or vice versa), normalize:
# Strip 'chr' from exons to match TE (Ensembl mm39 often has no 'chr'):
# (If your exons DO have 'chr' and TE does NOT, use this)
sed -E 's/^chr//' exons.bed > exons.tmp && mv exons.tmp exons.bedSort both files (bedtools is happier & faster on sorted inputs):
LC_ALL=C sort -k1,1 -k2,2n exons.bed -o exons.bed
LC_ALL=C sort -k1,1 -k2,2n te_grouped.bed -o te_grouped.bedMeasure overlaps, subtract, verify:
bedtools intersect -a te_grouped.bed -b exons.bed -u | wc -l # how many TE loci overlap exons?
bedtools subtract -a te_grouped.bed -b exons.bed > te_grouped_noExon.bed
bedtools intersect -a te_grouped_noExon.bed -b exons.bed -u | wc -l # should be 0Rebuild a SAF from the exon-removed BED:
#SAF_ALL=/data1/shared/ref/mouse/Ensembl/mm39/GRCm39_rmsk_TE_GROUPED_all.saf
SAF_NOEXON=/data1/shared/ref/mouse/Ensembl/mm39/GRCm39_rmsk_TE_GROUPED_all_noExon.saf
awk 'BEGIN{OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}
{print $4,$1,$2+1,$3,$6}' te_grouped_noExon.bed > "$SAF_NOEXON"
# Check counts again
wc -l "$SAF_NOEXON"
cut -f1 "$SAF_NOEXON" | tail -n +2 | sort -u | wc -l # groups should remain ~ the same (≈1243)How bedtools works (brief):
intersect -ureports intervals from A that have any overlap with B (we used it to count overlaps).subtractremoves the parts of A that overlap B (we used it to drop exon-overlapping TE loci segments).- We used 0-based BED (hence
start-1when converting from GTF/SAF to BED, and+1when converting back).