When I work with two close relatives, it is interesting which genes are homologous and how chromosomal structures are conserved among those relatives. Synteny plot is useful to visualize how to compare multiple genomic sequences. MCScan is a tool I would like to introduce here. As an example I used Solanum lycopersicum and Solanum pennellii, which had been used in my lab.
Preparation
Donwload genome fasta files and gff files.
Install last
cd /usr/local/bin
sudo wget https://gitlab.com/mcfrith/last/-/archive/1584/last-1584.tar.gz
sudo tar -xvzf last-1584.tar.gz
cd last-1584
sudo make
cd /usr/local/bin
sudo ln -s sudo ln -s last-1584/bin/* ./
Now install JCVI (which inlcudes MCscan)
conda create -n jcvi python=3.10 scipy # as of 12/2024, version 3.13 does not allow installation to complete.
conda activate jcvi
pip install jcvi
brew install texlive #note: conda version of texlive is broken
activate jcvi
cd /Volumes/data_work/data8/NGS_related/Sinha_lab/Sinteny_MCscan/input
conda activate jcvi
format fasta files
python -m jcvi.formats.fasta format /Volumes/data_work/Data8/NGS_related/Sinha_lab/ITAG4.0_release/ITAG4.0_proteins.fasta SLitag4_0.pep
python -m jcvi.formats.fasta format /Volumes/data_work/Data8/NGS_related/Sinha_lab/Solanum_pennellii_v2/annotations/Spenn-v2-aa-annot.fa SPenn.pep
Convert GFF to BED (genome)
# Convert GFF to BED (genome)
#python -m jcvi.formats.gff bed --type=mRNA --key=Name /Volumes/data_work/Data8/NGS_related/Sinha_lab/ITAG4.0_release/ITAG4.0_gene_models.gff -o SLitag4_0.bed
python -m jcvi.formats.gff bed --type=mRNA --key=Name --primary_only /Volumes/data_work/Data8/NGS_related/Sinha_lab/ITAG4.0_release/ITAG4.0_gene_models.gff -o SLitag4_0.bed
python -m jcvi.formats.gff bed --type=mRNA --key=Name --primary_only /Volumes/data_work/Data8/NGS_related/Sinha_lab/Solanum_pennellii_v2/spenn_v2.0_gene_models_annot.gff -o SPenn.bed
Modify bed file for working orthologw
If Solyc00g500001.1.1 in fasta file, bed file has to use Solyc00g500001.1
(Do not) To remove .* at the end in fasta file gene name to match gene names in both files.
(Do) To remove .* at the endo in bed file gene name.
grep "\.5\.1" SLitag4_0.bed # specific to *.5.1.1
head SLitag4_0.bed # others
# format
sed -i '' -E 's/(Solyc[0-9]+g[0-9]+\.[0-9]+)\.[0-9]/\1/' SLitag4_0.bed
head SLitag4_0.bed
grep "\.5\.1" SLitag4_0.bed
head SLitag4_0.bed
# Explanation gy ChatGPT
# Explanation of Parts:
#sed: The stream editor used for text transformations.
#
#-i '':
#
#The -i flag means "edit the file in place."
#The empty '' argument is specific to macOS sed and is used to indicate no backup file should be created. On Linux, you could use -i without the empty quotes or specify a backup suffix like -i.bak.
#-E: Enables extended regular expressions (ERE), which allow more advanced pattern matching than basic sed syntax.
#'s/.../.../':
#This is the sed substitution command. The syntax is:
#s/<pattern>/<replacement>/
#s/ indicates a substitution.
#The first part (<pattern>) defines what to search for.
#The second part (<replacement>) defines what to replace it with.
#Pattern Explanation ((Solyc[0-9]+g[0-9]+\.[0-9]+)\.[0-9]):
#Solyc: Matches the literal string Solyc.
#[0-9]+: Matches one or more digits (representing chromosome or gene numbers).
#g: Matches the literal character g.
#[0-9]+: Matches one or more digits (representing additional gene details).
#\.: Matches a literal period (.).
#[0-9]+: Matches one or more digits (possibly a version or sub-feature number).
#\.[0-9]: Matches another period followed by a single digit. This is the part you want to remove.
#( ... ): Groups part of the pattern so it can be referenced in the replacement using \1.
#Replacement (\1):
#\1 refers to the first group captured in parentheses ((Solyc[0-9]+g[0-9]+\.[0-9]+)).
#This keeps the matched part before the second .[0-9] while excluding the trailing .X.
#File (SLitag4_0.bed):
#This is the input file being edited in place.
#What It Does:
#This command removes the trailing .X (where X is a single digit) from strings that match the pattern Solyc<number>g<number>.<number>.<number> in the file SLitag4_0.bed. The modified file is saved in place without creating a backup.
# For SPenn.bed
head SPenn.bed > SPenn.test.bed
sed -i '' -E 's/(Sopen[0-9]+g[0-9]+)\.[0-9]$/\1/' SPenn.test.bed
cat SPenn.test.bed
# real
sed -i '' -E 's/(Sopen[0-9]+g[0-9]+)\.[0-9]+/\1/' SPenn.bed
cat SPenn.bed
find orthologs
cd ../../output/
ln -s ../input/* ./
python -m jcvi.compara.catalog ortholog --dbtype prot SLitag4_0 SPenn
#You can use awk to remove the “-” from the first column (chromosome name) and save the result. This is to have simple chromosome name in the synteny plot.
## for M82
awk '{gsub("SL4.0", "", $1); print}' SLitag4_0.bed > SLitag4_0_cleaned.bed
## for SPenn
awk '{gsub("Spenn-", "", $1); print}' SPenn.bed > SPenn_cleaned.bed
change space to “ in .bed using BBE editor
Write layout file
write_file(
"# y, xstart, xend, rotation, color, label, va, bed
.6, .1, .8, 0, m, M82, top, SLitag4_0_cleaned.bed
.4, .1, .8, 0, k, SPenn, top, SPenn_cleaned.bed
# edges
e, 0, 1, SLitag4.0.SPenn.anchors.simple
",
file="/Volumes/data_work/data8/NGS_related/Sinha_lab/Sinteny_MCscan/output/layout3")
Set chromosomes to plot and order.
This chromosome names has to be matched to the bed files described in layout file and each line correspond to each genotype (i.e. two genotypes needs two lines).
# M82 chromosome name
M82_ids <- read_delim("/Volumes/data_work/data8/NGS_related/Sinha_lab/Sinteny_MCscan/output/SLitag4_0_cleaned.bed",col_names = FALSE) %>% count(X1) %>% dplyr::select(X1) %>% as.vector()
# SPenn chromosome name
SPenn_ids <- read_delim("/Volumes/data_work/data8/NGS_related/Sinha_lab/Sinteny_MCscan/output/SPenn_cleaned.bed",col_names = FALSE) %>% count(X1) %>% dplyr::select(X1) %>% as.vector()
# combine them
seqIDs <- list(str_c(M82_ids$X1,collapse=","),
str_c(SPenn_ids$X1,collapse=",")
)
# write two lines
write_lines(seqIDs, "/Volumes/data_work/data8/NGS_related/Sinha_lab/Sinteny_MCscan/output/seqids4")
Plot
python -m jcvi.graphics.karyotype --outfile S.lycoM82_S.Penn_karyotype2.pdf seqids4 layout3
Microsynteny
Let’s move to local synteny.
python -m jcvi.compara.synteny mcscan SLitag4_0_cleaned.bed SLitag4_0.SPenn.lifted.anchors --iter=1 -o SLitag4_0.SPenn.i1.blocks
Select only chromosome 6 in M82
grep "Solyc06g" SLitag4_0.SPenn.i1.blocks > blocks.M82chr06
write block layout
write_file(
"# x, y, rotation, ha, va, color, ratio, label
0.5, 0.6, 0, left, center, m, 1, M82 Chr6
0.5, 0.4, 0, left, center, #fc8d62, 1, Penn
# edges
e, 0, 1",
file="/Volumes/data_work/data8/NGS_related/Sinha_lab/Sinteny_MCscan/output/blocks1.layout")
microsynteny plot (M82 chromosome 6)
python -m jcvi.formats.bed merge SLitag4_0_cleaned.bed SPenn_cleaned.bed -o SLitag4_0_SPenn.bed
python -m jcvi.graphics.synteny blocks.M82chr06 SLitag4_0_SPenn.bed blocks1.layout --outputprefix S.lycoM82_blocks.M82chr06
label “Solyc06g052070” ~ “Solyc06g072930”
python -m jcvi.graphics.synteny blocks.M82chr06 SLitag4_0_SPenn.bed blocks1.layout --outputprefix S.lycoM82_blocks.M82chr06.v2 --genelabelsize=10 --genelabels=Solyc06g052070.3,Solyc06g072930.4
how about SPenn chr06?
Make SPenn blocks
python -m jcvi.compara.synteny mcscan SPenn_cleaned.bed SLitag4_0.SPenn.lifted.anchors --iter=1 -o SPenn_SLitag4_0.i1.blocks
select only S. Penn chromosome 6
grep "Sopen06g" SPenn_SLitag4_0.i1.blocks > blocks.SPennchr06
write block layout
write_file(
"# x, y, rotation, ha, va, color, ratio, label
0.5, 0.6, 0, left, center, m, 1, M82
0.5, 0.4, 0, left, center, #fc8d62, 1, Penn Chr6
# edges
e, 0, 1",
file="/Volumes/data_work/data8/NGS_related/Sinha_lab/Sinteny_MCscan/output/blocks.SpennChr06.layout")
microsynteny plot (SPenn chromosome 6)
python -m jcvi.graphics.synteny blocks.SPennchr06 SLitag4_0_SPenn.bed blocks.SpennChr06.layout --outputprefix S.lycoM82_blocks.SPennchr06
visualization of synteny plot by JBrowse2
Install JBrowse2 app on a computer
Import .anchor.simple file together with .bed files
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [9] ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 jsonlite_1.8.9 compiler_4.4.0 tidyselect_1.2.1
## [5] jquerylib_0.1.4 scales_1.3.0 yaml_2.3.10 fastmap_1.2.0
## [9] R6_2.5.1 generics_0.1.3 knitr_1.48 bookdown_0.41
## [13] munsell_0.5.1 tzdb_0.4.0 bslib_0.8.0 pillar_1.9.0
## [17] rlang_1.1.4 utf8_1.2.4 stringi_1.8.4 cachem_1.1.0
## [21] xfun_0.49 sass_0.4.9 timechange_0.3.0 cli_3.6.3
## [25] withr_3.0.1 magrittr_2.0.3 digest_0.6.37 grid_4.4.0
## [29] rstudioapi_0.16.0 hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5
## [33] evaluate_1.0.0 glue_1.7.0 blogdown_1.19 fansi_1.0.6
## [37] colorspace_2.1-1 rmarkdown_2.28 tools_4.4.0 pkgconfig_2.0.3
## [41] htmltools_0.5.8.1