基础流程
(下面提供各个数据集运行 Molo Pipeline 的全流程,复制代码运行即可复现网站中展示的全部结果;此外提供数据集的官方链接 & 打包好的molo对象,用户可自行选择进行下载)
10X Visium
数据要求
用于 Molo Pipeline 的 10X Visium 数据需要包含 filtered_feature_bc_matrix.h5 文件和 spatial 文件夹,其中 spatial 文件夹中需要包含 scalefactors_json.json, tissue_lowres_image.png, tissue_positions_list.csv 三个文件。这些文件均由 Space Ranger 生成。下面是示例数据的文件结构树:
visium
├── filtered_feature_bc_matrix.h5
└── spatial
├── aligned_fiducials.jpg
├── detected_tissue_image.jpg
├── scalefactors_json.json
├── tissue_hires_image.png
├── tissue_lowres_image.png
└── tissue_positions_list.csv
标准流程
molo_obj <- create_molo("Visium", folder_path = "./visium")
molo_obj <- Molo_Pipeline(molo_obj,
nCount_range = c(10,2000),
nFeature_range = c(10,2000),
dataset = "mouse",
molo_env = "/anaconda3/envs/Molo/",
plot_path = "./plots/visium/",
labeled_ref = "./ref/mouse_brain.rds",
run_Banksy = FALSE)
Slide-seq v1/v2
数据要求
用于 Molo Pipeline 的 Slide-seq 数据需要包含 raw_counts.csv.gz 和 spatial_info.csv.gz 文件。此外我们也支持 .tsv.gz, .txt.gz 等多种格式的数据。下面是示例数据的文件结构树:
slide_seq
├── raw_counts.csv.gz
└── spatial_info.csv.gz
标准流程
molo_obj <- create_molo("Slide-seq", folder_path = "./slide_seq")
molo_obj <- Molo_Pipeline(molo_obj,
nCount_range = c(10,2000),
nFeature_range = c(10,2000),
dataset = "mouse",
plot_path = "./plots/slide_seq/",
labeled_ref = "./ref/mouse_brain.rds",
run_Banksy = FALSE)
Vizgen MERFISH
数据要求
用于 Molo Pipeline 的 Vizgen MERFISH 数据需要包含 cell_by_gene.csv 文件和 cell_boundaries 文件夹,其中 cell_boundaries 文件夹包含若干个 **feature_data_*.hdf5** 文件。下面是示例数据的文件结构树:
vizgen
├── cell_boundaries
│ ├── feature_data_0.hdf5
│ ├── feature_data_1.hdf5
│ ├── ...
│ └── feature_data_1225.hdf5
└── cell_by_gene.csv
标准流程
molo_obj <- create_molo("MERFISH", folder_path = "./merfish")
molo_obj <- Molo_Pipeline(molo_obj,
nCount_range = c(10,2000),
nFeature_range = c(10,2000),
dataset = "mouse",
plot_path = "./plots/vizgen/",
labeled_ref = "./ref/mouse_brain.rds",
run_Banksy = FALSE)
Nanostring CosMx
数据要求
用于 Molo Pipeline 的 Slide-seq v2 数据需要包含若干文件。下面是示例数据的文件结构树:
cosmx
├── Run5642_S3_Quarter_exprMat_file.csv
├── Run5642_S3_Quarter_fov_positions_file.csv
├── Run5642_S3_Quarter_metadata_file.csv
├── Run5642_S3_Quarter-polygons.csv
└── Run5642_S3_Quarter_tx_file.csv
标准流程
molo_obj <- create_molo("CosMx", folder_path = "./cosmx")
molo_obj <- Molo_Pipeline(molo_obj,
nCount_range = c(10,2000),
nFeature_range = c(10,2000),
dataset = "human",
plot_path = "./plots/cosmx/",
labeled_ref = NULL,
run_Banksy = FALSE)
10X Xenium
标准流程
molo_obj <- create_molo("Xenium", folder_path = "./xenium")
molo_obj <- Molo_Pipeline(molo_obj,
nCount_range = c(10,2000),
nFeature_range = c(10,2000),
dataset = "mouse",
plot_path = "./plots/xenium/",
labeled_ref = "./ref/mouse_brain.rds",
run_Banksy = FALSE)
spatial ATAC-seq
预处理
1)在 NCBI 下载 SRA 数据并解压;
2)根据提示填写 config.yaml,注意请不要修改除 config.yaml 外的任何文件;
3)在终端运行 snakemake --configfile config.yaml
;
4)在指定的输出目录中查看 fragments.tsv.gz 及其他输出文件。
Tip: 预处理时需要自行安装 CelllRanger ATAC,并将安装路径按照提示填入 config.yaml .
数据要求
用于 Molo Pipeline 的 Slide-seq v2 数据需要包含 fragments.tsv.gz 文件和 spatial 文件夹,其中 spatial 文件夹中包含 scalefactors_json.json, tissue_lowres_image.png, tissue_positions_list.csv 三个文件。下面是示例数据的文件结构树:
sp_ATAC
├── fragments.tsv.gz
└── spatial
├── scalefactors_json.json
├── tissue_lowres_image.png
└── tissue_positions_list.csv
标准流程
molo_obj <- create_molo("ATAC", folder_path = "./atac", ATAC_dataset = "mouse")
molo_obj <- Molo_Pipeline(molo_obj,
nCount_range = c(10,2000),
nFeature_range = c(10,2000),
dataset = "mouse",
plot_path = "./plots/atac/",
labeled_ref = "./ref/mouse_brain.rds",
run_Banksy = FALSE)
CODEX
标准流程
molo_obj <- create_molo("CODEX", folder_path = "./codex")
molo_obj <- Molo_Pipeline(molo_obj,
nCount_range = c(10,2000),
nFeature_range = c(10,2000),
dataset = "human",
plot_path = "./plots/codex/",
labeled_ref = NULL,
run_Banksy = FALSE)
Stereo-seq
标准流程
molo_obj <- create_molo("Stereo-seq", folder_path = "./stereo_seq/")
molo_obj <- Molo_Pipeline(molo_obj,
nCount_range = c(10,2000),
nFeature_range = c(10,2000),
dataset = "mouse",
plot_path = "./plots/stereo_seq/",
labeled_ref = NULL,
run_Banksy = FALSE)
Custom
数据要求
用于 Molo Pipeline 的 CODEX 数据需要包含matrix.parquet,此外还有部分可选项。下面是示例数据的文件结构树:
custom
├── matrix.parquet
├── embedding.parquet(optional)
├── umap.parquet(optional)
├── meta.csv(optional)
└── spatial.csv(optional)
Tip: 由于 matrix 和 embedding 数据量往往很大,为了提高 I/O 效率,我们使用了 parquet 而非传统的 csv。您可以使用 R 语言的 arrow 包或 python 的 pandas包中的 to_parquet(engine = ‘pyarrow’) 创建 parquet 文件。
标准流程
molo_obj <- create_molo("Custom", folder_path = "./custom/")
molo_obj <- Molo_Pipeline(molo_obj,
nCount_range = c(10,2000),
nFeature_range = c(10,2000),
dataset = "mouse",
plot_path = "./plots/custom/",
labeled_ref = NULL,
run_Banksy = FALSE)