Nanopore 16S full length Metagenomics report

I、Project information

II、Workflow

At the beginning of the analysis workflow, the raw pod5 files were basecalled and then the raw fastq files were demultiplexed with dorado_basecall_server v7.1.4. Next, reads were filtered using nanoq, discarding reads with Q score < 10 and of length below 1,300 base pairs(bp) or above 1,950 bp. Sequencing summary and statistics were calculated and visualized with NanoPlot. As the filtered reads were availible, each read was transformed into a normalized 5-mer frequency vector. After that, clustering was applied to the vector set with Uniform Manifold Approximation and Projection (UMAP) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), then cluster assignation was given to each read. For each cluster, we used Canu for read correction, FastANI for draft selection, Racon and Medaka for polishing. Finally, representative sequences selected from each cluster were classified using BLAST with SILVA 138 ribosomal RNA databases to generate the OTU table.

III、Analysis results

1. RAW QC report

Path: ./result/QC/NanoPlot_QC_raw

Sample1

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample2

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample3

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample4

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample5

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample6

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

2.Filter QC report

Path: ./result/QC/NanoPlot_QC_filter

Sample1

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample2

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample3

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample4

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample5

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

Sample6

Summary

LengthvsQualityScatterPlot_dot

HistogramReadlength

3.Analysis report

Path: ./result/Analysis_report.txt

Filter: # of sequencing reads filtered through the quality threshold and the range of read lengths
Cluster: # of clustered sequencing reads
Represent.OTU: # of clusters
Mapped.OTU: # of clusters mapped to Silva database
Number.of.Tax: # of Taxonomy

4. Group information

groupA: Sample1, Sample2, Sample3
groupB: Sample4, Sample5, Sample6

5. Taxonomy results

OTU table

Path: ./result/DEMO-OTU.txt

Barplot

According to taxonomy annotation, top 10 bacteria with relative abundance in each level (Phylum, Class, Order, family, Genus, Species) are selected for each sample or group and generate a histogram of relative abundance. Bacteria with lower relative abundance are merged into “Others”.

Path: ./result/Barplot

Phylum

Class

Order

Family

Genus

Species

Heatmap

The heatmap is drawn according to the taxonomy annotation and relative abundance of all samples (groups) in each level. “Z-score normalization” is applied to numerical conversion for relative abundance.

Path: ./result/Heatmap

Phylum

Class

Order

Family

Genus

Species

6. Alpha Diversity

Alpha Diversity analysis is to evaluate the diversity of the microbial community in a sample or grouping. The diversity analysis of a single sample can reflect the microbial diversity (Chao index) and the main bacterial diversity (Shannon index, Gini-Simpson index) of the microbial community in the sample. The definition of “Chao index” is regardless of the relative abundance of bacteria, whether they are present or not, the codes of 0 and 1 are used to evaluate the number of bacteria. Diversity of main bacteria, calculate the number of bacteria with high relative abundance, bacteria are ignored in the measurement due to low relative abundance, mainly used to evaluate the number of bacteria with high relative abundance.

Path: ./result/ALPHA

Index

Phylum

Class

Order

Family

Genus

Species

Rarefaction curve

The rarefaction curve randomly select a certain number of individuals among the samples, count the number of species represented by these individuals, and construct the curve with the number of individuals and species. It can be used to compare the abundance of species in samples with different amounts of sequencing data, and it can also be used to show whether the amount of sequencing data of a sample is reasonable. Using the method of random sampling of sequences, construct a rareaction curve based on the number of sequences drawn and the number of OTUs they can represent. When the curve tends to be flatten, it means that the amount of sequencing data is reasonable, and more data will only generate a small amount of new data. On the contrary, it indicates that more new OTUs may be generated by continuing the sequencing.

7. Beta Diversity

Beta diversity (Distance matrix between samples) is to compare the composition of microbial communities between different samples. The larger value represents the greater the difference in species distribution between samples. According to taxonomy annotations and OTUs abundance information, merge the OTUs information of the same classification to obtain an abundance table. The systematic evolution and relative abundance relationship between microbial sequences in each sample are used to calculate the distance matrix between samples or groups. Principal Coordinates Analysis (PCoA, Principal Coordinates Analysis) is used to reduce the dimensionality of the sample group.

Path: ./result/BETA

8. Differential analysis

Significant difference analysis

The Kruskal-Wallis test (also known as the H test method) in the non-matrix method is used for analysis to determine whether there is a difference in the median of two or more groups. If the chi-square value of the overall test is statistically significant, then the null hypothesis is rejected, which means that at least one pair of groups has an unequal average level. As for which pairs are different, a post-mortem comparison is required. The number of samples in each group for KW verification must be at least 5 or more.

In this analysis, three conditions were used for significance screening: (a) Select the H test method p.value <0.05 (b) Fold change more than doubled (c) Case group or Control group, the average relative abundance of at least one group exceeds 0.5% .

Path: ./result/Diff

Phylum

Class

Order

Family

Genus

Species

LEfSe analysis

LDA Effect Size (LEfSe) is an algorithm for discovering biomarkers and features in high-dimensional data. Firstly, the Kruskal-Wallis (KW) sumrank test was used to test the abundance difference of each species between the two groups, and then the Linear Discriminant Analysis (LDA) method was used to evaluate the classification utility of all the different species for the groups.

Cut off : Kruskal-Wallis test p value > 0.05 & |LDA score| < 2

Path: ./result/LEfSe

Summarize_table

Barplot

Cladogram

9. Advanced analysis

Tax4fun_pathway

The KEGG (Kyoto Encyclopedia of Genes and Genomes) database was launched in 1995 and has since developed into one of the most representative biological metabolic pathway database in the biological world. The database divides metabolic pathways into six categories, namely Cellular Processes, Environmental Information Processing, Genetic Information Processing, Human Diseases, Metabolism, Organismal Systems.

This analysis is based on 16S sequencing data, Silva database and Tax4fun tools to predict the metabolic pathway function of KEGG database. Use Kruskal Wallis two-sample test for P.value, and use 1.5 times the Fold change of Pathway activity as the screening criteria.

Path: ./result/PATHWAY