Nanopore 16S full length Metagenomics report
I、Project information
II、Workflow
At the beginning of the analysis workflow, the raw pod5 files were basecalled and then the raw fastq files were demultiplexed with dorado_basecall_server v7.1.4. Next, reads were filtered using nanoq, discarding reads with Q score < 10 and of length below 1,300 base pairs(bp) or above 1,950 bp. Sequencing summary and statistics were calculated and visualized with NanoPlot. As the filtered reads were availible, each read was transformed into a normalized 5-mer frequency vector. After that, clustering was applied to the vector set with Uniform Manifold Approximation and Projection (UMAP) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), then cluster assignation was given to each read. For each cluster, we used Canu for read correction, FastANI for draft selection, Racon and Medaka for polishing. Finally, representative sequences selected from each cluster were classified using BLAST with SILVA 138 ribosomal RNA databases to generate the OTU table.
III、Analysis results
1. RAW QC report
Path: ./result/QC/NanoPlot_QC_raw
Sample1
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample2
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample3
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample4
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample5
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample6
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
2.Filter QC report
Path: ./result/QC/NanoPlot_QC_filter
Sample1
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample2
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample3
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample4
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample5
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
Sample6
Summary
LengthvsQualityScatterPlot_dot
HistogramReadlength
3.Analysis report
Path: ./result/Analysis_report.txt
Filter: # of sequencing reads filtered through the quality threshold and the range of read lengths
Cluster: # of clustered sequencing reads
Represent.OTU: # of clusters
Mapped.OTU: # of clusters mapped to Silva database
Number.of.Tax: # of Taxonomy
4. Group information
groupA: Sample1, Sample2, Sample3
groupB: Sample4, Sample5, Sample6
5. Taxonomy results
OTU table
Path: ./result/DEMO-OTU.txt
Barplot
According to taxonomy annotation, top 10 bacteria with relative abundance in each level (Phylum, Class, Order, family, Genus, Species) are selected for each sample or group and generate a histogram of relative abundance. Bacteria with lower relative abundance are merged into “Others”.
Path: ./result/Barplot
Phylum
Class
Order
Family
Genus
Species
Heatmap
The heatmap is drawn according to the taxonomy annotation and relative abundance of all samples (groups) in each level. “Z-score normalization” is applied to numerical conversion for relative abundance.
Path: ./result/Heatmap
Phylum
Class
Order
Family
Genus
Species
6. Alpha Diversity
Alpha Diversity analysis is to evaluate the diversity of the microbial community in a sample or grouping. The diversity analysis of a single sample can reflect the microbial diversity (Chao index) and the main bacterial diversity (Shannon index, Gini-Simpson index) of the microbial community in the sample. The definition of “Chao index” is regardless of the relative abundance of bacteria, whether they are present or not, the codes of 0 and 1 are used to evaluate the number of bacteria. Diversity of main bacteria, calculate the number of bacteria with high relative abundance, bacteria are ignored in the measurement due to low relative abundance, mainly used to evaluate the number of bacteria with high relative abundance.
Path: ./result/ALPHA
Index
Phylum
Class
Order
Family
Genus
Species
Rarefaction curve
The rarefaction curve randomly select a certain number of individuals among the samples, count the number of species represented by these individuals, and construct the curve with the number of individuals and species. It can be used to compare the abundance of species in samples with different amounts of sequencing data, and it can also be used to show whether the amount of sequencing data of a sample is reasonable. Using the method of random sampling of sequences, construct a rareaction curve based on the number of sequences drawn and the number of OTUs they can represent. When the curve tends to be flatten, it means that the amount of sequencing data is reasonable, and more data will only generate a small amount of new data. On the contrary, it indicates that more new OTUs may be generated by continuing the sequencing.
7. Beta Diversity
Beta diversity (Distance matrix between samples) is to compare the composition of microbial communities between different samples. The larger value represents the greater the difference in species distribution between samples. According to taxonomy annotations and OTUs abundance information, merge the OTUs information of the same classification to obtain an abundance table. The systematic evolution and relative abundance relationship between microbial sequences in each sample are used to calculate the distance matrix between samples or groups. Principal Coordinates Analysis (PCoA, Principal Coordinates Analysis) is used to reduce the dimensionality of the sample group.
Path: ./result/BETA
8. Differential analysis
Significant difference analysis
The Kruskal-Wallis test (also known as the H test method) in the non-matrix method is used for analysis to determine whether there is a difference in the median of two or more groups. If the chi-square value of the overall test is statistically significant, then the null hypothesis is rejected, which means that at least one pair of groups has an unequal average level. As for which pairs are different, a post-mortem comparison is required. The number of samples in each group for KW verification must be at least 5 or more.
In this analysis, three conditions were used for significance screening: (a) Select the H test method p.value <0.05 (b) Fold change more than doubled (c) Case group or Control group, the average relative abundance of at least one group exceeds 0.5% .
Path: ./result/Diff
Phylum
Class
Order
Family
Genus
Species
LEfSe analysis
LDA Effect Size (LEfSe) is an algorithm for discovering biomarkers and features in high-dimensional data. Firstly, the Kruskal-Wallis (KW) sumrank test was used to test the abundance difference of each species between the two groups, and then the Linear Discriminant Analysis (LDA) method was used to evaluate the classification utility of all the different species for the groups.
Cut off : Kruskal-Wallis test p value > 0.05 & |LDA score| < 2
Path: ./result/LEfSe
Summarize_table
Barplot
Cladogram
9. Advanced analysis
Tax4fun_pathway
The KEGG (Kyoto Encyclopedia of Genes and Genomes) database was launched in 1995 and has since developed into one of the most representative biological metabolic pathway database in the biological world. The database divides metabolic pathways into six categories, namely Cellular Processes, Environmental Information Processing, Genetic Information Processing, Human Diseases, Metabolism, Organismal Systems.
This analysis is based on 16S sequencing data, Silva database and Tax4fun tools to predict the metabolic pathway function of KEGG database. Use Kruskal Wallis two-sample test for P.value, and use 1.5 times the Fold change of Pathway activity as the screening criteria.
Path: ./result/PATHWAY