Find outliers
Introduction
Outliers are data files that differ significantly from the rest of the dataset. In MS experiments, outliers can arise due to various factors such as instrument error, sample preparation issues, or biological variation. Identifying and removing outliers before downstream analysis is crucial to avoid misleading results.
masscube evaluates the analytical sequence and reports problematic samples in an unsupervised manner. It assesses the quality of the raw data by analyzing the total peak height of all detected features. Files with a Z-score lower than -2 (by default) are recognized as outliers, ensuring that only high-quality data are included in the downstream analysis.
How to use
Step 1. Organize the data
Put all processed raw data files in a folder. The data should be organized in the following structure:
my_project
├── single_files
│ ├── sample1.txt
│ ├── sample2.txt
| └── ...
└── sample_table.csv
Note: Please provide a sample table to specify the blank samples so that they can be excluded from the outlier detection.
Step 2. Run the outlier detection
In the data folder, open a terminal and run the following command:
find-outliers
Output
After the processing, you will find the following files and folders in the data folder:
my_project
├── single_files
│ ├── sample1.txt
│ ├── sample2.txt
| └── ...
|── sample_table.csv
└── problematic_files.txt
problematic_samples.txt
: a text file containing the names of the problematic samples.single_files
: a folder containing the feature detection results for each sample.
Explanation of the workflow
|
|
The run_evaluation
function evaluates the run and reports the problematic files. It checks the quality of the raw data and identifies the problematic samples based on the number of detected features.
The function generates a problematic_samples.txt
file that lists the names of the problematic samples. Users can further investigate the outliers and decide whether to remove them before downstream analysis.