params
Overview
The params
module defines a class Params
that stores and manages parameters for mass spectrometry-based untargeted metabolomics data processing. It also exposes a helper function find_ms_info
and two dictionaries of parameter ranges and defaults.
This documentation is synchronized with the current implementation in params.py
and reflects all attributes, methods, behaviors, defaults, and edge cases present in the code.
Classes
Params
A configuration container for project-level and file-level processing parameters, including project setup, raw data reading/cleaning, feature detection, grouping, alignment, annotation, normalization, statistics, visualization, and output controls.
Attributes
Project & Metadata
sample_metadata
(pandas.DataFrame | None) — sample table held in-memory.project_dir
(str | None) — project root directory.sample_dir
(str | None) — directory for raw MS data; set during workflow prep.single_file_dir
(str | None) — outputs for single-file processing.tmp_file_dir
(str | None) — temporary/intermediate files.ms2_matching_dir
(str | None) — MS/MS matching outputs.bpc_dir
(str | None) — base peak chromatogram outputs.project_file_dir
(str | None) — auxiliary project files (sample table with time, etc.).normalization_dir
(str | None) — normalization results.statistics_dir
(str | None) — statistical analysis results.problematic_files
(dict) — problematic files mapping{file_name: error_message}
.
Raw Data Reading & Cleaning
file_name
(str | None) — file name of the raw data.file_path
(str | None) — absolute path of the raw data.ion_mode
(str) —"positive"
(default) or"negative"
.ms_type
(str | None) —"orbitrap"
,"qtof"
,"tripletof"
, or"others"
.is_centroid
(bool) — whether data is centroided (True
by default).file_format
(str | None) — lower-case type ("mzml"
,"mzxml"
,"mzjson"
, or"mzjson.gz"
).scan_time_unit
(str) —"minute"
(default) or"second"
.mz_lower_limit
(float) — lower m/z bound (default0.0
).mz_upper_limit
(float) — upper m/z bound (default100000.0
).rt_lower_limit
(float) — lower RT bound in minutes (default0.0
).rt_upper_limit
(float) — upper RT bound in minutes (default10000.0
).scan_levels
(list[int]) — scan levels to read (default[1, 2]
).centroid_mz_tol
(float | None) — m/z tolerance for centroiding (0.005
by default; setNone
to disable centroiding).ms1_abs_int_tol
(float) — MS1 absolute intensity threshold (recommend30000
Orbitrap,1000
QTOF).ms2_abs_int_tol
(float) — MS2 absolute intensity threshold (recommend10000
Orbitrap,500
QTOF).ms2_rel_int_tol
(float) — MS2 relative intensity to base peak (default0.01
).precursor_mz_offset
(float) — m/z offset for defining MS2 range (default2.0
).
Feature Detection
mz_tol_ms1
(float) — MS1 m/z tolerance (default0.01
).mz_tol_ms2
(float) — MS2 m/z tolerance (default0.015
).feature_gap_tol
(int) — tolerance in consecutive scans without signal inside a feature (default10
inParams
; see note below aboutPARAMETER_DEFAULT
).batch_size
(int) — parallel processing batch size (default100
).percent_cpu_to_use
(float) — fraction of CPU to use (default0.8
).
Feature Grouping
group_features_single_file
(bool) — group features within a single file (defaultFalse
).scan_scan_cor_tol
(float) — scan-to-scan correlation threshold (default0.7
).mz_tol_feature_grouping
(float) — m/z tolerance for grouping (default0.015
).rt_tol_feature_grouping
(float) — RT tolerance for grouping (default0.1
).valid_charge_states
(list[int]) — allowed charge states (default[1]
).
Feature Alignment
mz_tol_alignment
(float) — m/z tolerance for alignment (default0.01
).rt_tol_alignment
(float) — RT tolerance for alignment (default0.2
).rt_tol_rt_correction
(float) — expected max RT shift for RT correction (default0.5
min).correct_rt
(bool) — perform RT correction (defaultTrue
).scan_number_cutoff
(int) — minimum non-zero scans to be aligned (default5
).detection_rate_cutoff
(float) — required detection rate across QC+samples (default0.1
).merge_features
(bool) — merge near-duplicate features (defaultTrue
).mz_tol_merge_features
(float) — m/z tolerance for merging (default0.01
).rt_tol_merge_features
(float) — RT tolerance for merging (default0.02
).group_features_after_alignment
(bool) — group after alignment (defaultTrue
).fill_gaps
(bool) — fill gaps in aligned features (defaultTrue
).gap_filling_method
(str) — method used in gap filling (default"local_maximum"
).gap_filling_rt_window
(float) — RT window for finding local maxima (default0.05
min).isotope_rel_int_limit
(float) — isotope intensity upper limit relative to base peak (default1.5
).
Feature Annotation
ms2_library_path
(str | None) — path to MS2 library (.msp
or.pickle
); set toNone
if not existing.fuzzy_search
(bool) — enable fuzzy search (defaultFalse
).consider_rt
(bool) — consider RT in MS2 matching (defaultFalse
).rt_tol_annotation
(float) — RT tolerance for annotation (default0.2
).ms2_sim_tol
(float) — MS2 similarity threshold (default0.7
).spectral_similarity_method
(str) — similarity method (default"unweighted_entropy"
).
Normalization
sample_normalization
(bool) — sample-wise normalization by total amount/concentration (defaultFalse
).sample_norm_method
(str) — method for sample normalization (default"pqn"
).signal_normalization
(bool) — feature-wise drift correction (defaultFalse
).signal_norm_method
(str) — drift correction method (default"lowess"
).
Statistics
run_statistics
(bool) — run statistical analysis (defaultFalse
).
Visualization
plot_bpc
(bool) — plot BPC chromatograms (defaultFalse
).plot_ms2
(bool) — plot MS2 mirror plots (defaultFalse
).plot_normalization
(bool) — plot normalization results (defaultFalse
).
Classifier Building
by_group_name
(str | None) — group name for classifier training (if used).
Output
output_single_file
(bool) — export processed single-file outputs (defaultFalse
; setTrue
during workflow prep).output_ms1_scans
(bool) — export all MS1 scans to pickle for fast reloading (defaultFalse
; setTrue
during workflow prep).output_aligned_file
(bool) — export aligned features (defaultFalse
; setTrue
during workflow prep).quant_method
(str) —"peak_height"
(default),"peak_area"
, or"top_average"
.
Methods
read_parameters_from_csv(path)
Reads a CSV of key–value pairs and sets attributes. Values convertible to float
are cast; otherwise, "true"/"yes"
→ True
, "false"/"no"
→ False
. Calls check_parameters()
afterward.
read_sample_metadata(path)
Loads sample metadata from CSV. Lower-cases columns named "is_qc"
and "is_blank"
; converts "yes"/"no"
strings to booleans (otherwise defaults both columns to False
). Sorts so QC appear first and blanks last; adds VALID
and ABSOLUTE_PATH
columns; stores in sample_metadata
.
_untargeted_metabolomics_workflow_preparation()
Prepares a project for the untargeted metabolomics workflow:
- Validates
project_dir
; derives standard subdirectories and creates them if missing. - Ensures raw data exist in
project_dir/data
(currently auto-detects only.mzML
and.mzXML
). - If
sample_table.csv
is missing, disables normalization/statistics and prints notices. - If
parameters.csv
is missing, prints notices, infers(ms_type, ion_mode)
from the first sample viafind_ms_info
, callsset_default
, and enablesplot_bpc
. - Loads sample table if present; otherwise builds from discovered file basenames. Validates presence of raw files, computes acquisition
time
viaget_start_time
, filters invalid, sorts by time, adds sequentialanalytical_order
, and assigns batch IDs vialabel_batch_id
. Writesproject_files/sample_table_with_time.csv
. - Sets output toggles:
output_single_file=True
,output_ms1_scans=True
,output_aligned_file=True
.
set_default(ms_type, ion_mode)
For "orbitrap"
, sets ms1_abs_int_tol=30000
, ms2_abs_int_tol=10000
; for others, ms1_abs_int_tol=1000
, ms2_abs_int_tol=500
. Also sets ion_mode
.
check_parameters()
Validates all numeric parameters against PARAMETER_RAGES
. If a value is out of range, it prints a warning and resets that parameter to the value in PARAMETER_DEFAULT
. If ms2_library_path
does not exist, sets it to None
. Casts batch_size
to int
.
Note: The code refers to
PARAMETER_RAGES
throughout (typo intentional to match the implementation).
output_parameters(path, format="json")
Exports all parameters (except project_dir
) to JSON, including "MassCube_version"
obtained from importlib.metadata.version("masscube")
. Only "json"
is supported.
_check_raw_files_in_data_dir()
Cross-references basenames from sample_metadata
with files in sample_dir
(currently only .mzML
/.mzXML
), sets VALID
and populates ABSOLUTE_PATH
accordingly.
Functions
find_ms_info(file_name)
Reads up to the first 200 lines of an .mzML
or .mzXML
file (lower-cased text) to infer:
ms_type
:"orbitrap"
if it contains"orbitrap"
/"q exactive"
,"tripletof"
if"tripletof"
, else"qtof"
if contains"tof"
.ion_mode
:"positive"
or"negative"
if mentioned.centroid
:True
if"centroid spectrum"
orcentroided="1"
present.
Returns (ms_type, ion_mode, centroid)
.
Constants
PARAMETER_RAGES
— Valid Ranges
PARAMETER_RAGES = {
"mz_lower_limit": (0.0, 100000.0),
"mz_upper_limit": (0.0, 100000.0),
"rt_lower_limit": (0.0, 10000.0),
"rt_upper_limit": (0.0, 10000.0),
"centroid_mz_tol": (0.0, 0.1),
"ms1_abs_int_tol": (0, 1e10),
"ms2_abs_int_tol": (0, 1e10),
"ms2_rel_int_tol": (0.0, 1.0),
"precursor_mz_offset": (0.0, 100000.0),
"mz_tol_ms1": (0.0, 0.02),
"mz_tol_ms2": (0.0, 0.02),
"feature_gap_tol": (0, 100),
"scan_scan_cor_tol": (0.0, 1.0),
"mz_tol_alignment": (0.0, 0.02),
"rt_tol_alignment": (0.0, 2.0),
"scan_number_cutoff": (0, 100),
"detection_rate_cutoff": (0.0, 1.0),
"mz_tol_merge_features": (0.0, 0.02),
"rt_tol_merge_features": (0.0, 0.5),
"ms2_sim_tol": (0.0, 1.0)
}
PARAMETER_DEFAULT
— Defaults Used on Reset
PARAMETER_DEFAULT = {
"mz_lower_limit": 0.0,
"mz_upper_limit": 100000.0,
"rt_lower_limit": 0.0,
"rt_upper_limit": 10000.0,
"centroid_mz_tol": 0.005,
"ms1_abs_int_tol": 1000.0,
"ms2_abs_int_tol": 500,
"ms2_rel_int_tol": 0.01,
"precursor_mz_offset": 2.0,
"mz_tol_ms1": 0.01,
"mz_tol_ms2": 0.015,
"feature_gap_tol": 30,
"scan_scan_cor_tol": 0.7,
"mz_tol_alignment": 0.01,
"rt_tol_alignment": 0.2,
"scan_number_cutoff": 5,
"detection_rate_cutoff": 0.1,
"mz_tol_merge_features": 0.01,
"rt_tol_merge_features": 0.05,
"ms2_sim_tol": 0.7
}
Discrepancy note: In the
Params
initializer,feature_gap_tol
defaults to10
, whereasPARAMETER_DEFAULT["feature_gap_tol"]
is30
. Ifcheck_parameters()
resets values (due to range violations), it will use the30
fromPARAMETER_DEFAULT
.
Additional Notes & Gotchas
- Raw file discovery in workflow prep and path validation currently recognizes only
.mzML
and.mzXML
files, even thoughfile_format
supports"mzjson"
/"mzjson.gz"
in principle. group_features_after_alignment
isTrue
in the initializer (despite an inline comment that mentionsFalse
), andgap_filling_method
is the string"local_maximum"
.- If
ms2_library_path
points to a non-existent path, it is automatically set toNone
during parameter checking. - Exported JSON via
output_parameters()
omitsproject_dir
and includes a"MassCube_version"
field. - The code consistently uses the identifier
PARAMETER_RAGES
(with a G), and method docs reflect that spelling.
This documentation mirrors the current params.py
implementation.