params
Overview
The params module defines a class Params that stores and manages parameters for mass spectrometry-based untargeted metabolomics data processing. It also exposes a helper function find_ms_info and two dictionaries of parameter ranges and defaults.
This documentation is synchronized with the current implementation in params.py and reflects all attributes, methods, behaviors, defaults, and edge cases present in the code.
Classes
Params
A configuration container for project-level and file-level processing parameters, including project setup, raw data reading/cleaning, feature detection, grouping, alignment, annotation, normalization, statistics, visualization, and output controls.
Attributes
Project & Metadata
sample_metadata(pandas.DataFrame | None) — sample table held in-memory.project_dir(str | None) — project root directory.sample_dir(str | None) — directory for raw MS data; set during workflow prep.single_file_dir(str | None) — outputs for single-file processing.tmp_file_dir(str | None) — temporary/intermediate files.ms2_matching_dir(str | None) — MS/MS matching outputs.bpc_dir(str | None) — base peak chromatogram outputs.project_file_dir(str | None) — auxiliary project files (sample table with time, etc.).normalization_dir(str | None) — normalization results.statistics_dir(str | None) — statistical analysis results.problematic_files(dict) — problematic files mapping{file_name: error_message}.
Raw Data Reading & Cleaning
file_name(str | None) — file name of the raw data.file_path(str | None) — absolute path of the raw data.ion_mode(str) —"positive"(default) or"negative".ms_type(str | None) —"orbitrap","qtof","tripletof", or"others".is_centroid(bool) — whether data is centroided (Trueby default).file_format(str | None) — lower-case type ("mzml","mzxml","mzjson", or"mzjson.gz").scan_time_unit(str) —"minute"(default) or"second".mz_lower_limit(float) — lower m/z bound (default0.0).mz_upper_limit(float) — upper m/z bound (default100000.0).rt_lower_limit(float) — lower RT bound in minutes (default0.0).rt_upper_limit(float) — upper RT bound in minutes (default10000.0).scan_levels(list[int]) — scan levels to read (default[1, 2]).centroid_mz_tol(float | None) — m/z tolerance for centroiding (0.005by default; setNoneto disable centroiding).ms1_abs_int_tol(float) — MS1 absolute intensity threshold (recommend30000Orbitrap,1000QTOF).ms2_abs_int_tol(float) — MS2 absolute intensity threshold (recommend10000Orbitrap,500QTOF).ms2_rel_int_tol(float) — MS2 relative intensity to base peak (default0.01).precursor_mz_offset(float) — m/z offset for defining MS2 range (default2.0).
Feature Detection
mz_tol_ms1(float) — MS1 m/z tolerance (default0.01).mz_tol_ms2(float) — MS2 m/z tolerance (default0.015).feature_gap_tol(int) — tolerance in consecutive scans without signal inside a feature (default10inParams; see note below aboutPARAMETER_DEFAULT).batch_size(int) — parallel processing batch size (default100).percent_cpu_to_use(float) — fraction of CPU to use (default0.8).
Feature Grouping
group_features_single_file(bool) — group features within a single file (defaultFalse).scan_scan_cor_tol(float) — scan-to-scan correlation threshold (default0.7).mz_tol_feature_grouping(float) — m/z tolerance for grouping (default0.015).rt_tol_feature_grouping(float) — RT tolerance for grouping (default0.1).valid_charge_states(list[int]) — allowed charge states (default[1]).
Feature Alignment
mz_tol_alignment(float) — m/z tolerance for alignment (default0.01).rt_tol_alignment(float) — RT tolerance for alignment (default0.2).rt_tol_rt_correction(float) — expected max RT shift for RT correction (default0.5min).correct_rt(bool) — perform RT correction (defaultTrue).scan_number_cutoff(int) — minimum non-zero scans to be aligned (default5).detection_rate_cutoff(float) — required detection rate across QC+samples (default0.1).merge_features(bool) — merge near-duplicate features (defaultTrue).mz_tol_merge_features(float) — m/z tolerance for merging (default0.01).rt_tol_merge_features(float) — RT tolerance for merging (default0.02).group_features_after_alignment(bool) — group after alignment (defaultTrue).fill_gaps(bool) — fill gaps in aligned features (defaultTrue).gap_filling_method(str) — method used in gap filling (default"local_maximum").gap_filling_rt_window(float) — RT window for finding local maxima (default0.05min).isotope_rel_int_limit(float) — isotope intensity upper limit relative to base peak (default1.5).
Feature Annotation
ms2_library_path(str | None) — path to MS2 library (.mspor.pickle); set toNoneif not existing.fuzzy_search(bool) — enable fuzzy search (defaultFalse).consider_rt(bool) — consider RT in MS2 matching (defaultFalse).rt_tol_annotation(float) — RT tolerance for annotation (default0.2).ms2_sim_tol(float) — MS2 similarity threshold (default0.7).spectral_similarity_method(str) — similarity method (default"unweighted_entropy").
Normalization
sample_normalization(bool) — sample-wise normalization by total amount/concentration (defaultFalse).sample_norm_method(str) — method for sample normalization (default"pqn").signal_normalization(bool) — feature-wise drift correction (defaultFalse).signal_norm_method(str) — drift correction method (default"lowess").
Statistics
run_statistics(bool) — run statistical analysis (defaultFalse).
Visualization
plot_bpc(bool) — plot BPC chromatograms (defaultFalse).plot_ms2(bool) — plot MS2 mirror plots (defaultFalse).plot_normalization(bool) — plot normalization results (defaultFalse).
Classifier Building
by_group_name(str | None) — group name for classifier training (if used).
Output
output_single_file(bool) — export processed single-file outputs (defaultFalse; setTrueduring workflow prep).output_ms1_scans(bool) — export all MS1 scans to pickle for fast reloading (defaultFalse; setTrueduring workflow prep).output_aligned_file(bool) — export aligned features (defaultFalse; setTrueduring workflow prep).quant_method(str) —"peak_height"(default),"peak_area", or"top_average".
Methods
read_parameters_from_csv(path)
Reads a CSV of key–value pairs and sets attributes. Values convertible to float are cast; otherwise, "true"/"yes" → True, "false"/"no" → False. Calls check_parameters() afterward.
read_sample_metadata(path)
Loads sample metadata from CSV. Lower-cases columns named "is_qc" and "is_blank"; converts "yes"/"no" strings to booleans (otherwise defaults both columns to False). Sorts so QC appear first and blanks last; adds VALID and ABSOLUTE_PATH columns; stores in sample_metadata.
_untargeted_metabolomics_workflow_preparation()
Prepares a project for the untargeted metabolomics workflow:
- Validates
project_dir; derives standard subdirectories and creates them if missing. - Ensures raw data exist in
project_dir/data(currently auto-detects only.mzMLand.mzXML). - If
sample_table.csvis missing, disables normalization/statistics and prints notices. - If
parameters.csvis missing, prints notices, infers(ms_type, ion_mode)from the first sample viafind_ms_info, callsset_default, and enablesplot_bpc. - Loads sample table if present; otherwise builds from discovered file basenames. Validates presence of raw files, computes acquisition
timeviaget_start_time, filters invalid, sorts by time, adds sequentialanalytical_order, and assigns batch IDs vialabel_batch_id. Writesproject_files/sample_table_with_time.csv. - Sets output toggles:
output_single_file=True,output_ms1_scans=True,output_aligned_file=True.
set_default(ms_type, ion_mode)
For "orbitrap", sets ms1_abs_int_tol=30000, ms2_abs_int_tol=10000; for others, ms1_abs_int_tol=1000, ms2_abs_int_tol=500. Also sets ion_mode.
check_parameters()
Validates all numeric parameters against PARAMETER_RAGES. If a value is out of range, it prints a warning and resets that parameter to the value in PARAMETER_DEFAULT. If ms2_library_path does not exist, sets it to None. Casts batch_size to int.
Note: The code refers to
PARAMETER_RAGESthroughout (typo intentional to match the implementation).
output_parameters(path, format="json")
Exports all parameters (except project_dir) to JSON, including "MassCube_version" obtained from importlib.metadata.version("masscube"). Only "json" is supported.
_check_raw_files_in_data_dir()
Cross-references basenames from sample_metadata with files in sample_dir (currently only .mzML/.mzXML), sets VALID and populates ABSOLUTE_PATH accordingly.
Functions
find_ms_info(file_name)
Reads up to the first 200 lines of an .mzML or .mzXML file (lower-cased text) to infer:
ms_type:"orbitrap"if it contains"orbitrap"/"q exactive","tripletof"if"tripletof", else"qtof"if contains"tof".ion_mode:"positive"or"negative"if mentioned.centroid:Trueif"centroid spectrum"orcentroided="1"present.
Returns (ms_type, ion_mode, centroid).
Constants
PARAMETER_RAGES — Valid Ranges
PARAMETER_RAGES = {
"mz_lower_limit": (0.0, 100000.0),
"mz_upper_limit": (0.0, 100000.0),
"rt_lower_limit": (0.0, 10000.0),
"rt_upper_limit": (0.0, 10000.0),
"centroid_mz_tol": (0.0, 0.1),
"ms1_abs_int_tol": (0, 1e10),
"ms2_abs_int_tol": (0, 1e10),
"ms2_rel_int_tol": (0.0, 1.0),
"precursor_mz_offset": (0.0, 100000.0),
"mz_tol_ms1": (0.0, 0.02),
"mz_tol_ms2": (0.0, 0.02),
"feature_gap_tol": (0, 100),
"scan_scan_cor_tol": (0.0, 1.0),
"mz_tol_alignment": (0.0, 0.02),
"rt_tol_alignment": (0.0, 2.0),
"scan_number_cutoff": (0, 100),
"detection_rate_cutoff": (0.0, 1.0),
"mz_tol_merge_features": (0.0, 0.02),
"rt_tol_merge_features": (0.0, 0.5),
"ms2_sim_tol": (0.0, 1.0)
}PARAMETER_DEFAULT — Defaults Used on Reset
PARAMETER_DEFAULT = {
"mz_lower_limit": 0.0,
"mz_upper_limit": 100000.0,
"rt_lower_limit": 0.0,
"rt_upper_limit": 10000.0,
"centroid_mz_tol": 0.005,
"ms1_abs_int_tol": 1000.0,
"ms2_abs_int_tol": 500,
"ms2_rel_int_tol": 0.01,
"precursor_mz_offset": 2.0,
"mz_tol_ms1": 0.01,
"mz_tol_ms2": 0.015,
"feature_gap_tol": 30,
"scan_scan_cor_tol": 0.7,
"mz_tol_alignment": 0.01,
"rt_tol_alignment": 0.2,
"scan_number_cutoff": 5,
"detection_rate_cutoff": 0.1,
"mz_tol_merge_features": 0.01,
"rt_tol_merge_features": 0.05,
"ms2_sim_tol": 0.7
}Discrepancy note: In the
Paramsinitializer,feature_gap_toldefaults to10, whereasPARAMETER_DEFAULT["feature_gap_tol"]is30. Ifcheck_parameters()resets values (due to range violations), it will use the30fromPARAMETER_DEFAULT.
Additional Notes & Gotchas
- Raw file discovery in workflow prep and path validation currently recognizes only
.mzMLand.mzXMLfiles, even thoughfile_formatsupports"mzjson"/"mzjson.gz"in principle. group_features_after_alignmentisTruein the initializer (despite an inline comment that mentionsFalse), andgap_filling_methodis the string"local_maximum".- If
ms2_library_pathpoints to a non-existent path, it is automatically set toNoneduring parameter checking. - Exported JSON via
output_parameters()omitsproject_dirand includes a"MassCube_version"field. - The code consistently uses the identifier
PARAMETER_RAGES(with a G), and method docs reflect that spelling.
This documentation mirrors the current params.py implementation.