alignment

Overview

This module provides functionality for aligning metabolic features from different samples in mass spectrometry data.

  • Feature alignment: Align features across different samples, considering parameters like m/z tolerance and retention time tolerance.
  • Gap filling: Fill in missing features across aligned samples using various strategies.
  • Merge features: Clean feature table by merging features with almost the same m/z and retention time.
  • Retention time correction: Correct retention times to align features more accurately.
  • Output feature table: Save the aligned features to a file.

Classes

AlignedFeature

A class to model a feature in mass spectrometry data. Generally, a feature is defined as a unique pair of m/z and retention time.

Attributes:

  • feature_id_arr (np.array): Feature ID from individual files (-1 if not detected or gap filled).
  • mz_arr (np.array): m/z values.
  • rt_arr (np.array): Retention times.
  • scan_idx_arr (np.array): Scan index of the peak apex.
  • peak_height_arr (np.array): Peak height.
  • peak_area_arr (np.array): Peak area.
  • top_average_arr (np.array): Average of the highest three intensities.
  • ms2_seq (list): Representative MS2 spectrum from each file (default: highest total intensity).
  • length_arr (np.array): Length (i.e. non-zero scans in the peak).
  • gaussian_similarity_arr (np.array): Gaussian similarity.
  • noise_score_arr (np.array): Noise score.
  • asymmetry_factor_arr (np.array): Asymmetry factor.
  • sse_arr (np.array): Squared error to the smoothed curve.
  • is_segmented_arr (np.array): Whether the peak is segmented.
  • id (int): Index of the feature.
  • feature_group_id (int): Feature group ID.
  • mz (float): m/z.
  • rt (float): Retention time.
  • reference_file (str): The reference file with the highest peak height.
  • reference_scan_idx (int): The scan index of the peak apex from the reference file.
  • highest_intensity (float): The highest peak height from individual files (which is the reference file).
  • ms2 (str): Representative MS2 spectrum.
  • ms2_reference_file (str): The reference file for the representative MS2 spectrum.
  • gaussian_similarity (float): Gaussian similarity from the reference file.
  • noise_score (float): Noise level from the reference file.
  • asymmetry_factor (float): Asymmetry factor from the reference file.
  • detection_rate (float): Number of detected files / total number of files (blank not included).
  • detection_rate_gap_filled (float): Number of detected files after gap filling / total number of files (blank not included).
  • charge_state (int): Charge state.
  • is_isotope (bool): Whether it is an isotope.
  • isotope_signals (list): Isotope signals [[m/z, intensity], …].
  • is_in_source_fragment (bool): Whether it is an in-source fragment.
  • adduct_type (str): Adduct type.
  • annotation_algorithm (str): Annotation algorithm. Not used now.
  • search_mode (str): ‘identity search’, ‘fuzzy search’, or ‘mzrt_search’.
  • similarity (float): Similarity score (0-1).
  • annotation (str): Name of annotated compound.
  • formula (str): Molecular formula.
  • matched_peak_number (int): Number of matched peaks.
  • smiles (str): SMILES.
  • inchikey (str): InChIKey.
  • matched_precursor_mz (float): Matched precursor m/z.
  • matched_adduct_type (str): Matched adduct type.
  • matched_ms2 (str): Matched ms2 spectra.

Functions

feature_alignment

feature_alignment(path: str, params: Params)

Align the features from multiple processed single files as .txt format.

Parameters:

  • path (str): The path to the feature tables of individual files.
  • params (Params object): The parameters for alignment including sample names and sample groups.

Returns:

  • features (list of AlignedFeature objects)

gap_filling

gap_filling(features, params: Params)

Fill the gaps for aligned features.

Parameters:

  • features (list of AlignedFeature objects): The aligned features.
  • parameters (Params object): The parameters used for gap filling.

Returns:

  • features (list of AlignedFeature objects).

merge_features

merge_features(features: list, params: Params)

Clean features by merging features with almost the same m/z and retention time.

Parameters:

  • features (list of AlignedFeature objects): The aligned features.
  • params (Params object): The parameters used for merging features.

Returns:

features (list of AlignedFeature objects).

convert_features_to_df

convert_features_to_df(features, sample_names, quant_method="peak_height")

Convert the aligned features to a DataFrame.

Parameters:

  • features (list of AlignedFeature objects): The aligned features.
  • sample_names (list): The sample names.
  • quant_method (str): The quantification method, “peak_height”, “peak_area” or “top_average”.

Returns:

  • feature_table (pd.DataFrame): The feature DataFrame.

output_feature_to_msp

output_feature_to_msp(feature_table, output_path)

Output MS2 spectra to MSP format.

Parameters:

  • feature_table (pd.DataFrame): The feature table.
  • output_path (str): The path to the output MSP file.

output_feature_table

output_feature_table(feature_table, output_path)

Output the aligned feature table.

Parameters:

  • feature_table (pd.DataFrame): The aligned feature table.
  • output_path (str): The path to save the aligned feature table.

retention_time_correction

retention_time_correction(mz_ref, rt_ref, mz_arr, rt_arr, mz_tol=0.01, rt_tol=2.0, mode='linear_interpolation', rt_max=None)

Correct retention times for feature alignment. There are three steps including (1) finding the selected anchors in the given data, (2) creating a model to correct retention times, and (3) correcting retention times.

Parameters:

  • mz_ref (np.array): The m/z values of the selected anchors from another reference file.
  • rt_ref (np.array): The retention times of the selected anchors from another reference file.
  • mz_arr (np.array): Feature m/z values in the current file.
  • rt_arr (np.array): Feature retention times in the current file.
  • mz_tol (float): The m/z tolerance for selecting anchors.
  • rt_tol (float): The retention time tolerance for selecting anchors.
  • mode (str): The mode for retention time correction. Only ’linear_interpolation’ is available now.
  • rt_max (float): End of the retention time range.

Returns:

  • rt_arr (np.array): The corrected retention times.
  • f (interp1d): The model for retention time correction.

rt_anchor_selection

rt_anchor_selection(data_path, num=50, noise_score_tol=0.1, mz_tol=0.01)

Select retention time anchors from the feature tables. Retention time anchors have unique m/z values and low noise scores. From all candidate features, the top num features with the highest peak heights are selected as anchors.

Parameters:

  • data_path (str): The absolute directory to the feature tables.
  • num (int): The number of anchors to be selected.
  • noise_score_tol (float): The noise level for the anchors.
  • mz_tol (float): The m/z tolerance for selecting anchors.

Returns:

  • anchors (list): A list of anchors (dict) for retention time correction.