classifier_builder

classifier_builder

Overview

This module provides a set of tools for building and using a random forest classifier for metabolomics data analysis. It supports feature selection, model training, cross-validation, and prediction on new data.

Functions

feature_selection

feature_selection(X, y, k=None)

Parameters:

  • X (numpy.ndarray): The feature matrix.
  • y (numpy.ndarray): The target variable.
  • k (int): The number of features to select. By default, it is set to the number of samples divided by 10 (1/10 rule) and rounded up.

Returns:

  • X_new (numpy.ndarray): The selected features.
  • selected_features (numpy.ndarray): The indices of the selected features.

def feature_selection(X, y, k=None): """ Select features for the classification model.

Parameters
----------
X : two-dimensional numpy array
    The feature matrix.
y : one-dimensional numpy array
    The target variable.
k : int
    The number of features to select. By default, it
    is set to the number of samples divided by 10 (1/10 rule)
    and rounded up.

Returns
-------
X_new : two-dimensional numpy array
    The fit-transformed feature matrix.
selected_features : one-dimensional numpy array
    The indices of the selected features.
"""

train_rdf_model

train_rdf_model(X_train, y_train)

Parameters:

  • X_train (numpy.ndarray): The feature matrix for training.
  • y_train (numpy.ndarray): The target variable for training.

Returns:

  • model (RandomForestClassifier): The trained random forest model.

cross_validate_model

cross_validate_model(X, y, model, k=5, random_state=0)

Parameters:

  • X (numpy.ndarray): The feature matrix.
  • y (numpy.ndarray): The target variable.
  • model (RandomForestClassifier): The trained random forest model.
  • k (int): The number of folds for cross-validation.
  • random_state (int): The random state for the shuffle in KFold.

Returns:

  • scores (list): The accuracy scores for each fold.

predict

predict(model, X_test)

Parameters:

  • model (RandomForestClassifier): The trained random forest model.
  • X_test (numpy.ndarray): The feature matrix for testing.

Returns:

  • predictions (numpy.ndarray): The predicted classes.

evaluate_model

evaluate_model(predictions, y_test)

Parameters:

  • predictions (numpy.ndarray): The predicted classes.
  • y_test (numpy.ndarray): The true classes.

Returns:

  • accuracy (float): The accuracy of the model.

build_classifier

build_classifier(path=None, by_group=None, feature_num=None, gaussian_cutoff=0.6, detection_rate_cutoff=0.9, fill_ratio=0.5, cross_validation_k=5)

Parameters:

  • path (str): Path to the project file.
  • feature_num (int): The number of features to select for building the model.
  • gaussian_cutoff (float): The Gaussian similarity cutoff. Default is 0.6.
  • fill_ratio (float): The zero values will be replaced by the minimum value in the feature matrix times fill_ratio. Default is 0.5.
  • cross_validation_k (int): The number of folds for cross-validation. Default is 5.

predict_samples

predict_samples(path, mz_tol=0.01, rt_tol=0.3)

Parameters:

  • path (str): Path to the project file.
  • mz_tol (float): The m/z tolerance for matching the features. Default is 0.01.
  • rt_tol (float): The retention time tolerance for matching the features. Default is 0.3.