Functions to run the analyses:

Feature selection methods

class Run.General_analyses.Fselection(n_features=10)

Perform feature selection based on the F test. Be careful, the data needs to be normalized before feature selection based on the F-test can be applied.

class Run.General_analyses.HVGselection(flavour, top_genes=500)

Perform highly variable feature selection based on the scanpy function.

Functions to run the analyses

No K-fold cross-validation

Run.General_analyses.Run_H_NoKF(classifier, data, labels, parameters, n_jobsHCL, Norm=True, greedy_=False)

Function to run hierarchical classification without K-fold cross validation with a dense data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • data (dense matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGrid by scikit-learn)

  • n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False

Returns:

  • Final Classifier – Trained scikit-lrean classifier

  • Xtest (pandas dataframe) – Data test splits

  • yest (list) – Label test splits

  • predicted (list) – Predictions

  • probs (matrix) – Prediction probabilities for all the classes

  • Bestparam (float or int) – Best hyperparameter(s)

Run.General_analyses.Run_H_NoKF_sparse(classifier, data, labels, parameters, n_jobsHCL, Norm=True, greedy_=False)

Function to run hierarchical classification without K-fold cross validation with a sparse data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • data (sparse matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False

Returns:

  • Final Classifier – Trained scikit-lrean classifier

  • Xtest (pandas dataframe) – Data test splits

  • yest (list) – Label test splits

  • predicted (list) – Predictions

  • probs (matrix) – Prediction probabilities for all the classes

  • Bestparam (float or int) – Best hyperparameter(s)

Run.General_analyses.Run_Flat_NoKF(classifier, data, labels, parameters, Norm=True)

Function to run flat classification without K-fold cross validation.

Parameters:
  • classifier – Scikit-learn classifier

  • data (dense matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

Returns:

  • Final Classifier – Trained scikit-lrean classifier

  • Xtest (pandas dataframe) – Data test splits

  • yest (list) – Label test splits

  • predicted (list) – Predictions

With K-fold cross-validation

Flat analyses

Run.General_analyses.Run_Flat_KF_sparse(classifier_, n_folds, data, labels, parameters, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run flat classification with K-fold cross validation and a sparse data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • n_folds (int) – Number of folds, must be at least 2

  • data (sparse matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False

  • F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False

  • save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the train-test splits, by default False

  • metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

  • AllPredictedValues (list of lists)

  • AllProbabilities (list of matrices)

  • AllActualValues (list of lists)

  • AccuraciesFolds (list)

  • Bestparams (list)

  • Classifiers (list of classifiers (optional))

  • Xtests (list of matrices (optional))

  • ytests (list of matrices (optional))

Run.General_analyses.Run_Flat_KF(classifier_, n_folds, data, labels, parameters, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run flat classification with K-fold cross validation and a dense data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • n_folds (int) – Number of folds, must be at least 2

  • data (dense matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False

  • F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False

  • save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False

  • metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

  • AllPredictedValues (list of lists)

  • AllProbabilities (list of matrices)

  • AllActualValues (list of lists)

  • AccuraciesFolds (list)

  • Bestparams (list)

  • Classifiers (list of classifiers (optional))

  • Xtests (list of matrices (optional))

  • ytests (list of matrices (optional))

Run.General_analyses.Run_Flat_KF_sparse_splitted(classifier_, n_folds, data, labels, parameters, fold, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run flat classification with for one K-fold cross validation fold on a sparse data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • n_folds (int) – Number of folds, must be at least 2

  • data (sparse matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • fold (int) – Specific fold that is currently considered

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False

  • F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False

  • save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and labels test splits, by default False

  • metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

  • predicted (list)

  • prob (matrix)

  • y_test; list

  • accuracy_fold (float)

  • Bestparams (list)

  • acc (list)

  • Final_classifier (trained classifier (optional))

Run.General_analyses.Run_Flat_KF_splitted(classifier_, n_folds, data, labels, parameters, fold, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run flat classification with for one K-fold cross validation fold on a dense data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • n_folds (int) – Number of folds, must be at least 2

  • data (dense matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • fold (int) – Specific fold that is currently considered

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False

  • F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False

  • save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False

  • metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

  • predicted (list)

  • prob (matrix)

  • X_test (matrix)

  • y_test (list)

  • accuracy_fold (float)

  • Bestparams (list)

  • acc (list)

  • Final_classifier (trained classifier (optional))

Hierarchical analyses

Run.General_analyses.Run_H_KF(classifier_, n_folds, data, labels, parameters, n_jobsHCL, reject_thresh, greedy_=True, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run hierarchical classification with K-fold cross validation and a dense data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • n_folds (int) – Number of folds, must be at least 2

  • data (dense matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification

  • reject_thresh (float or None (str)) – If not None, annotation will stop if the probability of a label will drop below the inputted valued

  • greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False

  • F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False

  • save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False

  • metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

  • AllPredictedValues (list of lists)

  • AllProbabilities (list of matrices)

  • AllActualValues (list of lists)

  • AccuraciesFolds (list)

  • Bestparams (list)

  • Classifiers (list of classifiers (optional))

  • Xtests (list of matrices (optional))

  • ytests (list of matrices (optional))

Run.General_analyses.Run_H_KF_sparse(classifier_, n_folds, data, labels, parameters, n_jobsHCL, reject_thresh, greedy_=True, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run hierarchical classification with K-fold cross validation and a sparse data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • n_folds (int) – Number of folds, must be at least 2

  • data (sparse matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification

  • reject_thresh (float or None (str)) – If not None, annotation will stop if the probability of a label will drop below the inputted value

  • greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False

  • F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False

  • save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False

  • metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

  • AllPredictedValues (list of lists)

  • AllProbabilities (list of matrices)

  • AllActualValues (list of lists)

  • AccuraciesFolds (list)

  • Bestparams (list)

  • Classifiers (list of classifiers (optional))

  • Xtests (list of matrices (optional))

  • ytests (list of matrices (optional))

Run.General_analyses.Run_H_KF_splitted(classifier_, n_folds, data, labels, parameters, n_jobsHCL, reject_thresh, fold, greedy_=True, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run hierarchical classification with for one K-fold cross validation fold on a dense data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • n_folds (int) – Number of folds, must be at least 2

  • data (matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • fold (int) – Specific fold that is currently considered

  • n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification

  • reject_thresh (float or None (str)) – If not None, annotation will stop if the probability of a label will drop below the inputted value

  • greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False

  • F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False

  • save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False

  • metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

  • predicted (list)

  • y_test (list)

  • accuracy_fold (float)

  • Bestparam (float)

  • acc (list)

  • Final_Classifiers (classifiers (optional))

  • Xtest (matrix (optional))

Run.General_analyses.Run_H_KF_sparse_splitted(classifier_, n_folds, data, labels, parameters, n_jobsHCL, reject_thresh, fold, greedy_=True, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run hierarchical classification with for one K-fold cross validation fold on a sparse data matrix.

Parameters:
  • classifier – Scikit-learn classifier

  • n_folds (int) – Number of folds, must be at least 2

  • data (sparse matrix) – Data matrix

  • labels (list) – Cell type labels

  • parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)

  • fold (int) – Specific fold that is currently considered

  • n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification

  • reject_thresh (float or None (str)) – If not None, annotation will stop if the probability of a label will drop below the inputted value

  • greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False

  • Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

  • HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False

  • F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False

  • save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False

  • metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

  • predicted (list)

  • y_test (list)

  • accuracy_fold (float)

  • Bestparam (float)

  • acc (list)

  • Final_Classifiers (classifiers (optional))

  • Xtest (matrix (optional))

To save the results

Run.General_analyses.SaveResultsKF(PredictedValues, ActualValues, AccuraciesFolds, AllAccuracies, overall_best_params, directory, namespecific)

Function to save K-Fold cross-validation results.

Parameters:
  • PredictedValues (list of lists) –

  • ActualValues (list of lists) –

  • AccuraciesFolds (list) –

  • overall_best_params (list) –

  • directory (str) – Directory where you want to save it

  • namespecific (str) – Name of the analysis

Returns:

  • namespecific_Other.csv – Contains accuracies and best parameters per fold

  • namespecific_ActualValueslist.csv

  • namespecific_Preslist.csv