Functions to run the analyses:

Feature selection methods

class Run.General_analyses.Fselection(n_features=10): Perform feature selection based on the F test. Be careful, the data needs to be normalized before feature selection based on the F-test can be applied.

class Run.General_analyses.HVGselection(flavour, top_genes=500): Perform highly variable feature selection based on the scanpy function.

Functions to run the analyses

No K-fold cross-validation

Run.General_analyses.Run_H_NoKF(classifier, data, labels, parameters, n_jobsHCL, Norm=True, greedy_=False)

Function to run hierarchical classification without K-fold cross validation with a dense data matrix.

Parameters:

classifier – Scikit-learn classifier
data (dense matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGrid by scikit-learn)
n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False

Returns:

Final Classifier – Trained scikit-lrean classifier
Xtest (pandas dataframe) – Data test splits
yest (list) – Label test splits
predicted (list) – Predictions
probs (matrix) – Prediction probabilities for all the classes
Bestparam (float or int) – Best hyperparameter(s)

Run.General_analyses.Run_H_NoKF_sparse(classifier, data, labels, parameters, n_jobsHCL, Norm=True, greedy_=False)

Function to run hierarchical classification without K-fold cross validation with a sparse data matrix.

Parameters:

classifier – Scikit-learn classifier
data (sparse matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False

Returns:

Final Classifier – Trained scikit-lrean classifier
Xtest (pandas dataframe) – Data test splits
yest (list) – Label test splits
predicted (list) – Predictions
probs (matrix) – Prediction probabilities for all the classes
Bestparam (float or int) – Best hyperparameter(s)

Run.General_analyses.Run_Flat_NoKF(classifier, data, labels, parameters, Norm=True)

Function to run flat classification without K-fold cross validation.

Parameters:

classifier – Scikit-learn classifier
data (dense matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True

Returns:

Final Classifier – Trained scikit-lrean classifier
Xtest (pandas dataframe) – Data test splits
yest (list) – Label test splits
predicted (list) – Predictions

With K-fold cross-validation

Flat analyses

Run.General_analyses.Run_Flat_KF_sparse(classifier_, n_folds, data, labels, parameters, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run flat classification with K-fold cross validation and a sparse data matrix.

Parameters:

classifier – Scikit-learn classifier
n_folds (int) – Number of folds, must be at least 2
data (sparse matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False
F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False
save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the train-test splits, by default False
metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

AllPredictedValues (list of lists)
AllProbabilities (list of matrices)
AllActualValues (list of lists)
AccuraciesFolds (list)
Bestparams (list)
Classifiers (list of classifiers (optional))
Xtests (list of matrices (optional))
ytests (list of matrices (optional))

Run.General_analyses.Run_Flat_KF(classifier_, n_folds, data, labels, parameters, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run flat classification with K-fold cross validation and a dense data matrix.

Parameters:

classifier – Scikit-learn classifier
n_folds (int) – Number of folds, must be at least 2
data (dense matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False
F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False
save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False
metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

AllPredictedValues (list of lists)
AllProbabilities (list of matrices)
AllActualValues (list of lists)
AccuraciesFolds (list)
Bestparams (list)
Classifiers (list of classifiers (optional))
Xtests (list of matrices (optional))
ytests (list of matrices (optional))

Run.General_analyses.Run_Flat_KF_sparse_splitted(classifier_, n_folds, data, labels, parameters, fold, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run flat classification with for one K-fold cross validation fold on a sparse data matrix.

Parameters:

classifier – Scikit-learn classifier
n_folds (int) – Number of folds, must be at least 2
data (sparse matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
fold (int) – Specific fold that is currently considered
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False
F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False
save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and labels test splits, by default False
metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

predicted (list)
prob (matrix)
y_test; list
accuracy_fold (float)
Bestparams (list)
acc (list)
Final_classifier (trained classifier (optional))

Run.General_analyses.Run_Flat_KF_splitted(classifier_, n_folds, data, labels, parameters, fold, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run flat classification with for one K-fold cross validation fold on a dense data matrix.

Parameters:

classifier – Scikit-learn classifier
n_folds (int) – Number of folds, must be at least 2
data (dense matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
fold (int) – Specific fold that is currently considered
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False
F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False
save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False
metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

predicted (list)
prob (matrix)
X_test (matrix)
y_test (list)
accuracy_fold (float)
Bestparams (list)
acc (list)
Final_classifier (trained classifier (optional))

Hierarchical analyses

Run.General_analyses.Run_H_KF(classifier_, n_folds, data, labels, parameters, n_jobsHCL, reject_thresh, greedy_=True, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run hierarchical classification with K-fold cross validation and a dense data matrix.

Parameters:

classifier – Scikit-learn classifier
n_folds (int) – Number of folds, must be at least 2
data (dense matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification
reject_thresh (float or None (str)) – If not None, annotation will stop if the probability of a label will drop below the inputted valued
greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False
F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False
save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False
metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

AllPredictedValues (list of lists)
AllProbabilities (list of matrices)
AllActualValues (list of lists)
AccuraciesFolds (list)
Bestparams (list)
Classifiers (list of classifiers (optional))
Xtests (list of matrices (optional))
ytests (list of matrices (optional))

Run.General_analyses.Run_H_KF_sparse(classifier_, n_folds, data, labels, parameters, n_jobsHCL, reject_thresh, greedy_=True, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run hierarchical classification with K-fold cross validation and a sparse data matrix.

Parameters:

classifier – Scikit-learn classifier
n_folds (int) – Number of folds, must be at least 2
data (sparse matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification
reject_thresh (float or None (str)) – If not None, annotation will stop if the probability of a label will drop below the inputted value
greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False
F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False
save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False
metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

AllPredictedValues (list of lists)
AllProbabilities (list of matrices)
AllActualValues (list of lists)
AccuraciesFolds (list)
Bestparams (list)
Classifiers (list of classifiers (optional))
Xtests (list of matrices (optional))
ytests (list of matrices (optional))

Run.General_analyses.Run_H_KF_splitted(classifier_, n_folds, data, labels, parameters, n_jobsHCL, reject_thresh, fold, greedy_=True, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run hierarchical classification with for one K-fold cross validation fold on a dense data matrix.

Parameters:

classifier – Scikit-learn classifier
n_folds (int) – Number of folds, must be at least 2
data (matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
fold (int) – Specific fold that is currently considered
n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification
reject_thresh (float or None (str)) – If not None, annotation will stop if the probability of a label will drop below the inputted value
greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False
F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False
save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False
metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

predicted (list)
y_test (list)
accuracy_fold (float)
Bestparam (float)
acc (list)
Final_Classifiers (classifiers (optional))
Xtest (matrix (optional))

Run.General_analyses.Run_H_KF_sparse_splitted(classifier_, n_folds, data, labels, parameters, n_jobsHCL, reject_thresh, fold, greedy_=True, Norm=True, HVG=False, F_test=False, save_clf=False, metric='accuracy_score')

Function to run hierarchical classification with for one K-fold cross validation fold on a sparse data matrix.

Parameters:

classifier – Scikit-learn classifier
n_folds (int) – Number of folds, must be at least 2
data (sparse matrix) – Data matrix
labels (list) – Cell type labels
parameters (dict of str to sequence, or sequence of such) – Hyperparameters to be evaluated (for more information on the correct input format of this parameter check ParameterGridd by scikit-learn)
fold (int) – Specific fold that is currently considered
n_jobsHCL (int) – Number of CPU cores used for parallelization of the hierarchical classification
reject_thresh (float or None (str)) – If not None, annotation will stop if the probability of a label will drop below the inputted value
greedy (bool, optional) – Perform greedy (True) or non-greedy (False) hierarchical classification, by default False
Norm (bool, optional) – Perform log(1+x) normalisation before running the analysis, by default True
HVG (bool, optional) – If true, perform highly variable feature selection, the number of selected features (‘top_genes’) is a hyperparameter that should be incorporated in parameters, by default False
F_test (bool, optional) – If True, perform feature selection based on the F-test, the number of selected features (‘n_features’) is a hyperparameter that should be incorporated in parameters, by default False
save_clf (bool, optional) – If True, outputs the trained classifiers per fold together with the data and label test splits, by default False
metric (str, optional) – Use “accuracy score” or “log_loss” for test and train evaluation, by default “accuracy_score”

Returns:

predicted (list)
y_test (list)
accuracy_fold (float)
Bestparam (float)
acc (list)
Final_Classifiers (classifiers (optional))
Xtest (matrix (optional))

To save the results

Run.General_analyses.SaveResultsKF(PredictedValues, ActualValues, AccuraciesFolds, AllAccuracies, overall_best_params, directory, namespecific)

Function to save K-Fold cross-validation results.

Parameters:

PredictedValues (list of lists) –
ActualValues (list of lists) –
AccuraciesFolds (list) –
overall_best_params (list) –
directory (str) – Directory where you want to save it
namespecific (str) – Name of the analysis

Returns:

namespecific_Other.csv – Contains accuracies and best parameters per fold
namespecific_ActualValueslist.csv
namespecific_Preslist.csv