Tutorial

This is a guide on how to use the functions in this repository. Be aware, I assume here that this notebook is ran inside the folder Hierarchical_reject of the repository. If you download this notebook and run it outside of this folder, you will have to adapt the function paths when you import them.

For the following analyses, we will assume that the AMB dataset is downloaded from DOI and that the paths to the data and label files are specified down below.

[ ]:
AMBPath = ...
LabelsAMBPath = ...

Preprocessing

Preprocessing of the data can be easily performed with the help of the dataset-specific preprocessing functions.

[ ]:
from Preprocessing.Preprocessing_AMB import Preprocessing_AMB
Data, Labels = Preprocessing_AMB(AMBPath, LabelsAMBPath)

Flat annotation and evaluation

The AMB Data is not loaded in under a sparse format during preprocessing (this will only be the case for the Azimuth PBMC dataset) and thus the non-sparse functions can be used.

In the following block of code, we show how to perform flat annotation with 5-fold cross-validation and HVG selection, with the logistic regression classifier of scikit-klearn. The regularization strength parameter of the Logistic Regression classifier (‘C’) will also be tuned.

[ ]:
from Run.General_analyses import Run_Flat_KF, SaveResultsKF
from sklearn.linear_model import LogisticRegression

## Define the classifier and parameters
clf = LogisticRegression(penalty = 'l2', multi_class = 'multinomial', n_jobs = 1)
params = {'C': [1,100,10000], 'top_genes' : [10 000, 30 000, 50 000]}

## Run the analyses
Predictions, Probs, Act, Acc, Bestparams, Classifiers, Xtest, ytests =Run_Flat_KF(clf, 5, Data, Labels, params, Norm = True, HVG = True, save_clf = True)

## (Optional) save the results
dir_ = ...
name = ...

SaveResultsKF(Predictions, Act, Acc, Bestparams, dir_, name)

Based on these results, the accuracy score or other metrics can be calculated.

To construct Accuracy-rejection curves, the Evaluation_AR_Flat function can be used.

Note that the AMB label hierarchy is balanced (all cell type labels have three levels), unlike all the other datasets in the repository DOI.

[ ]:
from Evaluation.Functions_Accuracy_Reject import Evaluate_AR_Flat

results = Evaluate_AR_Flat(Classifiers, Xtest, ytests, Predictions, Probs, b = True, scores = False)
[ ]:
import matplotlib.pyplot as plt

# accuracy rejection curves
plt.plot(results['steps'], results['acc'])

# rejection percentage curves
plt.plot(results['steps'], results['perc'])

Hierarchical annotation and evaluation

The same set-up as above is illustrated here, only with hierarchical annotation instead of flat annotation.

[ ]:
from Run.General_analyses import Run_H_KF, SaveResultsKF
from Evaluation.Functions_Accuracy_Reject import Evaluate_AR
import matplotlib.pyplot as plt

## Define the classifier and parameters
clf = LogisticRegression(penalty = 'l2', multi_class = 'multinomial', n_jobs = 1)
params = {'C': [1,100,10000], 'top_genes' : [10 000, 30 000, 50 000]}

## Run the analyses
Predictions, Probs, Act, Acc, Bestparams, Classifiers, Xtests, ytests =Run_H_KF(clf, 5, Data, Labels, params, 1, None, greedy_ = False, Norm = True, HVG = True, save_clf = True)
# Note: for the number of cores, be careful as n_jobs (classifier) * n_jobsHCL can be used
# If you don't want to make accuracy rejection curves, but just perform partial rejection directly, modify reject_thresh.
# Full rejection can easily be applied through simple thresholding aftwards, based on the entire label

## (Optional) save the results
dir_ = ...
name = ...

SaveResultsKF(Predictions, Act, Acc, Bestparams, dir_, name)

results = Evaluate_AR(Classifiers, Xtests, ytests, Predictions, greedy = False)

# accuracy rejection curves
plt.plot(results['steps'], results['acc'])

# rejection percentage curves
plt.plot(results['steps'], results['perc'])