The main page to manage your methods and upload your results is My Results. The workflow to upload estimated causal networks on the CauseMe platform is as follows:
Below, we provide code in different programming languages illustrating the method module and script to iterate over datasets.
Area Under the Receiver Operating Characteristic Curve (ROC AUC). This metric is based on the score matrices (non-negative real numbers) uploaded by users. A high score entry (i,j) indicates high confidence in a link i --> j. The true label of a link is binary: 0 (missing) or 1 (causal link). The ROC curve measures the ability of a binary classifier system (your score) as its discrimination threshold is varied. The AUC is computed across all (N, N) matrices (excluding self-links) of the B different datasets in an experiment (in most experiments, B=200) and provides a summary metric, here computed using the trapezoidal rule. High AUC values (maximum is 1) indicate better performance.
False Positive Rate and True Positive Rate. These metrics are based on the p-value matrices thresholded at a 5% significance level. FPR quantifies the ratio of falsely predicted links divided by the number of absent links among all N*(N-1)*B links of an experiment. TPR measures the ratio of correctly identified links divided by the number of true links. These metrics are only computed if a p-value matrix is available. The FPR should be at 0.05 (well-calibrated) or below (over-conservative), while higher TPR indicates better performance.
Also the F-measure (or Fbeta-score) metric is based on the p-value matrices thresholded at a 5% significance level. The F-measure is the harmonic average of the precision and recall, where an F-measure reaches its best value at 1 (perfect precision and recall) and worst at 0. Here we set beta=0.5 which puts more weight on precision (by attenuating the influence of false negatives).
True Lag Rate. Like the TPR, this metric is based on the p-value matrix thresholded at a 5% significance level and additionally the predicted lag matrix. TLR quantifies the ratio of correctly identified lags among the correctly predicted links. This metric is only computed if a p-value and lag matrix are available. The TLR can be between 0 (no lag correctly predicted) and 1 (all lags correctly predicted). But note that a very high rate may not mean much if the TPR is very low.
If users upload code alongside with a method registration, then we can, in principle, validate whether the code produces the uploaded results. We plan to run such validations in the future, which also enables us to provide runtime estimates on the same computing machine. Currently, only results uploaded by us are marked as validated.
For selected experiments, we are able to compute FPR and TPR rates also for individual links / non-links. These are the experiments where multiple realizations of different ground truth models are exist. This setup allows to evaluate how a method performs not only on average, but for individual links. For example, a method may be sensitive to the local coupling topology of a link. For example, if the lower quartile of the boxplot for TPR is at 40%, it means that the method has predicted more than 25% of the links for this experiment with a TPR below 40%.