Video tutorial


HowTo


The main page to manage your methods and upload your results is My Results. The workflow to upload estimated causal networks on the CauseMe platform is as follows:

  1. Register your method by providing information on name, parameters, description, optionally a URL of a paper and/or code, and further fields. Ideally, you upload code of your method (you do not need to include publically available packages that can be imported). Uploading code enables us to validate methods and estimate inter-comparable runtimes. The function module of your method must be in a specific format as shown in the script "causeme_my_method". You may also include a requirements.txt in the zip file to specify packages and versions.
  2. You will get a method hash as an identifier of your method in Your Methods where you can also update the information of your methods. However, you cannot change uploaded code (since we might have already used it to validate results). Instead, you would need to register a new method.
  3. You apply your method to datasets downloaded from Data and Models using the provided script "causeme_example" that calls your method module and writes the results in a json dictionary. This dictionary must also contain the hash of the method and is uploaded on My Results. There are three major fields: scores, pvalues, and lags. Each of these is a list (over the several hundred datasets for an experiment) of flattened matrices. Important note: The order of the matrices in the results list should be the same as the one found in the zipped downloaded data file.
    • scores (required): Each list entry contains a flattened estimated causation matrix. For example, if a dataset contains N=20 time series, then the causation matrix is of size 20x20. The A_ij element of this matrix, corresponding to the i-th row and j-th column, indicates the score of a causal link i to j, which is a non-negative real number that can either be a probability estimates, confidence value, or binary decision. Higher values indicate more confidence in a link. If the absence of a link is estimated with maximum certainty, then A_ij = 0. Real values between 0 and 1 indicate probabilities in between. The matrix must be stored flattened in row-major (C-style) order. The scores are evaluated based on metrics suitable for probabilistic predictions, see Metrics.
    • pvalues (optional): If available, you can in addition store the estimated p-values in the same format as above. The p-values are thresholded at a 5% significance level and the binary predictions are then evaluated based a number of metrics, see Metrics.
    • lags (optional): If available, you can further store the estimated causal time lag with the same format as above, except that now A_ij=lag where lag is a non-zero positive integer indicating the time lag in units of the data step size. If you predict a zero time lag (instantaneous causation), you will store it as A_ij=0. Only A_ij time lags that are marked as causal links in the causation matrix will be taken into account.
    In the dictionary you also need to provide the fields "model", "experiment", "method_sha", and "parameter_values" that identify which method together with which parameters you used to generate the results. This information allows us to validate results. You can upload multiple results files at once, but we restrict the number of uploads per user.

Example code


Below, we provide code in different programming languages illustrating the method module and script to iterate over datasets.


Evaluation metrics

AUC

Area Under the Receiver Operating Characteristic Curve (ROC AUC). This metric is based on the score matrices (non-negative real numbers) uploaded by users. A high score entry (i,j) indicates high confidence in a link i --> j. The true label of a link is binary: 0 (missing) or 1 (causal link). The ROC curve measures the ability of a binary classifier system (your score) as its discrimination threshold is varied. The AUC is computed across all (N, N) matrices (excluding self-links) of the B different datasets in an experiment (in most experiments, B=200) and provides a summary metric, here computed using the trapezoidal rule. High AUC values (maximum is 1) indicate better performance.

FPR and TPR

False Positive Rate and True Positive Rate. These metrics are based on the p-value matrices thresholded at a 5% significance level. FPR quantifies the ratio of falsely predicted links divided by the number of absent links among all N*(N-1)*B links of an experiment. TPR measures the ratio of correctly identified links divided by the number of true links. These metrics are only computed if a p-value matrix is available. The FPR should be at 0.05 (well-calibrated) or below (over-conservative), while higher TPR indicates better performance.

F-measure

Also the F-measure (or Fbeta-score) metric is based on the p-value matrices thresholded at a 5% significance level. The F-measure is the harmonic average of the precision and recall, where an F-measure reaches its best value at 1 (perfect precision and recall) and worst at 0. Here we set beta=0.5 which puts more weight on precision (by attenuating the influence of false negatives).

TLR

True Lag Rate. Like the TPR, this metric is based on the p-value matrix thresholded at a 5% significance level and additionally the predicted lag matrix. TLR quantifies the ratio of correctly identified lags among the correctly predicted links. This metric is only computed if a p-value and lag matrix are available. The TLR can be between 0 (no lag correctly predicted) and 1 (all lags correctly predicted). But note that a very high rate may not mean much if the TPR is very low.

Validation and runtime

If users upload code alongside with a method registration, then we can, in principle, validate whether the code produces the uploaded results. We plan to run such validations in the future, which also enables us to provide runtime estimates on the same computing machine. Currently, only results uploaded by us are marked as validated.

Boxplots of FPR and TPR

For selected experiments, we are able to compute FPR and TPR rates also for individual links / non-links. These are the experiments where multiple realizations of different ground truth models are exist. This setup allows to evaluate how a method performs not only on average, but for individual links. For example, a method may be sensitive to the local coupling topology of a link. For example, if the lower quartile of the boxplot for TPR is at 40%, it means that the method has predicted more than 25% of the links for this experiment with a TPR below 40%.