Leaderboard submissions

Overview

Thank you for submitting to the WILDS leaderboards.

We welcome submissions of new algorithms and/or models, and we encourage contributors to test their new methods on as many datasets as applicable. This is valuable even if (or especially if) your method performs well on some datasets but not others.

We also welcome re-implementations of existing methods. On the leaderboards, we distinguish between official submissions (made by the authors of a method) and unofficial submissions (re-implementations by other contributors). Unofficial submissions are equally valuable, especially if the re-implementations achieve better performance than the original implementations because of better tuning or simple tweaks.

All submissions must use the dataset classes and evaluators in the WILDS package. In addition, they must report results on multiple replicates: 5 random seeds for CivilComments; 10 random seeds for Camelyon17; 5 folds for PovertyMap; and 3 random seeds for all other datasets.

Submissions fall into two categories: standard submissions and non-standard submissions.

Standard submissions

Standard submissions must follow these guidelines:

Results must be reported on at least 3 random seeds. The following datasets must have more replicates: 5 random seeds for CivilComments; 10 random seeds for Camelyon17; and 5 folds for PovertyMap.
The test set must not be used in any form for model training or selection.
The validation set must be either the official out-of-distribution (OOD) validation set or, if applicable, the official in-distribution (ID) validation set.
The validation set should only be used for hyperparameter selection. For example, after hyperparameters have been selected, do not combine the validation set with the training set and retrain the model.
Training and model selection should not use any additional data, labeled or unlabeled, beyond the official training and validation data.
To avoid unintended adaptation, models should not use batch statistics during evaluation. BatchNorm is accepted in its default mode (where it uses batch statistics during training, and then fixes them during evaluation).
Other dataset-specific guidelines:
- For Camelyon17, models should not be pretrained on external data. Note: We have relaxed the constraint that models should not use color augmentation, since unlabeled data methods typically rely on data augmentation suites that include color augmentation.
- For iWildCam, models should not be pretrained on external data. This includes off-the-shelf detectors (e.g., MegaDetector) that have been trained on external data.

Non-standard submissions

Non-standard submissions only need to follow the first two guidelines from above:

Results must be reported on at least 3 random seeds. The following datasets must have more replicates: 5 random seeds for CivilComments; 10 random seeds for Camelyon17; and 5 folds for PovertyMap.
The test set must not be used in any form for model training or selection.

These submissions will be differentiated from standard submissions in our leaderboards. They are meant for the community to try out different approaches to solving these tasks. Examples of non-standard submissions might include:

Using unlabeled data from external sources
Specialized methods for particular datasets/domains, such as color augmentation for Camelyon17
Using leave-one-domain-out cross-validation instead of the fixed OOD validation set

Making a submission

Submitting to the WILDS leaderboard consists of two steps: first, uploading your predictions in .csv format, and second, filling up our submission form.

Submission formatting

Please submit your predictions in .csv format for all datasets except GlobalWheat, and .pth format for the GlobalWheat dataset. The example scripts in the examples/ folder will automatically train models and save their predictions in the right format; see the Get Started page for information on how to use these scripts.

If you are not using the example scripts, see the last section on this page for details on the expected format.

Step 1: Uploading your predictions

Upload a .tar.gz or .zip file containing your predictions in the format specified above. Feel free to use any standard host for your file (Google Drive, Dropbox, etc.).

Check that your predictions are valid by running the evaluate.py script on them. To do so, run python3 examples/evaluate.py [path_to_predictions] [path_to_output_results] --root_dir [path_to_data].

Please upload a separate .tar.gz or .zip file per method that you are submitting. For example, if you are submitting algorithm A and algorithm B, both of which are evaluated on 6 different datasets, then you should submit two different .tar.gz or .zip files: one corresponding to algorithm A (and containing predictions for all 6 datasets) and the other corresponding to algorithm B (also containing predictions for all 6 datasets.)

Step 2: Filling out the submission form

Next, fill up the submission form. You will need to fill out one form per .tar.gz/.zip file submitted. The form will ask for the URL to your submission file.

Once these steps have been completed, we will evaluate the predictions using the evaluate.py script and update the leaderboard within a week.

Detailed submission format

If you are manually generating the submission without using the example scripts, it should be structured in the following way:

Each submission should have its own predictions folder (the name of the folder does not matter).
Every dataset that you have results for should have its own subfolder: amazon, camelyon17, civilcomments, fmow, iwildcam, ogb-molpcba, poverty, py150.
In each directory, there should be one .csv or .pth per available evaluation split (e.g., val, test, id-val, id-test) and replicate. Please use the following naming convention: {dataset}_split:{split}_seed:{seed}_epoch:{epoch}_pred.csv. It does not matter what {epoch} is, so long as there is only one epoch selected per dataset, split, and seed. For poverty, replace seed with fold.

As an example, this would be a valid predictions directory:

predictions [the name of the top-level folder is arbitrary]
|-- iwildcam
|   |-- iwildcam_split:id_test_seed:0_epoch:best_pred.csv
|   |-- iwildcam_split:id_val_seed:0_epoch:best_pred.csv
|   |-- iwildcam_split:test_seed:0_epoch:best_pred.csv
|   |-- iwildcam_split:val_seed:0_epoch:best_pred.csv
|   |-- iwildcam_split:id_test_seed:1_epoch:best_pred.csv
|   |-- iwildcam_split:id_val_seed:1_epoch:best_pred.csv
|   |-- iwildcam_split:test_seed:1_epoch:best_pred.csv
|   |-- iwildcam_split:val_seed:1_epoch:best_pred.csv
|   |-- iwildcam_split:id_test_seed:2_epoch:best_pred.csv
|   |-- iwildcam_split:id_val_seed:2_epoch:best_pred.csv
|   |-- iwildcam_split:test_seed:2_epoch:best_pred.csv
|   |-- iwildcam_split:val_seed:2_epoch:best_pred.csv
|-- poverty
|   |-- poverty_split:id_test_fold:A_epoch:best_pred.csv
|   |-- poverty_split:id_val_fold:A_epoch:best_pred.csv
|   |-- poverty_split:test_fold:A_epoch:best_pred.csv
|   |-- poverty_split:val_fold:A_epoch:best_pred.csv
|   |-- poverty_split:id_test_fold:B_epoch:best_pred.csv
|   |-- poverty_split:id_val_fold:B_epoch:best_pred.csv
|   |-- poverty_split:test_fold:B_epoch:best_pred.csv
|   |-- poverty_split:val_fold:B_epoch:best_pred.csv
|   |-- poverty_split:id_test_fold:C_epoch:best_pred.csv
|   |-- poverty_split:id_val_fold:C_epoch:best_pred.csv
|   |-- poverty_split:test_fold:C_epoch:best_pred.csv
|   |-- poverty_split:val_fold:C_epoch:best_pred.csv
|   |-- poverty_split:id_test_fold:D_epoch:best_pred.csv
|   |-- poverty_split:id_val_fold:D_epoch:best_pred.csv
|   |-- poverty_split:test_fold:D_epoch:best_pred.csv
|   |-- poverty_split:val_fold:D_epoch:best_pred.csv
|   |-- poverty_split:id_test_fold:E_epoch:best_pred.csv
|   |-- poverty_split:id_val_fold:E_epoch:best_pred.csv
|   |-- poverty_split:test_fold:E_epoch:best_pred.csv
|   |-- poverty_split:val_fold:E_epoch:best_pred.csv
...

Each .csv should be structured in the following way:

Each row should correspond to an example, in the order of the dataset (i.e., the first row should correspond to the first example of the dataset, etc.).
Specifically, each row should contain the entry of y_pred corresponding to that example. The format of y_pred should be exactly what is passed to that dataset’s eval function. The columns correspond to the different dimensions of y_pred. For example, binary classification datasets will have only one column (of integers), while ogb-molpcba will have 128 columns (of logits), and py150 will have 255 columns (of integers).

For example, iwildcam_split:id_test_seed:0_epoch:best_pred.csv might look like

4
172
0
24
...

representing a prediction of class 4 for the first example, class 172 for the second example, etc.

For GlobalWheat-WILDS, the .pth should be structured in the following way:

The .pth should contain a list with one element per example.
Each element of the list should be a dictionary containing at least the following keys: boxes and scores.
boxes should be a M x 4 tensor where M is the number of bounding boxes predicted in the example, and the columns correspond to (x_min, y_min, x_max, y_max), where the coordinates are not normalized.
scores should be a M-dimensional tensor containing the predicted probability of the corresponding bounding box representing a wheat head.