Overview
Thank you for submitting to the WILDS leaderboards.
We welcome submissions of new algorithms and/or models, and we encourage contributors to test their new methods on as many datasets as applicable. This is valuable even if (or especially if) your method performs well on some datasets but not others.
We also welcome re-implementations of existing methods. On the leaderboards, we distinguish between official submissions (made by the authors of a method) and unofficial submissions (re-implementations by other contributors). Unofficial submissions are equally valuable, especially if the re-implementations achieve better performance than the original implementations because of better tuning or simple tweaks.
All submissions must use the dataset classes and evaluators in the WILDS package. In addition, they must report results on multiple replicates: 5 random seeds for CivilComments; 10 random seeds for Camelyon17; 5 folds for PovertyMap; and 3 random seeds for all other datasets.
Submissions fall into two categories: standard submissions and non-standard submissions.
Standard submissions
Standard submissions must follow these guidelines:
- Results must be reported on at least 3 random seeds. The following datasets must have more replicates: 5 random seeds for CivilComments; 10 random seeds for Camelyon17; and 5 folds for PovertyMap.
- The test set must not be used in any form for model training or selection.
- The validation set must be either the official out-of-distribution (OOD) validation set or, if applicable, the official in-distribution (ID) validation set.
- The validation set should only be used for hyperparameter selection. For example, after hyperparameters have been selected, do not combine the validation set with the training set and retrain the model.
- Training and model selection should not use any additional data, labeled or unlabeled, beyond the official training and validation data.
- To avoid unintended adaptation, models should not use batch statistics during evaluation. BatchNorm is accepted in its default mode (where it uses batch statistics during training, and then fixes them during evaluation).
- Other dataset-specific guidelines:
- For Camelyon17, models should not be pretrained on external data. Note: We have relaxed the constraint that models should not use color augmentation, since unlabeled data methods typically rely on data augmentation suites that include color augmentation.
- For iWildCam, models should not be pretrained on external data. This includes off-the-shelf detectors (e.g., MegaDetector) that have been trained on external data.
Non-standard submissions
Non-standard submissions only need to follow the first two guidelines from above:
- Results must be reported on at least 3 random seeds. The following datasets must have more replicates: 5 random seeds for CivilComments; 10 random seeds for Camelyon17; and 5 folds for PovertyMap.
- The test set must not be used in any form for model training or selection.
These submissions will be differentiated from standard submissions in our leaderboards. They are meant for the community to try out different approaches to solving these tasks. Examples of non-standard submissions might include:
- Using unlabeled data from external sources
- Specialized methods for particular datasets/domains, such as color augmentation for Camelyon17
- Using leave-one-domain-out cross-validation instead of the fixed OOD validation set
Making a submission
Submitting to the WILDS leaderboard consists of two steps: first, uploading your predictions in .csv
format, and second, filling up our submission form.
Submission formatting
Please submit your predictions in .csv
format for all datasets except GlobalWheat, and .pth
format for the GlobalWheat dataset.
The example scripts in the examples/
folder will automatically train models and save their predictions in the right format; see the
Get Started page for information on how to use these scripts.
If you are not using the example scripts, see the last section on this page for details on the expected format.
Step 1: Uploading your predictions
Upload a .tar.gz or .zip file containing your predictions in the format specified above. Feel free to use any standard host for your file (Google Drive, Dropbox, etc.).
Check that your predictions are valid by running the evaluate.py script on them. To do so, run python3 examples/evaluate.py [path_to_predictions] [path_to_output_results] --root_dir [path_to_data]
.
Please upload a separate .tar.gz or .zip file per method that you are submitting. For example, if you are submitting algorithm A and algorithm B, both of which are evaluated on 6 different datasets, then you should submit two different .tar.gz or .zip files: one corresponding to algorithm A (and containing predictions for all 6 datasets) and the other corresponding to algorithm B (also containing predictions for all 6 datasets.)
Step 2: Filling out the submission form
Next, fill up the submission form. You will need to fill out one form per .tar.gz/.zip file submitted. The form will ask for the URL to your submission file.
Once these steps have been completed, we will evaluate the predictions using the evaluate.py script and update the leaderboard within a week.
Detailed submission format
If you are manually generating the submission without using the example scripts, it should be structured in the following way:
- Each submission should have its own predictions folder (the name of the folder does not matter).
- Every dataset that you have results for should have its own subfolder:
amazon
,camelyon17
,civilcomments
,fmow
,iwildcam
,ogb-molpcba
,poverty
,py150
. - In each directory, there should be one
.csv
or.pth
per available evaluation split (e.g.,val
,test
,id-val
,id-test
) and replicate. Please use the following naming convention:{dataset}_split:{split}_seed:{seed}_epoch:{epoch}_pred.csv
. It does not matter what{epoch}
is, so long as there is only one epoch selected per dataset, split, and seed. Forpoverty
, replaceseed
withfold
.
As an example, this would be a valid predictions directory:
predictions [the name of the top-level folder is arbitrary]
|-- iwildcam
| |-- iwildcam_split:id_test_seed:0_epoch:best_pred.csv
| |-- iwildcam_split:id_val_seed:0_epoch:best_pred.csv
| |-- iwildcam_split:test_seed:0_epoch:best_pred.csv
| |-- iwildcam_split:val_seed:0_epoch:best_pred.csv
| |-- iwildcam_split:id_test_seed:1_epoch:best_pred.csv
| |-- iwildcam_split:id_val_seed:1_epoch:best_pred.csv
| |-- iwildcam_split:test_seed:1_epoch:best_pred.csv
| |-- iwildcam_split:val_seed:1_epoch:best_pred.csv
| |-- iwildcam_split:id_test_seed:2_epoch:best_pred.csv
| |-- iwildcam_split:id_val_seed:2_epoch:best_pred.csv
| |-- iwildcam_split:test_seed:2_epoch:best_pred.csv
| |-- iwildcam_split:val_seed:2_epoch:best_pred.csv
|-- poverty
| |-- poverty_split:id_test_fold:A_epoch:best_pred.csv
| |-- poverty_split:id_val_fold:A_epoch:best_pred.csv
| |-- poverty_split:test_fold:A_epoch:best_pred.csv
| |-- poverty_split:val_fold:A_epoch:best_pred.csv
| |-- poverty_split:id_test_fold:B_epoch:best_pred.csv
| |-- poverty_split:id_val_fold:B_epoch:best_pred.csv
| |-- poverty_split:test_fold:B_epoch:best_pred.csv
| |-- poverty_split:val_fold:B_epoch:best_pred.csv
| |-- poverty_split:id_test_fold:C_epoch:best_pred.csv
| |-- poverty_split:id_val_fold:C_epoch:best_pred.csv
| |-- poverty_split:test_fold:C_epoch:best_pred.csv
| |-- poverty_split:val_fold:C_epoch:best_pred.csv
| |-- poverty_split:id_test_fold:D_epoch:best_pred.csv
| |-- poverty_split:id_val_fold:D_epoch:best_pred.csv
| |-- poverty_split:test_fold:D_epoch:best_pred.csv
| |-- poverty_split:val_fold:D_epoch:best_pred.csv
| |-- poverty_split:id_test_fold:E_epoch:best_pred.csv
| |-- poverty_split:id_val_fold:E_epoch:best_pred.csv
| |-- poverty_split:test_fold:E_epoch:best_pred.csv
| |-- poverty_split:val_fold:E_epoch:best_pred.csv
...
Each .csv
should be structured in the following way:
- Each row should correspond to an example, in the order of the dataset (i.e., the first row should correspond to the first example of the dataset, etc.).
- Specifically, each row should contain the entry of
y_pred
corresponding to that example. The format ofy_pred
should be exactly what is passed to that dataset’seval
function. The columns correspond to the different dimensions ofy_pred
. For example, binary classification datasets will have only one column (of integers), whileogb-molpcba
will have 128 columns (of logits), andpy150
will have 255 columns (of integers).
For example, iwildcam_split:id_test_seed:0_epoch:best_pred.csv
might look like
4
172
0
24
...
representing a prediction of class 4 for the first example, class 172 for the second example, etc.
For GlobalWheat-WILDS, the .pth
should be structured in the following way:
- The
.pth
should contain a list with one element per example. - Each element of the list should be a dictionary containing at least the following keys:
boxes
andscores
. boxes
should be aM x 4
tensor whereM
is the number of bounding boxes predicted in the example, and the columns correspond to(x_min, y_min, x_max, y_max)
, where the coordinates are not normalized.scores
should be aM
-dimensional tensor containing the predicted probability of the corresponding bounding box representing a wheat head.