Overview

To submit, please read our submission guidelines.

Higher numbers are better for all metrics. In parentheses, we show corrected sample standard deviations across random replicates.

A bold algorithm or model name indicates an official implementation submitted by an author of the original paper.

An asterisk next to a value indicates that the entry deviates from the official submission guidelines, for example because it uses a non-default model or additional pre-training data. The deviations are described in the notes in the dataset-specific leaderboards.

This overall leaderboard show out-of-distribution test performance across all datasets. For each dataset, we highlight in green the best-performing algorithm that conforms to official submission guidelines.

Without unlabeled data
Algorithm Amazon Camelyon17 CivilComments FMoW GlobalWheat iWildCam OGB-MolPCBA PovertyMap Py150 RxRx1 Contact References
10% Acc Avg Acc Worst-Group Acc Worst-Reg Acc Avg domain acc Macro F1 Avg Precision Worst-U/R r Mtd/Cls Acc Avg Acc
LISA 54.7 (0.0) 77.1 (6.9) 72.9 (1.0) 35.5 (0.81) - - - - - 31.9 (1.0) Yu Wang
Paper / Code
Fish 53.3 (0.0) 74.7 (7.1) 75.3 (0.6) 34.6 (0.18) - 22.0 (1.8) - - - - Yuge Shi
Paper / Code
DFR - - 72.5 (0.9) 42.8 (0.42) * - - - - - - Pavel Izmailov
Paper / Code
CORAL 52.9 (0.8) 59.5 (7.7) 65.6 (1.3) 32.8 (0.66) - 32.7 (0.2) 17.9 (0.5) 0.44 (0.07) 65.9 (0.1) 28.4 (0.3) WILDS
Paper / Code
CGD - 69.4 (7.9) 69.1 (1.9) 32.0 (2.26) - - - 0.43 (0.04) - - Vihari Piratla
Paper / Code
Group DRO 53.3 (0.0) 68.4 (7.3) 70.0 (2.0) 31.1 (1.66) 47.9 (2.0) 23.8 (2.0) 22.4 (0.6) 0.39 (0.06) 66.0 (0.1) 22.5 (0.3) WILDS
Paper / Code
ARM-BN - - - 24.4 (0.54) - 23.3 (2.8) - - - 31.2 (0.1) Marvin Zhang
Paper / Code
IID representation learning - - - - - - - - - 39.2 (0.2) Jiqing Wu
Paper / Code
ABSGD - - - - - 33.0 (0.6) - - - - Qi Qi
Paper / Code
ERM w/ data aug - 82.0 (7.4) * - 35.7 (0.26) - 32.2 (1.2) - 0.49 (0.06) - - WILDS
Paper / Code
ERM (CutMix) - - - - - - - - - 38.4 (0.2) Jiqing Wu
Paper / Code
ERM (more checkpoints) - - - 34.8 (1.9) - 32.0 (1.5) - - - - Kazuki Irie
Paper / Code
ERM (grid search) 53.8 (0.8) 70.3 (6.4) 56.0 (3.6) 31.3 (0.17) 51.2 (1.8) 30.8 (1.3) 27.2 (0.3) 0.45 (0.06) 67.9 (0.1) 29.9 (0.4) WILDS
Paper / Code
ERM (rand search) 54.2 (0.8) 70.8 (7.2) - 34.1 (1.42) 50.5 (1.7) 30.6 (1.1) 28.3 (0.1) 0.5 (0.07) - - WILDS
Paper / Code
IRM 52.4 (0.8) 64.2 (8.1) 66.3 (2.1) 32.8 (2.09) - 15.1 (4.9) 15.6 (0.3) 0.43 (0.07) 64.3 (0.2) 8.2 (1.1) WILDS
Paper / Code
Test-time BN adaptation - - - 30.0 (0.23) - 13.8 (0.6) - - - 20.1 (0.2) Marvin Zhang
Paper / Code
SGD (Freeze-Embed) - 96.5 (0.4) * - 50.3 (1.1) * - - - - - - Ananya Kumar
Paper / Code
MBDG - 93.3 (1.0) * - - - - - - - - Alex Robey
Paper / Code
MixUp - 63.5 (0.9) * - - - 13.8 (0.8) - - - - Olivia Wiles
Paper / Code
JTT - 63.8 (1.4) * - - - 11.0 (2.5) - - - - Olivia Wiles
Paper / Code
With unlabeled data
Algorithm Amazon Camelyon17 CivilComments FMoW GlobalWheat iWildCam OGB-MolPCBA PovertyMap Contact References
10% Acc Avg Acc Worst-Group Acc Worst-Reg Acc Avg domain acc Macro F1 Avg Precision Worst-U/R r
ICON 54.7 (0.0) 93.8 (0.3) 68.8 (1.3) 39.9 (1.12) 52.3 (0.2) 34.5 (1.4) 28.3 (0.0) 0.49 (0.04) Nick Y.
Paper / Code
Noisy Student - 86.7 (1.7) - 37.8 (0.62) 49.3 (3.7) 32.1 (0.7) 27.5 (0.1) 0.42 (0.11) WILDS
Paper / Code
SwAV - 91.4 (2.0) - 36.3 (1.01) - 29.0 (2.0) - 0.45 (0.05) WILDS
Paper / Code
AFN 54.2 (0.8) 83.2 (6.2) - 38.3 (0.95) - 30.8 (0.5) 14.9 (1.3) 0.39 (0.08) WILDS
Paper / Code
DANN 53.3 (0.0) 68.4 (9.2) - 34.6 (1.71) - 31.9 (1.4) 20.4 (0.8) 0.33 (0.1) WILDS
Paper / Code
Pseudo-Label 52.3 (1.1) 67.7 (8.2) 66.9 (2.6) 33.7 (0.24) 42.9 (2.3) 30.3 (0.4) 19.7 (0.1) - WILDS
Paper / Code
FixMatch - 71.0 (4.9) - 32.6 (2.05) - 31.0 (1.3) - 0.3 (0.11) WILDS
Paper / Code
CORAL 53.3 (0.0) 77.9 (6.6) - 33.7 (0.23) - 27.9 (0.4) 26.6 (0.2) 0.36 (0.08) WILDS
Paper / Code
Masked LM 53.5 (0.2) - 65.7 (2.3) - - - - - WILDS
Paper / Code

Below, we list individual leaderboards with more details on each submission.

iWildCam

Without unlabeled data
Rank Algorithm Model Test ID Macro F1 Test ID Avg Acc Test OOD Macro F1 ▼ Test OOD Avg Acc Contact References Date Notes
1 Model Soups (CLIP ViT-L) ViT-L 57.6 (1.9) * 79.1 (0.4) * 43.3 (1.0) * 79.3 (0.3) * Mitchell Wortsman
Paper / Code March 12, 2022 Model soups on top of a random hyperparameter search over LR, iterations, data augmentation, label smoothing.
2 ERM (CLIP ViT-L) ViT-L 55.8 (1.9) * 77.0 (0.7) * 41.4 (0.5) * 78.3 (1.1) * Mitchell Wortsman
Paper / Code July 28, 2022 Random hyperparameter search over LR, iterations, data augmentation, label smoothing.
3 ERM PNASNet-5-Large 52.8 (1.4) * 77.3 (0.7) * 38.5 (0.6) * 78.3 (1.4) * John Miller
Paper / Code July 20, 2021 Does not use the default model.
4 ABSGD ResNet50 47.5 (1.6) 74.8 (0.5) 33.0 (0.6) 72.7 (1.8) Qi Qi
Paper / Code October 13, 2022 The learning rates are tuned in {3e-05, *4e-05}. The hyperparameters for ABSGD are tuned between {1.1, 1.5, 2} and gamma is 0.9. Trained for 36 epochs with a batch size of 16. The learning rate is decayed at the 18th epoch by a factor of 2
5 CORAL ResNet50 43.6 (3.3) 73.8 (0.3) 32.7 (0.2) 73.3 (4.3) WILDS
Paper / Code July 15, 2021
6 ERM w/ data aug ResNet50 47.0 (1.4) 76.9 (0.6) 32.2 (1.2) 73.0 (0.4) WILDS
Paper / Code December 09, 2021 Implements RandAugment. Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
7 ERM (more checkpoints) ResNet50 47.9 (2.6) 76.2 (0.1) 32.0 (1.5) 69.0 (0.4) Kazuki Irie
Paper / Code February 10, 2022 We used the default hyper-parameters from the official code base, but conducted cross validation every 1000 training steps.
8 ERM (grid search) ResNet50 47.1 (1.5) 75.7 (0.4) 30.8 (1.3) 71.5 (2.6) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
9 ERM (rand search) ResNet50 46.7 (0.6) 74.9 (1.2) 30.6 (1.1) 72.5 (3.2) WILDS
Paper / Code December 09, 2021 Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
10 Group DRO ResNet50 37.5 (1.9) 71.6 (2.7) 23.8 (2.0) 72.7 (2.0) WILDS
Paper / Code July 15, 2021
11 ARM-BN ResNet50 27.5 (5.4) 62.0 (4.0) 23.3 (2.8) 70.2 (2.4) Marvin Zhang
Paper / Code April 19, 2022 Requires test data to be batched by groups
12 Fish ResNet50 40.3 (0.6) 73.8 (0.1) 22.0 (1.8) 64.7 (2.6) Yuge Shi
Paper / Code December 14, 2021
13 IRM ResNet50 22.4 (7.7) 59.8 (8.2) 15.1 (4.9) 59.7 (3.8) WILDS
Paper / Code July 15, 2021
14 MixUp ResNet50 31.2 (3.1) 66.1 (1.8) 13.8 (0.8) 48.6 (1.1) Olivia Wiles
Paper / Code June 16, 2022 lr: [0.01, 0.001, 0.0001]; alpha: [0.2, 0.5, 1.0]
15 Test-time BN adaptation ResNet50 12.0 (0.3) 37.2 (0.7) 13.8 (0.6) 46.6 (0.9) Marvin Zhang
Paper / Code April 19, 2022 Requires test data to be batched by groups
16 JTT ResNet50 32.6 (4.4) 64.9 (2.8) 11.0 (2.5) 47.4 (2.2) Olivia Wiles
Paper / Code June 16, 2022 lr: [0.01, 0.001, 0.0001]; lambda: [0.2, 2, 20, 200]
With unlabeled data

Unlabeled data is available from the extra domain.

Rank Algorithm Model Test ID Macro F1 Test ID Avg Acc Test OOD Macro F1 ▼ Test OOD Avg Acc Contact References Date Notes
1 ICON ResNet50 50.6 (1.3) 77.5 (0.2) 34.5 (1.4) 72.0 (0.2) Nick Y.
Paper / Code November 03, 2022 Uses unlabeled data from extra domains. lr: 0.0002253717686699905*, dropout: 0.5*; follows NoisyStudent.
2 Noisy Student ResNet50 48.3 (1.6) 77.6 (1.4) 32.1 (0.7) 71.0 (3.1) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from extra domains.
3 DANN ResNet50 48.5 (2.8) 77.2 (0.8) 31.9 (1.4) 70.7 (2.6) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from extra domains.
4 FixMatch ResNet50 46.3 (0.5) 78.1 (0.5) 31.0 (1.3) 71.9 (3.1) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from extra domains.
5 AFN ResNet50 46.8 (0.8) 77.2 (0.5) 30.8 (0.5) 74.4 (1.6) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from extra domains.
6 Pseudo-Label ResNet50 47.3 (0.4) 77.5 (0.1) 30.3 (0.4) 68.7 (3.4) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from extra domains.
7 SwAV ResNet50 47.3 (1.4) 74.8 (1.3) 29.0 (2.0) 63.4 (1.5) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from extra domains.
8 CORAL ResNet50 40.5 (1.4) 77.0 (0.3) 27.9 (0.4) 69.7 (0.8) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from extra domains.

Camelyon17

Without unlabeled data
Rank Algorithm Model Val Acc Test Acc ▼ Contact References Date Notes
1 SGD (Freeze-Embed) CLIP ViT-L 95.2 (0.3) * 96.5 (0.4) * Ananya Kumar
Paper / Code October 18, 2022 Uses a pretrained model. lr: [3e-5, 1e-4, 3e-4*, 1e-3, 3e-3, 1e-2]
2 MBDG DenseNet121 88.1 (1.8) * 93.3 (1.0) * Alex Robey
Paper / Code March 17, 2022 Uses a pretrained model. lr: [1e-4*, 1e-3, 1e-2], wd: [0*, 1e-3, 1e-3], gamma (mbdg margin): [0.01, 0.1*, 0.5], eta_d (mbdg dual step size): [5e-3, 5e-2*, 5e-1]
3 ERM w/ H&E jitter se_resnext101_32x4d 88.0 (4.2) * 91.6 (1.9) * Rohan Taori
Paper / Code July 19, 2021 Implements specialized H&E staining color jitter described here. Does not use the default model.
4 ERM w/ data aug DenseNet121 90.6 (1.2) * 82.0 (7.4) * WILDS
Paper / Code December 09, 2021 Implements RandAugment. Uses color augmentation as part of RandAugment. Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
5 LISA DenseNet121 81.8 (1.4) 77.1 (6.9) Yu Wang
Paper / Code March 13, 2022 We have used all the default hyperparameters, except for the mix_alpha, which is tuned in [0.5, 2], and 2 is the final choice.
6 Fish DenseNet121 83.9 (1.2) 74.7 (7.1) Yuge Shi
Paper / Code December 14, 2021
7 ERM (rand search) DenseNet121 85.8 (1.9) 70.8 (7.2) WILDS
Paper / Code December 09, 2021 Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
8 ERM (grid search) DenseNet121 84.9 (3.1) 70.3 (6.4) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
9 CGD DenseNet121 86.8 (1.4) 69.4 (7.9) Vihari Piratla
Paper / Code April 10, 2022 No hyperparameter tuning. CG step size: 0.05. LR, optimizer, decay rate etc. all set to default.
10 Group DRO DenseNet121 85.5 (2.2) 68.4 (7.3) WILDS
Paper / Code July 15, 2021
11 IRM DenseNet121 86.2 (1.4) 64.2 (8.1) WILDS
Paper / Code July 15, 2021
12 JTT ResNet50 66.9 (1.5) * 63.8 (1.4) * Olivia Wiles
Paper / Code June 16, 2022 lr: [0.01, 0.001, 0.0001]; lambda: [0.2, 2, 20, 200]
13 MixUp ResNet50 65.5 (1.4) * 63.5 (0.9) * Olivia Wiles
Paper / Code June 16, 2022 lr: [0.01, 0.001, 0.0001]; alpha:[0.2, 0.5, 1.0]
14 CORAL DenseNet121 86.2 (1.4) 59.5 (7.7) WILDS
Paper / Code July 15, 2021
With unlabeled data

Unlabeled data is available from the source, validation and target domains.

Rank Algorithm Model Val Acc Test Acc ▼ Contact References Date Notes
1 ICON DenseNet121 90.1 (0.4) 93.8 (0.3) Nick Y.
Paper / Code November 03, 2022 Uses unlabeled data from the target domain. lr: 0.0027267495451878732*, dropout: [0.0*, 0.5].
2 SwAV DenseNet121 92.3 (0.4) 91.4 (2.0) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain. Uses color augmentation as part of RandAugment.
3 Noisy Student DenseNet121 93.2 (0.5) 86.7 (1.7) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain. Uses color augmentation as part of RandAugment.
4 AFN DenseNet121 91.1 (0.9) 83.2 (6.2) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain. Uses color augmentation as part of RandAugment.
5 CORAL DenseNet121 90.4 (0.9) 77.9 (6.6) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain. Uses color augmentation as part of RandAugment.
6 FixMatch DenseNet121 91.3 (1.1) 71.0 (4.9) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain. Uses color augmentation as part of RandAugment.
7 DANN DenseNet121 86.9 (2.2) 68.4 (9.2) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain. Uses color augmentation as part of RandAugment.
8 Pseudo-Label DenseNet121 91.3 (1.3) 67.7 (8.2) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain. Uses color augmentation as part of RandAugment.

RxRx1

Rank Algorithm Model Val Acc Test ID Acc Test Acc ▼ Contact References Date Notes
1 IID representation learning ResNet50 23.9 (0.3) 49.9 (0.5) 39.2 (0.2) Jiqing Wu
Paper / Code February 20, 2022 Uses CutMix regularizer.
2 ERM (CutMix) ResNet50 23.6 (0.3) 47.4 (1.0) 38.4 (0.2) Jiqing Wu
Paper / Code February 20, 2022 Uses CutMix regularizer.
3 LISA ResNet50 20.1 (0.4) 41.1 (1.3) 31.9 (1.0) Yu Wang
Paper / Code March 13, 2022 We have used all the default hyperparameters, except for the mix_alpha, which is tuned in [0.5, 2], and 2 is the final choice.
4 ARM-BN ResNet50 20.9 (0.2) 34.9 (0.2) 31.2 (0.1) Marvin Zhang
Paper / Code April 19, 2022 Requires test data to be batched by groups
5 ERM (grid search) ResNet50 19.4 (0.2) 35.9 (0.4) 29.9 (0.4) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
6 CORAL ResNet50 18.5 (0.4) 34.0 (0.3) 28.4 (0.3) WILDS
Paper / Code July 15, 2021
7 Group DRO ResNet50 15.2 (0.1) 28.1 (0.3) 22.5 (0.3) WILDS
Paper / Code July 15, 2021
8 Test-time BN adaptation ResNet50 12.3 (0.2) 21.5 (0.2) 20.1 (0.2) Marvin Zhang
Paper / Code April 19, 2022 Requires test data to be batched by groups
9 IRM ResNet50 5.6 (0.4) 9.9 (1.4) 8.2 (1.1) WILDS
Paper / Code July 15, 2021

OGB-MolPCBA

Without unlabeled data
Rank Algorithm Model Val Avg Precision Test Avg Precision ▼ Contact References Date Notes
1 ERM (rand search) GIN-virtual 29.3 (0.3) 28.3 (0.1) WILDS
Paper / Code December 09, 2021 Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
2 ERM (grid search) GIN-virtual 27.8 (0.1) 27.2 (0.3) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
3 Group DRO GIN-virtual 23.1 (0.6) 22.4 (0.6) WILDS
Paper / Code July 15, 2021
4 CORAL GIN-virtual 18.4 (0.2) 17.9 (0.5) WILDS
Paper / Code July 15, 2021
5 IRM GIN-virtual 15.8 (0.2) 15.6 (0.3) WILDS
Paper / Code July 15, 2021
With unlabeled data

Unlabeled data is available from the source, validation and target domains.

Rank Algorithm Model Val Avg Precision Test Avg Precision ▼ Contact References Date Notes
1 ICON GIN-virtual 29.6 (0.0) 28.3 (0.0) Nick Y.
Paper / Code November 03, 2022 Uses unlabeled data from the target domain. lr: 0.000398452164375177*, dropout: 0.0*.
2 Noisy Student GIN-virtual 28.9 (0.1) 27.5 (0.1) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
3 CORAL GIN-virtual 27.0 (0.4) 26.6 (0.2) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
4 DANN GIN-virtual 20.7 (0.8) 20.4 (0.8) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
5 Pseudo-Label GIN-virtual 21.9 (0.6) 19.7 (0.1) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
6 AFN GIN-virtual 15.1 (1.3) 14.9 (1.3) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.

GlobalWheat

Without unlabeled data
Rank Algorithm Model Val ID Acc Val Acc Test ID Acc Test Acc ▼ Contact References Date Notes
1 ERM (grid search) Faster R-CNN 68.6 (0.4) 51.2 (1.8) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
2 ERM (rand search) Faster R-CNN 68.7 (0.9) 50.5 (1.7) WILDS
Paper / Code December 09, 2021 Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
3 Group DRO Faster R-CNN 66.2 (0.4) 47.9 (2.0) WILDS
Paper / Code July 15, 2021
With unlabeled data

Unlabeled data is available from the source, validation, target and extra domains.

Rank Algorithm Model Val ID Acc Val Acc Test ID Acc Test Acc ▼ Contact References Date Notes
1 ICON Faster R-CNN 68.9 (0.3) 52.3 (0.2) Nick Y.
Paper / Code November 03, 2022 Uses unlabeled data from the target domain. lr: 0.000001*.
2 Noisy Student Faster R-CNN 68.9 (0.4) 49.3 (3.7) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
3 Pseudo-Label Faster R-CNN 65.0 (0.7) 42.9 (2.3) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.

CivilComments

Without unlabeled data
Rank Algorithm Model Val Avg Acc Val Worst-Group Acc Test Avg Acc Test Worst-Group Acc ▼ Contact References Date Notes
1 Fish DistillBERT-base-uncased 88.9 (0.6) 70.5 (1.1) 89.3 (0.3) 75.3 (0.6) Yuge Shi
Paper / Code December 14, 2021
2 LISA DistillBERT-base-uncased 90.3 (0.3) 71.2 (0.9) 90.1 (0.3) 72.9 (1.0) Yu Wang
Paper / Code March 13, 2022 We have used all the default hyperparameters, except for the mix_alpha, which is tuned in [0.5, 2], and 2 is the final choice.
3 DFR DistillBERT-base-uncased 87.5 (0.7) 71.7 (0.9) 87.3 (0.7) 72.5 (0.9) Pavel Izmailov
Paper / Code September 26, 2022 Uses default WILDS training script for the base models, dropping 20% of the data from the training set for last layer retraining. We then retrain the last layer on the left-out data, with two hyper-parameters: regularization coefficient C: [1., 0.3*, 0.1, 0.07, 0.03, 0.01, 0.003] and bias vector correction t: [-0.15, -0.1*, -0.05, 0., 0.05, 0.1, 0.15] according to the best WGA on the validation set. We use all the group identities for retraining the last layer. We assign a group [0-8] to each example according to its attributes; if multiple attributes are present, we use the least common attribute (with the smallest number of occurrences) as the group for that example.
4 Group DRO DistillBERT-base-uncased 90.1 (0.4) 67.7 (1.8) 89.9 (0.5) 70.0 (2.0) WILDS
Paper / Code July 15, 2021 Groups data according to (label, Black).
5 Reweighted (Label) DistillBERT-base-uncased 90.1 (0.4) 65.9 (1.8) 89.8 (0.4) 69.2 (0.9) WILDS
Paper / Code July 15, 2021 Groups data according to labels.
6 Group DRO (Label) DistillBERT-base-uncased 90.4 (0.4) 65.0 (3.8) 90.2 (0.3) 69.1 (1.8) WILDS
Paper / Code July 15, 2021 Groups data according to labels.
7 CGD distilbert-base-uncased 89.6 (0.5) 68.3 (1.6) 89.6 (0.4) 69.1 (1.9) Vihari Piratla
Paper / Code April 10, 2022 CG step size: [0.005, 0.01, 0.05, 0.1]; 0.05 performed the best. LR, optimizer, decay rate etc. all set to default.
8 DFR (label, Black) DistillBERT-base-uncased 88.1 (1.1) 69.9 (1.0) 87.9 (1.2) 68.2 (2.3) Pavel Izmailov
Paper / Code September 26, 2022 Uses default WILDS training script for the base models, dropping 20% of the data from the training set for last layer retraining. We then retrain the last layer on the left-out data, with two hyper-parameters: regularization coefficient C: [1., 0.3*, 0.1, 0.07, 0.03, 0.01, 0.003] and bias vector correction t: [-0.15, -0.1*, -0.05, 0., 0.05, 0.1, 0.15] according to the best WGA on the validation set. We only use the (label, Black) attributes to form the groups for last layer retraining.
9 IRM DistillBERT-base-uncased 89.0 (0.7) 65.9 (2.8) 88.8 (0.7) 66.3 (2.1) WILDS
Paper / Code July 15, 2021 Groups data according to (label, Black).
10 Reweighted (Label x Black) DistillBERT-base-uncased 89.6 (0.6) 66.6 (1.5) 89.2 (0.6) 66.2 (1.2) WILDS
Paper / Code July 15, 2021 Groups data according to (label, Black).
11 CORAL DistillBERT-base-uncased 88.9 (0.6) 64.7 (1.4) 88.7 (0.5) 65.6 (1.3) WILDS
Paper / Code July 15, 2021 Groups data according to (label, Black).
12 ERM (grid search) DistillBERT-base-uncased 92.3 (0.2) 50.5 (1.9) 92.2 (0.1) 56.0 (3.6) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
With unlabeled data

Unlabeled data is available from the extra domain.

Rank Algorithm Model Val Avg Acc Val Worst-Group Acc Test Avg Acc Test Worst-Group Acc ▼ Contact References Date Notes
1 ICON DistillBERT-base-uncased 89.9 (0.1) 66.4 (0.7) 89.7 (0.1) 68.8 (1.3) Nick Y.
Paper / Code November 03, 2022 Uses unlabeled data from the same distribution. lr: 7.324042204632364e-05*, dropout: 0.0*.
2 Pseudo-Label DistillBERT-base-uncased 90.5 (0.6) 63.9 (1.7) 90.3 (0.5) 66.9 (2.6) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the same distribution.
3 Masked LM DistillBERT-base-uncased 89.7 (1.1) 64.5 (2.5) 89.4 (1.2) 65.7 (2.3) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the same distribution.

FMoW

Without unlabeled data
Rank Algorithm Model Val Avg Acc Test Avg Acc Val Worst-region Acc Test Worst-region Acc ▼ Contact References Date Notes
1 SGD (Freeze-Embed) CLIP-ViT-L/14 @ 338px 73.7 (0.34) * 68.3 (0.42) * 60.0 (0.07) * 50.3 (1.1) * Ananya Kumar
Paper / Code October 18, 2022 lr: [3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2]
2 Model Soups (CLIP ViT-L) ViT-L 75.7 (0.07) * 69.5 (0.08) * 59.8 (0.43) * 47.6 (0.33) * Mitchell Wortsman
Paper / Code March 12, 2022 Model soups on top of a random hyperparameter search over LR, iterations, data augmentation, label smoothing.
3 ERM (CLIP ViT-L) ViT-L 73.6 (0.23) * 66.9 (0.17) * 59.5 (1.31) * 46.1 (0.59) * Mitchell Wortsman
Paper / Code July 28, 2022 Random hyperparameter search over LR, iterations, data augmentation, label smoothing.
4 DFR DenseNet121 68.4 (1.32) * 53.4 (0.44) * 64.3 (1.39) * 42.8 (0.42) * Pavel Izmailov
Paper / Code September 26, 2022 Uses the OOD validation set to retrain the last layer of the model. Trains the base model with standard ERM training scripts from the WILDS repo, and only tunes the regularization strength parameter C: [1.*, 0.3, 0.1, 0.07, 0.03, 0.01, 0.003] for last layer retraining.
5 ERM w/ data aug DenseNet121 62.1 (0.23) 55.5 (0.42) 53.2 (0.61) 35.7 (0.26) WILDS
Paper / Code December 09, 2021 Implements RandAugment. Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
6 LISA DenseNet121 58.7 (1.12) 52.8 (1.15) 48.7 (0.92) 35.5 (0.81) Yu Wang
Paper / Code March 13, 2022 We have used all the default hyperparameters, except for the mix_alpha, which is tuned in [0.5, 2], and 2 is the final choice.
7 ERM se_resnext101_32x4d 62.1 (0.24) * 55.5 (0.14) * 51.3 (2.93) * 35.0 (0.78) * John Miller
Paper / Code July 15, 2021 Does not use the default model.
8 ERM (more checkpoints) DenseNet121 62.0 (0.06) 55.6 (0.23) 52.5 (1.25) 34.8 (1.9) Kazuki Irie
Paper / Code February 10, 2022 batch_size: [20*, 32, 64], lr: [1e-4, 3e-4*]. We conducted cross validation every 200 training steps.
9 Fish DenseNet121 57.8 (0.15) 51.8 (0.32) 49.5 (2.34) 34.6 (0.18) Yuge Shi
Paper / Code December 14, 2021
10 ERM (rand search) DenseNet121 60.6 (0.57) 54.0 (0.4) 52.6 (0.25) 34.1 (1.42) WILDS
Paper / Code December 09, 2021 Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
11 IRM DenseNet121 56.1 (0.61) 50.4 (0.75) 49.7 (0.97) 32.8 (2.09) WILDS
Paper / Code July 15, 2021
12 CORAL DenseNet121 56.5 (0.15) 50.1 (0.07) 48.9 (1.31) 32.8 (0.66) WILDS
Paper / Code July 15, 2021
13 CGD DenseNet121 57.0 (1.03) 50.6 (1.39) 49.8 (1.04) 32.0 (2.26) Vihari Piratla
Paper / Code April 10, 2022 CG step size: [0.05, 0.01, 0.2]; 0.2 performed the best. LR, optimizer, decay rate etc. all set to default.
14 ERM (grid search) DenseNet121 59.2 (0.07) 52.7 (0.23) 49.8 (0.36) 31.3 (0.17) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
15 Group DRO DenseNet121 57.6 (0.7) 51.2 (0.38) 49.4 (0.45) 31.1 (1.66) WILDS
Paper / Code July 15, 2021
16 Test-time BN adaptation DenseNet121 57.9 (0.36) 51.5 (0.25) 47.8 (0.52) 30.0 (0.23) Marvin Zhang
Paper / Code April 19, 2022 Requires test data to be batched by groups
17 ARM-BN DenseNet121 48.0 (0.65) 42.1 (0.26) 38.9 (2.17) 24.4 (0.54) Marvin Zhang
Paper / Code April 19, 2022 Requires test data to be batched by groups
With unlabeled data

Unlabeled data is available from the source, validation and target domains.

Rank Algorithm Model Val Avg Acc Test Avg Acc Val Worst-region Acc Test Worst-region Acc ▼ Contact References Date Notes
1 ICON DenseNet121 64.4 (0.18) 58.5 (0.16) 55.6 (0.44) 39.9 (1.12) Nick Y.
Paper / Code November 03, 2022 Uses unlabeled data from the target domain. lr: 0.00013443151989778619*, dropout: 0.5*.
2 AFN DenseNet121 61.7 (0.49) 55.6 (0.23) 53.4 (0.78) 38.3 (0.95) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
3 Noisy Student DenseNet121 64.0 (0.37) 58.4 (0.4) 55.4 (0.47) 37.8 (0.62) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
4 SwAV DenseNet121 63.1 (0.38) 56.3 (0.67) 51.6 (0.57) 36.3 (1.01) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
5 DANN DenseNet121 59.5 (0.45) 53.0 (0.58) 50.8 (2.18) 34.6 (1.71) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
6 CORAL DenseNet121 60.1 (0.56) 53.3 (0.61) 51.7 (1.23) 33.7 (0.23) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
7 Pseudo-Label DenseNet121 62.5 (0.08) 55.6 (0.2) 51.5 (0.52) 33.7 (0.24) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
8 FixMatch DenseNet121 58.9 (2.07) 52.5 (1.86) 50.8 (1.11) 32.6 (2.05) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.

PovertyMap

Without unlabeled data
Rank Algorithm Model Val Pearson r Test Pearson r Val Worst-U/R Pearson r Test Worst-U/R Pearson r ▼ Contact References Date Notes
1 ERM (rand search) ResNet18-MS 0.81 (0.03) 0.8 (0.04) 0.53 (0.06) 0.5 (0.07) WILDS
Paper / Code December 09, 2021 Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
2 ERM w/ data aug ResNet18-MS 0.81 (0.03) 0.79 (0.04) 0.54 (0.06) 0.49 (0.06) WILDS
Paper / Code December 09, 2021 Implements composition of random horizontal flip, random affine transformation, color jitter on the RGB channels, and Cutout on all channels. Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
3 ERM (grid search) ResNet18-MS 0.8 (0.04) 0.78 (0.04) 0.51 (0.06) 0.45 (0.06) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
4 CORAL ResNet18-MS 0.8 (0.04) 0.78 (0.05) 0.51 (0.06) 0.44 (0.07) WILDS
Paper / Code July 15, 2021
5 IRM ResNet18-MS 0.81 (0.03) 0.77 (0.05) 0.53 (0.05) 0.43 (0.07) WILDS
Paper / Code July 15, 2021
6 CGD ResNet18-MS 0.81 (0.03) 0.77 (0.04) 0.51 (0.05) 0.43 (0.04) Vihari Piratla
Paper / Code April 10, 2022 No hyperparameter search; CG step size: 0.05, LR, optimizer, decay rate etc. all set to default.
7 Group DRO ResNet18-MS 0.78 (0.05) 0.75 (0.07) 0.46 (0.04) 0.39 (0.06) WILDS
Paper / Code July 15, 2021
With unlabeled data

Unlabeled data is available from the source, validation and target domains.

Rank Algorithm Model Val Pearson r Test Pearson r Val Worst-U/R Pearson r Test Worst-U/R Pearson r ▼ Contact References Date Notes
1 ICON ResNet18-MS 0.8 (0.04) 0.77 (0.04) 0.52 (0.08) 0.49 (0.04) Nick Y.
Paper / Code November 03, 2022 Uses unlabeled data from the target domain. lr: 0.0009738391232813829*, dropout: 0.5*.
2 SwAV ResNet18-MS 0.81 (0.05) 0.78 (0.06) 0.54 (0.07) 0.45 (0.05) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
3 Noisy Student ResNet18-MS 0.8 (0.05) 0.76 (0.08) 0.52 (0.08) 0.42 (0.11) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
4 AFN ResNet18-MS 0.76 (0.05) 0.75 (0.08) 0.44 (0.07) 0.39 (0.08) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
5 CORAL ResNet18-MS 0.79 (0.04) 0.74 (0.05) 0.5 (0.09) 0.36 (0.08) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
6 DANN ResNet18-MS 0.77 (0.04) 0.69 (0.04) 0.44 (0.11) 0.33 (0.1) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
7 FixMatch ResNet18-MS 0.76 (0.07) 0.64 (0.11) 0.48 (0.05) 0.3 (0.11) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.

Amazon

Without unlabeled data
Rank Algorithm Model Val Avg Acc Test Avg Acc Val 10% Acc Test 10% Acc ▼ Contact References Date Notes
1 LISA DistillBERT-base-uncased 71.4 (0.4) 70.7 (0.3) 54.8 (0.2) 54.7 (0.0) Yu Wang
Paper / Code March 13, 2022 We have used all the default hyperparameters, except for the mix_alpha, which is tuned in [0.5, 2], and 2 is the final choice.
2 ERM (rand search) DistillBERT-base-uncased 72.8 (0.1) 72.0 (0.1) 56.0 (0.0) 54.2 (0.8) WILDS
Paper / Code December 09, 2021 Hyperparameters tuned via random search, consistent with the WILDS unlabeled data baselines.
3 ERM (grid search) DistillBERT-base-uncased 72.7 (0.1) 71.9 (0.1) 55.2 (0.7) 53.8 (0.8) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
4 Group DRO DistillBERT-base-uncased 70.7 (0.6) 70.0 (0.5) 54.7 (0.0) 53.3 (0.0) WILDS
Paper / Code July 15, 2021
5 Fish DistillBERT-base-uncased 72.5 (0.0) 71.7 (0.1) 54.2 (0.8) 53.3 (0.0) Yuge Shi
Paper / Code December 14, 2021
6 CORAL DistillBERT-base-uncased 72.0 (0.3) 71.1 (0.3) 54.7 (0.0) 52.9 (0.8) WILDS
Paper / Code July 15, 2021
7 IRM DistillBERT-base-uncased 71.3 (0.5) 70.3 (0.6) 54.2 (0.8) 52.4 (0.8) WILDS
Paper / Code July 15, 2021
8 Reweighted (Label) DistillBERT-base-uncased 68.9 (0.9) 68.3 (0.9) 52.1 (0.2) 51.6 (0.8) WILDS
Paper / Code July 15, 2021
With unlabeled data

Unlabeled data is available from the validation, target and extra domains.

Rank Algorithm Model Val Avg Acc Test Avg Acc Val 10% Acc Test 10% Acc ▼ Contact References Date Notes
1 ICON DistillBERT-base-uncased 72.7 (0.2) 71.9 (0.1) 55.2 (0.7) 54.7 (0.0) Nick Y.
Paper / Code November 03, 2022 Uses unlabeled data from the target domain. lr: 1.5581425972502133e-05*, self_training_lambda:1*, self_training_threshold: 0.7681640736450283*.
2 AFN DistillBERT-base-uncased 73.0 (0.4) 72.1 (0.3) 56.0 (0.0) 54.2 (0.8) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
3 Masked LM DistillBERT-base-uncased 72.6 (0.5) 71.7 (0.4) 55.1 (0.8) 53.5 (0.2) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
4 CORAL DistillBERT-base-uncased 72.5 (0.1) 71.7 (0.1) 54.2 (0.8) 53.3 (0.0) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
5 DANN DistillBERT-base-uncased 72.6 (0.1) 71.7 (0.1) 54.7 (0.0) 53.3 (0.0) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.
6 Pseudo-Label DistillBERT-base-uncased 72.5 (0.1) 71.6 (0.1) 54.2 (0.8) 52.3 (1.1) WILDS
Paper / Code December 09, 2021 Uses unlabeled data from the target domain.

Py150

Rank Algorithm Model Test ID Method/Class Acc Test ID All Acc Test OOD Method/class Acc ▼ Test OOD All Acc Contact References Date Notes
1 ERM (grid search) CodeGPT 75.4 (0.4) 74.5 (0.4) 67.9 (0.1) 69.6 (0.1) WILDS
Paper / Code July 15, 2021 Hyperparameters tuned via grid search, consistent with other WILDS baselines that do not use unlabeled data.
2 Group DRO CodeGPT 70.8 (0.0) 71.0 (0.0) 66.0 (0.1) 67.9 (0.0) WILDS
Paper / Code July 15, 2021
3 CORAL CodeGPT 70.6 (0.0) 70.8 (0.1) 65.9 (0.1) 67.9 (0.0) WILDS
Paper / Code July 15, 2021
4 IRM CodeGPT 67.3 (1.1) 68.3 (0.7) 64.3 (0.2) 66.4 (0.1) WILDS
Paper / Code July 15, 2021