Reproducing results from papers #21

orionw · 2023-06-09T16:51:08Z

Hi there! Great work - this is a very interesting line of research!

I was hoping to replicate your results on BEIR but seem to be having some trouble. For example, in both InPars v1 and v2 papers you mention using a learning rate of 1e−3, but I can't find any example scripts that use that (in legacy or otherwise, they seem to use 3e-4). When I use the hyperparameters from the papers (or the default example), I am getting much worse results.

I'm sure it's just some config that I'm missing from reading the papers/code, but if you happen to have the commands that reproduce the numbers in the paper I'd really appreciate it!

Thanks for your time!

lhbonifacio · 2023-06-21T16:17:27Z

Hey @orionw
Thank you for your interest in our work!
Could you give us more information about how are you trying to replicate the results? (the dataset you are using, are you generating new synthetic data or using the data we made available, are you fine-tuning/evaluating using TPU/GPU,....)
And regarding the learning rate, we used 3e-4 (we are going to correct it).

Moreover, we are about to release a reproduction paper of InPars with further details on how to reproduce the results.

Thank you!

orionw · 2023-06-21T16:43:09Z

Thanks for the reply @lhbonifacio!

I've tried a couple datasets (SciFact, SciDocs) but can't reproduce it. I'm using GPUs and the code in inpars not in legacy. I am generating new questions using huggingface models (not the available InPars v1 questions and I haven't seen the InPars v2 generated questions publicly available).

I've tried several learning rates (including 3e-4) and optimizers but for both of them any amount of re-ranker fine-tuning on the synthetic docs makes the performance worse than just using castorini/monot5-3b-msmarco-10k without fine-tuning (and performance is worse than reported in the paper).

If you have the fine-tuning hyperparameters for any of the BEIR runs that would be great (optimizer, learning rate, scheduler, steps, etc.).

Obviously with non-determinism there will be randomness in the generated questions and in training, but I was hoping to minimize differences due to model training.

cramraj8 · 2023-11-07T20:07:51Z

Hi @orionw , I wonder if you fine-tune from the castorini/monot5-3b-msmarco-10k checkpoint or from t5-base checkpoint. Any luck on sorting this out ?

orionw · 2023-11-07T21:26:49Z

Hi @cramraj8! I didn't use t5-base but I don't think they did either? I never did sort it out and moved on from this as it didn't seem like it would be released soon.

If they do (or you have time to figure it out), would love to see it be reproducible.

cramraj8 · 2023-11-09T21:19:13Z

@orionw Got it. I tried to generate unsupervised data by other tools, and in all cases the performance seem to drop for some cases.

cramraj8 · 2024-04-08T13:36:11Z

Hi @orionw , I did found out the reason behind the performance drop and proposed an effective solution in my recent NAACL paper. You can find it here - https://arxiv.org/pdf/2404.02489.pdf

orionw · 2024-04-08T14:04:05Z

Awesome @cramraj8! Thank you, I'm very excited to read the paper 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing results from papers #21

Reproducing results from papers #21

orionw commented Jun 9, 2023

lhbonifacio commented Jun 21, 2023

orionw commented Jun 21, 2023 •

edited

Loading

cramraj8 commented Nov 7, 2023

orionw commented Nov 7, 2023

cramraj8 commented Nov 9, 2023

cramraj8 commented Apr 8, 2024

orionw commented Apr 8, 2024

Reproducing results from papers #21

Reproducing results from papers #21

Comments

orionw commented Jun 9, 2023

lhbonifacio commented Jun 21, 2023

orionw commented Jun 21, 2023 • edited Loading

cramraj8 commented Nov 7, 2023

orionw commented Nov 7, 2023

cramraj8 commented Nov 9, 2023

cramraj8 commented Apr 8, 2024

orionw commented Apr 8, 2024

orionw commented Jun 21, 2023 •

edited

Loading