-
Notifications
You must be signed in to change notification settings - Fork 365
Prompts for adjunct_island subset of blimp dataset #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prompts for adjunct_island subset of blimp dataset #736
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
- A bunch of prompts just harcode the target correct answer as like Sentence 1, even when S2 may be the correct answer. In general, it's bad to hardcode it like this and you should use
random
in jinja. Something like this:
{% set shuffled_order = [0, 1] | random %}
Which one of the following sentences is grammatical? Please answer A or B.
{% if shuffled_order == 0 %}
A: {{ sentence_good }}
B: {{ sentence_bad }}
{% else %}
A: {{ sentence_bad }}
B: {{ sentence_good }}
{% endif %}
|||
{% if shuffled_order == 0 %}
A
{% else %}
B
{% endif %}
- Prompts like "adjunct_bad_first" expect models to answer "Sentence 1". If so you need to tell models "answer by Sentence 1 or Sentence 2". Or better yet just ask models to reply a single token "1"/"2" or "A"/"B". Be sure to update answer choices accordingly.
- Why are the "A/B", "B/A" prompts not original task?
Thank you! There are some duplicate ones:
Also there is an extra space in |
|
||
{% if shuffled_order == 0 %} | ||
|
||
Sentence A: {{ sentence_good }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space after Sentence: A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed! Thanks!
I think the choice ordering is nontrivial (speaking from experience, don't have a source to cite). But I've been running local tests for T0-3B and for anaphor_number_agreement and here are some results: Accuracy (%), 1000 examples A-B order (spacing fixed) = 55.7% This is what motivates the minimally different prompts, but if you feel strongly about removing the order swapped versions I can remove them. |
Oh wow great to know these order differences! I trust that you will pay extra attention when you analyze the variance of these minimal prompts and when we plot the scatters like Fig 4 in the T0 paper as we discussed before. |
Well, I guess some part of me still thinks it's a bit irresponsible to ignore this variance so here's a happy medium: randomizing presentation order for in-prompt options within a single prompt :) |
…der swapped prompts.
Well done! Thanks Najoung and Urmish! I didn't check all 50-something subsets of Blimp but I trust that you programmatically copied the identical prompts? |
Yep used |
Note - Still WIP, but will like a quick review to see if the choices being made are correct.
Choices -