Skip to content

Prompts for adjunct_island subset of blimp dataset #736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 28, 2022

Conversation

Urmish
Copy link
Contributor

@Urmish Urmish commented Apr 25, 2022

Note - Still WIP, but will like a quick review to see if the choices being made are correct.

Choices -

  1. Metric - Accuracy. Reason - Avoiding BLEU or ROGUE score as these sentences are just some permutation of each other. Wasn't sure if BLEU or ROGUE style metrics will capture the goal of this subset using prompt style methods
  2. Choices in template - To enable accuracy style metric, I had to add choices in the template

@Urmish Urmish marked this pull request as ready for review April 25, 2022 20:11
@awebson awebson self-assigned this Apr 25, 2022
@awebson awebson self-requested a review April 25, 2022 23:23
Copy link
Contributor

@awebson awebson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

  1. A bunch of prompts just harcode the target correct answer as like Sentence 1, even when S2 may be the correct answer. In general, it's bad to hardcode it like this and you should use random in jinja. Something like this:
{% set shuffled_order = [0, 1] | random %}
Which one of the following sentences is grammatical? Please answer A or B.
{% if shuffled_order == 0 %}
A: {{ sentence_good }}
B: {{ sentence_bad }}
{% else %}
A: {{ sentence_bad }}
B: {{ sentence_good }}
{% endif %}
|||
{% if shuffled_order == 0 %}
A
{% else %}
B
{% endif %}
  1. Prompts like "adjunct_bad_first" expect models to answer "Sentence 1". If so you need to tell models "answer by Sentence 1 or Sentence 2". Or better yet just ask models to reply a single token "1"/"2" or "A"/"B". Be sure to update answer choices accordingly.
  2. Why are the "A/B", "B/A" prompts not original task?

@awebson
Copy link
Contributor

awebson commented Apr 26, 2022

Thank you! There are some duplicate ones:

  • A/B choice randomized (choice order: B-A) vs. order: A-B
  • Some prompts just swap "Yes and No" with "No and Yes". Let's just keep the "Yes and No" ones

Also there is an extra space in Sentence A: {{ sentence_good }}


{% if shuffled_order == 0 %}

Sentence A: {{ sentence_good }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after Sentence: A

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed! Thanks!

@najoungkim
Copy link

I think the choice ordering is nontrivial (speaking from experience, don't have a source to cite). But I've been running local tests for T0-3B and for anaphor_number_agreement and here are some results:

Accuracy (%), 1000 examples
Yes-No order for good sentences = 46.2%
No-Yes order for good sentences: 50.3%

A-B order (spacing fixed) = 55.7%
B-A order (spacing fixed) = 48.1%

This is what motivates the minimally different prompts, but if you feel strongly about removing the order swapped versions I can remove them.

@jzf2101 jzf2101 changed the base branch from main to eval-hackathon April 26, 2022 23:24
@awebson
Copy link
Contributor

awebson commented Apr 27, 2022

Oh wow great to know these order differences! I trust that you will pay extra attention when you analyze the variance of these minimal prompts and when we plot the scatters like Fig 4 in the T0 paper as we discussed before.

@najoungkim
Copy link

Well, I guess some part of me still thinks it's a bit irresponsible to ignore this variance so here's a happy medium: randomizing presentation order for in-prompt options within a single prompt :)

@awebson
Copy link
Contributor

awebson commented Apr 28, 2022

Well done! Thanks Najoung and Urmish!

I didn't check all 50-something subsets of Blimp but I trust that you programmatically copied the identical prompts?

@awebson awebson merged commit b99bfc2 into bigscience-workshop:eval-hackathon Apr 28, 2022
@najoungkim
Copy link

Yep used copy_templates.py you shared!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants