Skip to content

Find a new regression dataset #938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
artidoro opened this issue Sep 18, 2018 · 5 comments
Open

Find a new regression dataset #938

artidoro opened this issue Sep 18, 2018 · 5 comments
Labels
P2 Priority of the issue for triage purpose: Needs to be fixed at some point. test related to tests

Comments

@artidoro
Copy link
Contributor

Some regression tests rely on a machine generated regression dataset (Gaussian noise on top of a linear function of a vector input). The file was introduced by #937.

We should replace this dataset with a real dataset. Justin @justinormont suggested to find something from data.gov, for example predicting the SF employee pay: https://catalog.data.gov/dataset/employee-compensation-53987

@wschin
Copy link
Member

wschin commented Sep 19, 2018

LIBSVM dataset is also commonly used in researches.

@rogancarr
Copy link
Contributor

We have the following data sets that can be used as regression:

  • housing
  • taxi-fare

The following can be reformulated to use as a regression prediction:

  • adult (predicting age from all the other variables)
  • breast-cancer (predict any feature)
  • iris (predict any feature)

@codemzs
Copy link
Member

codemzs commented Jun 30, 2019

Rogan seems to have answered this question.

@codemzs codemzs closed this as completed Jun 30, 2019
@justinormont
Copy link
Contributor

justinormont commented Jul 2, 2019

The work item is to replace the synthetic datasets w/ ones more representative of user datasets. Rogan has pointed out great ones we can use as replacements in our tests.

@justinormont justinormont reopened this Jul 2, 2019
@codemzs
Copy link
Member

codemzs commented Jul 2, 2019

@justinormont The ones that Rogan pointed out are real datasets, breast-cancer dataset is from 1992.

@mstfbl mstfbl added P2 Priority of the issue for triage purpose: Needs to be fixed at some point. test related to tests labels Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority of the issue for triage purpose: Needs to be fixed at some point. test related to tests
Projects
None yet
Development

No branches or pull requests

6 participants