Skip to content

Tickets/sp 1771 #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Tickets/sp 1771 #21

wants to merge 13 commits into from

Conversation

beckynevin
Copy link

I'm submitting this for PR because my last day is Friday (5/9). However, there's still work to be done here. Namely, I'm realizing that the kron fluxes are probably best interpreted in logarithms. I'm planning on looking into how changing this analysis to logscale would affect the fit in the next couple of days.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@beckynevin
Copy link
Author

beckynevin commented May 9, 2025

Okay final commits (from me are being submitted). To dos for the next person to take over this project:

  • Explore whether log scaling is necessary. I'm just not that familiar with fitting cron fluxes, where the scaling is in nJy.
  • For the first fit, I show an example of scaling the y variable as well as the x. I'm not sure this is necessary, but might be necessary to show how to use the inverse transform.
  • One potential modification to the SQL: restricting the S/N as in Christina's notebook (number 19)
  • Make sure figure captions are up to snuff
  • Being more careful with variable names so they are not reused between models
  • Add a warning about expected time to wait for the grid search to complete (I haven't yet tested this).
  • Currently the MSE for the multiple linear regression is better than the random forest regressor - figure out why this is and if it remains so, change the grid search to iterate on the MLR instead.

Good luck! Notes are bolded (covering the above) and left throughout the notebook.

@beckynevin
Copy link
Author

These to dos are also added to the JIRA ticket.

@MelissaGraham
Copy link
Contributor

Thank you so much for opening this PR and getting it ready for transfer, especially that list, looks very useful thanks!!

@bnord
Copy link

bnord commented May 18, 2025

review by @bnord for additional edits he will make

  • general
    • make sure we're using sentence case, not title case everywhere
    • where necessary, add appropriate emph to proper nouns, like pandas and seaborn
    • make sure exercises for learners are declarative instructions, not questions
    • make sure exercises are tested; include a hint for the exercises
    • convert some calculations and plots to functions, but keep them right before they are used.
    • add more exercises
    • remove colloquialisms and informal language.
    • change questions to declarative instructions
    • set plot sizes for for all plot sizes fig = plt.figure(figsize=(6, 4))
    • use the required seaborn colorblind choice: plt.style.use('seaborn-v0_8-colorblind')
    • make python variables a little more human-readable
    • Add figure numbers for all figures
  • section 1
    • add distinct Related tutorials before 1.1
    • clarify and link to tutorials
    • rename package imports to "Import Packages"
    • rename to "Define parameters"
    • change indented quotations to just quotations (rtn045); only use indents for warnings and figures.
    • when to use embedded images instead images created by the notebook cell processing?
    • move the tap query start closer to where it is used
    rsp_tap = get_tap_service("tap")
    job = rsp_tap.submit_job(query)
    job.run()
    job.wait(phases=['COMPLETED', 'ERROR'])
    print('Job phase is', job.phase)
    assert job.phase == 'COMPLETED'
    
  • section 2
    • change title --> Query for Kron fluxes of galaxies
    • give a journal article and a web page citation for Kron flux
    • update grammar in first paragraphs
    • These will be used in Sections 2 and 3.
    • Clean up name of the tutorial notebook;
    • no reason to print the query
  • section 3
    • link to pandas docs
    • move the quotation out of indentation format
    • remove display results
    • is type(results) useful? probably not?
    • Are the pandas instructions in and of themselves useful? they should probably be tailored to a data science task.
    • consider more discussion on how to review and understand data used in a data science study. make this relevant to scikit learn. how can pandas be used from a data science perspective?
  • section 4
    • state the goals of visualizing information
    • consider drawing on nord's ai tutorial and its visualizations
    • explain the unique value (pros/cons) of a box plot for data science for astro)
    • draw conclusions from the box plots
    • explain the unique value (pros/cons) of a violin plot for data science for astro)
    • draw conclusions from the violin plots
    • probably a few additional histograms and correlation plots
  • section 5
    • probably a better term than 'clean' what are we trying to do?
    • provide more introduction to the section and why we're doing it
    • discuss how both tabular data and visualizations help plan and inform how to edit the data.
    • why do predictive work based on g and r band flux correlations?
    • remove nans first and make it more matter of fact
    • use assert for nan removal
    • remove zeroes as a matter of fact; use assert
    • make subsections for nan and zero removal
  • section 6
    • cleaning data is actually a part of preparing the data set, so we'll subsume sec 5 into sec 6
    • the goal of the notebook should be stated early on, not in section 6.
    • Is predicting one flux based on another band the appropriate goal?
    • perhaps this should be star-galaxy separation based on
    • visualizations may also indicate how to 'clean the data'
    • change xtraintt variable name to be more human readable
    • use assert for checking the inverse function?
    • there needs to also be a validation data set
  • section 7
    • Change title of section to be declarative instructive
    • write an explanation for the section
    • Explain linear regression; include some math and a reference
    • add figure numbers for figures
    • explain why we're testing in transformed and un-transformed spaces
    • separate the cell on data prep, fitting, predicting and plotting.
    • make the 1-1 plot a function
    • explain why random forest is useful.
    • why is mse worse for random forest than for linear regression?
  • section 8
    • report expected training times
  • section 9
    • more concise section title
    • why are we doing this section? not sure yet
  • section 10
    • either perform a study with the other model classes or just mention. not useful to do the in-between thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants