Tickets/sp 1771 #21

beckynevin · 2025-05-06T21:57:17Z

I'm submitting this for PR because my last day is Friday (5/9). However, there's still work to be done here. Namely, I'm realizing that the kron fluxes are probably best interpreted in logarithms. I'm planning on looking into how changing this analysis to logscale would affect the fit in the next couple of days.

review-notebook-app · 2025-05-06T21:57:22Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…ust doing standard scaler

beckynevin · 2025-05-09T16:09:45Z

Okay final commits (from me are being submitted). To dos for the next person to take over this project:

Explore whether log scaling is necessary. I'm just not that familiar with fitting cron fluxes, where the scaling is in nJy.
For the first fit, I show an example of scaling the y variable as well as the x. I'm not sure this is necessary, but might be necessary to show how to use the inverse transform.
One potential modification to the SQL: restricting the S/N as in Christina's notebook (number 19)
Make sure figure captions are up to snuff
Being more careful with variable names so they are not reused between models
Add a warning about expected time to wait for the grid search to complete (I haven't yet tested this).
Currently the MSE for the multiple linear regression is better than the random forest regressor - figure out why this is and if it remains so, change the grid search to iterate on the MLR instead.

Good luck! Notes are bolded (covering the above) and left throughout the notebook.

beckynevin · 2025-05-09T16:09:59Z

These to dos are also added to the JIRA ticket.

MelissaGraham · 2025-05-09T16:26:33Z

Thank you so much for opening this PR and getting it ready for transfer, especially that list, looks very useful thanks!!

bnord · 2025-05-18T19:21:19Z

review by @bnord for additional edits he will make

general
- make sure we're using sentence case, not title case everywhere
- where necessary, add appropriate emph to proper nouns, like pandas and seaborn
- make sure exercises for learners are declarative instructions, not questions
- make sure exercises are tested; include a hint for the exercises
- convert some calculations and plots to functions, but keep them right before they are used.
- add more exercises
- remove colloquialisms and informal language.
- change questions to declarative instructions
- set plot sizes for for all plot sizes fig = plt.figure(figsize=(6, 4))
- use the required seaborn colorblind choice: plt.style.use('seaborn-v0_8-colorblind')
- make python variables a little more human-readable
- Add figure numbers for all figures
section 1
- add distinct Related tutorials before 1.1
- clarify and link to tutorials
- rename package imports to "Import Packages"
- rename to "Define parameters"
- change indented quotations to just quotations (rtn045); only use indents for warnings and figures.
- when to use embedded images instead images created by the notebook cell processing?
- move the tap query start closer to where it is used
```
rsp_tap = get_tap_service("tap")
job = rsp_tap.submit_job(query)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)
assert job.phase == 'COMPLETED'
```
section 2
- change title --> Query for Kron fluxes of galaxies
- give a journal article and a web page citation for Kron flux
- update grammar in first paragraphs
- These will be used in Sections 2 and 3.
- Clean up name of the tutorial notebook;
- no reason to print the query
section 3
- link to pandas docs
- move the quotation out of indentation format
- remove display results
- is type(results) useful? probably not?
- Are the pandas instructions in and of themselves useful? they should probably be tailored to a data science task.
- consider more discussion on how to review and understand data used in a data science study. make this relevant to scikit learn. how can pandas be used from a data science perspective?
section 4
- state the goals of visualizing information
- consider drawing on nord's ai tutorial and its visualizations
- explain the unique value (pros/cons) of a box plot for data science for astro)
- draw conclusions from the box plots
- explain the unique value (pros/cons) of a violin plot for data science for astro)
- draw conclusions from the violin plots
- probably a few additional histograms and correlation plots
section 5
- probably a better term than 'clean' what are we trying to do?
- provide more introduction to the section and why we're doing it
- discuss how both tabular data and visualizations help plan and inform how to edit the data.
- why do predictive work based on g and r band flux correlations?
- remove nans first and make it more matter of fact
- use assert for nan removal
- remove zeroes as a matter of fact; use assert
- make subsections for nan and zero removal
section 6
- cleaning data is actually a part of preparing the data set, so we'll subsume sec 5 into sec 6
- the goal of the notebook should be stated early on, not in section 6.
- Is predicting one flux based on another band the appropriate goal?
- perhaps this should be star-galaxy separation based on
- visualizations may also indicate how to 'clean the data'
- change xtraintt variable name to be more human readable
- use assert for checking the inverse function?
- there needs to also be a validation data set
section 7
- Change title of section to be declarative instructive
- write an explanation for the section
- Explain linear regression; include some math and a reference
- add figure numbers for figures
- explain why we're testing in transformed and un-transformed spaces
- separate the cell on data prep, fitting, predicting and plotting.
- make the 1-1 plot a function
- explain why random forest is useful.
- why is mse worse for random forest than for linear regression?
section 8
- report expected training times
section 9
- more concise section title
- why are we doing this section? not sure yet
section 10
- either perform a study with the other model classes or just mention. not useful to do the in-between thing.

beckynevin added 5 commits May 6, 2025 18:11

adding data science notebook

38dc11e

renaming

0d79f19

updating intro to be dp0.2 style

40984ec

working on outliers in histogram

3a45901

facelift to text in intersection section

a809627

beckynevin assigned MelissaGraham May 6, 2025

beckynevin unassigned MelissaGraham May 6, 2025

beckynevin requested review from bnord and MelissaGraham May 6, 2025 22:05

beckynevin added 6 commits May 7, 2025 21:59

correcting palette

44f6cdd

starting to change to log

e7d048d

struggling to get inverse log transformation

f82a32f

finished up transform section, added text for optional log, for now j…

d1512f6

…ust doing standard scaler

adding captions

a6ff72c

adding section for prediction of flagged r-band values

8f18b0b

beckynevin added 2 commits May 9, 2025 16:28

Ran all the way through with printouts

3cbc7a4

outputs cleared last run date updated

3566d77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tickets/sp 1771 #21

Tickets/sp 1771 #21

Uh oh!

beckynevin commented May 6, 2025

Uh oh!

review-notebook-app bot commented May 6, 2025

Uh oh!

beckynevin commented May 9, 2025 •

edited

Loading

Uh oh!

beckynevin commented May 9, 2025

Uh oh!

MelissaGraham commented May 9, 2025

Uh oh!

bnord commented May 18, 2025

Uh oh!

Uh oh!

Tickets/sp 1771 #21

Are you sure you want to change the base?

Tickets/sp 1771 #21

Uh oh!

Conversation

beckynevin commented May 6, 2025

Uh oh!

review-notebook-app bot commented May 6, 2025

Uh oh!

beckynevin commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beckynevin commented May 9, 2025

Uh oh!

MelissaGraham commented May 9, 2025

Uh oh!

bnord commented May 18, 2025

Uh oh!

Uh oh!

beckynevin commented May 9, 2025 •

edited

Loading