Skip to content

[WIP] NIST strong line retrieval #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

simontorres
Copy link

This PR is an example of how to load NIST data into pandas DataFrame.
I'm not sure if this qualifies for merging but should be useful for merging. Or give me useful feedback and I can improve it.
See #28

  • There is at least another option to do the same using BeautifulSoup but I used HTMLParser which is more low level I think.
  • It assumes two-character names for the chemical elements

@tepickering
Copy link
Contributor

looks like this and #30 cover some of the same ground. a few comments:

  • i think we should pick a convention for returning wavelength/intensity and be consistent. i'm fairly agnostic whether it's an astropy Table, a DataFrame, or a tuple of numpy arrays, but all methods that return them should do so in the same way.
  • if we use pandas, it needs to be added to the conda dependencies in .travis.yml.
  • BeautifulSoup is a cleaner, more concise way of parsing HTML, but it does add an extra dependency vs. the built-in HTMLParser. OTOH, it's available via the standard conda channel so it's not that onerous to install and it is widely used. should look at what, if any, redundancies there are between parsing the cgi interface in Created method to retrieve the wavelength lines from NIST. #30 and the html tables here and generalize as much as possible.
  • new code should include tests that cover it.

@simontorres
Copy link
Author

They do indeed roughly the same, my idea was to provide a "proof of concept" (if it can be called that) in order to contribute in the discussion. I agree that BeautifulSoup is cleaner and I must admit I have not have use it.

Regarding the convention for retrieving the data I would definitely go with pandas.DataFrame for its flexibility, for instance, when filtering the data.

I have created a quick example.

import pandas as pd

data_neon = {'rel_int': [100,
                         80,
                         80,
                         90,
                         90],
             'wavelength' : [2809.485,
                             2906.592,
                             2906.816,
                             2910.061,
                             2910.408],
             'spectrum' : ['Ne II',
                           'Ne II',
                           'Ne II',
                           'Ne II',
                           'Ne II'],
             'reference' : ['P71',
	                    'P71',
	                    'P71',
	                    'P71',
	                    'P71']}


df = pd.DataFrame(data=data_neon, columns=['rel_int',
                                           'wavelength',
                                           'spectrum',
                                           'reference'])
# the DataFrame object
print("The DataFrame Object")
print(df)
The DataFrame Object
   rel_int  wavelength spectrum reference
0      100    2809.485    Ne II       P71
1       80    2906.592    Ne II       P71
2       80    2906.816    Ne II       P71
3       90    2910.061    Ne II       P71
4       90    2910.408    Ne II       P71
# selecting the three most intense
print("Selecting the three most intense")

three_most_intense = df.sort_values('rel_int', ascending=False)
three_most_intense = three_most_intense[:3].sort_values('wavelength')
print(three_most_intense)
Selecting the three most intense
   rel_int  wavelength spectrum reference
0      100    2809.485    Ne II       P71
3       90    2910.061    Ne II       P71
4       90    2910.408    Ne II       P71
print('select between 2900 and 2910')

print(df[((df.wavelength > 2900) & (df.wavelength < 2910))])
select between 2900 and 2910
   rel_int  wavelength spectrum reference
1       80    2906.592    Ne II       P71
2       80    2906.816    Ne II       P71

@tepickering
Copy link
Contributor

i'm a big pandas fan, but i'll point out one big disadvantage that i just realized: lack of units support. even in your example, it's not clear if the wavelengths are in Å or nm.

it also doesn't appear that units support is coming to pandas any time soon: pandas-dev/pandas#15698

to avoid headaches down the road, i think it's important that at least the wavelengths returned by line list utilities contain explicit metadata describing the wavelength units. this leaves either a Quantity array or a QTable. using pandas within the methods/functions is fine and very handy, though, as you show.

@bsipocz
Copy link
Member

bsipocz commented Jun 22, 2018

hmm, for the sake of compatibility with the rest of the stack, imo Table/QTable should be preferred over pandas, unless of course a crucial functionality of the latter is used that is not available with the astropy framework.

@simontorres
Copy link
Author

I guess I should be closing this as well, based on the conclusion of #30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants