Skip to content

SyntaxError on HDF queries where right-hand contains string delimiter #6901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ariddell opened this issue Apr 17, 2014 · 20 comments · Fixed by #6905
Closed

SyntaxError on HDF queries where right-hand contains string delimiter #6901

ariddell opened this issue Apr 17, 2014 · 20 comments · Fixed by #6905
Labels
Docs IO HDF5 read_hdf, HDFStore
Milestone

Comments

@ariddell
Copy link

If your index is a string that contains a ' or " there is potentially no consistent way to do an HDF query from disk. The following is real-ish data from the Wikipedia pagecounts dataset. Worked in 0.12, SyntaxError in 0.13.

Easy to reproduce:
This is test.csv

title,hits
Al Lawson',30
'Blind' Willie McTell,20
df = pd.read_csv('test.csv', index_col=0)
store = pd.HDFStore('test.h5')
store.append('df', df)

# bug, SyntaxError
title = """Al Lawson'"""
result = store.select('df', pd.Term('index', '=', title))
result = store.select('df', 'index == {}'.format(title))

# fancy escaping helps, but it isn't a cure all
result = store.select('df', 'index == """{}"""'.format(title))
# SyntaxError
result = store.select('df', "index == '''{}'''".format(title))
@ariddell
Copy link
Author

You know, I think it might be as simple as adding a repr() in the right place. (But I think pandas should be doing this, not the user.)

This works:

result = store.select('df', pd.Term('index', '=', repr(title)))

@jreback
Copy link
Contributor

jreback commented Apr 17, 2014

this is fixed in 0.13.1 iirc

in any event you should use the new syntax in your queries rather than Term directly

http://pandas.pydata.org/pandas-docs/stable/io.html#querying-a-table

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

This isn't and never was a bug. The query contains unmatched quotes therefore it's invalid syntax

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

There was an issue with strings containing or or and, but never with querying strings without those tokens

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

The first and second case are really the same thing. If you were to write them outside of the query string they would be the following (invalid) code:

index == Al Lawson'

The third works because it results in

index == """Al Lawson'"""

And the fourth doesn't work because you have the following invalid code

index == '''Al Lawson''''

notice there are 3 quotes on the lhs and 4 on the rhs

@ariddell
Copy link
Author

So the user is responsible for doing repr in this case?
So the programmatic way to do it would be this?

title = "Al Lawson'"
result = store.select('df', "index={}".format(repr(title)))

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

@ariddell Yep. Or "index=%r" % title, which is less verbose or "index={!r}".format(title)

@ariddell
Copy link
Author

Ok. Thanks!

@jreback
Copy link
Contributor

jreback commented Apr 17, 2014

@ariddell it's impossible for the parser to know that you meant an exact string (and not just an error) if u don't quite it

just like python syntax (which is doing the parsing)

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

@jreback u stole my ⚡ ! :)

@ariddell
Copy link
Author

Might be nice to have a note in the docs? (especially since it worked in 0.12)

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

@ariddell Agreed! What did it do there? automatically quote? care to do a doc PR?

@ariddell
Copy link
Author

It must have automatically done the repr since it didn't raise an error. I'll admit I don't understand the underlying code or what change happened between 0.12 and 0.13 so perhaps I'm not the best person to update the docs.

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

@ariddell Ok np. I'll take this one then.

@cpcloud cpcloud self-assigned this Apr 17, 2014
@cpcloud cpcloud added this to the 0.14.0 milestone Apr 17, 2014
@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

Reopening for doc issue

@cpcloud cpcloud reopened this Apr 17, 2014
@jreback
Copy link
Contributor

jreback commented Apr 17, 2014

yah I guess just make a mention that u need to quote strings (although all strings are quoted in examples) - it IS python syntax - but I suppose u can clarify

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

@ariddell @jreback

I just realized that none of this is even necessary if you just pass in the variable that you're interested in

In [7]: title
Out[7]: "Al Lawson'"

In [8]: store
Out[8]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable,nrows->2,ncols->1,indexers->[index])

In [9]: store.select('df', 'index == title')
Out[9]:
            hits
title
Al Lawson'    30

[1 rows x 1 columns]

I'll make a doc note about this, but there's really no need to mention quoting

@ariddell
Copy link
Author

Good to know. I have to say, I'm a little concerned about grabbing the contents of local variables by name. It strikes me as something borrowed from R.

In any event, thanks for your work on this.

@cpcloud
Copy link
Member

cpcloud commented Apr 17, 2014

@ariddell What are you concerned about? I'm happy to explain any of the machinery to you. There's actually a full parse of the query string into something like

expr = Eq(lhs=Term('index', df.index), rhs=Term('title', "Al Lawson'"))

which allows pandas to do some alignment if necessary but ultimately it's passed to PyTables (which passes to numexpr). numexpr's evaluate function does a lookup of the global and local variables of the calling stack frame. We do something similar with the DataFrame.query()/eval() methods.

@ariddell
Copy link
Author

numexpr seems like a special case -- since it's departing from the land of Python. But I suppose that's true for PyTables too.

What causes me concern is just the "principle of least surprise" -- in Python you never refer to a local variable in a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants