-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430
Conversation
Personally I am fine with not supporting that. I listed this as one of the breaking changes in #59328, but we can definitely discuss whether we want to keep support for it or not (let's use that issue for it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
pandas/tests/groupby/test_raises.py
Outdated
msg = "No matching signature found" | ||
elif groupby_func == "corrwith": | ||
msg = ( | ||
"'ArrowStringArrayNumpySemantics' with dtype string does " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily for this PR, but personally I think we should avoid including such full name of the array in the error message, and just stick to saying that "dtype xx does not support operation yy"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to get the same error message for corrwith across all types? Probably related to @jorisvandenbossche comment on the pa.compute errors but generally I think we can catch these and make them consistent?
pandas/tests/groupby/test_raises.py
Outdated
import pyarrow as pa | ||
|
||
klass = pa.lib.ArrowNotImplementedError | ||
if groupby_func == "pct_change": | ||
msg = "Function 'divide' has no kernel matching input types" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be different for the object-dtype based variant (as you also noted), but so in general I think we should avoid bubbling up this pyarrow error to the user (the name of the function there is also different, and that should be an implementation detail). If pandas does not support substraction for string dtype (as we do), then we should just raise a TypeError with a specific message about that?
(in #59437 where I am enabling the object-dtype tests, I am adding xfails for such cases right now (not yet pushed that))
ca30d58
to
4224a52
Compare
Updated to make the exceptions match for object/pyarrow variants. Still needs some work, but now is a good time to bikeshed exception messages. |
well if no one else wants to bikeshed, i'll start: using |
Yeah, to avoid that something reads a bit strange depending on the exact dtype and operation that is filled in, I would maybe do both with quotes, something like |
Updated to fix most of the remaining affected tests. This PR changes the behavior of groupby.sum with string dtype to raise rather than cast to object and attempt to sum, which causes 11 new test failures that this PR does not yet address. im on the fence as to whether we should implement StringArray.sum and allow groupby.sum to go unchanged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which causes 11 new test failures that this PR does not yet address
Feel free to add new xfails if that is easier for this PR, either way
im on the fence as to whether we should implement StringArray.sum and allow groupby.sum to go unchanged.
You mean generally supporting "sum" for strings, i.e. for both plain sum and groupby sum?
As mentioned above (#59430 (comment)), I am personally fine with making this a breaking change, but also don't feel strongly about it
7f2e1c8
to
c498d20
Compare
Thanks for the updates! Added a few more comments |
6c70bc6
to
87eac8a
Compare
87eac8a
to
c8ebe07
Compare
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Still active |
I merged the sum PR, so this could be updated now, I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
I pushed a commit with a few more cases where some changes could be reverte (i.e. some explicit astype(object)
no longer being needed because string dtype now supports sum as well)
def test_cython_fail_agg(): | ||
dr = bdate_range("1/1/2000", periods=50) | ||
ts = Series(["A", "B", "C", "D", "E"] * 10, index=dr) | ||
ts = Series(["A", "B", "C", "D", "E"] * 10, dtype=object, index=dr) | ||
|
||
grouped = ts.groupby(lambda x: x.month) | ||
summed = grouped.sum() | ||
expected = grouped.agg(np.sum) | ||
expected = grouped.agg(np.sum).astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a specific reason you added an explicit dtype=object
here (since it seems you only added this in the last commit, after updating for sum()
being implemented, so now this is actually no longer needed, I think) ?
Thanks @jbrockmendel! |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
…te error message (pandas-dev#59430) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> (cherry picked from commit e5dd89d)
Manual backport -> #60246 |
…te error message (pandas-dev#59430) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…fer_string + update error message (#59430) (#60246) * TST (string dtype): fix groupby xfails with using_infer_string + update error message (#59430) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> (cherry picked from commit e5dd89d) * fix test --------- Co-authored-by: jbrockmendel <jbrockmendel@gmail.com>
This fixes 590 xfails.
Two potential follow-ons: 1) should ArrowStringArrayNumpySemantics support sum (done in the meantime), 2) make the exception types and messages match between pyarrow and non-pyarrow cases (see update at the top)
xref #54792