[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430

Merged
merged 13 commits into from
Nov 8, 2024

Conversation

jbrockmendel
Copy link
Member
@jbrockmendel jbrockmendel commented Aug 6, 2024

This fixes 590 xfails.

Two potential follow-ons: 1) should ArrowStringArrayNumpySemantics support sum (done in the meantime), 2) make the exception types and messages match between pyarrow and non-pyarrow cases (see update at the top)

xref #54792

@jorisvandenbossche
Copy link
Member

should ArrowStringArrayNumpySemantics support sum

Personally I am fine with not supporting that. I listed this as one of the breaking changes in #59328, but we can definitely discuss whether we want to keep support for it or not (let's use that issue for it)

@jorisvandenbossche jorisvandenbossche changed the title TST: fix groupby xfails with using_infer_string TST (string dtype): fix groupby xfails with using_infer_string Aug 7, 2024
Copy link
Member
@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

msg = "No matching signature found"
elif groupby_func == "corrwith":
msg = (
"'ArrowStringArrayNumpySemantics' with dtype string does "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily for this PR, but personally I think we should avoid including such full name of the array in the error message, and just stick to saying that "dtype xx does not support operation yy"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to get the same error message for corrwith across all types? Probably related to @jorisvandenbossche comment on the pa.compute errors but generally I think we can catch these and make them consistent?

Comment on lines 205 to 209
import pyarrow as pa

klass = pa.lib.ArrowNotImplementedError
if groupby_func == "pct_change":
msg = "Function 'divide' has no kernel matching input types"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be different for the object-dtype based variant (as you also noted), but so in general I think we should avoid bubbling up this pyarrow error to the user (the name of the function there is also different, and that should be an implementation detail). If pandas does not support substraction for string dtype (as we do), then we should just raise a TypeError with a specific message about that?

(in #59437 where I am enabling the object-dtype tests, I am adding xfails for such cases right now (not yet pushed that))

@jbrockmendel
Copy link
Member Author

Updated to make the exceptions match for object/pyarrow variants. Still needs some work, but now is a good time to bikeshed exception messages.

@jbrockmendel
Copy link
Member Author

well if no one else wants to bikeshed, i'll start: using f"{self.dtype} dtype does not support prod operations" renders as "str dtype [...]" which seems weird. maybe add ticks around 'str'?

@jorisvandenbossche
Copy link
Member

well if no one else wants to bikeshed, i'll start: using f"{self.dtype} dtype does not support prod operations" renders as "str dtype [...]" which seems weird. maybe add ticks around 'str'?

Yeah, to avoid that something reads a bit strange depending on the exact dtype and operation that is filled in, I would maybe do both with quotes, something like f"dtype '{dtype}' does not support operation '{op}'"

@jbrockmendel
Copy link
Member Author

Updated to fix most of the remaining affected tests. This PR changes the behavior of groupby.sum with string dtype to raise rather than cast to object and attempt to sum, which causes 11 new test failures that this PR does not yet address. im on the fence as to whether we should implement StringArray.sum and allow groupby.sum to go unchanged.

Copy link
Member
@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which causes 11 new test failures that this PR does not yet address

Feel free to add new xfails if that is easier for this PR, either way

im on the fence as to whether we should implement StringArray.sum and allow groupby.sum to go unchanged.

You mean generally supporting "sum" for strings, i.e. for both plain sum and groupby sum?
As mentioned above (#59430 (comment)), I am personally fine with making this a breaking change, but also don't feel strongly about it

pandas/core/arrays/arrow/array.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_raises.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_raises.py Outdated Show resolved Hide resolved
pandas/tests/reshape/test_pivot.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_raises.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/groupby.py Outdated Show resolved Hide resolved
pandas/tests/groupby/aggregate/test_aggregate.py Outdated Show resolved Hide resolved
pandas/tests/groupby/methods/test_quantile.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_groupby.py Show resolved Hide resolved
pandas/tests/groupby/test_groupby.py Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

Thanks for the updates! Added a few more comments

@jbrockmendel jbrockmendel force-pushed the tst-str-gb branch 2 times, most recently from 6c70bc6 to 87eac8a Compare August 28, 2024 14:53
@mroeschke mroeschke added this to the 2.3 milestone Aug 28, 2024
@jbrockmendel jbrockmendel marked this pull request as ready for review August 28, 2024 19:59
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 28, 2024
@jbrockmendel
Copy link
Member Author

Still active

@rhshadrach rhshadrach added Groupby Strings String extension data type and string data and removed Stale labels Oct 1, 2024
@jorisvandenbossche
Copy link
Member

I merged the sum PR, so this could be updated now, I think

Copy link
Member
@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

I pushed a commit with a few more cases where some changes could be reverte (i.e. some explicit astype(object) no longer being needed because string dtype now supports sum as well)

Comment on lines 149 to +155
def test_cython_fail_agg():
dr = bdate_range("1/1/2000", periods=50)
ts = Series(["A", "B", "C", "D", "E"] * 10, index=dr)
ts = Series(["A", "B", "C", "D", "E"] * 10, dtype=object, index=dr)

grouped = ts.groupby(lambda x: x.month)
summed = grouped.sum()
expected = grouped.agg(np.sum)
expected = grouped.agg(np.sum).astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a specific reason you added an explicit dtype=object here (since it seems you only added this in the last commit, after updating for sum() being implemented, so now this is actually no longer needed, I think) ?

@jorisvandenbossche jorisvandenbossche changed the title TST (string dtype): fix groupby xfails with using_infer_string TST (string dtype): fix groupby xfails with using_infer_string + update error message Nov 4, 2024
@jorisvandenbossche jorisvandenbossche merged commit e5dd89d into pandas-dev:main Nov 8, 2024
50 of 51 checks passed
@jorisvandenbossche
Copy link
Member

Thanks @jbrockmendel!

Copy link
lumberbot-app bot commented Nov 8, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 e5dd89d4d74d8e2a06256023717880788f2b10ed
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #59430: TST (string dtype): fix groupby xfails with using_infer_string + update error message'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-59430-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #59430 on branch 2.3.x (TST (string dtype): fix groupby xfails with using_infer_string + update error message)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Nov 8, 2024
…te error message (pandas-dev#59430)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
(cherry picked from commit e5dd89d)
@jorisvandenbossche
Copy link
Member

Manual backport -> #60246

ZKaoChi pushed a commit to ZKaoChi/pandas that referenced this pull request Nov 9, 2024
…te error message (pandas-dev#59430)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
mroeschke pushed a commit that referenced this pull request Nov 9, 2024
…fer_string + update error message (#59430) (#60246)

* TST (string dtype): fix groupby xfails with using_infer_string + update error message (#59430)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
(cherry picked from commit e5dd89d)

* fix test

---------

Co-authored-by: jbrockmendel <jbrockmendel@gmail.com>
@jbrockmendel jbrockmendel deleted the tst-str-gb branch November 10, 2024 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants