TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430

jbrockmendel · 2024-08-06T21:29:22Z

This fixes 590 xfails.

Two potential follow-ons: 1) should ArrowStringArrayNumpySemantics support sum (done in the meantime), 2) make the exception types and messages match between pyarrow and non-pyarrow cases (see update at the top)

xref #54792

jorisvandenbossche · 2024-08-06T21:37:29Z

should ArrowStringArrayNumpySemantics support sum

Personally I am fine with not supporting that. I listed this as one of the breaking changes in #59328, but we can definitely discuss whether we want to keep support for it or not (let's use that issue for it)

jorisvandenbossche

Thanks!

jorisvandenbossche · 2024-08-07T15:37:28Z

pandas/tests/groupby/test_raises.py

+                msg = "No matching signature found"
+            elif groupby_func == "corrwith":
+                msg = (
+                    "'ArrowStringArrayNumpySemantics' with dtype string does "


Not necessarily for this PR, but personally I think we should avoid including such full name of the array in the error message, and just stick to saying that "dtype xx does not support operation yy"

Is it not possible to get the same error message for corrwith across all types? Probably related to @jorisvandenbossche comment on the pa.compute errors but generally I think we can catch these and make them consistent?

jorisvandenbossche · 2024-08-07T15:40:52Z

pandas/tests/groupby/test_raises.py

+                import pyarrow as pa
+
+                klass = pa.lib.ArrowNotImplementedError
+                if groupby_func == "pct_change":
+                    msg = "Function 'divide' has no kernel matching input types"


This is going to be different for the object-dtype based variant (as you also noted), but so in general I think we should avoid bubbling up this pyarrow error to the user (the name of the function there is also different, and that should be an implementation detail). If pandas does not support substraction for string dtype (as we do), then we should just raise a TypeError with a specific message about that?

(in #59437 where I am enabling the object-dtype tests, I am adding xfails for such cases right now (not yet pushed that))

jbrockmendel · 2024-08-13T21:22:04Z

Updated to make the exceptions match for object/pyarrow variants. Still needs some work, but now is a good time to bikeshed exception messages.

jbrockmendel · 2024-08-14T20:57:32Z

well if no one else wants to bikeshed, i'll start: using f"{self.dtype} dtype does not support prod operations" renders as "str dtype [...]" which seems weird. maybe add ticks around 'str'?

jorisvandenbossche · 2024-08-14T21:10:42Z

well if no one else wants to bikeshed, i'll start: using f"{self.dtype} dtype does not support prod operations" renders as "str dtype [...]" which seems weird. maybe add ticks around 'str'?

Yeah, to avoid that something reads a bit strange depending on the exact dtype and operation that is filled in, I would maybe do both with quotes, something like f"dtype '{dtype}' does not support operation '{op}'"

jbrockmendel · 2024-08-14T21:17:05Z

Updated to fix most of the remaining affected tests. This PR changes the behavior of groupby.sum with string dtype to raise rather than cast to object and attempt to sum, which causes 11 new test failures that this PR does not yet address. im on the fence as to whether we should implement StringArray.sum and allow groupby.sum to go unchanged.

jorisvandenbossche

which causes 11 new test failures that this PR does not yet address

Feel free to add new xfails if that is easier for this PR, either way

im on the fence as to whether we should implement StringArray.sum and allow groupby.sum to go unchanged.

You mean generally supporting "sum" for strings, i.e. for both plain sum and groupby sum?
As mentioned above (#59430 (comment)), I am personally fine with making this a breaking change, but also don't feel strongly about it

pandas/core/arrays/arrow/array.py

pandas/core/groupby/groupby.py

pandas/tests/groupby/test_raises.py

pandas/tests/reshape/test_pivot.py

pandas/tests/groupby/test_raises.py

asv_bench/benchmarks/groupby.py

pandas/tests/groupby/aggregate/test_aggregate.py

pandas/tests/groupby/methods/test_quantile.py

pandas/tests/groupby/test_groupby.py

pandas/core/groupby/groupby.py

jorisvandenbossche · 2024-08-27T10:17:21Z

Thanks for the updates! Added a few more comments

github-actions · 2024-09-28T00:07:06Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

jbrockmendel · 2024-10-01T18:48:26Z

Still active

jorisvandenbossche · 2024-10-31T20:51:10Z

I merged the sum PR, so this could be updated now, I think

jorisvandenbossche

Thanks for the update!

I pushed a commit with a few more cases where some changes could be reverte (i.e. some explicit astype(object) no longer being needed because string dtype now supports sum as well)

jorisvandenbossche · 2024-11-04T08:31:46Z

pandas/tests/groupby/aggregate/test_cython.py

 def test_cython_fail_agg():
    dr = bdate_range("1/1/2000", periods=50)
-    ts = Series(["A", "B", "C", "D", "E"] * 10, index=dr)
+    ts = Series(["A", "B", "C", "D", "E"] * 10, dtype=object, index=dr)

    grouped = ts.groupby(lambda x: x.month)
    summed = grouped.sum()
-    expected = grouped.agg(np.sum)
+    expected = grouped.agg(np.sum).astype(object)


Was there a specific reason you added an explicit dtype=object here (since it seems you only added this in the last commit, after updating for sum() being implemented, so now this is actually no longer needed, I think) ?

jorisvandenbossche · 2024-11-08T13:34:35Z

Thanks @jbrockmendel!

lumberbot-app · 2024-11-08T13:34:49Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 e5dd89d4d74d8e2a06256023717880788f2b10ed

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #59430: TST (string dtype): fix groupby xfails with using_infer_string + update error message'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-59430-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #59430 on branch 2.3.x (TST (string dtype): fix groupby xfails with using_infer_string + update error message)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…te error message (pandas-dev#59430) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> (cherry picked from commit e5dd89d)

jorisvandenbossche · 2024-11-08T14:40:44Z

Manual backport -> #60246

…te error message (pandas-dev#59430) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…fer_string + update error message (#59430) (#60246) * TST (string dtype): fix groupby xfails with using_infer_string + update error message (#59430) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> (cherry picked from commit e5dd89d) * fix test --------- Co-authored-by: jbrockmendel <jbrockmendel@gmail.com>

jorisvandenbossche changed the title ~~TST: fix groupby xfails with using_infer_string~~ TST (string dtype): fix groupby xfails with using_infer_string Aug 7, 2024

This was referenced Aug 7, 2024

String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451

Merged

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

jorisvandenbossche reviewed Aug 7, 2024

View reviewed changes

jbrockmendel marked this pull request as draft August 13, 2024 17:39

jbrockmendel force-pushed the tst-str-gb branch from ca30d58 to 4224a52 Compare August 13, 2024 21:21

jorisvandenbossche reviewed Aug 22, 2024

View reviewed changes

jbrockmendel force-pushed the tst-str-gb branch from 7f2e1c8 to c498d20 Compare August 22, 2024 21:21

jorisvandenbossche reviewed Aug 27, 2024

View reviewed changes

jbrockmendel force-pushed the tst-str-gb branch 2 times, most recently from 6c70bc6 to 87eac8a Compare August 28, 2024 14:53

jbrockmendel added 10 commits August 28, 2024 08:26

TST: fix groupby xfails with using_infer_string

34b36fb

TST: update _groupby_op to raise

9127829

update tests

e7ae735

Fix failing test_in_numeric_groupby

4ca5a2f

update exception messages

2c28a2c

update message

708e5d3

skip no-longer-supported

75eddea

update exception messages

72c59cf

update exception message

10be506

update exception message

c8ebe07

jbrockmendel force-pushed the tst-str-gb branch from 87eac8a to c8ebe07 Compare August 28, 2024 15:27

mroeschke added this to the 2.3 milestone Aug 28, 2024

jbrockmendel marked this pull request as ready for review August 28, 2024 19:59

jbrockmendel requested a review from rhshadrach as a code owner August 28, 2024 19:59

github-actions bot added the Stale label Sep 28, 2024

rhshadrach added Groupby Strings String extension data type and string data and removed Stale labels Oct 1, 2024

jbrockmendel and others added 3 commits November 1, 2024 08:51

Merge branch 'main' into tst-str-gb

f3c44cb

Update now that .sum() is supported

0871326

more cleanups now sum is implemented

baa1dd9

jorisvandenbossche approved these changes Nov 4, 2024

View reviewed changes

jorisvandenbossche changed the title ~~TST (string dtype): fix groupby xfails with using_infer_string~~ TST (string dtype): fix groupby xfails with using_infer_string + update error message Nov 4, 2024

jorisvandenbossche merged commit e5dd89d into pandas-dev:main Nov 8, 2024
50 of 51 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Nov 8, 2024

jorisvandenbossche mentioned this pull request Nov 8, 2024

[backport 2.3.x] TST (string dtype): fix groupby xfails with using_infer_string + update error message (#59430) #60246

Merged

jorisvandenbossche removed the Still Needs Manual Backport label Nov 8, 2024

ZKaoChi pushed a commit to ZKaoChi/pandas that referenced this pull request Nov 9, 2024

TST (string dtype): fix groupby xfails with using_infer_string + upda…

979fb2e

…te error message (pandas-dev#59430) Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jbrockmendel deleted the tst-str-gb branch November 10, 2024 03:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430

TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430

TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430

TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment