String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451

jorisvandenbossche · 2024-04-27T12:59:34Z

Similarly like #54533 added a variant of the pyarrow-backed string dtype to use numpy nullable (NaN) semantics (now named StringDtype(storage="pyarrow", na_value=np.nan) with ArrowStringArrayNumpySemantics), this PR does the same for our object-dtype numpy array backed StringArray: a new StringDtype(storage="python", na_value=np.nan) with the corresponding StringArrayNumpySemantics array.

Illustration for discussion in #57073

xref #54792

…umPy semantics

jorisvandenbossche · 2024-04-27T18:37:32Z

pandas/_testing/asserters.py

+    # Specifically for StringArrayNumpySemantics, validate here we have a valid array
+    if isinstance(left.dtype, StringDtype) and left.dtype.storage == "python_numpy":
+        assert np.all(
+            [np.isnan(val) for val in left._ndarray[left_na]]  # type: ignore[attr-defined]
+        ), "wrong missing value sentinels"


This is a bit a custom check (and we don't do anything similarly for other types), but given I initially overlooked a case where we were creating string arrays with the wrong missing value sentinel because the tests don't actually catch that (two arrays with different missing value sentinels still pass as equal in case of EAs), I would prefer keeping this in at least on the short term.

pandas/_libs/lib.pyx

pandas/_testing/asserters.py

pandas/tests/extension/test_string.py

WillAyd · 2024-07-30T12:14:46Z

pandas/_testing/asserters.py

        left = repr(left)
+    elif isinstance(left, StringDtype):
+        left = f"StringDtype(storage={left.storage}, na_value={left.na_value})"


For the NA variants, do we really even need to expose/check against the storage or can we always just check that we have a StringDtype with a pd.NA missing value marker?

The main thing I want to decouple users from is relying heavily on things like StringDtype(storage="python"), because it makes it really hard to move them away from our internals.

Taking one of the examples we talked about for a possible implementation in 3.0, if we used nanoarrow to back a StringArray without taking on a pyarrow dependency, having a bunch of scattered calls to "python" makes that much harder than it needs to be, without a ton of benefit (?)

I realize the storage= keyword is documented in PDEP-14 so not surprised to see it here; I just generally am hoping to minimize how often it appears to users in non-developmental / internal contexts

For the NA variants, do we really even need to expose/check against the storage or can we always just check that we have a StringDtype with a pd.NA missing value marker?

You mean that our asserters (assert_frame_equal et al) would consider a column with string dtype backed by pyarrow vs python as equal? (as long as the na_value is equal)

The main thing I want to decouple users from is relying heavily on things like StringDtype(storage="python"), because it makes it really hard to move them away from our internals.

Yes, I also want to ensure that our general goal is that essentially almost no user should have to be explicit about the storage (that's also one of the reasons that I do not want to include the storage in the string alias for the new-to-be-default string dtype).
But, this is about a developer tool. Currently we regard pd.StringDtype("pyarrow") == pd.StringDtype("python") as False, and so assert_frame_equal will fail for that. And in that case, for the developer UX, the assert error message should be clear.

Right now, you can get something like:

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="col1") are different Attribute "dtype" are different [left]: string[pyarrow] [right]: string[pyarrow]

which is not very helpful ...

The reason for that is because I did not bake the pd.NA vs np.nan information in the string alias / representation.

See also #59342 for an issue about this that was just opened (we should probably continue the main discussion about an informative repr there, but short term I want to include something like the above as a short-term solution until we decide on the best way forward to address the issues in #59342)

FWIW I also added this code change in #59352 (to fix up failing tests on main) but with a comment explaining the issue.

You mean that our asserters (assert_frame_equal et al) would consider a column with string dtype backed by pyarrow vs python as equal? (as long as the na_value is equal)

Yea exactly. I'm not sure I 100% feel that way and I see both sides to the argument, but wanted to raise it for discussion

Understood and agreed on all of your other points

For the asserters and how to treat variants of the same dtype, that is a topic we certainly have to discuss more in general as well when expanding this topic to other dtypes (PDEP 13).

Personally, I think there is indeed something to say about treating those dtypes as equal, or at least have an option to toggle how "strict" the dtype check is.

WillAyd · 2024-07-30T12:19:31Z

pandas/core/arrays/string_.py

+            )
+            return type(self)(result)
+        else:
+            # This is when the result type is object. We reach this when


Shouldn't this raise an error or not be possible in the first place?

some str methods are weird (i.e. what's In the comment here)

And not only weird, there are some methods that genuinely return an object dtype (of course because of lack of a better proper dtype, but right not with the default dtype this is object dtype). For example ser.str.split() returns list elements.

Makes sense. The list-returning functions are more good use cases for PDEP-13 #58455

WillAyd · 2024-07-30T12:20:34Z

pandas/tests/arrays/string_/test_string.py

@@ -655,7 +663,11 @@ def test_isin(dtype, fixed_now_ts):
    tm.assert_series_equal(result, expected)

    result = s.isin(["a", pd.NA])
-    expected = pd.Series([True, False, True])
+    if dtype.storage == "python" and dtype.na_value is np.nan:
+        # TODO what do we want here?


I think this is the best outcome

In any case, we should be consistent between the object and pyarrow backed version of the NaN-variant of this dtype (currently it is only StringDtype("python", na_value=np.nan) that deviates from the three others).

But given that with plain object dtype we currently also match pd.NA with None, I would maybe rather keep that behaviour?

Hmm OK - maybe we should just have this be a branch between np.nan and pd.NA then?

But given that with plain object dtype we currently also match pd.NA with None, I would maybe rather keep that behaviour?

I see the reasoning for it but I am hesitant to keep repeating this behavior. Seems like its really a quirk of how our historical Python object storage has worked, and I don't see that aging well as we move beyond it

Thinking about this a bit more: when doing ser.isin(["a", pd.NA]), the user is providing a list here, i.e. not something with already a well defined dtype (numpy array or pandas array/series).
So what I expect what would happen here is that the list of values is coerced to the calling ser.dtype. And then this should give the same result as ser.isin(pd.array("a", pd.NA], ser.dtype)).

pd.array("a", pd.NA], ser.dtype) will coerce the NA to a missing value for that dtype, so in case of the new string dtype, this is NaN.

(to be clear, what actually happens is clearly not matching what I describe above as what I would expect)

I think I agree with @jorisvandenbossche description of the result (and object sucks...)

pandas/tests/strings/test_find_replace.py

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

jorisvandenbossche · 2024-08-07T12:50:05Z

This PR is certainly not yet perfect (and further reviews are welcome!), but I am going to merge it: we still have to fix a lot of things in general anyway (all the xfailing tests with the current implementation), and by getting this in now we start addressing those xfails (e.g. #59430), we can fix the tests/implementations for both dtype variants at the same time.

jbrockmendel · 2024-08-08T17:44:49Z

pandas/core/arrays/string_.py

+                na_value=na_value,
+                dtype=np.dtype(cast(type, dtype)),
+            )
+            if na_value_is_na and mask.any():


this method (which has now been refactored to _str_map_nan_semantics) is slightly different in StringArray vs ArrowStringArray and im trying to sort out whether the differences are intentional or just cosmetic. could use some help from the author

the Arrow version handles this doing the check before map_infer_mask and changing the dtype passed there (also doesn't check for na_value_is_na)

the Arrow version sets na_value = np.nan/False on the analogue to L837/839 (again without a na_value_is_na check)

the Arrow version doesn't have the L831 convert = convert and not np.all(mask); AFAICT no existing tests rely on that line

Woops, my claim in 3 about it not mattering was incorrect. it matters for test_contains_nan and test_empty_str_methods

could use some help from the author

Although an author who wrote this code almost 4 months ago ;)

Will take a closer look at it later today, but one quick find is that there were changes to the arrow version after I started this PR, so I might not have taken those into account in this version, eg #58483

ive convinced myself that the arrow version doesnt need the na_value_is_na check bc it is always True

... and that 'convert' is never used

…umPy semantics (#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

…umPy semantics (#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

String dtype: implement object-dtype based StringArray variant with N…

63a7fc5

…umPy semantics

jorisvandenbossche added the Strings String extension data type and string data label Apr 27, 2024

jorisvandenbossche requested a review from WillAyd as a code owner April 27, 2024 12:59

jorisvandenbossche mentioned this pull request Apr 27, 2024

DISC: Consider not requiring PyArrow in 3.0 #57073

Open

jorisvandenbossche added 2 commits April 27, 2024 20:27

fix constructor to not convert to NA

0eee625

fix typing

607b95e

jorisvandenbossche commented Apr 27, 2024

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

improve logic in str_map

bca157d

jorisvandenbossche requested a review from phofl April 27, 2024 20:13

jorisvandenbossche mentioned this pull request May 6, 2024

PDEP-14: Dedicated string data type for pandas 3.0 #58551

Merged

This comment was marked as outdated.

Sign in to view

github-actions bot added the Stale label May 28, 2024

jorisvandenbossche mentioned this pull request Jun 27, 2024

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

41 tasks

jorisvandenbossche removed the Stale label Jul 26, 2024

Merge remote-tracking branch 'upstream/main' into string-dtype-object

79eb3b4

WillAyd requested changes Jul 26, 2024

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

pandas/_testing/asserters.py Outdated Show resolved Hide resolved

jorisvandenbossche added 2 commits July 30, 2024 08:48

Merge remote-tracking branch 'upstream/main' into string-dtype-object

c063298

remove most usage of python_numpy

ab96aa4

jorisvandenbossche commented Jul 30, 2024

View reviewed changes

pandas/tests/extension/test_string.py Outdated Show resolved Hide resolved

update tests to avoid string[python_numpy]

bae8d65

WillAyd reviewed Jul 30, 2024

View reviewed changes

jorisvandenbossche mentioned this pull request Jul 30, 2024

TST (string dtype): follow-up on GH-59329 fixing new xfails #59352

Merged

jorisvandenbossche added 7 commits July 31, 2024 21:32

Merge remote-tracking branch 'upstream/main' into string-dtype-object

31f1c33

Merge remote-tracking branch 'upstream/main' into string-dtype-object

cbd0820

remove all python_numpy usage

864c166

remove hardcoded storage

d3ad7b0

implement any/all reductions

028dc2c

Merge remote-tracking branch 'upstream/main' into string-dtype-object

1750bcb

fix typing

7f4baf7

jorisvandenbossche mentioned this pull request Aug 5, 2024

API/TST: expand tests for string any/all reduction + fix pyarrow-based implementation #59414

Merged

1 task

Merge remote-tracking branch 'upstream/main' into string-dtype-object

fdf1454

jorisvandenbossche mentioned this pull request Aug 7, 2024

TST (string dtype): add test build with future strings enabled without pyarrow #59437

Merged

jorisvandenbossche and others added 2 commits August 7, 2024 14:36

Update pandas/core/arrays/string_.py

fe6fce6

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

update todo comment

70325d4

jorisvandenbossche requested a review from WillAyd August 7, 2024 12:49

WillAyd approved these changes Aug 7, 2024

View reviewed changes

jorisvandenbossche merged commit 1272cb1 into pandas-dev:main Aug 7, 2024
39 of 45 checks passed

jorisvandenbossche deleted the string-dtype-object branch August 7, 2024 14:07

jorisvandenbossche mentioned this pull request Aug 8, 2024

String dtype: fix alignment sorting in case of python storage #59448

Merged

jbrockmendel reviewed Aug 8, 2024

View reviewed changes

WillAyd pushed a commit that referenced this pull request Aug 13, 2024

String dtype: implement object-dtype based StringArray variant with N…

348ae83

…umPy semantics (#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 14, 2024

String dtype: implement object-dtype based StringArray variant with N…

c3daf91

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024

String dtype: implement object-dtype based StringArray variant with N…

d0b84e6

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024

String dtype: implement object-dtype based StringArray variant with N…

93b72a8

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024

String dtype: implement object-dtype based StringArray variant with N…

f739b59

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 21, 2024

String dtype: implement object-dtype based StringArray variant with N…

539a5e5

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

String dtype: implement object-dtype based StringArray variant with N…

003385d

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

String dtype: implement object-dtype based StringArray variant with N…

1a5dd7d

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024

String dtype: implement object-dtype based StringArray variant with N…

4fb4478

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

WillAyd added a commit to WillAyd/pandas that referenced this pull request Aug 27, 2024

String dtype: implement object-dtype based StringArray variant with N…

c760c00

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

jorisvandenbossche mentioned this pull request Sep 9, 2024

String dtype: fix isin() values handling for python storage #59759

Merged

WillAyd added a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024

String dtype: implement object-dtype based StringArray variant with N…

3add2e6

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024

String dtype: implement object-dtype based StringArray variant with N…

4a430cf

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024

String dtype: implement object-dtype based StringArray variant with N…

fdefe64

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024

String dtype: implement object-dtype based StringArray variant with N…

5ee61c3

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

jorisvandenbossche pushed a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024

String dtype: implement object-dtype based StringArray variant with N…

463fd91

…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

jorisvandenbossche pushed a commit that referenced this pull request Oct 9, 2024

String dtype: implement object-dtype based StringArray variant with N…

67f9df4

…umPy semantics (#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

jorisvandenbossche added the backported label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451

String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451

This comment was marked as outdated.

String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451

String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451

Conversation

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment