String dtype: rename the storage options and add `na_value` keyword in `StringDtype()` #59330

jorisvandenbossche · 2024-07-26T14:44:30Z

Rename the storage options and add na_value keyword, i.e. from the PDEP: from current storage="pyarrow_numpy" to storage="pyarrow", na_value=np.nan

This does not yet deprecate the "pyarrow_numpy" string option or "string[pyarrow_numpy]" alias, as our tests are using that a lot (I would leave that for a follow-up PR to first update all tests and then deprecate it).

… to be pyarrow_numpy

WillAyd

Minor comments / questions but generally lgtm

pandas/core/arrays/string_.py

WillAyd · 2024-07-27T15:02:29Z

pandas/tests/arrays/string_/test_string.py

        exp_dtype = "int64"
+    elif dtype.storage == "pyarrow":
+        exp_dtype = "int64[pyarrow]"


Somewhat surprised by this behavior - so depending on if pyarrow is installed or not you will get back two different types? Is it not possible to just always return pd.Int64Dtype()

Yeah, I am personally not a fan of this behaviour (and if I knew we were changing this I think I would have objected). This seems to have changed somewhere in the 2.x releases:

With pandas 1.5

In [3]: pd.Series(["a", "b", "a"], dtype="string[python]").value_counts() Out[3]: a 2 b 1 dtype: Int64 In [4]: pd.Series(["a", "b", "a"], dtype="string[pyarrow]").value_counts() Out[4]: a 2 b 1 dtype: Int64

With pandas 2.2:

In [1]: pd.Series(["a", "b", "a"], dtype="string[pyarrow]").value_counts() Out[1]: a 2 b 1 Name: count, dtype: int64[pyarrow] In [2]: pd.Series(["a", "b", "a"], dtype="string[python]").value_counts() Out[2]: a 2 b 1 Name: count, dtype: Int64

But so this is not introduced by this PR (I just reordered the clauses here, seemingly giving a bigger diff), and given this is strictly not related and potentially contentious to change, it's definitely a topic for a different issue/PR :)

Fair enough on this PR, though this behavior is a little stranger now that we have one pd.StringDtype class instead of separate python / arrow types. This leaks implementation details into the API space

Blame suggests it started with #51542 so cc @mroeschke for any insights, but I would really like to move away from this behavior in follow ups

though this behavior is a little stranger now that we have one pd.StringDtype class instead of separate python / arrow types

To be clear, that's something we already have for quite a time. This PR is only touching how the NaN-variant of StringDtype is constructed, it does not touch the NA-variants of the StringDtype (and the fact that we also have a ArrowDtype("string") ..). The behaviour you (rightfully IMO) bring up is only for the NA-variant.

Ah ok makes sense - misinterpreted this as changing the NA variant.

Still a bit strange that the NA and NaN variant return different types, but not something to solve here

Still a bit strange that the NA and NaN variant return different types, but not something to solve here

That part is then actually intentionally (although also not changed in this PR), as that is one of the essential parts of the PDEP, quoting from the section about missing value semantics:

In practice, this means that the default string dtype will use NaN as
the missing value sentinel, and:

String columns will follow NaN-semantics for missing values, where NaN gives
False in boolean operations such as comparisons or predicates.

Operations on the string column that give a numeric or boolean result will use
the default data types (i.e. numpy int64/float64/bool).

Actually need to take another look at this. I still think pd.StringDtype + pd.NA should always return a pd.Int64Dtype here regardless of if pyarrow is installed or not

Actually need to take another look at this. I still think pd.StringDtype + pd.NA should always return a pd.Int64Dtype here regardless of if pyarrow is installed or not

Can you open a new issue about that to discuss this?
As mentioned above, this is both not affected by as not really related to this PR (this PR only deals with the NaN variant, not the NA variant of the dtype, and the behaviour you question is solely related to the NA variant).

Ah OK - I did not realize this was already in main. Opened up #59346 for discussion; hoping we can revert that behavior before too long

WillAyd · 2024-07-27T20:26:02Z

@mroeschke over to you

WillAyd

Need to discuss API a bit more

mroeschke · 2024-07-29T17:51:52Z

pandas/core/arrays/string_.py:123: error: Incompatible types in assignment (expression has type "tuple[str, str]", base class "StorageExtensionDtype" defined the type as "tuple[str]")  [assignment]

mroeschke · 2024-07-29T19:32:13Z

pandas/core/arrays/arrow/array.py

-                    "pyarrow_numpy",
-                ):
+                if self._dtype.name == "string" and self._dtype.storage == "pyarrow":
+                    # TODO(infer_string) should this be large_string?


Yeah I think this was overlooked when the large_string transition happened. Might be nice if this pyarrow type was an attribute on StringDtype?

mroeschke · 2024-07-29T19:41:00Z

Thanks @jorisvandenbossche

…n `StringDtype()` (#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute

…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute

…n `StringDtype()` (#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute

jorisvandenbossche added 2 commits July 26, 2024 16:39

rename storage option and add na_value keyword

2f1bc37

Merge remote-tracking branch 'upstream/main' into string-dtype-naming

ff95a83

jorisvandenbossche mentioned this pull request Jul 26, 2024

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

41 tasks

jorisvandenbossche added 5 commits July 26, 2024 21:44

Merge remote-tracking branch 'upstream/main' into string-dtype-naming

10c14fb

update init

e29ca8d

fix propagating na_value to Array class + fix some tests

cb7410f

fix more tests

ffa7ead

disallow pyarrow_numpy as option + fix more cases of checking storage…

a9c466b

… to be pyarrow_numpy

jorisvandenbossche added the Strings String extension data type and string data label Jul 27, 2024

jorisvandenbossche added 2 commits July 27, 2024 16:09

restore pyarrow_numpy as option for now

1fc2113

linting

b347b94

jorisvandenbossche marked this pull request as ready for review July 27, 2024 14:39

jorisvandenbossche requested a review from WillAyd as a code owner July 27, 2024 14:39

WillAyd reviewed Jul 27, 2024

View reviewed changes

WillAyd approved these changes Jul 27, 2024

View reviewed changes

jorisvandenbossche mentioned this pull request Jul 27, 2024

String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451

Merged

WillAyd requested changes Jul 28, 2024

View reviewed changes

jorisvandenbossche requested a review from WillAyd July 29, 2024 13:31

jorisvandenbossche added 3 commits July 29, 2024 15:36

try fix typing

80489fe

try fix typing

8587297

fix dtype equality to take into account the NaN vs NA

a9650bb

jorisvandenbossche mentioned this pull request Jul 29, 2024

TST (string dtype): change any_string_dtype fixture to use actual dtype instances #59345

Merged

jorisvandenbossche added 3 commits July 29, 2024 18:13

fix pickling of dtype

4136c9e

fix test_convert_dtypes

c33e14a

update expected result for dtype='string'

151e3d1

WillAyd approved these changes Jul 29, 2024

View reviewed changes

jorisvandenbossche added 2 commits July 29, 2024 20:10

suppress typing error with _metadata attribute

899e3fc

Merge remote-tracking branch 'upstream/main' into string-dtype-naming

fc952e0

mroeschke added this to the 3.0 milestone Jul 29, 2024

mroeschke reviewed Jul 29, 2024

View reviewed changes

mroeschke approved these changes Jul 29, 2024

View reviewed changes

mroeschke merged commit f25a09e into pandas-dev:main Jul 29, 2024
40 of 46 checks passed

jorisvandenbossche deleted the string-dtype-naming branch July 30, 2024 06:37

jorisvandenbossche mentioned this pull request Jul 30, 2024

TST (string dtype): xfail all currently failing tests with future.infer_string #59329

Merged

jorisvandenbossche modified the milestones: 3.0, 2.3 Aug 20, 2024

jorisvandenbossche added the backported label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String dtype: rename the storage options and add `na_value` keyword in `StringDtype()` #59330

String dtype: rename the storage options and add `na_value` keyword in `StringDtype()` #59330

String dtype: rename the storage options and add na_value keyword in StringDtype() #59330

String dtype: rename the storage options and add na_value keyword in StringDtype() #59330

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

String dtype: rename the storage options and add `na_value` keyword in `StringDtype()` #59330

String dtype: rename the storage options and add `na_value` keyword in `StringDtype()` #59330