String dtype: map builtin str alias to StringDtype #59685

jorisvandenbossche · 2024-09-02T12:03:43Z

This ensures that the builtin type str is also mapped to the future default string dtype (if infer_string is enabled), i.e. to make .astype(str) behave the same as .astype('str").

In general we rely on numpy for mapping builtin types to dtypes. For example int is mapped to np.dtype("int64") by passing that value to np.dtype(..) in pandas_dtype().
But now that we have a default dtype that is mapped to one of those builtin types, we can no longer rely on numpy doing that. Implementation wise, for pandas_dtype(str) I currently added a special case checking explicitly for str and then importing the dtype lazily and returning it.

An alternative could be to handle this through the ExtensionDtype Registry.find(). However, that currently relies on using ExtensionDtype.construct_from_string() (i.e. we would have to expand the typing there to allow type objects as well next to strings, which is of course also possible). And maybe it is better to have control centrally about builtin aliases, instead of allowing external dtypes to map to those as well.

xref #54792

…-str

jorisvandenbossche · 2024-09-02T19:07:01Z

pandas/_libs/lib.pyx

+        if (
+            hasattr(arr, "dtype")
+            and arr.dtype.kind in "mM"
+            and not hasattr(arr, "_pa_array")


This is a bit ugly, but essentially this is avoiding that astype(str) for ArrowExtensionArray (if that happens to have a datetimelike dtype) is called here, because that will use ensure_string_array for that implementation, causing a recursion error.

An alternative would be a better check than arr.dtype.kind in "mM" that restricts itself to DatetimeLikeArrayMixin (essentially I need a isinstance(arr, DatetimeLikeArrayMixin), I think). Could potentially do that with some ABC check.

Another alternative is to handle astype(str) specifically in ArrowExtensionArray (currently astype is not implemented there, and it falls back entirely on the base class implementation)

I think eventually it would be good to implement ArrowExtensionArray.astype so I would support eventually handing str there. Could you add a TODO comment here describing when this could be removed when astype is implemented?

Added a TODO comment

…-str

pandas/_libs/lib.pyx

…erring StringDtype

…-str

mroeschke

One comment otherwise looks good

WillAyd

Generally looks great - some minor things

WillAyd · 2024-09-16T17:26:08Z

pandas/core/dtypes/common.py

@@ -1448,6 +1450,9 @@ def is_extension_array_dtype(arr_or_dtype) -> bool:
    elif isinstance(dtype, np.dtype):
        return False
    else:
+        # TODO ugly -> move into registry find()? Or make this work with pandas_dtype?
+        if dtype is str and using_string_dtype():


Are there other places where maybe we should just have a function like is_builtin_string_dtype instead of this branch? I think its hard to mentally process what logic like this is doing - would be good to abstract that as much as we can

@WillAyd I replaced the registry.find() with pandas_dtype(), then I don't need to handle the case of str here specifically because pandas_dtype() already does that.

In general pandas_dtype() should be our general preprocessing function. I only needed to suppress any warning or error here, because in this case if pandas_dtype fails then is_extension_array_dtype should return False.
(potentially that could be factored out into something like pandas_dtype_safe() if that would be needed in multiple places)

WillAyd · 2024-09-16T17:26:40Z

pandas/tests/arrays/sparse/test_dtype.py

@@ -184,7 +184,7 @@ def test_construct_from_string_fill_value_raises(string):
    [
        (SparseDtype(int, 0), float, SparseDtype(float, 0.0)),
        (SparseDtype(int, 1), float, SparseDtype(float, 1.0)),
-        (SparseDtype(int, 1), str, SparseDtype(object, "1")),
+        (SparseDtype(int, 1), np.str_, SparseDtype(object, "1")),


Do we even need this test case? I'm not sure what expectations we have around the np.str_ data type

Yeah, can also leave out this test case entirely. I am not entirely sure yet what dtype=np.str_ should mean, I was planning to open an issue about that.

I'd be fine to remove; I feel like this opens up a pandora's box of issues that aren't worth tracking down at this point in time

pandas/tests/frame/test_constructors.py

pandas/tests/indexes/period/methods/test_astype.py

…-str

mroeschke · 2024-09-25T17:57:55Z

Thanks @jorisvandenbossche

jbrockmendel · 2024-09-25T18:08:28Z

pandas/core/indexes/base.py

+        return (
+            self._is_comparable_dtype(dtype)
+            or is_object_dtype(dtype)
+            or is_string_dtype(dtype)


why is the is_string_dtype case needed here?

Because we allow to compare non-strings with strings for certain cases, I guess?

In [4]: pd.date_range("2020-01-01", periods=3) == "2020-01-01" Out[4]: array([ True, False, False])

hmm looks like two tests fail if this is disabled, test_map_missing_mixed, test_get_indexer_datetime. The map test I'd have to look into, but the get_indexer one make sense:

ii = IntervalIndex.from_breaks(date_range("2018-01-01", periods=4)) target = DatetimeIndex(["2018-01-02"], dtype="M8[ns]") result = ii.get_indexer(target.astype(str)) > tm.assert_numpy_array_equal(result, expected)

I suspect there are dtype combinations for which we don't need the is_string_dtype check here, might be able to move it to _is_comparable_dtype and restore fast-paths in some cases.

jbrockmendel · 2024-09-25T18:09:38Z

pandas/core/indexes/interval.py

@@ -712,7 +713,7 @@ def _get_indexer(
            # left/right get_indexer, compare elementwise, equality -> match
            indexer = self._get_indexer_unique_sides(target)

-        elif not is_object_dtype(target.dtype):
+        elif not (is_object_dtype(target.dtype) or is_string_dtype(target.dtype)):


why do we exclude string dtype here? id expect _get_indexer to fastpath to all-minus-ones

* String dtype: map builtin str alias to StringDtype * fix tests * fix datetimelike astype and more tests * remove xfails * try fix typing * fix copy_view tests * fix remaining tests with infer_string enabled * ignore typing issue for now * move to common.py * simplify Categorical._str_get_dummies * small cleanup * fix ensure_string_array to not modify extension arrays inplace * fix ensure_string_array once more + fix is_extension_array_dtype for str * still xfail TestArrowArray::test_astype_str when not using infer_string * ensure maybe_convert_objects copies object dtype input array when inferring StringDtype * update test_1d_object_array_does_not_copy test * update constructor copy test + do not copy in maybe_convert_objects? * skip str.get_dummies test for now * use pandas_dtype() instead of registry.find * fix corner cases for calling pandas_dtype * add TODO comment in ensure_string_array

String dtype: map builtin str alias to StringDtype

f444105

jorisvandenbossche added the Strings String extension data type and string data label Sep 2, 2024

jorisvandenbossche added this to the 2.3 milestone Sep 2, 2024

jorisvandenbossche added 11 commits September 2, 2024 15:55

fix tests

630d41c

fix datetimelike astype and more tests

d127770

remove xfails

38d011a

try fix typing

c4ed9d3

fix copy_view tests

cad7d8f

fix remaining tests with infer_string enabled

189e26d

ignore typing issue for now

1089eb3

move to common.py

8f7e968

Merge remote-tracking branch 'upstream/main' into string-dtype-astype…

51900f1

…-str

simplify Categorical._str_get_dummies

15f45d2

small cleanup

4464fb1

jorisvandenbossche commented Sep 2, 2024

View reviewed changes

jorisvandenbossche added 2 commits September 2, 2024 21:27

fix ensure_string_array to not modify extension arrays inplace

650f694

Merge remote-tracking branch 'upstream/main' into string-dtype-astype…

49297f0

…-str

jorisvandenbossche marked this pull request as ready for review September 2, 2024 19:27

jorisvandenbossche requested a review from WillAyd as a code owner September 2, 2024 19:27

jorisvandenbossche requested a review from jbrockmendel September 2, 2024 19:28

jorisvandenbossche commented Sep 2, 2024

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

jorisvandenbossche added 8 commits September 3, 2024 09:47

fix ensure_string_array once more + fix is_extension_array_dtype for str

9164dbb

still xfail TestArrowArray::test_astype_str when not using infer_string

cf9f855

ensure maybe_convert_objects copies object dtype input array when inf…

bd79fc9

…erring StringDtype

update test_1d_object_array_does_not_copy test

4c775d1

Merge remote-tracking branch 'upstream/main' into string-dtype-astype…

3b0b779

…-str

update constructor copy test + do not copy in maybe_convert_objects?

b0276b2

skip str.get_dummies test for now

d413fc6

Merge remote-tracking branch 'upstream/main' into string-dtype-astype…

d634da2

…-str

mroeschke reviewed Sep 16, 2024

View reviewed changes

WillAyd requested changes Sep 16, 2024

View reviewed changes

jorisvandenbossche added 3 commits September 21, 2024 10:51

Merge remote-tracking branch 'upstream/main' into string-dtype-astype…

e791330

…-str

use pandas_dtype() instead of registry.find

db8900c

fix corner cases for calling pandas_dtype

e6aad17

This was referenced Sep 21, 2024

String dtype: overview of breaking behaviour changes #59328

Open

astype(str) / astype_unicode: np.nan converted to "nan" (checknull, skipna) #25353

Open

add TODO comment in ensure_string_array

4e6cf04

WillAyd approved these changes Sep 21, 2024

View reviewed changes

mroeschke approved these changes Sep 25, 2024

View reviewed changes

mroeschke merged commit a9e30c5 into pandas-dev:main Sep 25, 2024
50 of 51 checks passed

jbrockmendel reviewed Sep 25, 2024

View reviewed changes

jorisvandenbossche mentioned this pull request Oct 10, 2024

BUG (string dtype): fix inplace mutation with copy=False in ensure_string_array #59756

Merged

jorisvandenbossche deleted the string-dtype-astype-str branch October 10, 2024 08:48

jorisvandenbossche added the backported label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String dtype: map builtin str alias to StringDtype #59685

String dtype: map builtin str alias to StringDtype #59685

String dtype: map builtin str alias to StringDtype #59685

String dtype: map builtin str alias to StringDtype #59685

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment