-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: map builtin str alias to StringDtype #59685
String dtype: map builtin str alias to StringDtype #59685
Conversation
if ( | ||
hasattr(arr, "dtype") | ||
and arr.dtype.kind in "mM" | ||
and not hasattr(arr, "_pa_array") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit ugly, but essentially this is avoiding that astype(str)
for ArrowExtensionArray (if that happens to have a datetimelike dtype) is called here, because that will use ensure_string_array
for that implementation, causing a recursion error.
An alternative would be a better check than arr.dtype.kind in "mM"
that restricts itself to DatetimeLikeArrayMixin
(essentially I need a isinstance(arr, DatetimeLikeArrayMixin)
, I think). Could potentially do that with some ABC check.
Another alternative is to handle astype(str)
specifically in ArrowExtensionArray (currently astype
is not implemented there, and it falls back entirely on the base class implementation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think eventually it would be good to implement ArrowExtensionArray.astype
so I would support eventually handing str
there. Could you add a TODO
comment here describing when this could be removed when astype
is implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO comment
…erring StringDtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment otherwise looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great - some minor things
pandas/core/dtypes/common.py
Outdated
@@ -1448,6 +1450,9 @@ def is_extension_array_dtype(arr_or_dtype) -> bool: | |||
elif isinstance(dtype, np.dtype): | |||
return False | |||
else: | |||
# TODO ugly -> move into registry find()? Or make this work with pandas_dtype? | |||
if dtype is str and using_string_dtype(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other places where maybe we should just have a function like is_builtin_string_dtype
instead of this branch? I think its hard to mentally process what logic like this is doing - would be good to abstract that as much as we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd I replaced the registry.find()
with pandas_dtype()
, then I don't need to handle the case of str
here specifically because pandas_dtype()
already does that.
In general pandas_dtype()
should be our general preprocessing function. I only needed to suppress any warning or error here, because in this case if pandas_dtype
fails then is_extension_array_dtype
should return False
.
(potentially that could be factored out into something like pandas_dtype_safe()
if that would be needed in multiple places)
@@ -184,7 +184,7 @@ def test_construct_from_string_fill_value_raises(string): | |||
[ | |||
(SparseDtype(int, 0), float, SparseDtype(float, 0.0)), | |||
(SparseDtype(int, 1), float, SparseDtype(float, 1.0)), | |||
(SparseDtype(int, 1), str, SparseDtype(object, "1")), | |||
(SparseDtype(int, 1), np.str_, SparseDtype(object, "1")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need this test case? I'm not sure what expectations we have around the np.str_
data type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, can also leave out this test case entirely. I am not entirely sure yet what dtype=np.str_
should mean, I was planning to open an issue about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine to remove; I feel like this opens up a pandora's box of issues that aren't worth tracking down at this point in time
Thanks @jorisvandenbossche |
return ( | ||
self._is_comparable_dtype(dtype) | ||
or is_object_dtype(dtype) | ||
or is_string_dtype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the is_string_dtype case needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we allow to compare non-strings with strings for certain cases, I guess?
In [4]: pd.date_range("2020-01-01", periods=3) == "2020-01-01"
Out[4]: array([ True, False, False])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm looks like two tests fail if this is disabled, test_map_missing_mixed, test_get_indexer_datetime. The map test I'd have to look into, but the get_indexer one make sense:
ii = IntervalIndex.from_breaks(date_range("2018-01-01", periods=4))
target = DatetimeIndex(["2018-01-02"], dtype="M8[ns]")
result = ii.get_indexer(target.astype(str))
> tm.assert_numpy_array_equal(result, expected)
I suspect there are dtype combinations for which we don't need the is_string_dtype check here, might be able to move it to _is_comparable_dtype and restore fast-paths in some cases.
@@ -712,7 +713,7 @@ def _get_indexer( | |||
# left/right get_indexer, compare elementwise, equality -> match | |||
indexer = self._get_indexer_unique_sides(target) | |||
|
|||
elif not is_object_dtype(target.dtype): | |||
elif not (is_object_dtype(target.dtype) or is_string_dtype(target.dtype)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we exclude string dtype here? id expect _get_indexer to fastpath to all-minus-ones
* String dtype: map builtin str alias to StringDtype * fix tests * fix datetimelike astype and more tests * remove xfails * try fix typing * fix copy_view tests * fix remaining tests with infer_string enabled * ignore typing issue for now * move to common.py * simplify Categorical._str_get_dummies * small cleanup * fix ensure_string_array to not modify extension arrays inplace * fix ensure_string_array once more + fix is_extension_array_dtype for str * still xfail TestArrowArray::test_astype_str when not using infer_string * ensure maybe_convert_objects copies object dtype input array when inferring StringDtype * update test_1d_object_array_does_not_copy test * update constructor copy test + do not copy in maybe_convert_objects? * skip str.get_dummies test for now * use pandas_dtype() instead of registry.find * fix corner cases for calling pandas_dtype * add TODO comment in ensure_string_array
* String dtype: map builtin str alias to StringDtype * fix tests * fix datetimelike astype and more tests * remove xfails * try fix typing * fix copy_view tests * fix remaining tests with infer_string enabled * ignore typing issue for now * move to common.py * simplify Categorical._str_get_dummies * small cleanup * fix ensure_string_array to not modify extension arrays inplace * fix ensure_string_array once more + fix is_extension_array_dtype for str * still xfail TestArrowArray::test_astype_str when not using infer_string * ensure maybe_convert_objects copies object dtype input array when inferring StringDtype * update test_1d_object_array_does_not_copy test * update constructor copy test + do not copy in maybe_convert_objects? * skip str.get_dummies test for now * use pandas_dtype() instead of registry.find * fix corner cases for calling pandas_dtype * add TODO comment in ensure_string_array
This ensures that the builtin type
str
is also mapped to the future default string dtype (ifinfer_string
is enabled), i.e. to make.astype(str)
behave the same as.astype('str")
.In general we rely on numpy for mapping builtin types to dtypes. For example
int
is mapped tonp.dtype("int64")
by passing that value tonp.dtype(..)
inpandas_dtype()
.But now that we have a default dtype that is mapped to one of those builtin types, we can no longer rely on numpy doing that. Implementation wise, for
pandas_dtype(str)
I currently added a special case checking explicitly forstr
and then importing the dtype lazily and returning it.An alternative could be to handle this through the ExtensionDtype
Registry.find()
. However, that currently relies on usingExtensionDtype.construct_from_string()
(i.e. we would have to expand the typing there to allow type objects as well next to strings, which is of course also possible). And maybe it is better to have control centrally about builtin aliases, instead of allowing external dtypes to map to those as well.xref #54792