-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451
String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451
Conversation
pandas/_testing/asserters.py
Outdated
# Specifically for StringArrayNumpySemantics, validate here we have a valid array | ||
if isinstance(left.dtype, StringDtype) and left.dtype.storage == "python_numpy": | ||
assert np.all( | ||
[np.isnan(val) for val in left._ndarray[left_na]] # type: ignore[attr-defined] | ||
), "wrong missing value sentinels" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit a custom check (and we don't do anything similarly for other types), but given I initially overlooked a case where we were creating string arrays with the wrong missing value sentinel because the tests don't actually catch that (two arrays with different missing value sentinels still pass as equal in case of EAs), I would prefer keeping this in at least on the short term.
This comment was marked as outdated.
This comment was marked as outdated.
pandas/_testing/asserters.py
Outdated
left = repr(left) | ||
elif isinstance(left, StringDtype): | ||
left = f"StringDtype(storage={left.storage}, na_value={left.na_value})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the NA variants, do we really even need to expose/check against the storage or can we always just check that we have a StringDtype with a pd.NA missing value marker?
The main thing I want to decouple users from is relying heavily on things like StringDtype(storage="python")
, because it makes it really hard to move them away from our internals.
Taking one of the examples we talked about for a possible implementation in 3.0, if we used nanoarrow to back a StringArray without taking on a pyarrow dependency, having a bunch of scattered calls to "python" makes that much harder than it needs to be, without a ton of benefit (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize the storage= keyword is documented in PDEP-14 so not surprised to see it here; I just generally am hoping to minimize how often it appears to users in non-developmental / internal contexts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the NA variants, do we really even need to expose/check against the storage or can we always just check that we have a StringDtype with a pd.NA missing value marker?
You mean that our asserters (assert_frame_equal
et al) would consider a column with string dtype backed by pyarrow vs python as equal? (as long as the na_value
is equal)
The main thing I want to decouple users from is relying heavily on things like
StringDtype(storage="python")
, because it makes it really hard to move them away from our internals.
Yes, I also want to ensure that our general goal is that essentially almost no user should have to be explicit about the storage (that's also one of the reasons that I do not want to include the storage in the string alias for the new-to-be-default string dtype).
But, this is about a developer tool. Currently we regard pd.StringDtype("pyarrow") == pd.StringDtype("python")
as False, and so assert_frame_equal
will fail for that. And in that case, for the developer UX, the assert error message should be clear.
Right now, you can get something like:
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="col1") are different
Attribute "dtype" are different
[left]: string[pyarrow]
[right]: string[pyarrow]
which is not very helpful ...
The reason for that is because I did not bake the pd.NA
vs np.nan
information in the string alias / representation.
See also #59342 for an issue about this that was just opened (we should probably continue the main discussion about an informative repr there, but short term I want to include something like the above as a short-term solution until we decide on the best way forward to address the issues in #59342)
FWIW I also added this code change in #59352 (to fix up failing tests on main) but with a comment explaining the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that our asserters (
assert_frame_equal
et al) would consider a column with string dtype backed by pyarrow vs python as equal? (as long as thena_value
is equal)
Yea exactly. I'm not sure I 100% feel that way and I see both sides to the argument, but wanted to raise it for discussion
Understood and agreed on all of your other points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the asserters and how to treat variants of the same dtype, that is a topic we certainly have to discuss more in general as well when expanding this topic to other dtypes (PDEP 13).
Personally, I think there is indeed something to say about treating those dtypes as equal, or at least have an option to toggle how "strict" the dtype check is.
) | ||
return type(self)(result) | ||
else: | ||
# This is when the result type is object. We reach this when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this raise an error or not be possible in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some str methods are weird (i.e. what's In the comment here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And not only weird, there are some methods that genuinely return an object dtype (of course because of lack of a better proper dtype, but right not with the default dtype this is object dtype). For example ser.str.split()
returns list elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. The list-returning functions are more good use cases for PDEP-13 #58455
@@ -655,7 +663,11 @@ def test_isin(dtype, fixed_now_ts): | |||
tm.assert_series_equal(result, expected) | |||
|
|||
result = s.isin(["a", pd.NA]) | |||
expected = pd.Series([True, False, True]) | |||
if dtype.storage == "python" and dtype.na_value is np.nan: | |||
# TODO what do we want here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the best outcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, we should be consistent between the object and pyarrow backed version of the NaN-variant of this dtype (currently it is only StringDtype("python", na_value=np.nan)
that deviates from the three others).
But given that with plain object dtype we currently also match pd.NA with None, I would maybe rather keep that behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK - maybe we should just have this be a branch between np.nan
and pd.NA
then?
But given that with plain object dtype we currently also match pd.NA with None, I would maybe rather keep that behaviour?
I see the reasoning for it but I am hesitant to keep repeating this behavior. Seems like its really a quirk of how our historical Python object storage has worked, and I don't see that aging well as we move beyond it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more: when doing ser.isin(["a", pd.NA])
, the user is providing a list here, i.e. not something with already a well defined dtype (numpy array or pandas array/series).
So what I expect what would happen here is that the list of values is coerced to the calling ser.dtype
. And then this should give the same result as ser.isin(pd.array("a", pd.NA], ser.dtype))
.
pd.array("a", pd.NA], ser.dtype)
will coerce the NA to a missing value for that dtype, so in case of the new string dtype, this is NaN.
(to be clear, what actually happens is clearly not matching what I describe above as what I would expect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @jorisvandenbossche description of the result (and object sucks...)
Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
This PR is certainly not yet perfect (and further reviews are welcome!), but I am going to merge it: we still have to fix a lot of things in general anyway (all the xfailing tests with the current implementation), and by getting this in now we start addressing those xfails (e.g. #59430), we can fix the tests/implementations for both dtype variants at the same time. |
na_value=na_value, | ||
dtype=np.dtype(cast(type, dtype)), | ||
) | ||
if na_value_is_na and mask.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method (which has now been refactored to _str_map_nan_semantics) is slightly different in StringArray vs ArrowStringArray and im trying to sort out whether the differences are intentional or just cosmetic. could use some help from the author
-
the Arrow version handles this doing the check before map_infer_mask and changing the dtype passed there (also doesn't check for na_value_is_na)
-
the Arrow version sets na_value = np.nan/False on the analogue to L837/839 (again without a na_value_is_na check)
-
the Arrow version doesn't have the L831
convert = convert and not np.all(mask)
; AFAICT no existing tests rely on that line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, my claim in 3 about it not mattering was incorrect. it matters for test_contains_nan and test_empty_str_methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use some help from the author
Although an author who wrote this code almost 4 months ago ;)
Will take a closer look at it later today, but one quick find is that there were changes to the arrow version after I started this PR, so I might not have taken those into account in this version, eg #58483
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ive convinced myself that the arrow version doesnt need the na_value_is_na check bc it is always True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and that 'convert' is never used
…umPy semantics (#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (pandas-dev#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
…umPy semantics (#58451) Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
Similarly like #54533 added a variant of the pyarrow-backed string dtype to use numpy nullable (NaN) semantics (now named
StringDtype(storage="pyarrow", na_value=np.nan)
withArrowStringArrayNumpySemantics
), this PR does the same for our object-dtype numpy array backed StringArray: a newStringDtype(storage="python", na_value=np.nan)
with the correspondingStringArrayNumpySemantics
array.Illustration for discussion in #57073
xref #54792