-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: rename the storage options and add na_value
keyword in StringDtype()
#59330
String dtype: rename the storage options and add na_value
keyword in StringDtype()
#59330
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments / questions but generally lgtm
exp_dtype = "int64" | ||
elif dtype.storage == "pyarrow": | ||
exp_dtype = "int64[pyarrow]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat surprised by this behavior - so depending on if pyarrow is installed or not you will get back two different types? Is it not possible to just always return pd.Int64Dtype()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am personally not a fan of this behaviour (and if I knew we were changing this I think I would have objected). This seems to have changed somewhere in the 2.x releases:
With pandas 1.5
In [3]: pd.Series(["a", "b", "a"], dtype="string[python]").value_counts()
Out[3]:
a 2
b 1
dtype: Int64
In [4]: pd.Series(["a", "b", "a"], dtype="string[pyarrow]").value_counts()
Out[4]:
a 2
b 1
dtype: Int64
With pandas 2.2:
In [1]: pd.Series(["a", "b", "a"], dtype="string[pyarrow]").value_counts()
Out[1]:
a 2
b 1
Name: count, dtype: int64[pyarrow]
In [2]: pd.Series(["a", "b", "a"], dtype="string[python]").value_counts()
Out[2]:
a 2
b 1
Name: count, dtype: Int64
But so this is not introduced by this PR (I just reordered the clauses here, seemingly giving a bigger diff), and given this is strictly not related and potentially contentious to change, it's definitely a topic for a different issue/PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough on this PR, though this behavior is a little stranger now that we have one pd.StringDtype
class instead of separate python / arrow types. This leaks implementation details into the API space
Blame suggests it started with #51542 so cc @mroeschke for any insights, but I would really like to move away from this behavior in follow ups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though this behavior is a little stranger now that we have one
pd.StringDtype
class instead of separate python / arrow types
To be clear, that's something we already have for quite a time. This PR is only touching how the NaN-variant of StringDtype
is constructed, it does not touch the NA-variants of the StringDtype
(and the fact that we also have a ArrowDtype("string")
..). The behaviour you (rightfully IMO) bring up is only for the NA-variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok makes sense - misinterpreted this as changing the NA variant.
Still a bit strange that the NA and NaN variant return different types, but not something to solve here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a bit strange that the NA and NaN variant return different types, but not something to solve here
That part is then actually intentionally (although also not changed in this PR), as that is one of the essential parts of the PDEP, quoting from the section about missing value semantics:
In practice, this means that the default string dtype will use
NaN
as
the missing value sentinel, and:
- String columns will follow NaN-semantics for missing values, where
NaN
gives
False in boolean operations such as comparisons or predicates.- Operations on the string column that give a numeric or boolean result will use
the default data types (i.e. numpyint64
/float64
/bool
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually need to take another look at this. I still think pd.StringDtype + pd.NA should always return a pd.Int64Dtype here regardless of if pyarrow is installed or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually need to take another look at this. I still think pd.StringDtype + pd.NA should always return a pd.Int64Dtype here regardless of if pyarrow is installed or not
Can you open a new issue about that to discuss this?
As mentioned above, this is both not affected by as not really related to this PR (this PR only deals with the NaN variant, not the NA variant of the dtype, and the behaviour you question is solely related to the NA variant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK - I did not realize this was already in main. Opened up #59346 for discussion; hoping we can revert that behavior before too long
@mroeschke over to you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to discuss API a bit more
pandas/core/arrays/string_.py:123: error: Incompatible types in assignment (expression has type "tuple[str, str]", base class "StorageExtensionDtype" defined the type as "tuple[str]") [assignment] |
"pyarrow_numpy", | ||
): | ||
if self._dtype.name == "string" and self._dtype.storage == "pyarrow": | ||
# TODO(infer_string) should this be large_string? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this was overlooked when the large_string transition happened. Might be nice if this pyarrow type was an attribute on StringDtype
?
Thanks @jorisvandenbossche |
…n `StringDtype()` (#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (pandas-dev#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
…n `StringDtype()` (#59330) * rename storage option and add na_value keyword * update init * fix propagating na_value to Array class + fix some tests * fix more tests * disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy * restore pyarrow_numpy as option for now * linting * try fix typing * try fix typing * fix dtype equality to take into account the NaN vs NA * fix pickling of dtype * fix test_convert_dtypes * update expected result for dtype='string' * suppress typing error with _metadata attribute
xref #54792
Rename the storage options and add
na_value
keyword, i.e. from the PDEP: from currentstorage="pyarrow_numpy"
tostorage="pyarrow", na_value=np.nan
This does not yet deprecate the
"pyarrow_numpy"
string option or"string[pyarrow_numpy]"
alias, as our tests are using that a lot (I would leave that for a follow-up PR to first update all tests and then deprecate it).