[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: rename the storage options and add na_value keyword in StringDtype() #59330

Merged
merged 17 commits into from
Jul 29, 2024

Conversation

jorisvandenbossche
Copy link
Member
@jorisvandenbossche jorisvandenbossche commented Jul 26, 2024

xref #54792

Rename the storage options and add na_value keyword, i.e. from the PDEP: from current storage="pyarrow_numpy" to storage="pyarrow", na_value=np.nan

This does not yet deprecate the "pyarrow_numpy" string option or "string[pyarrow_numpy]" alias, as our tests are using that a lot (I would leave that for a follow-up PR to first update all tests and then deprecate it).

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Jul 27, 2024
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review July 27, 2024 14:39
Copy link
Member
@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments / questions but generally lgtm

pandas/core/arrays/string_.py Show resolved Hide resolved
exp_dtype = "int64"
elif dtype.storage == "pyarrow":
exp_dtype = "int64[pyarrow]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat surprised by this behavior - so depending on if pyarrow is installed or not you will get back two different types? Is it not possible to just always return pd.Int64Dtype()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am personally not a fan of this behaviour (and if I knew we were changing this I think I would have objected). This seems to have changed somewhere in the 2.x releases:

With pandas 1.5

In [3]: pd.Series(["a", "b", "a"], dtype="string[python]").value_counts()
Out[3]: 
a    2
b    1
dtype: Int64

In [4]: pd.Series(["a", "b", "a"], dtype="string[pyarrow]").value_counts()
Out[4]: 
a    2
b    1
dtype: Int64

With pandas 2.2:

In [1]: pd.Series(["a", "b", "a"], dtype="string[pyarrow]").value_counts()
Out[1]: 
a    2
b    1
Name: count, dtype: int64[pyarrow]

In [2]: pd.Series(["a", "b", "a"], dtype="string[python]").value_counts()
Out[2]: 
a    2
b    1
Name: count, dtype: Int64

But so this is not introduced by this PR (I just reordered the clauses here, seemingly giving a bigger diff), and given this is strictly not related and potentially contentious to change, it's definitely a topic for a different issue/PR :)

Copy link
Member
@WillAyd WillAyd Jul 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough on this PR, though this behavior is a little stranger now that we have one pd.StringDtype class instead of separate python / arrow types. This leaks implementation details into the API space

Blame suggests it started with #51542 so cc @mroeschke for any insights, but I would really like to move away from this behavior in follow ups

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though this behavior is a little stranger now that we have one pd.StringDtype class instead of separate python / arrow types

To be clear, that's something we already have for quite a time. This PR is only touching how the NaN-variant of StringDtype is constructed, it does not touch the NA-variants of the StringDtype (and the fact that we also have a ArrowDtype("string") ..). The behaviour you (rightfully IMO) bring up is only for the NA-variant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok makes sense - misinterpreted this as changing the NA variant.

Still a bit strange that the NA and NaN variant return different types, but not something to solve here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a bit strange that the NA and NaN variant return different types, but not something to solve here

That part is then actually intentionally (although also not changed in this PR), as that is one of the essential parts of the PDEP, quoting from the section about missing value semantics:

In practice, this means that the default string dtype will use NaN as
the missing value sentinel, and:

  • String columns will follow NaN-semantics for missing values, where NaN gives
    False in boolean operations such as comparisons or predicates.
  • Operations on the string column that give a numeric or boolean result will use
    the default data types (i.e. numpy int64/float64/bool).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually need to take another look at this. I still think pd.StringDtype + pd.NA should always return a pd.Int64Dtype here regardless of if pyarrow is installed or not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually need to take another look at this. I still think pd.StringDtype + pd.NA should always return a pd.Int64Dtype here regardless of if pyarrow is installed or not

Can you open a new issue about that to discuss this?
As mentioned above, this is both not affected by as not really related to this PR (this PR only deals with the NaN variant, not the NA variant of the dtype, and the behaviour you question is solely related to the NA variant).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK - I did not realize this was already in main. Opened up #59346 for discussion; hoping we can revert that behavior before too long

@WillAyd
Copy link
Member
WillAyd commented Jul 27, 2024

@mroeschke over to you

Copy link
Member
@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to discuss API a bit more

@mroeschke
Copy link
Member
pandas/core/arrays/string_.py:123: error: Incompatible types in assignment (expression has type "tuple[str, str]", base class "StorageExtensionDtype" defined the type as "tuple[str]")  [assignment]

@mroeschke mroeschke added this to the 3.0 milestone Jul 29, 2024
"pyarrow_numpy",
):
if self._dtype.name == "string" and self._dtype.storage == "pyarrow":
# TODO(infer_string) should this be large_string?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this was overlooked when the large_string transition happened. Might be nice if this pyarrow type was an attribute on StringDtype?

@mroeschke mroeschke merged commit f25a09e into pandas-dev:main Jul 29, 2024
40 of 46 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the string-dtype-naming branch July 30, 2024 06:37
WillAyd pushed a commit that referenced this pull request Aug 13, 2024
…n `StringDtype()` (#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 14, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
@jorisvandenbossche jorisvandenbossche modified the milestones: 3.0, 2.3 Aug 20, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 21, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 27, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024
…n `StringDtype()` (pandas-dev#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
jorisvandenbossche added a commit that referenced this pull request Oct 9, 2024
…n `StringDtype()` (#59330)

* rename storage option and add na_value keyword

* update init

* fix propagating na_value to Array class + fix some tests

* fix more tests

* disallow pyarrow_numpy as option + fix more cases of checking storage to be pyarrow_numpy

* restore pyarrow_numpy as option for now

* linting

* try fix typing

* try fix typing

* fix dtype equality to take into account the NaN vs NA

* fix pickling of dtype

* fix test_convert_dtypes

* update expected result for dtype='string'

* suppress typing error with _metadata attribute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants