[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Dataframe.astype(...) Fails With 'bool' object has no attribute 'all' #7364

Open
2 of 3 tasks
zombie-einstein opened this issue Aug 7, 2024 · 2 comments
Open
2 of 3 tasks
Labels
bug 🦗 Something isn't working Triage 🩹 Issues that need triage

Comments

@zombie-einstein
Copy link

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd
import numpy as np

# This will cast to a float type due to NaNs
df = pd.DataFrame({"a": [1, 2, 3, np.nan]})

# This passes
df.astype({"a": pd.Int64Dtype()})
>>    a
0     1
1     2
2     3
3  <NA>

# This fails
df.astype(pd.Int64Dtype())
>> AttributeError: 'bool' object has no attribute 'all'

Issue Description

I think this is the same error as #7276, but that issue was closed. When using astype on a Dataframe it fails at this check

File .../modin/core/dataframe/pandas/dataframe/dataframe.py:
   1736, in PandasDataframe.astype(self, col_dtypes, errors)
   1730         return df.astype(
   1731             {k: v for k, v in col_dtypes.items() if k in df}, errors=errors
   1732         )
   1734 else:
   1735     # Assume that the dtype is a scalar.
-> 1736     if not (col_dtypes == self_dtypes).all():
   1737         new_dtypes = self_dtypes.copy()
   1738         new_dtype = pandas.api.types.pandas_dtype(col_dtypes)

AttributeError: 'bool' object has no attribute 'all'

When the type argument is a single value (i.e. astype(pd.Int64Dtype())) then it seems that col_dtypes == self_dtypes works out as a single bool value (hence no all attribute).

Note that this works Ok if the argument is a dictionary of column namess to dtypes.

This also seems to be the same for Series, i.e.:

df["a"].astype(pd.Int64Dtype())

Fails with the same error

Expected Behavior

In native Pandas

df.astype(pd.Int64Dtype())

casts the DataFrame/series to the argument type

Error Logs

No response

Installed Versions

'0.31.0'

@zombie-einstein zombie-einstein added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Aug 7, 2024
@zombie-einstein
Copy link
Author

In fact this seems to work ok if in line 1736 in .../modin/core/dataframe/pandas/dataframe/dataframe.py

   1735     # Assume that the dtype is a scalar.
-> 1736     if not (col_dtypes == self_dtypes).all():
   1737         new_dtypes = self_dtypes.copy()

the ordering of the arguments is swappe, i.e.:

   1735     # Assume that the dtype is a scalar.
-> 1736     if not (self_dtypes == col_dtypes).all():
   1737         new_dtypes = self_dtypes.copy()

I guess some ordering dependency in how the result is calculated?

I can open a PR to change, but not sure if there is some deeper reasoning here.

@aivanoved
Copy link
aivanoved commented Aug 16, 2024

I have encountered this issue while working on a separate project and looking into this as the code suggests

self_dtypes is of type pandas.Series
col_dtypes in the else branch is a scalar object

col_dtypes == self_dtypes uses the col_dtypes.__eq__(obj) method first, which for pandas dtypes is implemented unconditionally, so it will always return False as the left-hand side is a dtype and the left-hand side is a pandas.Series
for strings it works because the direct equality is not implemented, so it tires self_dtypes.__eq__(obj)

flipping the comparison, self_dtypes == col_dtypes notices that the lhs is a series and the rhs is a scalar so it brodcasts to a pandas.Series of type bool, which has a member all to call

@zombie-einstein, to me the solution seems correct, but I am new to modin

hope one of the maintainters of modin can take a further look into this issue

More general context below

here is an example of why the flipping works:

import pandas as pd

s = pd.Series(['a', 'b', 'c'])

int_64 = pd.Int64Dtype()
string = 'a'

print(s == string)
# 0     True
# 1    False
# 2    False
# dtype: bool



print(string == s)
# 0     True
# 1    False
# 2    False
# dtype: bool



print(int_64 == s)
# False



print(s == int_64)
# 0    False
# 1    False
# 2    False
# dtype: bool

note on usage and thoughts on it:

The documentation of astype of modin itself does say the following

def astype(self, col_dtypes, errors: str = "raise"):
    """
    Convert the columns dtypes to given dtypes.
    
    Parameters
    ----------
    col_dtypes : dictionary of {col: dtype,...} or str
        Where col is the column name and dtype is a NumPy dtype.
    errors : {'raise', 'ignore'}, default: 'raise'
        Control raising of exceptions on invalid data for provided dtype.
    
    Returns
    -------
    BaseDataFrame
        Dataframe with updated dtypes.
    """

so assert isinstance(col_dtypes, dict | str) should pass and we should not be passing dtype objects, however I would argue this is bad UX, for reference pandas accepts data type objects for astype, documentation here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working Triage 🩹 Issues that need triage
Projects
None yet
Development

No branches or pull requests

2 participants