Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: consistent error messages for unsupported setitem values #60218

Open
jorisvandenbossche opened this issue Nov 6, 2024 · 4 comments
Open
Labels
API - Consistency Internal Consistency of API/Behavior Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@jorisvandenbossche
Copy link
Member

While cleaning up some string tests, I noticed that the setitem validation error message was different between pyarrow vs python storage for StringDtype (and will do a PR to make that consistent), but that made me wonder how the situation is in general. Creating an overview here, similarly to #59580 (error messages in reduction operations).

dtype val exception message
string int TypeError Scalar must be NA or str
datetime int TypeError value should be a 'Timestamp', 'NaT', or array of those. Got 'int' instead.
datetime-tz int TypeError value should be a 'Timestamp', 'NaT', or array of those. Got 'int' instead.
datetime-tz timestamp TypeError Cannot compare tz-naive and tz-aware datetime-like objects
period timestamp TypeError value should be a 'Period', 'NaT', or array of those. Got 'Timestamp' instead.
timedelta timestamp TypeError value should be a 'Timedelta', 'NaT', or array of those. Got 'Timestamp' instead.
range str ValueError invalid literal for int() with base 10: 'str'
range timestamp TypeError int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
int8 str ValueError invalid literal for int() with base 10: 'str'
int8 timestamp TypeError int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
int8 interval TypeError int() argument must be a string, a bytes-like object or a real number, not 'pandas._libs.interval.Interval'
int16 str ValueError invalid literal for int() with base 10: 'str'
int16 timestamp TypeError int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
int32 str ValueError invalid literal for int() with base 10: 'str'
int32 timestamp TypeError int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
int64 str ValueError invalid literal for int() with base 10: 'str'
int64 timestamp TypeError int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
uint8 str ValueError invalid literal for int() with base 10: 'str'
uint8 timestamp TypeError int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
uint16 str ValueError invalid literal for int() with base 10: 'str'
uint16 timestamp TypeError int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
uint32 str ValueError invalid literal for int() with base 10: 'str'
uint32 timestamp TypeError int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
uint64 str ValueError invalid literal for int() with base 10: 'str'
uint64 timestamp TypeError int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
float32 str ValueError could not convert string to float: 'str'
float32 timestamp TypeError float() argument must be a string or a real number, not 'Timestamp'
float64 str ValueError could not convert string to float: 'str'
float64 timestamp TypeError float() argument must be a string or a real number, not 'Timestamp'
complex64 str ValueError complex() arg is a malformed string
complex64 timestamp TypeError must be real number, not Timestamp
complex128 str ValueError complex() arg is a malformed string
complex128 timestamp TypeError must be real number, not Timestamp
categorical int TypeError Cannot setitem on a Categorical with a new category (1), set the categories first
categorical timestamp TypeError Cannot setitem on a Categorical with a new category (2020-01-01 00:00:00), set the categories first
interval int TypeError 'value' should be an interval type, got <class 'int'> instead.
nullable_int str TypeError Invalid value 'str' for dtype Int64
nullable_int timestamp TypeError Invalid value '2020-01-01 00:00:00' for dtype Int64
nullable_int interval TypeError Invalid value '(0, 1]' for dtype Int64
nullable_uint str TypeError Invalid value 'str' for dtype UInt16
nullable_float str TypeError Invalid value 'str' for dtype Float32
nullable_bool int TypeError Invalid value '1' for dtype boolean
nullable_bool str TypeError Invalid value 'str' for dtype boolean
string-python int TypeError Cannot set non-string value '1' into a StringArray.
string-python timestamp TypeError Cannot set non-string value '2020-01-01 00:00:00' into a StringArray.
string-pyarrow int TypeError Scalar must be NA or str
string-pyarrow timestamp TypeError Scalar must be NA or str

The code to generate the table above (the above table is a trimmed version of the result, removing some lines with identical results):

import numpy as np

import pandas as pd
from pandas import Index, CategoricalIndex, IntervalIndex

# from conftest.py
indices_dict = {
    "object": Index([f"pandas_{i}" for i in range(10)], dtype=object),
    "string": Index([f"pandas_{i}" for i in range(10)], dtype="str"),
    "datetime": pd.date_range("2020-01-01", periods=10),
    "datetime-tz": pd.date_range("2020-01-01", periods=10, tz="US/Pacific"),
    "period": pd.period_range("2020-01-01", periods=10, freq="D"),
    "timedelta": pd.timedelta_range(start="1 day", periods=10, freq="D"),
    "range": pd.RangeIndex(10),
    "int8": Index(np.arange(10), dtype="int8"),
    "int16": Index(np.arange(10), dtype="int16"),
    "int32": Index(np.arange(10), dtype="int32"),
    "int64": Index(np.arange(10), dtype="int64"),
    "uint8": Index(np.arange(10), dtype="uint8"),
    "uint16": Index(np.arange(10), dtype="uint16"),
    "uint32": Index(np.arange(10), dtype="uint32"),
    "uint64": Index(np.arange(10), dtype="uint64"),
    "float32": Index(np.arange(10), dtype="float32"),
    "float64": Index(np.arange(10), dtype="float64"),
    "bool-object": Index([True, False] * 5, dtype=object),
    "bool-dtype": Index([True, False] * 5, dtype=bool),
    "complex64": Index(
        np.arange(10, dtype="complex64") + 1.0j * np.arange(10, dtype="complex64")
    ),
    "complex128": Index(
        np.arange(10, dtype="complex128") + 1.0j * np.arange(10, dtype="complex128")
    ),
    "categorical": CategoricalIndex(list("abcd") * 2),
    "interval": IntervalIndex.from_breaks(np.linspace(0, 100, num=11)),
    # "empty": Index([]),
    "nullable_int": Index(np.arange(10), dtype="Int64"),
    "nullable_uint": Index(np.arange(10), dtype="UInt16"),
    "nullable_float": Index(np.arange(10), dtype="Float32"),
    "nullable_bool": Index(np.arange(10).astype(bool), dtype="boolean"),
    "string-python": Index(
        pd.array([f"pandas_{i}" for i in range(10)], dtype="string[python]")
    ),
    "string-pyarrow": Index(pd.array([f"pandas_{i}" for i in range(10)], dtype="string[pyarrow]"))
}

results = []

for dtype, data in indices_dict.items():
    for val, val_type in [
        (1, "int"),
        ("str", "str"),
        (pd.Timestamp("2020-01-01"), "timestamp"),
        (pd.Interval(0, 1), "interval")
    ]:
        try:
            data.array[0] = val
        except Exception as e:
            # print(dtype, val, type(e), e)
            results.append((dtype, val_type, str(type(e).__name__), str(e)))

df = pd.DataFrame(results, columns=["dtype", "val", "exception", "message"])
print(df)

print(df.set_index("dtype").to_markdown())
@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Error Reporting Incorrect or improved errors from pandas API - Consistency Internal Consistency of API/Behavior labels Nov 6, 2024
@rhshadrach
Copy link
Member

rhshadrach commented Nov 6, 2024

Very nice! I believe in the ValueError cases pandas is trying to convert in setitem.

data = Index(np.arange(5), dtype="int64")
data.array[0] = "1"
print(data)
# Index([1, 1, 2, 3, 4], dtype='int64')

Some of the TypeError cases are also a result of pandas attempting to convert, e.g. int() argument must be a string, a bytes-like object or a real number, not 'Timestamp' whereas others arise when no conversion is attempted, e.g. Invalid value 'str' for dtype Int64.

Whether we try to convert or raise seems to me to be the most important from a API consistency perspective. That then determines whether we should be raising a ValueError (conversion failed) or TypeError (RHS has the wrong type).

@jorisvandenbossche
Copy link
Member Author

I believe in the ValueError cases pandas is trying to convert in setitem.

I think it is actually coming from numpy, but indeed from trying to convert the input (because numpy actually allows you to set strings, as long s they can be converted to an int):

>>> arr = np.array([1, 2, 3])
>>> arr[0] = "10"
>>> arr
array([10,  2,  3])

>>> arr[0] = "not an integer"
...
ValueError: invalid literal for int() with base 10: 'not an integer'
>>> int("not an integer")
...
ValueError: invalid literal for int() with base 10: 'not an integer'

For our nullable dtypes, we don't allow setting strings, so in that case there is no "failed conversion", but the value's type is just considered wrong:

>>> arr = pd.array([1, 2, 3])
>>> arr[0] = "10"
...
File ~/scipy/repos/pandas/pandas/core/arrays/masked.py:289, in BaseMaskedArray._validate_setitem_value(self, value)
...
TypeError: Invalid value '10' for dtype Int64

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 6, 2024

In terms of messages in the case of a TypeError, we currently have those variations:

  • value should be a 'xx' or 'xx', or array of those. Got 'yy' instead.
  • argument must be a xx or xx, not 'yy'
  • 'value' should be an xx type, got yy instead.
  • Invalid value '{value}' for dtype xx
  • Cannot set non-xx value '{value}' into a xx array.
  • Scalar must be xx

Any preference there?

I was thinking of some combo like "Invalid value '{value}' for dtype xx. Value should be a 'xx' or .., got '{type(value)}' instead."
But maybe that's getting too long / with somewhat duplicative information?
Question is maybe also if we want to include specifically in the error message that it is about setting values into an array?

@rhshadrach
Copy link
Member

Thanks @jorisvandenbossche, I didn't realize conversion was a NumPy behavior. Should we be unifying the behavior between NumPy-backed (converting & raising ValueError on failure) and EA-backed (no attempt to convert & raising TypeError)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

2 participants