ERR: consistent error messages for unsupported setitem values #60218

jorisvandenbossche · 2024-11-06T16:36:32Z

While cleaning up some string tests, I noticed that the setitem validation error message was different between pyarrow vs python storage for StringDtype (and will do a PR to make that consistent), but that made me wonder how the situation is in general. Creating an overview here, similarly to #59580 (error messages in reduction operations).

dtype	val	exception	message
string	int	TypeError	Scalar must be NA or str
datetime	int	TypeError	value should be a 'Timestamp', 'NaT', or array of those. Got 'int' instead.
datetime-tz	int	TypeError	value should be a 'Timestamp', 'NaT', or array of those. Got 'int' instead.
datetime-tz	timestamp	TypeError	Cannot compare tz-naive and tz-aware datetime-like objects
period	timestamp	TypeError	value should be a 'Period', 'NaT', or array of those. Got 'Timestamp' instead.
timedelta	timestamp	TypeError	value should be a 'Timedelta', 'NaT', or array of those. Got 'Timestamp' instead.
range	str	ValueError	invalid literal for int() with base 10: 'str'
range	timestamp	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
int8	str	ValueError	invalid literal for int() with base 10: 'str'
int8	timestamp	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
int8	interval	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'pandas._libs.interval.Interval'
int16	str	ValueError	invalid literal for int() with base 10: 'str'
int16	timestamp	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
int32	str	ValueError	invalid literal for int() with base 10: 'str'
int32	timestamp	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
int64	str	ValueError	invalid literal for int() with base 10: 'str'
int64	timestamp	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
uint8	str	ValueError	invalid literal for int() with base 10: 'str'
uint8	timestamp	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
uint16	str	ValueError	invalid literal for int() with base 10: 'str'
uint16	timestamp	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
uint32	str	ValueError	invalid literal for int() with base 10: 'str'
uint32	timestamp	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
uint64	str	ValueError	invalid literal for int() with base 10: 'str'
uint64	timestamp	TypeError	int() argument must be a string, a bytes-like object or a real number, not 'Timestamp'
float32	str	ValueError	could not convert string to float: 'str'
float32	timestamp	TypeError	float() argument must be a string or a real number, not 'Timestamp'
float64	str	ValueError	could not convert string to float: 'str'
float64	timestamp	TypeError	float() argument must be a string or a real number, not 'Timestamp'
complex64	str	ValueError	complex() arg is a malformed string
complex64	timestamp	TypeError	must be real number, not Timestamp
complex128	str	ValueError	complex() arg is a malformed string
complex128	timestamp	TypeError	must be real number, not Timestamp
categorical	int	TypeError	Cannot setitem on a Categorical with a new category (1), set the categories first
categorical	timestamp	TypeError	Cannot setitem on a Categorical with a new category (2020-01-01 00:00:00), set the categories first
interval	int	TypeError	'value' should be an interval type, got <class 'int'> instead.
nullable_int	str	TypeError	Invalid value 'str' for dtype Int64
nullable_int	timestamp	TypeError	Invalid value '2020-01-01 00:00:00' for dtype Int64
nullable_int	interval	TypeError	Invalid value '(0, 1]' for dtype Int64
nullable_uint	str	TypeError	Invalid value 'str' for dtype UInt16
nullable_float	str	TypeError	Invalid value 'str' for dtype Float32
nullable_bool	int	TypeError	Invalid value '1' for dtype boolean
nullable_bool	str	TypeError	Invalid value 'str' for dtype boolean
string-python	int	TypeError	Cannot set non-string value '1' into a StringArray.
string-python	timestamp	TypeError	Cannot set non-string value '2020-01-01 00:00:00' into a StringArray.
string-pyarrow	int	TypeError	Scalar must be NA or str
string-pyarrow	timestamp	TypeError	Scalar must be NA or str

The code to generate the table above (the above table is a trimmed version of the result, removing some lines with identical results):

import numpy as np

import pandas as pd
from pandas import Index, CategoricalIndex, IntervalIndex

# from conftest.py
indices_dict = {
    "object": Index([f"pandas_{i}" for i in range(10)], dtype=object),
    "string": Index([f"pandas_{i}" for i in range(10)], dtype="str"),
    "datetime": pd.date_range("2020-01-01", periods=10),
    "datetime-tz": pd.date_range("2020-01-01", periods=10, tz="US/Pacific"),
    "period": pd.period_range("2020-01-01", periods=10, freq="D"),
    "timedelta": pd.timedelta_range(start="1 day", periods=10, freq="D"),
    "range": pd.RangeIndex(10),
    "int8": Index(np.arange(10), dtype="int8"),
    "int16": Index(np.arange(10), dtype="int16"),
    "int32": Index(np.arange(10), dtype="int32"),
    "int64": Index(np.arange(10), dtype="int64"),
    "uint8": Index(np.arange(10), dtype="uint8"),
    "uint16": Index(np.arange(10), dtype="uint16"),
    "uint32": Index(np.arange(10), dtype="uint32"),
    "uint64": Index(np.arange(10), dtype="uint64"),
    "float32": Index(np.arange(10), dtype="float32"),
    "float64": Index(np.arange(10), dtype="float64"),
    "bool-object": Index([True, False] * 5, dtype=object),
    "bool-dtype": Index([True, False] * 5, dtype=bool),
    "complex64": Index(
        np.arange(10, dtype="complex64") + 1.0j * np.arange(10, dtype="complex64")
    ),
    "complex128": Index(
        np.arange(10, dtype="complex128") + 1.0j * np.arange(10, dtype="complex128")
    ),
    "categorical": CategoricalIndex(list("abcd") * 2),
    "interval": IntervalIndex.from_breaks(np.linspace(0, 100, num=11)),
    # "empty": Index([]),
    "nullable_int": Index(np.arange(10), dtype="Int64"),
    "nullable_uint": Index(np.arange(10), dtype="UInt16"),
    "nullable_float": Index(np.arange(10), dtype="Float32"),
    "nullable_bool": Index(np.arange(10).astype(bool), dtype="boolean"),
    "string-python": Index(
        pd.array([f"pandas_{i}" for i in range(10)], dtype="string[python]")
    ),
    "string-pyarrow": Index(pd.array([f"pandas_{i}" for i in range(10)], dtype="string[pyarrow]"))
}

results = []

for dtype, data in indices_dict.items():
    for val, val_type in [
        (1, "int"),
        ("str", "str"),
        (pd.Timestamp("2020-01-01"), "timestamp"),
        (pd.Interval(0, 1), "interval")
    ]:
        try:
            data.array[0] = val
        except Exception as e:
            # print(dtype, val, type(e), e)
            results.append((dtype, val_type, str(type(e).__name__), str(e)))

df = pd.DataFrame(results, columns=["dtype", "val", "exception", "message"])
print(df)

print(df.set_index("dtype").to_markdown())

rhshadrach · 2024-11-06T17:32:59Z

Very nice! I believe in the ValueError cases pandas is trying to convert in setitem.

data = Index(np.arange(5), dtype="int64")
data.array[0] = "1"
print(data)
# Index([1, 1, 2, 3, 4], dtype='int64')

Some of the TypeError cases are also a result of pandas attempting to convert, e.g. int() argument must be a string, a bytes-like object or a real number, not 'Timestamp' whereas others arise when no conversion is attempted, e.g. Invalid value 'str' for dtype Int64.

Whether we try to convert or raise seems to me to be the most important from a API consistency perspective. That then determines whether we should be raising a ValueError (conversion failed) or TypeError (RHS has the wrong type).

jorisvandenbossche · 2024-11-06T19:02:32Z

I believe in the ValueError cases pandas is trying to convert in setitem.

I think it is actually coming from numpy, but indeed from trying to convert the input (because numpy actually allows you to set strings, as long s they can be converted to an int):

>>> arr = np.array([1, 2, 3])
>>> arr[0] = "10"
>>> arr
array([10,  2,  3])

>>> arr[0] = "not an integer"
...
ValueError: invalid literal for int() with base 10: 'not an integer'
>>> int("not an integer")
...
ValueError: invalid literal for int() with base 10: 'not an integer'

For our nullable dtypes, we don't allow setting strings, so in that case there is no "failed conversion", but the value's type is just considered wrong:

>>> arr = pd.array([1, 2, 3])
>>> arr[0] = "10"
...
File ~/scipy/repos/pandas/pandas/core/arrays/masked.py:289, in BaseMaskedArray._validate_setitem_value(self, value)
...
TypeError: Invalid value '10' for dtype Int64

jorisvandenbossche · 2024-11-06T19:12:11Z

In terms of messages in the case of a TypeError, we currently have those variations:

value should be a 'xx' or 'xx', or array of those. Got 'yy' instead.
argument must be a xx or xx, not 'yy'
'value' should be an xx type, got yy instead.
Invalid value '{value}' for dtype xx
Cannot set non-xx value '{value}' into a xx array.
Scalar must be xx

Any preference there?

I was thinking of some combo like "Invalid value '{value}' for dtype xx. Value should be a 'xx' or .., got '{type(value)}' instead."
But maybe that's getting too long / with somewhat duplicative information?
Question is maybe also if we want to include specifically in the error message that it is about setting values into an array?

rhshadrach · 2024-11-07T21:48:44Z

Thanks @jorisvandenbossche, I didn't realize conversion was a NumPy behavior. Should we be unifying the behavior between NumPy-backed (converting & raising ValueError on failure) and EA-backed (no attempt to convert & raising TypeError)?

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Error Reporting Incorrect or improved errors from pandas API - Consistency Internal Consistency of API/Behavior labels Nov 6, 2024

jorisvandenbossche mentioned this issue Nov 6, 2024

ERR (string dtype): harmonize setitem error message for python and pyarrow storage #60219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: consistent error messages for unsupported setitem values #60218

ERR: consistent error messages for unsupported setitem values #60218

jorisvandenbossche commented Nov 6, 2024

rhshadrach commented Nov 6, 2024 •

edited

Loading

jorisvandenbossche commented Nov 6, 2024

jorisvandenbossche commented Nov 6, 2024 •

edited

Loading

rhshadrach commented Nov 7, 2024

ERR: consistent error messages for unsupported setitem values #60218

ERR: consistent error messages for unsupported setitem values #60218

Comments

jorisvandenbossche commented Nov 6, 2024

rhshadrach commented Nov 6, 2024 • edited Loading

jorisvandenbossche commented Nov 6, 2024

jorisvandenbossche commented Nov 6, 2024 • edited Loading

rhshadrach commented Nov 7, 2024

rhshadrach commented Nov 6, 2024 •

edited

Loading

jorisvandenbossche commented Nov 6, 2024 •

edited

Loading