pandas_to_arrow tries to take len() of int

Description

ap_verify failed with this error:

lsst.ctrl.mpexec.singleQuantumExecutor ERROR: Execution of task 'diaPipe' on quantum {instrument: 'HSC', detector: 50, visit: 59150, ...} failed. Exception RuntimeError: Failed to serialize dataset fakes_deepDiff_assocDiaSrc@{instrument: 'HSC', detector: 50, visit: 59150, ...}, sc=DataFrame] (id=ed6f86ed-b7f0-4a73-8816-12805a1235e2) of type <class 'pandas.core.frame.DataFrame'> to temporary location file:///j/ws/scipipe/ap_verify/cosmos_pdr2-main%5Egen3%5Eap_verify-installed/run/ap_verify_ci_cosmos_pdr2/repo/ap_verify-output/20221028T132311Z/fakes_deepDiff_assocDiaSrc/20160307/g/HSC-G/59150/qejve6gc1es0zx5w.parq Process task-{instrument: 'HSC', detector: 50, visit: 59150, ...}: Traceback (most recent call last): File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-4.1.0/Linux64/daf_butler/gee902a2e5e+c02f467b15/python/lsst/daf/butler/datastores/fileDatastore.py", line 1162, in _write_in_memory_to_artifact formatter.write(inMemoryDataset) File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-4.1.0/Linux64/daf_butler/gee902a2e5e+c02f467b15/python/lsst/daf/butler/formatters/parquet.py", line 148, in write arrow_table = pandas_to_arrow(inMemoryDataset) File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-4.1.0/Linux64/daf_butler/gee902a2e5e+c02f467b15/python/lsst/daf/butler/formatters/parquet.py", line 374, in pandas_to_arrow strlen = max(len(row) for row in dataframe[name].values) File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-4.1.0/Linux64/daf_butler/gee902a2e5e+c02f467b15/python/lsst/daf/butler/formatters/parquet.py", line 374, in <genexpr> strlen = max(len(row) for row in dataframe[name].values) TypeError: object of type 'int' has no len()

Checklist

Issue Matrix

hide

Lucidchart Diagrams

Activity

Eli Rykoff October 29, 2022 at 12:00 AM

Thanks! The problem is that pandas dataframes store all string arrays as opaque "objects" and under normal usage this is true, but apparently sometimes pandas can decide to call an integer column "objects" (which I'm sure does wonders for performance), and this was tripping up my code. I couldn't write a specific test for this since I couldn't figure out how to get a pandas dataframe into that state in over an hour of banging my head against it. Inspecting the arrow table for which columns are string columns works because arrow itself checks all the values in a pandas dataframe to figure out the datatype anyway (I assume because pyarrow knows that pandas can't be trusted when it says what type a column is). (FYI, this is all buried in the pyarrow cython code, and not obviously documented anywhere.)

Clare Saunders October 28, 2022 at 11:32 PM

Ok, I approved the PR, and it's fine to not add tests. However, if pandas_to_arrow is well covered shouldn't the tests have been failing? Surely the arrays used aren't all things with a len()? I'm sure I'm missing some subtelty.

Eli Rykoff October 28, 2022 at 9:26 PM

This is a very subtle thing, but this code is exercised via configuration in the formatter such that a write of a dataframe always uses pandas_to_arrow and a read of an arrow table to a dataframe always uses arrow_to_pandas. If it weren't run at all, the coverage tool would pop up boxes on the files in the PR. (And trust me, when I first did there were a lot of boxes.)

Clare Saunders October 28, 2022 at 8:50 PM

I saw that nothing in the unit tests uses pandas_to_arrow except a test with an empty array. Would it be worthwhile to add some tests that do a round trip pandas_to_arrow and arrow_to_pandas, or something similar? 

Done
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Reviewers

Clare Saunders

RubinTeam

Data Release Production

Components

Checklist

Created October 28, 2022 at 2:18 PM
Updated October 29, 2022 at 1:11 PM
Resolved October 28, 2022 at 11:54 PM