pandas_to_arrow tries to take len() of int

Description

ap_verify failed with this error:

lsst.ctrl.mpexec.singleQuantumExecutor ERROR: Execution of task 'diaPipe' on quantum {instrument: 'HSC', detector: 50, visit: 59150, ...} failed. Exception RuntimeError: Failed to serialize dataset fakes_deepDiff_assocDiaSrc@{instrument: 'HSC', detector: 50, visit: 59150, ...}, sc=DataFrame] (id=ed6f86ed-b7f0-4a73-8816-12805a1235e2) of type <class 'pandas.core.frame.DataFrame'> to temporary location file:///j/ws/scipipe/ap_verify/cosmos_pdr2-main%5Egen3%5Eap_verify-installed/run/ap_verify_ci_cosmos_pdr2/repo/ap_verify-output/20221028T132311Z/fakes_deepDiff_assocDiaSrc/20160307/g/HSC-G/59150/qejve6gc1es0zx5w.parq
Process task-{instrument: 'HSC', detector: 50, visit: 59150, ...}:
Traceback (most recent call last):
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-4.1.0/Linux64/daf_butler/gee902a2e5e+c02f467b15/python/lsst/daf/butler/datastores/fileDatastore.py", line 1162, in _write_in_memory_to_artifact
    formatter.write(inMemoryDataset)
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-4.1.0/Linux64/daf_butler/gee902a2e5e+c02f467b15/python/lsst/daf/butler/formatters/parquet.py", line 148, in write
    arrow_table = pandas_to_arrow(inMemoryDataset)
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-4.1.0/Linux64/daf_butler/gee902a2e5e+c02f467b15/python/lsst/daf/butler/formatters/parquet.py", line 374, in pandas_to_arrow
    strlen = max(len(row) for row in dataframe[name].values)
  File "/opt/lsst/software/stack/stack/miniconda3-py38_4.9.2-4.1.0/Linux64/daf_butler/gee902a2e5e+c02f467b15/python/lsst/daf/butler/formatters/parquet.py", line 374, in <genexpr>
    strlen = max(len(row) for row in dataframe[name].values)
TypeError: object of type 'int' has no len()

Linked issues

relates to

DM-34874

Add ArrowTable StorageClass, formatter, and converters

DM-36795

pandas_to_arrow tries to take len() of None

Checklist

Issue Matrix

hide

Lucidchart Diagrams

Activity

Eli Rykoff October 29, 2022 at 12:00 AM

Thanks! The problem is that pandas dataframes store all string arrays as opaque "objects" and under normal usage this is true, but apparently sometimes pandas can decide to call an integer column "objects" (which I'm sure does wonders for performance), and this was tripping up my code. I couldn't write a specific test for this since I couldn't figure out how to get a pandas dataframe into that state in over an hour of banging my head against it. Inspecting the arrow table for which columns are string columns works because arrow itself checks all the values in a pandas dataframe to figure out the datatype anyway (I assume because pyarrow knows that pandas can't be trusted when it says what type a column is). (FYI, this is all buried in the pyarrow cython code, and not obviously documented anywhere.)

Clare Saunders October 28, 2022 at 11:32 PM

Ok, I approved the PR, and it's fine to not add tests. However, if pandas_to_arrow is well covered shouldn't the tests have been failing? Surely the arrays used aren't all things with a len()? I'm sure I'm missing some subtelty.

Eli Rykoff October 28, 2022 at 9:26 PM

This is a very subtle thing, but this code is exercised via configuration in the formatter such that a write of a dataframe always uses pandas_to_arrow and a read of an arrow table to a dataframe always uses arrow_to_pandas. If it weren't run at all, the coverage tool would pop up boxes on the files in the PR. (And trust me, when I first did there were a lot of boxes.)

Clare Saunders October 28, 2022 at 8:50 PM

I saw that nothing in the unit tests uses pandas_to_arrow except a test with an empty array. Would it be worthwhile to add some tests that do a round trip pandas_to_arrow and arrow_to_pandas, or something similar?

Done

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Eli Rykoff
Reporter
Kian-Tat Lim
Reviewers
Clare Saunders
RubinTeam
Data Release Production
Components

Checklist

Created October 28, 2022 at 2:18 PM

Updated October 29, 2022 at 1:11 PM

Resolved October 28, 2022 at 11:54 PM

pandas_to_arrow tries to take len() of int

Description

Linked issues

relates to

Checklist

Issue Matrix

Lucidchart Diagrams

Activity

Eli Rykoff October 29, 2022 at 12:00 AM

Clare Saunders October 28, 2022 at 11:32 PM

Eli Rykoff October 28, 2022 at 9:26 PM

Clare Saunders October 28, 2022 at 8:50 PM

DetailsAssigneeEli RykoffEli RykoffReporterKian-Tat LimKian-Tat LimReviewersClare SaundersRubinTeamData Release ProductionComponents

Details

Assignee

Reporter

Reviewers

RubinTeam

Components

ChecklistOpen Checklist

Checklist

Details
Assignee
Eli Rykoff
Reporter
Kian-Tat Lim
Reviewers
Clare Saunders
RubinTeam
Data Release Production
Components

Checklist