Trouble reading documents with empty embedded arrays #208

ccrouch · 2024-05-05T14:41:26Z

Goal:
Trying to read a mongo document with an embedded object containing an empty array to a pyarrow table, then write it out as a parquet file.

Expected result:
Parquet file created

Actual Result:
Getting error from pymongoarrow when creating the pyarrow.Table. Interestingly reading the same document from mongo directly and using pyarrow.json to create the table works fine. Obviously embedded objects with non-empty arrays work fine with pymongoarrow.

Steps to reproduce:

from pymongo import MongoClient

import pymongoarrow.api as pmaapi

import pyarrow.parquet as papq
import pyarrow.json as pajson

import io
import json
import bson


client = MongoClient()
collection = client.testdb.data;
collection.drop();

client.testdb.data.insert_many([
    { '_id': 1, 'foo':  { 'bar': ['1','2'] } },
    { '_id': 2, 'foo':  { 'bar': [] } }
])

# get document out of mongo, put it in a file and read it with pyarrow and write it to parquet
doc1 = client.testdb.data.find_one({'_id': 1})
string1 = bson.json_util.dumps(doc1, indent = 2) 
file1 = io.BytesIO(bytes(string1, encoding='utf-8'))
papatable1 = pajson.read_json(file1)
print(str(papatable1))
papq.write_table(papatable1, 'pyarrow' + str(1) + '.parquet')

# read document with pymongoarrow and write it to parquet
pmapatable1 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 1}})
print(str(pmapatable1))
papq.write_table(pmapatable1, 'pymongoarrow' + str(1) + '.parquet')



doc2 = client.testdb.data.find_one({'_id': 2})
string2 = bson.json_util.dumps(doc2, indent = 2) 
file2 = io.BytesIO(bytes(string2, encoding='utf-8'))
papatable2 = pajson.read_json(file2)
print(str(papatable2))
papq.write_table(papatable2, 'pyarrow' + str(2) + '.parquet')

pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
papq.write_table(pmapatable2, 'pymongoarrow' + str(2) + '.parquet')

produces

$ python repro.py
pyarrow.Table
_id: int64
foo: struct<bar: list<item: string>>
  child 0, bar: list<item: string>
      child 0, item: string
----
_id: [[1]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int32
foo: struct<bar: list<item: string>>
  child 0, bar: list<item: string>
      child 0, item: string
----
_id: [[1]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int64
foo: struct<bar: list<item: null>>
  child 0, bar: list<item: null>
      child 0, item: null
----
_id: [[2]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: null>
[0 nulls]]
Traceback (most recent call last):
  File "/workspaces/vscode-python/pymongoarrow/repro.py", line 45, in <module>
    pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/Envs/pma1/lib/python3.11/site-packages/pymongoarrow/api.py", line 112, in find_arrow_all
    process_bson_stream(batch, context)
  File "pymongoarrow/lib.pyx", line 159, in pymongoarrow.lib.process_bson_stream
  File "pymongoarrow/lib.pyx", line 246, in pymongoarrow.lib.process_raw_bson_stream
  File "pymongoarrow/lib.pyx", line 133, in pymongoarrow.lib.extract_document_dtype
  File "pymongoarrow/lib.pyx", line 108, in pymongoarrow.lib.extract_field_dtype
  File "pyarrow/types.pxi", line 4452, in pyarrow.lib.list_
TypeError: List requires DataType or Field

FWIW the three parquet files which are produced, duckdb shows the following...

D select * from 'pyarrow1.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int64 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│     1 │ {'bar': [1, 2]}       │
└───────┴───────────────────────┘
D select * from 'pymongoarrow1.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int32 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│     1 │ {'bar': [1, 2]}       │
└───────┴───────────────────────┘
D select * from 'pyarrow2.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int64 │ struct(bar integer[]) │
├───────┼───────────────────────┤
│     2 │ {'bar': []}           │
└───────┴───────────────────────┘
D

Versions:

Python 3.11.8 (main, Mar 12 2024, 11:41:52) [GCC 12.2.0] on linux
Successfully installed dnspython-2.6.1 numpy-1.26.4 packaging-23.2 pandas-2.2.2 pyarrow-15.0.2 pymongo-4.7.1 pymongoarrow-1.3.0 python-dateutil-2.9.0.post0 pytz-2024.1 six-1.16.0 tzdata-2024.1

The text was updated successfully, but these errors were encountered:

blink1073 · 2024-05-07T01:16:26Z

Hi @ccrouch, thanks for pointing out the limitation in our parser. I opened https://jira.mongodb.org/browse/ARROW-230 to track the fix.

blink1073 added the bug Something isn't working label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble reading documents with empty embedded arrays #208

Trouble reading documents with empty embedded arrays #208

ccrouch commented May 5, 2024

blink1073 commented May 7, 2024

Trouble reading documents with empty embedded arrays #208

Trouble reading documents with empty embedded arrays #208

Comments

ccrouch commented May 5, 2024

blink1073 commented May 7, 2024