Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble reading documents with empty embedded arrays #208

Open
ccrouch opened this issue May 5, 2024 · 1 comment
Open

Trouble reading documents with empty embedded arrays #208

ccrouch opened this issue May 5, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ccrouch
Copy link

ccrouch commented May 5, 2024

Goal:
Trying to read a mongo document with an embedded object containing an empty array to a pyarrow table, then write it out as a parquet file.

Expected result:
Parquet file created

Actual Result:
Getting error from pymongoarrow when creating the pyarrow.Table. Interestingly reading the same document from mongo directly and using pyarrow.json to create the table works fine. Obviously embedded objects with non-empty arrays work fine with pymongoarrow.

Steps to reproduce:

from pymongo import MongoClient

import pymongoarrow.api as pmaapi

import pyarrow.parquet as papq
import pyarrow.json as pajson

import io
import json
import bson


client = MongoClient()
collection = client.testdb.data;
collection.drop();

client.testdb.data.insert_many([
    { '_id': 1, 'foo':  { 'bar': ['1','2'] } },
    { '_id': 2, 'foo':  { 'bar': [] } }
])

# get document out of mongo, put it in a file and read it with pyarrow and write it to parquet
doc1 = client.testdb.data.find_one({'_id': 1})
string1 = bson.json_util.dumps(doc1, indent = 2) 
file1 = io.BytesIO(bytes(string1, encoding='utf-8'))
papatable1 = pajson.read_json(file1)
print(str(papatable1))
papq.write_table(papatable1, 'pyarrow' + str(1) + '.parquet')

# read document with pymongoarrow and write it to parquet
pmapatable1 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 1}})
print(str(pmapatable1))
papq.write_table(pmapatable1, 'pymongoarrow' + str(1) + '.parquet')



doc2 = client.testdb.data.find_one({'_id': 2})
string2 = bson.json_util.dumps(doc2, indent = 2) 
file2 = io.BytesIO(bytes(string2, encoding='utf-8'))
papatable2 = pajson.read_json(file2)
print(str(papatable2))
papq.write_table(papatable2, 'pyarrow' + str(2) + '.parquet')

pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
papq.write_table(pmapatable2, 'pymongoarrow' + str(2) + '.parquet')

produces

$ python repro.py
pyarrow.Table
_id: int64
foo: struct<bar: list<item: string>>
  child 0, bar: list<item: string>
      child 0, item: string
----
_id: [[1]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int32
foo: struct<bar: list<item: string>>
  child 0, bar: list<item: string>
      child 0, item: string
----
_id: [[1]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: string>
[["1","2"]]]
pyarrow.Table
_id: int64
foo: struct<bar: list<item: null>>
  child 0, bar: list<item: null>
      child 0, item: null
----
_id: [[2]]
foo: [
  -- is_valid: all not null
  -- child 0 type: list<item: null>
[0 nulls]]
Traceback (most recent call last):
  File "/workspaces/vscode-python/pymongoarrow/repro.py", line 45, in <module>
    pmapatable2 = pmaapi.find_arrow_all(client.testdb.data,{'_id': {'$eq': 2}})
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vscode/Envs/pma1/lib/python3.11/site-packages/pymongoarrow/api.py", line 112, in find_arrow_all
    process_bson_stream(batch, context)
  File "pymongoarrow/lib.pyx", line 159, in pymongoarrow.lib.process_bson_stream
  File "pymongoarrow/lib.pyx", line 246, in pymongoarrow.lib.process_raw_bson_stream
  File "pymongoarrow/lib.pyx", line 133, in pymongoarrow.lib.extract_document_dtype
  File "pymongoarrow/lib.pyx", line 108, in pymongoarrow.lib.extract_field_dtype
  File "pyarrow/types.pxi", line 4452, in pyarrow.lib.list_
TypeError: List requires DataType or Field

FWIW the three parquet files which are produced, duckdb shows the following...

D select * from 'pyarrow1.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int64 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│     1 │ {'bar': [1, 2]}       │
└───────┴───────────────────────┘
D select * from 'pymongoarrow1.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int32 │ struct(bar varchar[]) │
├───────┼───────────────────────┤
│     1 │ {'bar': [1, 2]}       │
└───────┴───────────────────────┘
D select * from 'pyarrow2.parquet';
┌───────┬───────────────────────┐
│  _id  │          foo          │
│ int64 │ struct(bar integer[]) │
├───────┼───────────────────────┤
│     2 │ {'bar': []}           │
└───────┴───────────────────────┘
D 

Versions:

Python 3.11.8 (main, Mar 12 2024, 11:41:52) [GCC 12.2.0] on linux
Successfully installed dnspython-2.6.1 numpy-1.26.4 packaging-23.2 pandas-2.2.2 pyarrow-15.0.2 pymongo-4.7.1 pymongoarrow-1.3.0 python-dateutil-2.9.0.post0 pytz-2024.1 six-1.16.0 tzdata-2024.1
@blink1073
Copy link
Member

Hi @ccrouch, thanks for pointing out the limitation in our parser. I opened https://jira.mongodb.org/browse/ARROW-230 to track the fix.

@blink1073 blink1073 added the bug Something isn't working label May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants