[Bug]: IcebergIO opens a writer using the table's schema, which can cause data loss #32050

ahmedabu98 · 2024-08-01T20:07:00Z

What happened?

I'm experimenting with HiveCatalog and noticed that when writing data with nested records, there is data loss. Specifically, the nested records are never committed to the table. I read back null values instead. I can confirm this happens even without our Beam library:

    TableIdentifier tableIdentifier =
        TableIdentifier.parse(String.format("%s.%s", TEST_DB, TEST_TABLE));
    org.apache.iceberg.Schema icebergSchema = IcebergUtils.beamSchemaToIcebergSchema(ROW_SCHEMA);
    Table table = catalog.createTable(tableIdentifier, icebergSchema);

    // write
    List<Record> records = LongStream.range(1, 10).boxed().map(l -> IcebergUtils.beamRowToIcebergRecord(icebergSchema, ROW_FUNC.apply(l))).collect(Collectors.toList());
    String filepath = table.location() + "/" + UUID.randomUUID();
    OutputFile file = table.io().newOutputFile(filepath);
    DataWriter<Record> writer =
            Parquet.writeData(file)
//                    .schema(table.schema())    <---- xxxx this is the problematic line xxxx
                    .schema(icebergSchema)
                    .createWriterFunc(GenericParquetWriter::buildWriter)
                    .overwrite()
                    .withSpec(table.spec())
                    .build();
    for (Record rec: records) {
      System.out.println("xxx writing: " + rec);
      writer.write(rec);
    }
    writer.close();
    AppendFiles appendFiles = table.newAppend();
    String manifestFilename = FileFormat.AVRO.addExtension(filepath + ".manifest");
    OutputFile outputFile = table.io().newOutputFile(manifestFilename);
    ManifestWriter<DataFile> manifestWriter;
    try (ManifestWriter<DataFile> openWriter = ManifestFiles.write(table.spec(), outputFile)) {
      openWriter.add(writer.toDataFile());
      manifestWriter = openWriter;
    }
    appendFiles.appendManifest(manifestWriter.toManifestFile());
    appendFiles.commit();

    
    // read
    table = catalog.loadTable(tableIdentifier);
    TableScan tableScan = table.newScan().project(icebergSchema);
    for (CombinedScanTask task : tableScan.planTasks()) {
      InputFilesDecryptor decryptor = new InputFilesDecryptor(task, table.io(), table.encryption());
      for (FileScanTask fileTask : task.files()) {
        InputFile inputFile = decryptor.getInputFile(fileTask);
        CloseableIterable<Record> iterable =
                Parquet.read(inputFile)
                        .split(fileTask.start(), fileTask.length())
                        .project(icebergSchema)
                        .createReaderFunc(
                                fileSchema -> GenericParquetReaders.buildReader(icebergSchema, fileSchema))
                        .filter(fileTask.residual())
                        .build();

        for (Record rec : iterable) {
          System.out.println("xxx reading: " + rec);
        }
      }
    }

I've tried the same with HadoopCatalog and everything seems to be working fine. I'm not sure why I'm seeing this error with HiveCatalog. There may be a gimmick in how the catalog fetches and returns the table schema.

I believe in this case we shouldn't rely on the catalog, and instead create our writers using the schema of the records in our PCollection. ie. line 70 here:

beam/sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/RecordWriter.java

Lines 66 to 70 in 21009e6

    
           case PARQUET: 
        
             icebergDataWriter = 
        
                 Parquet.writeData(outputFile) 
        
                     .createWriterFunc(GenericParquetWriter::buildWriter) 
        
                     .schema(table.schema())

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

ahmedabu98 · 2024-08-02T23:17:57Z

I see similar behavior when using the BigQuery Metastore catalog

ahmedabu98 · 2024-08-07T03:15:38Z

Update: I realized this is actually due to a bug in our Iceberg utils. Fixing in #32095

ahmedabu98 added bug awaiting triage labels Aug 1, 2024

github-actions bot added java io P1 labels Aug 1, 2024

ahmedabu98 mentioned this issue Aug 1, 2024

Managed Iceberg hive support and integration tests #32052

Merged

chamikaramj assigned ahmedabu98 Aug 5, 2024

github-actions bot removed the awaiting triage label Aug 5, 2024

ahmedabu98 mentioned this issue Aug 7, 2024

Fix Beam Schema to Iceberg Schema ID conversion logic #32095

Merged

ahmedabu98 closed this as completed in #32095 Aug 7, 2024

github-actions bot added this to the 2.59.0 Release milestone Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: IcebergIO opens a writer using the table's schema, which can cause data loss #32050

[Bug]: IcebergIO opens a writer using the table's schema, which can cause data loss #32050

ahmedabu98 commented Aug 1, 2024

ahmedabu98 commented Aug 2, 2024

ahmedabu98 commented Aug 7, 2024

[Bug]: IcebergIO opens a writer using the table's schema, which can cause data loss #32050

[Bug]: IcebergIO opens a writer using the table's schema, which can cause data loss #32050

Comments

ahmedabu98 commented Aug 1, 2024

What happened?

Issue Priority

Issue Components

ahmedabu98 commented Aug 2, 2024

ahmedabu98 commented Aug 7, 2024