Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to parquet 1.8.1 #20

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zhenxiao
Copy link
Contributor

@zhenxiao zhenxiao commented Mar 8, 2016

@dain @electrum here is my draft. Need to update presto parquet code namespaces when a new presto-hive-apache version is released

@electrum
Copy link
Contributor

This looks good, but I like to have the Presto changes ready first instead so that we know if anything is missing. I made most of the changes here in this branch so that it compiles:

https://github.com/electrum/presto/tree/parquet

With the exception of TestParquetTimestampUtils which uses NanoTimeUtils from Hive which references the old Parquet classes. I'm wondering if there are other classes in Hive that we use which will need the old classes (and would fail with ClassNotFoundException or similar).

@nezihyigitbasi
Copy link

nezihyigitbasi commented Jul 13, 2016

There are other parts of the code that imports hive's parquet classes so this change should be tested properly to see whether we hit old/new parquet package/class conflicts.

nyigitbasi@lgml-nyigitbasi /tmp/presto (master) $ grep -r org.apache.hadoop.hive.ql.io.parquet --include \*.java *
presto-hive/src/main/java/com/facebook/presto/hive/HiveStorageFormat.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
presto-hive/src/main/java/com/facebook/presto/hive/HiveStorageFormat.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat;
presto-hive/src/main/java/com/facebook/presto/hive/HiveStorageFormat.java:import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java:import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java:            .add("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe")
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetRecordCursorProvider.java:            .add("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe")
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetTimestampUtils.java: * This class is equivalent of @see org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTime,
presto-hive/src/test/java/com/facebook/presto/hive/parquet/ParquetTester.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat;
presto-hive/src/test/java/com/facebook/presto/hive/parquet/ParquetTester.java:import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;
presto-hive/src/test/java/com/facebook/presto/hive/parquet/TestParquetTimestampUtils.java:import org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTimeUtils;
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java:import org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveFileFormats.java:import org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe;

@nezihyigitbasi
Copy link

@electrum @zhenxiao I am currently looking at upgrading presto to parquet 1.8.1 and hit some issues during runtime. Presto's hive-apache package depends on hive 1.2 and that version still uses the old parquet packages (parquet.*), which causes these runtime issues.

For example, in my tests when querying a parquet table presto creates an instance of org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat and that class tries loading parquet.hadoop.ParquetInputFormat, which it cannot find. How do we want to handle this? Do we want to upgrade to hive 2.0, which uses the new parquet packages, or do we want to handle it some other way?

@ghost ghost added the CLA Signed label Aug 4, 2016
@electrum
Copy link
Contributor

@nezihyigitbasi MapredParquetInputFormat is a simple class. Might be easiest to fork it? Or bypass it entirely? Given we have a native Parquet reader, it might not be needed anymore.

@ducky427
Copy link

ducky427 commented Mar 2, 2017

+1

@drdee
Copy link

drdee commented Sep 1, 2017

@nezihyigitbasi -- what's the status of this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants