From 2f76dacced7c0b01996d125e9e3e891c553e756f Mon Sep 17 00:00:00 2001
From: Philip Durbin <philip_durbin@harvard.edu>
Date: Mon, 3 Jun 2024 10:06:13 -0400
Subject: [PATCH] add section on differences from Kaggle

---
 README.md | 5 +++++
 1 file changed, 5 insertions(+)
diff --git a/README.md b/README.md
index b28fa14..c57a0f3 100644
--- a/README.md
+++ b/README.md
@@ -149,6 +149,11 @@ Same as above but use a JVM option in domain.xml such as the example below.
 ```
 <jvm-options>-Ddataverse.spi.exporters.directory=/home/dataverse/dataverse-exporters/croissant/target</jvm-options>
 ```
+### Differences from Kaggle
+
+- I see an `encodingFormat` of `text/comma-separated-values`. Kind of curious about that since I think `text/csv` is more the MIME type that's on https://www.iana.org/assignments/media-types/media-types.xhtml and https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types
+- One big difference I see is that you have many `recordSets` (and each one containing a single `field`) despite there being only 1 CSV. My understanding was that a `recordSet` maps roughly to a table and a `field` maps roughly to a column. So you'll see that our implementation has only 1 `recordSet` with many `field`s. This might be a good thing to get clarification on.
+- Another thing that sticks out is that I see all of the `field`s have a `dataType` of `sc:Integer`. But nearly all of the columns (excluding `quality` and `Id`) are `sc:Float`. On the Kaggle side, we have a column type of "Id" and so if that's set on a column, we set the `dataType` to `sc:Text` since Ids can often be non-numerical. Just a minor difference there, though, so nothing alarming to me personally.
 
 ### Differences from pyDataverse