From 2f76dacced7c0b01996d125e9e3e891c553e756f Mon Sep 17 00:00:00 2001 From: Philip Durbin Date: Mon, 3 Jun 2024 10:06:13 -0400 Subject: [PATCH] add section on differences from Kaggle --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index b28fa14..c57a0f3 100644 --- a/README.md +++ b/README.md @@ -149,6 +149,11 @@ Same as above but use a JVM option in domain.xml such as the example below. ``` -Ddataverse.spi.exporters.directory=/home/dataverse/dataverse-exporters/croissant/target ``` +### Differences from Kaggle + +- I see an `encodingFormat` of `text/comma-separated-values`. Kind of curious about that since I think `text/csv` is more the MIME type that's on https://www.iana.org/assignments/media-types/media-types.xhtml and https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types +- One big difference I see is that you have many `recordSets` (and each one containing a single `field`) despite there being only 1 CSV. My understanding was that a `recordSet` maps roughly to a table and a `field` maps roughly to a column. So you'll see that our implementation has only 1 `recordSet` with many `field`s. This might be a good thing to get clarification on. +- Another thing that sticks out is that I see all of the `field`s have a `dataType` of `sc:Integer`. But nearly all of the columns (excluding `quality` and `Id`) are `sc:Float`. On the Kaggle side, we have a column type of "Id" and so if that's set on a column, we set the `dataType` to `sc:Text` since Ids can often be non-numerical. Just a minor difference there, though, so nothing alarming to me personally. ### Differences from pyDataverse