-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function to recursively export all Socrata data to local directory #126
Comments
Really glad to see this enhancement request. I've used this function combination before and it works well, so I'm glad to see that you're thinking of combining them. An interesting extension of this enhancement --and one perhaps not vulnerable to throttling -- would be to enable RSocrata users to export a list of column names associated with each dataset (rather than the datasets themselves). That would allow users to investigate which datasets share which fields. |
Thanks, @joshmwolff. It's an interesting idea and can see that being generally useful. But, as I think about it, would that be most useful to retain all of the columns in memory instead of writing them to disk? |
@tomschenkjr: think you're right: it would be inefficient to use the export.socrata() function to get a list of all column names for all datasets within a domain if that means downloading all datasets into memory instead of writing them to disk. That said, I'm not sure I'd necessarily want to download all datasets to disk if I were only interested in keeping column names and ultimately combining those names into a single data table. I see, however, that Socrata's Discovery API will in fact return dataset column names for datasets in a particular domain. Can RSocrata hit the Discovery API? That might enable the package to capture the column names without having to also download the column data itself. |
@joshmwolff - right now we're not using the discovery API, but something we're planning for an upcoming release. In either case, I think Would you mind opening a new issue on this? I think it's a worthwhile conversation to track as a separate feature than this one. |
Outlining some thoughts on unit tests for this function:
|
FWIW, I've been playing with Socrata's "Discovery API" and it works well as a means to for creating a small dataframe of dataset names and other metadata. The following worked for me:
This avoids having to write any files to disk, as you're just storing a single small dataframe in memory (in Cambridge's case, 88 rows by 17 columns). |
Whoops. You're right: my comment belongs in #128 rather than here. Feel free to ignore. |
I've pushed the branch to the repo: |
I've pushed my first stab at downloading non-tabular data files. Here's how it works: if the first download URL available is not a CSV, the For geographical data with multiple download choices, it looks like KML is first so that is what is being saved. Still to do:
|
@nicklucius - that's great. I'll check-off export for Shapefile/KML/KMZ for now. Looks like this will also work for geojson, but will play with that for a bit to see how it works. The compression is surprisingly tricky. |
Scratch the geojson remark, that's certainly taken care of. I'll check that off, too. |
Getting an error when testing with the bulk files. Appears to be an error with the file name structure. Looking into it. > export.socrata("https://data.cityofchicago.org/")
Hide Traceback
Rerun with Debug
Error in file(con, "wb") : cannot open the connection
3. file(con, "wb")
2. writeBin(response$content, filename)
1. export.socrata("https://data.cityofchicago.org/")
In addition: Warning message:
In file(con, "wb") :
cannot open file 'data.cityofchicago.org/qixn-wjxu_2017-05-06_133501."Street Sweeping - 2017 -
Map.kml"': Invalid argument |
Ok - I've fixed the above error and also added the Later today, can move this over to |
As for unit testing, one option is to setup a valid, but fake data.json file that is just a subset of data and placed at a file location we can control. Those files can point to actual data from the Chicago data portal. This is a bit of work so don't necessarily like it, but is one option. |
It's now on |
Hey guys, been on paternity leave (👶🍼💩) so I haven't had my head in GitHub enough, but @tomschenkjr was kind enough to alert me via email to what is going on. Big question that'll help this be performant on Socrata portals - for the actual data dumping, are you using the export links ( The former will allow you to download the dataset export in one big file, and takes better advantage of caching where available. It should be faster for you. |
@chrismetcalf - Congrats! I can understand how GitHub might not be the first thing on your mind right now. The |
Yes, @nicklucius, it looks like you're all good! The only other recommendation I would make would be to watch for eTag headers if you've got local caching, but if you're using the export links you're already taking advantage of our bulk export and server-side caching. |
Documenting two ideas from the analytics meeting today:
|
Not sure if this error message should be a concern?
It appears to have extracted some files, but I'm not sure it is the complete set of public files. |
@James-SR - thanks. It's still a beta feature so good to see some use cases that's producing warnings. When we develop it more, we will look at this warnings and handle them more elegantly. The easiest way to check is if the number of exported documents is the same as the number of entries at https://opendata.cheshireeast.gov.uk/data.json |
@James-SR - The latest build of Feel free to test this again and hope it is still useful. |
There is quirk with some data listed in Socrata that I wanted to document. Socrata supports several HTML-based "non-data" formats in the Second, Socrata also supports "external data". Sometimes, these external data are links to HTML webpages while other times they link to actual data. For instance, this dataset is linked to a web page not hosted by Socrata. Our function simply does not handle this well because HTML websites do not have content disposition. I've made a change to ignore links that do not have a content disposition to ignore these kind of sites. Sometimes external data is actually data (e.g., CSV), so those should still be downloaded. Other content, e.g., HTML, will be skipped. The downside of these approaches is it will ignore some information and could cause confusion. The former scenario is easy to determine because the This also makes more difficult to write unit tests. For instance, this approach means the I will think of ways to resolve these issues. |
I've resolved the issue by having the function follow these rules:
|
* Ignores HTML files (e.g., Socrata Pages) * Ignores on occassions there isn't any data * Will download (uncompressed) PDFs, Word, Excel, PowerPoint, plain text attachments.
The
ls.socrata()
function supports the listing of all data on a data portal whileread.socrata()
. Thus, the two can be combined underexport.socrata()
to download all of the files a neatly compress and place them in a single directory.This sort of functionality can be used by people wishing to archive data portals or help in the migration of one platform to another one. The function should be focused on saving the data to a local or cloud-based storage (e.g., S3) and should avoid loading all of the data into memory.
Solving #124 will allow an easier integration between
ls.socrata()
andread.socrata()
, so RSocrata v1.7.2-7 or above will be required.data.cityofchicago.org/
)export.socrata()
to avoid throttlingdata.frame
friendly, but JSON downloads faster from Socrata)An initial alpha is on this gist.This is now on theissue126
branch. Feedback is encouraged as I've done limited testing at this point.The text was updated successfully, but these errors were encountered: