-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formalize an "exporting WOF documents" specification #35
Comments
Here are the rules such as they have been formalized in code and/or explicitly written down anywhere. 1-4 have been copied from the comments in py-mapzen-whosonfirst-geojson.
|
To @Joxit 's point about "No use of utf-8 hexa codes" in the
|
Thanks for bringing this together! On point 6: it's worth mentioning that there's a lot of documents out there with mixed space indentation, varying between 2, 4 and 0 spaces, depending on the depth and key! To try and summarise, I think we're trying to decide on:
Things that I'd consider fixed are:
Have I missed anything? |
That looks about right, yeah. Can you talk a bit more about these and how/why they are important:
One thing that hasn't been mentioned so far are "remarks" files which were developed to address the need for comments (remarks, even) that weren't suited for a WOF document specifically: https://github.com/whosonfirst/whosonfirst-cookbook/blob/master/definition/remarks_files.md I mention them because I guess I've never imagined a situation where any given property would need to be wrapped at a given length (outside of having an explicit newline character which I think we assume(d) would never happen). Is there a particular use case you're thinking of? As far as indenting offsets go, is there value/benefit in being strict about the length of those offsets? The principal aim of indenting was to make it easier to read a WOF document in a text editor or a browser window and not anything else. There is something to be said for being able to compare the bytes of two documents in canonical form but maybe that's a secondary formatting that follows all the default rules but has no indenting? I don't know if that just makes things more confusing or not... |
a) true, b) true, c) nobody except cyborgs 🤖 😆 I think your list is complete @tomtaylor If a spec is created, the update should by lazy => only for new and updated documents. My feelings:
|
Different formatters do different things with arrays, sometimes guided by an ideal maximum line length. For example: {
"key": [1, 2, 3]
} vs {
"key": [
1,
2,
3
]
} |
Apologies for the radio silence on this. For indenting, I think the rule(s) should be:
Trailing whitespace SHOULD be avoided as a rule but as with indentation it is not worth making a big deal over. As with indentation, if this is not the representation used to compare bytes insignificant whitespace should be left to the discretion of users, I think. I agree with @Joxit about encodings but can someone confirm that Python 3 does the "right" thing? I also agree that 14 decimal points is a lot but there were reasons (not that I remember them... @nvkelso ?). They can be shorter (per the rule about trimming zeros from coordinates) but I think the compromise was that they just can't be longer than 14 decimal points. @tomtaylor : Arrays have always been indented with one item per line since that's what Python's JSON encoder has done so I'd prefer to just stick with that. Thoughts? |
I'm ok with your 4 rules ! I'm not a python guy, but I did two tests : $ python2
>>> print("அஜாக்ஸியோ")
அஜாக்ஸியோ
>>> print("\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb")
\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb
>>> unicode(u"அஜாக்ஸியோ")
u'\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb' $ python3
>>> print("\u0b85\u0b9c\u0bbe\u0b95\u0bcd\u0bb8\u0bbf\u0baf\u0bcb")
அஜாக்ஸியோ
>>> print("அஜாக்ஸியோ")
அஜாக்ஸியோ I found the use of the package So, if I'm understanding well, the default encoding was ASCII for python 2, that's why we need to declare unicode strings (with |
I think I've got a slightly different perspective to you, having just wrangled a ~2.5 million file repo of UK postcodes. I wrote a Go tool to sync against the source data and manipulate the files. It rewrites every file, because it's difficult to compare the representation precisely in Go (as it stands). Because there is no canonical formatting spec, I ended up creating a lot of unnecessary changes in the diffs because my tool formatted things slightly differently. Now it's difficult to see what I introduced/edited/deprecated, because there's a load of whitespace noise. Without a tight canonical spec for how a WOF file should be formatted, this'll happen again and again, as multiple tools serving different purposes write and rewrite files. It'll produce a git history where it's difficult to work out what has actually changed and why. That might be fine, and maybe we can live with that. Maybe hand editing is a big enough use case that it should take priority? But I feel like most of the WOF changes I see are being performed by tools and scripts. Maybe there's a way of solving this in the tools themselves, by preserving some of the formatting as the files pass in/out, but that sounds quite tricky to me. I think I'd prefer to decide a tight formatting spec, but I'd like to hear other arguments. |
So, what you would like is to have a spec, format all the WOF documents from all repositories and then work with this new base ? |
No need to touch everything at once. But by updating the formatting libraries in each implementation it'll tend towards that over time. I'm not pushing this hard - but I think it's worth thinking about. |
Okay, in a lazy way then. Thanks to this only the first contribution will be hard to compare. I feel like your case is included in rules of @thisisaaronland. He wrote permissive rules in order to limit side effects on diffs. But you're right, this is useful only for hand editing.... 🤔 |
Heya looks like I'm a bit late to the party on this one 🎉 I was actually thinking over the holiday break about how to simplify reading/writing different WOF collection formats. Talking about interoperability between different tools/languages, I think this quote is pertinent:
As well as an excerpt from the 'Unix Philosophy', which is an oldie (1978) but a goodie:
So I wrote a reference implementation of how the 'Unix Philosophy' could work for WOF data, the code is only two days old but it's capable of reading/writing all the major formats including # convert git repo to sqlite database
wof git export /data/whosonfirst-data-admin-nz | wof sqlite import nz.db
sqlite3 nz.db '.tables'
ancestors concordances geojson names spr # convert a sqlite database to a tar.bz2 bundle
wof sqlite export nz.db | wof bundle import nz.tar.bz2
tar -tf nz.tar.bz2
... many geojson files
data/857/829/01/85782901.geojson
data/857/829/05/85782905.geojson
data/857/842/67/85784267.geojson
data/857/846/57/85784657.geojson
meta/locality.csv
meta/county.csv
meta/localadmin.csv
meta/neighbourhood.csv
meta/dependency.csv
meta/country.csv
meta/region.csv The magic here is in the interface, each process just knows it's either reading or writing a stream of geojson features, it doesn't care how the bytes are encoded/marshalled, only that the contents are valid geojson. The only time the marshalling is actually relevant is when using byte-for-byte comparison tools like These traditional diff tools are not a good fit for this task, but can still be used when the json marshal algorithm is the same for both data being compared, which I think is what we're discussing here 😄 The current marshalling format makes working with existing line-based unix tooling difficult: wof git export /data/whosonfirst-data-admin-nz --path=data/857/846/57/85784657.geojson \
| wc -l
105 But it's simple enough to reformat the stream so that each feature is minified and printed one-per-line by piping the stream to a formatter: wof git export /data/whosonfirst-data-admin-nz --path=data/857/846/57/85784657.geojson \
| wof feature format \
| wc -l
1 The same idea can be applied to a 'canonical marshalling', as per the 'Unix Philosophy', we could simply have one program which accepts geojson via The nice thing about this is that tooling written in other languages doesn't need to worry about what the git format is exactly, they can simply pipe their own output to this one program to be guaranteed bug-for-bug compatibility. I could be wrong about this but I suspect that if we all try to implement a common marshalling format in various languages it's going to be time-consuming, error-prone and will lose out on the performance benefits of using the native serialisation provided by each environment. I haven't looked into it deeply but I'm assuming the string & number encoding issues will no longer be an issue so long as each implementation agrees to being lossless, by avoiding truncating float precision or otherwise mutating the underlying data? |
I think there are two separate, but equally valid, use cases here:
That's why I suggested earlier that perhaps we settle on two different marshaling formats, one for publishing and one for comparisons. Rules for the latter might be as simple as:
The former would be the list as we've discussed so far with recommendations about indenting and white space but deviations from those suggestions would not trigger errors. Would that work for you @tomtaylor ? @missinglink There is on-going work to something along the lines of what you're suggesting but most of data "source" abstraction happens at the reader (and writer) level: In a WOF context the idea is that bytes (specifically And then exported: To generic "writer" interfaces: (This one lacking documentation but it is just like All of which can then be encapsulated in WOF specific code and rules: Related is the standard WOF GeoJSON package (which is probably due for an overhaul as plain old And the validation package which
There are some outstanding inconsistencies in many of these interfaces but that's what I am trying to figure out. |
I was actually hoping for 0 marshalling formats but it seems 1 will be required 😝 Insofar as I see it, there are three distinct problems:
Regarding the last 2 points, I think the only viable option is a program (preferably a single binary) which encodes/marshals geojson in a predicable way. For 🤖 it doesn't really matter which format is chosen as long as it is deterministic. ...and I think getting different languages to marshall in a deterministic way is going to be difficult or near impossible! # some examples:
# a python2 serializer using 'indent=2, sort_keys=True'
function wof_write_python(){ python2 -c 'import sys, json; print json.dumps(json.load(sys.stdin), separators=(",", ":"), indent=2, sort_keys=True);' }
# a nodejs serializer using 'indent=2' (sorted keys requires user code)
function wof_write_javascript(){ node -e 'fs=require("fs"); console.log(JSON.stringify(JSON.parse(fs.readFileSync("/dev/stdin")), null, 2))' }
# a jq serializer using '--indent 2' and '-S' to sort keys
function wof_write_jq(){ jq -rMS --indent 2 } I exported a random record from git and serialized it using these various methods and none of them produced the same result:
This leads me to believe that we should just have a single program which is responsible for this task. The program will need to be deterministic and ideally for portability reasons it would be a compiled binary available for multiple architectures (here's looking at you Go 😉).
So. what format does What is quite nice about the current format (formats?!) I see in git is that the Another bonus is that the So I actually like the format as-is and would be 👍 for keeping that the same, albeit resolving the string and number encoding issues of python2. One final thought is that it would be nice if |
If anyone fancies playing around with node wof CLI you can install it with: npm i -g @whosonfirst/wof wof --help
wof <cmd> [args]
Commands:
wof bundle interact with tar bundles
wof feature misc functions for working with feature streams
wof fs interact with files in the filesystem
wof git interact with git repositories
wof sqlite interact with SQLite databases
Options:
--version Show version number [boolean]
--verbose, -v enable verbose logging [boolean] [default: false]
--help Show help [boolean] |
Hey @missinglink, great minds think alike, I was also working on a wof cli during my holidays (in rust) : https://github.com/joxit/wof-cli
|
@missinglink I think we are saying the same thing? So-called Those rules should be simple enough that they can be implemented in any language. The benefit of Go, for example, is that it can be cross-compiled for people who don't know, or care to know, the boring details of computers. For those who do it's easiest to assume that they have legitimate reasons for using [ some other language ] and shouldn't be forced to use [ some specific language ] just to compare WOF representations. What I am proposing introduces non-zero CPU/brain cycles when comparing WOF records. Specifically, a program must first read and parse a source record, for example the human-readable "published" document and then marshal it in to second byte-level representation. The reasons this seems like a reasonable trade-off are:
Thoughts? |
@Joxit @missinglink On the subject of your respective CLI tools: Blog posts about their "theory, practice and gotchas" would be welcomed and encouraged: |
@Joxit that's awesome 😸 It would be amazing if we all wrote small programs which read and write geojson streams, then we could just pipe them all together to achieve complex workflows 🎉 |
Trying to implement whosonfirst/whosonfirst-cookbook#35
👋 lots of great discussion here :) The above items Aaron lists are workable for me. Stephen and I rely on the git diff tooling on command line in web interface extensively so that’s a primary concern for me, even as we develop better edit tooling we still will be relying on diff for review. I’m willing to relax the 14 decimal precision if it makes everyone’s life easier. A few years ago I wanted the option to preserve the original geometries from providers as they provided them in all their crazy precision... but we end up needing to modify them for topology reasons and other consistency reasons and it just ballots file size otherwise. |
(now that my newborn is sleeping)... I'm fine with pretty Unicode versus the escaped – often it's challenging for Stephen and I to manually decode them during the diff reviews which we could build more GUI around... but if this is for humans then I have mild preference towards switching them in the src. We've done several large every file, every repo change sets in last 12 months – my preference is we settle on something and then submit PRs to update all files in all repos so we're not half-half old new. And I have a mild preference for following convention on Github with trailing line so it's easier to make quick changes – but then they have to be exportified anyhow so really whatever works best for our tooling. |
Some background:
@Joxit noted that:
To which I replied:
And then @tomtaylor said:
cc @nvkelso @stepps00
References:
Which are used respectively by:
For the purposes of this issue I think we are only discussing the first two packages. As in: How do blobs of GeoJSON get marshaled and not what the content of those blobs is.
The text was updated successfully, but these errors were encountered: