-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mimetypes as codes #4
Comments
Yeah, we could have a mime type based multicodec, something like:
not everything is registered under mime. also, mime is not generally specific enough. In most circumstances, |
Hi @jbenet @Fstub42 , has there been more discussions or agreements for getting this supported? I'm experimenting with Please let me know if you have some more info that I'll be up for sending a PR we could work on towards such a support. |
Based on the information found at https://www.iana.org/assignments/media-types/media-types.xhtml, plus the suggestions in previous posts here, I'm thinking the following ranges can be reserved for the different mime types/subtypes (I put some examples I'm using to play with from the // 0x1000 - 0x17ff (11 bits) reserved for application/* (there currently are ~1,300 subtypes)
multicodec.addCodec('mime/application/json', Buffer.from('1000', 'hex'));
multicodec.addCodec('mime/application/octet-stream', Buffer.from('1001', 'hex'));
multicodec.addCodec('mime/application/ld+json', Buffer.from('1002', 'hex'));
multicodec.addCodec('mime/application/rdf+xml', Buffer.from('1003', 'hex'));
// 0x1800 - 0x18ff (8 bits) reserved for audio/* (there currently are ~150 subtypes)
multicodec.addCodec('mime/audio/mp4', Buffer.from('1800', 'hex'));
// 0x1900 - 0x190f (4 bits) reserved for font/* (there currently are ~8 subtypes)
multicodec.addCodec('mime/font/ttf', Buffer.from('1900', 'hex'));
// 0x1910 - 0x197f (7 bits) reserved for image/* (there currently are ~60 subtypes)
multicodec.addCodec('mime/image/png', Buffer.from('1910', 'hex'));
// 0x1980 - 0x19cf (5 bits) reserved for message/* (there currently are ~18 subtypes)
multicodec.addCodec('mime/message/sip', Buffer.from('1980', 'hex'));
// 0x19d0 - 0x1a3f (6 bits) reserved for model/* (there currently are ~24 subtypes)
multicodec.addCodec('mime/model/3mf', Buffer.from('19d0', 'hex'));
// 0x1a40 - 0x1a8f (5 bits) reserved for multipart/* (there currently are ~13 subtypes)
multicodec.addCodec('mime/multipart/byteranges', Buffer.from('1a40', 'hex'));
// 0x1a90 - 0x1aff (7 bits) reserved for text/* (there currently are ~71 subtypes)
multicodec.addCodec('mime/text/html', Buffer.from('1a90', 'hex'));
multicodec.addCodec('mime/text/csv', Buffer.from('1a91', 'hex'));
multicodec.addCodec('mime/text/turtle', Buffer.from('1a92', 'hex'));
multicodec.addCodec('mime/text/xml', Buffer.from('1a93', 'hex'));
// 0x1b00 - 0x1b6f (7 bits) reserved for video/* (there currently are ~78 subtypes)
multicodec.addCodec('mime/video/JPEG', Buffer.from('1b00', 'hex'));
multicodec.addCodec('mime/video/mp4', Buffer.from('1b01', 'hex')); |
This is cool, thanks for giving it a push :) It might be nicer to start with just one bucket of numbers. Most will inevitably run full anyway, so there's little sense in pushing the problem down a few years. An approach that feels more accomodating for simple future change is to start with a single bucket that includes a snapshot of the whole mediatypes table, and then regularly add a new bucket with mediatypes added in the meantime Mimetypes seem like a category of multicodecs that would be fine with fragmented numbers, i.e. they don't seem to benefit from being strictly consecutively numbered. (While something like the various variable-length multihash functions clearly do.)
I was always under the impression that mimetypes were case-insensitive -- is that the case? Important question for decoding/encoding.
@jbenet There's a longstanding convention for this, e.g. |
It would also be interesting to look at the complete mediatype syntax ( |
Thanks @lgierth .
As per the ranges I think you have a valid point, I was proposing to reserve enough bits to cope with quite a large addition to each of the types, but I wouldn't disagree with having them fragmented as long as we get the initial bucket of most common/popular ones all together now at least. |
Since multicodec is represented with varint and MIME is a hierarchical classification system anyway, wouldn't it make sense to define one big range for MIME with a "prefix" (most significant septet of a 3-byte varint) followed by 7 bits for type and subtype? That gives a range of 128 types and 128 subtypes for each type - types with more subtypes can use multiple type septets. Squabbling over bit real estate is unnecessarily complex and less valuable than simpler decoding logic. Speaking of all this, isn't multicodec essentially a broader (though non-hierarchical) version of mimetypes with a binary encoding? |
The missing piece is actually making the mapping. |
Is it outside of the scope of this topic to suggest bijective improvements of MIME in deciding the mapping? For instance, in #84 @Stebalien mentions that Some ideas:
Some of these may belong in their own multicodecs or outright unnecessary. Also, all the suggestions thus far don't suggest any future-proof way of representing higher level interpretations of a lower level data format, eg |
Redefining MIME types is likely way outside of the scope of this project. This project is primarily concerned with defining short "codes" for arbitrary things. |
Not sure where to put this comment, if in a new issue, or else where, but I'd like to see a mime-type column in the table. It should be unique, as codec parameters can be specified for things that are not. This would help ensure that duplicate entries are not added, but also provide a way to automatically map from mime-types to multicodecs w/o people building their own [and possibly incorrect] tables of such. |
Made this comment in the PR, but reposting here for visibility:
|
Just my two cents: it looks like concern over achieving the perfect encoding of mime types is what has stalled this (long overdue) work. I would suggest being much less precious about it, and just treating mime types like a legacy format. The existing codec name Content that has been encoded with the intent of running on a HTTP server (a legacy protocol in the context of multiformats) can and should use multicodec encoded mime type mappings because that is the nature of the content. If in the future a more idiomatic multicodec schema is designed for stuff like images, then those codecs can be added in addition to the existing legacy mime-type mappings. There's plenty of byte real-estate. No need to be precious. |
Naming every possible type is already done with mime types.
I think it would make sense to use them as codes, instead of defining your own.
"image/jpg" or "text/plain" make nice paths btw ;)
What are your thoughts on it, did I miss something?
The text was updated successfully, but these errors were encountered: