Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SoftWare Heritage persistent IDentifiers #203

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Ericson2314
Copy link
Contributor

@Ericson2314 Ericson2314 commented Jan 15, 2021

See https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html for the technical details of that spec.

Software Heritage has done a superb job promoting content addressing in general, and their identifier scheme (SWHIDs, for short) in particular. By supporting them in CIDs / IPLD, I hope the IPFS ecosystem can align itself with that effort.

Per the linked documentation, SWHIDs have their own nested grammar and versioning scheme. I have taken the version 1 core identifier grammar, unrolled it, and replaced : with - per the guidelines on separators, with the result being these 5 rows.

Also note that some of those schemes coincide with certain forms of git-raw, already in this table. However, adding them here is not redundant, because the deserialization direction. While a git-raw CID may legally point to any sort of git object, the relevant SWHIDs specify in the identifier what sort of git object is expected, and if it points to a different type of valid git object, deserialization fails. That means there is no way to losslessly convert a git-coinciding SWHID into a git-raw CID.

Contains #207.

@vmx
Copy link
Member

vmx commented Jan 18, 2021

From having a quick look at the spec you linked, it seems that they are identifiers and not codecs (in the IPLD sense). The multicodec within a CID is about how the data is encoded and not what contextual meaning it has. If I understand it correctly the SoftWare Heritage always points to Git blobs, then the codec for the CID to use would indeed be git-raw.

@Ericson2314
Copy link
Contributor Author

@vmx

  • SWHID snapshots are not an official git format, unlike the others.

  • SWHID releases, revisions, directories, contents, are things covered by git-raw, but git-raw is no substitute as I tried to explain in that last paragraph. If I have SWHID directory hash and convert it to a git-raw, I cannot automatically convert it back because I don't know which of those 4 cases it should be. Likewise, those 4 codecs guarantee it will be that exact sort of object when decoding (or get a decode failure) and git-raw cannot do that.

The only way to to embed SWHIDs losslessly in IPLD is to have 5 codecs for each of the 5 possible prefixes before the hash of a core SWHID.

@vmx
Copy link
Member

vmx commented Jan 18, 2021

  • SWHID snapshots are not an official git format, unlike the others.

This might be its own IPLD Codec.

  • If I have SWHID directory hash and convert it to a git-raw, I cannot automatically convert it back because I don't know which of those 4 cases it should be.

Yes, this information would need to transmitted out of band. Similar to git-raw (except that in case of Git it's part of the encoded data). From the CID you only know that it is a Git Object, you cant't tell whether it's a commit or a tree. THe CID only gives you the information on how to decode the data and not the semantics of the data.

@Ericson2314
Copy link
Contributor Author

Trying to transfer the information out of band is rather sub-par though.

Remember that these objects are not just leaves, they contain SWHIDS themselves (after decoding). SWHID snapshots point to the other git objects, and when one dereferences a directory, one gets full SWHIDs rather than mere git hashes (the permission bits in the git directory object in this case provide enough info to recover a full SWHID for each entry).

A goal here would be to try to provide access to software heritage's archive over IPFS. But if all this logic has to happen out of band, then we need to modify IPFS in an arbitrary way? It's much nicer to just make a codec that that does this work, and fits everything within the normal IPLD data model without extra steps. It also means that a bitswap query doesn't need to be amplified into 4 SWHID queries if the mirroring is done on demand.

The CID only gives you the information on how to decode the data and not the semantics of the data.

To be clear, this "how to decode" vs "semantics" of the data is purely human interpretation.

If the git object grammar is something like:

<git-object> ::= <git-blob>
              |  <git-tree>
              |  ...

git-tree isn't any less a non-terminal just because git-object includes it. If we really didn't want any codec to ever agree with a "super codec", well, then we would just have raw bytes because everything can be recovered from raw bytes! And again, the SWHID codecs wouldn't actually be "sub codecs" by the letter, because one gives you git-raw child links and the other gives swhid-1-* child links.

@rvagg
Copy link
Member

rvagg commented Jan 19, 2021

Trying to transfer the information out of band is rather sub-par though.

Well, yes, this is unfortunately a limitation of CIDs as they are currently framed, but there has to be a bound somewhere in how much context you can jam into an identifier and the current incarnation of CID defines the codec portion as something like: what piece of decoding software do I need to reach for to turn the arbitrary bytes associated with this identifier into something not arbitrary. As you point out, the boundaries of this are a bit squishy because the same bytes could be interpreted in different forms at different levels. We try our best though to keep CIDs at the most basic level, which I think is the basis of @vmx's objection. Additional context that involve questions of where, why or how this object fits into a larger picture are (mostly) out of scope of CIDs.

Let's take one of the SWH examples: https://archive.softwareheritage.org/browse/content/sha1_git:94a9ed024d3859793618152ea559a168bbcbb5e2/raw/

If I take that file locally and calculate its SHA-1 on it, I get this:

$ curl -sL https://archive.softwareheritage.org/browse/content/sha1_git:94a9ed024d3859793618152ea559a168bbcbb5e2/raw/ > license
$ sha1sum license
8624bcdae55baeef00cd11d5dfcfa60f68710a02  license

So for that file as a blob, I could make a CID that takes that digest, wraps it in a SHA-1 multihash, then wraps that in a raw codec to say "when you get the bytes associated with this digest, pass it through the raw decoder to get the usable data (raw being a special pass-through decoder).

I could then add it to a git repository and see how it bundles it:

$ git init
$ git add license
$ git commit -m 'added licence' -a
[master (root-commit) fa1d860] added licence
 1 file changed, 674 insertions(+)
 create mode 100644 license
$ git rev-list --objects --all
fa1d8600ae9c38e1e96361649f2e214b00c6d485
158a4f5296f0b94433acb8057517ec6ce7925732
94a9ed024d3859793618152ea559a168bbcbb5e2 license

And now we see the 94a9ed024d3859793618152ea559a168bbcbb5e2 identifier that SWH is using. But what's in that?

cat .git/objects/94/a9ed024d3859793618152ea559a168bbcbb5e2 just gives me gibberish. It's encoded with git-raw which is distinctly not raw since I can't do anything useful with it without unwrapping using a codec. So the git-raw codec is doing two main things on this file - prepending blob <size>\0 and deflating the results. So there's a two-way codec process going on here before i can get to the usable content.

So a CID that wraps 94a9ed024d3859793618152ea559a168bbcbb5e2 in a SHA-1 multihash and identifies it as git-raw tells me (a) how to verify the contents of the associated binary with a hashing function and (b) how to turn the contents into usable data.

What happens beyond that is out of scope for CIDs unfortunately. If you want to bring some special context about what to do with this file then that has to come from elsewhere—which is a very common activity in content addressing. Just because I have a file that has the text contents of the GPL doesn't tell you anything about why I have it or how it fits into a larger picture. I could be using it to say "this is my project's license" as a wider project that compares different licenses, or maybe it's part of my test fixtures—this is context that's important to the collection of data that must sit outside of how you identify a single piece of a large content-addressed bundle of data. Typically you bring your context with you as you navigate down to individual pieces - if my GPL file is part of my text fixtures then the context would likely come via some directory structure that I've built to contain it, but I can't fit that directory structure into a CID, it has to come before the CID and the application that is consuming my collection of blobs is building the context as it goes.

I hope that explanation helps get toward the heart of the objection here. The codec field for a CID should tell you how to extract the usable data from raw bytes.

Regarding your comment:

While a git-raw CID may legally point to any sort of git object, the relevant SWHIDs specify in the identifier what sort of git object is expected, and if it points to a different type of valid git object, deserialization fails.

The git-raw codec has a decoding phase which disambiguates the binary data encountered. The initial bytes after deflate tell us how to interpret the rest of the content, blob, commit, tag, tree. (e.g. https://github.com/ipld/js-ipld-git/blob/ac45e5d6fa9d84dd5d4588f7c614eb61053c42c2/src/util.js#L56-L80)

SWH recognises this at https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html#git-compatibility, but does point out that snapshots are something different. Do you know anything about what the binary format of these objects that are being hashed are? There may be a case for adding a codec for this missing one if it's something that can't be covered by an existing codec (i.e. new code would have to be written to decode such an object, not just re-using existing code, like for git-raw).

Otherwise, wrapping these SWH identifiers in SHA-1 multihash and git-raw should give you as much information as CIDs are intended to provide - and they would also point to identical objects that are stored in any other system outside of SWH because it's the same object with the same decoding mechanism.

@warpfork
Copy link
Contributor

warpfork commented Jan 19, 2021

Quick nonfinal thoughts:

  • We tend to try to say that multicodecs are not meant to act as types. We try to say this because trying to make a numbering for every type of data in the universe is basically a Gödelian quest -- it can become somewhat silly if every new concept requires a new number, so we try to avoid it.

    • By contrast the SWHID docs say "A SWHID points to a single object, whose type is explicitly captured by <object_type>".
      • So... indeed. If we map SWHID's object_type info into multicodec identifiers, we're kinda tipping towards the Gödelian. It's definitely gives pause for review.
  • But "tend" and "try" are carefully chosen words there.

    • Really, we prioritize encouraging new developments to try to reuse existing codecs as much as possible, and build higher level features on top of the Data Model semantics, maybe check out IPLD Schemas, etc.
      • We like a world with fewer novel codecs better because such a world is just plain more legible (not just for us, but for everyone in the world).
    • Bridging to existing systems can be worth spending a multicodec identifier on.
      • Especially if they already have large adoption, or are in relevant ecosystems, or both.

So what's the balance, in this case?

  • I don't yet have a strongly held opinion.

  • It's very head-scratchy that four out of the five SWHID object_type's are essentially git.

    • I doubly don't know how I feel about this, yet.
  • ✔️ that the SWHID snapshot object_type isn't git, and so would seem to need a separate multicodec indicator regardless.

    • I've looked at the encoding spec for snapshot object_type and... this is definitely one of those things where if it was a design doc draft, and somebody asked my opinion on it, I'd say "have you considered using CBOR instead of inventing a new one-off?". But, it's not a draft, and they did, and people are using this, so. Ship: sailed. Fine.
  • I'm (I'll confess frankly) rather fond of the Software Heritage project. I think they're doing good stuff for good reasons. And to put it slightly more objectively: I'm aware that they have some significant and relevant adoption.

    • This makes me generally more open to voting in favor of this being worth spending one (or even several) multicodec identifiers on.

@vmx
Copy link
Member

vmx commented Jan 19, 2021

I had another closer look at the SWHIDs.

At the section about how to compute the identifiers it reads:

An important property of any SWHID is that its core identifier is intrinsic: it can be computed from the object itself, without having to rely on any third party.

This is exactly what CIDs are about. CIDs are computed from the object itself and contain just enough information to make it interpretable. Ideally objects are self-describing and if I understand things correctly all of your objects are, else you couldn't compute your identifier.

If you use the git-raw Multicodec you could retrieve and object from your archive. You know it's a a Git object so you could even use existing IPLD decoders to decode those. Once you've decoded the Git Object you have enough information to decode the rest, which is specific to SoftWare Heritage.

It's similar what to what IPFS is doing with the filesystem (UnixFSv1) implementation. UnixFSv1 contains information about files and directories (like your "directories", "contents", etc). This specific information is then wrapped in a more general purpose container called DAG-PB, where you use Git Objects instead. To the outside UnixFSv1 is only DAG-PB and also the CIDs are. Thanks @aschmahmann for coming up with this analogy.

Even the snapshots would work without a special SWHID Multicodec, as you wrap it in a valid Git object and just use a different identifier (or whatever this prefix is called). The Git IPLD Codec would need to be adapted to just parse the format and not caring about the exact identifier. But I'd totally be open to relax those rules (if we want to keep the strictness, we could even introduce a new codec called git-container or so). You did a great job at re-using the Git object format for snapshots, that really plays well with IPLD and multicodecs.

@Ericson2314
Copy link
Contributor Author

@vmx

Even the snapshots would work without a special SWHID Multicodec

But if git officially decides to call anything else a snapshot, then IPLD is in a very unfortunate situation.

@Ericson2314
Copy link
Contributor Author

@rvagg @warpfork

I hope that explanation helps get toward the heart of the objection here. The codec field for a CID should tell you how to extract the usable data from raw bytes.

We tend to try to say that multicodecs are not meant to act as types. We try to say this because trying to make a numbering for every type of data in the universe is basically a Gödelian quest -- it can become somewhat silly if every new concept requires a new number, so we try to avoid it.

Indeed, and I am sympathetic: a multicodec like "prime number" would be a classic violator here, because validating at codec times would be quite expensive. Even worse would be "valid RSA public key" as a codec. These are types that clearly violate that maxim and don't belong as codecs.

Now, I am very sympathetic that it feels a little icky to be adding these "subcodecs" when all the extra processing could be done out-of-band. And if SWH were in the process of designing SWHID v1 right now, I too might ask them "do we really need this information in the reference and we already have a nice tag in the referee?"

But at least we can cheaply incorporate the distinction here into the codec: instead of dispatching on that tag, just require it to be an exact match and take that branch. This makes it a far less bad offender than "prime number" and "valid RSA public key": they are types, but types a clear and efficient interpretation for a codec. Also, precisely because it overlaps with git-raw, it shouldn't be hard to implement as we can reuse the code that exists.


And finally, let me step back a bit from "pure engineering" considerations and appeal to the real world context:

As @warpfork says:

  • Bridging to existing systems can be worth spending a multicodec identifier on.

    • Especially if they already have large adoption, or are in relevant ecosystems, or both.
  • I'm (I'll confess frankly) rather fond of the Software Heritage project. I think they're doing good stuff for good reasons. And to put it slightly more objectively: I'm aware that they have some significant and relevant adoption.

    • This makes me generally more open to voting in favor of this being worth spending one (or even several) multicodec identifiers on.

There are my exactly thoughts, too. IPFS (and filecoin) has long done good work with major internet-accessible archives. SWH alone however is the only major archive I'm aware of which is already totally on board with the principles of content addressing, Merkle DAGs, etc. There is the possibly here not just to migrate data from the archives and store on IPFS nodes, or even builds an archive-accessing app that uses IPFS beyond the scenes, but to turn the SWH archive into bonafide "super nodes" in the IPFS network that share the archive using just IPFS regular interfaces (bitswap etc.,), are accessible to regular IPFS nodes, and can be easily embedded in other IPFS data.

Accepting SWHIDs as a fait accompli for anything that wants to so deeply bridge with SWH, but that is also the only impediment. On every other consideration, SWH and IPFS are in perfect alignment. Maybe when it's time to make SWHIDs v2, the embedding in IPLD well be an paramount consideration from get-go :).

@vmx
Copy link
Member

vmx commented Jan 19, 2021

But if git officially decides to call anything else a snapshot, then IPLD is in a very unfortunate situation.

Good point. I also looked at the Git implementation again and it does indeed decode the whole thing (and not just the container) so it would make sense to have a codec for the snapshots.

@mikeal
Copy link
Contributor

mikeal commented Jan 20, 2021

Sorry for showing up a bit late but I’m going to have to go back over a few things other people covered because I have some different recommendations.

It looks like (I could be wrong since I haven’t seen the implementation) that the various codec identifiers are not variations in block format (each one can be parsed with an identical parser without any out-of-band information). Each one is meant to contain some typing information so that the link itself is more meaningful. This is something we try not to do but have been forced into accepting in a few cases. To be clear, we recommend against this because it’s just not the best way to add type information to linking and we try to steer people to better alternatives, we aren’t trying to over-police the codec table.

It’s hard to see when just looking at multicodecs and CID’s but it’s actually expected that context about linked data is held in a parent linking to that subtree. It’s not the case that we expect applications to be able to fully interpret that application’s meaning of a link by looking only at the link’s multicodec. There may be context about the link inferred from the multicodec and the mulitcodec must tell us how to parse the linked block data, but there’s often more information than this that an application will need in order to figure out how to use the link.

Ideally, I’d like to see this paired down to one or two new multicodecs (not paired down to just the git multicodec as has been suggested) by putting the additional typing information (snapshot, release, revision, directory or content) somewhere else.

  • If there is something in the block data you could encode in order to contain this additional typing information that would be ideal, but potentially not possible.
  • If you can get by with the typing information being in the parent node and not in the link or the linked data you could encode something like [ ‘release’, CID ] into the parent and parse out the type information along with the link when you traverse. This works for cases in which you are not passing around links to each of these subtrees and are instead passing around a reference to the root of the tree along with paths to relevant subtress.
  • However, if you can’t do the either of the prior options (which may be the case) you could create a compound link using an identity CID. An identity CID has a block body as its “hash.” It’s the CID equivalent of a Data URI.
    • Option 1: Use dag-cbor for encoding the link as [ ‘release’, CID ] in the identity.
    • Option 2: Add a new multicodec for your typed link that corresponds to a very small block format for encoding these specialized links, something like | integer-code-for-type | CID | in bytes and put that in the identity.
      • Option 2a: Use the git multicodec for the CID inside this new compound link.
      • Option 2b: Use a second multicodec for your extended git blocks.

@Ericson2314
Copy link
Contributor Author

@mikeal thanks for your input. Seeing your link to this in ipld/specs#349, I left a comment there describing my understanding of the principles for multicodecs ipld/specs#349 (comment), and indeed it is ambiguous whether having this information in the multicodec passes that litmus test.

To me, the real deciding factor here is not technical but social. As @warpfork mentions, SWH is "doing good stuff for good reasons" and "they have some significant and relevant adoption". With the bare minimum of multicodecs: swhid-1-snp + git-raw, we can share SWH's immense archive on IPFS, but it will be lossy, because even nodes are isomorphic, edges/references/links are not. SWH and IPFS will both be second-class citizens within each other's ecosystems. Conversely, if we spend just a few more multicodecs, we have 1st class interoperability and 1st class citizenship. Again, the goal isn't building a new end-application, but bridging two already extant things, each of which is the "thin waist" in their retrospective ecosystems.

@Ericson2314
Copy link
Contributor Author

To get really specific about what "first-class" interopt might mean, what I'm envisioning is a modified IPFS node which knows how to relay Bitswap requests responses to/from SWH's native interfaces. I call this node the bridge, since the requests and responses are 1-1 and the translation is stateless. Regular IPFS can connect to this node (certainly by manual intervention, and hopefully eventually DHT if we also make the effort to populate it), and then do normal IPFS things, and accessing SWH-provided data will transparently work.

I actually think the answer to @mikeal's two questions in principle "yes": there is enough information in both the parent and child blocks alike. But, Bitswap requesting is by CID: the bridge won't have access to either the parent or child blocks when it goes to translate the request.

Any other way to try to get the information from the parent and child blocks to the bridge node or from the node making the ultimate request seems strictly worse to me. The code is the one place where format-specific logic is supposed to be; everything else is supposed to be format agnostic (dag-pb backcompat notwithstanding). If off-the-shelf IPFS is to work effortlessly with the existing SWH data model and existing storage and retrieval infrastructure, enough new multicodecs to faithfully translate SWHIDs to CIDs seems the least invasive way too do it.

@rvagg
Copy link
Member

rvagg commented Jan 22, 2021

OK, so it seems to me that we're boiling down to the "content routing" problem here, would that be correct @Ericson2314? That it's not practical to just throw all of the objects in SWH into the IPFS DHT but instead the CIDs themselves should provide a hint for where to go to retrieve such objects. Or something like that.

This came up recently for likecoin, discussion from here onward: #200 (comment) including @aschmahmann's excellent input from the go-ipfs side.

Is content routing the primary need here? Your OP suggested a bi-directionality problem, with loss of information:

While a git-raw CID may legally point to any sort of git object, the relevant SWHIDs specify in the identifier what sort of git object is expected, and if it points to a different type of valid git object, deserialization fails.

I can see this being true for snapshots, but it's not strictly true for the other types, is it because git-raw retains that information internally. Does this simply come back to the content routing problem of wanting to intercept requests for certain CIDs and convert them to request from SWH, and without this additional information you can't properly query SWH's systems because it needs this additional information in order to form a valid identifier for their current API to handle? In terms of object identification, this additional prefix information is redundant (perhaps except for the fact that it slightly hardens the use of SHA-1), so I suppose there's some other limitations in their systems that require partitioning of these objects.

@Ericson2314
Copy link
Contributor Author

OK, so it seems to me that we're boiling down to the "content routing" problem here, would that be correct @Ericson2314?

Routing yes, but forwarding in particular. Populating the DHT is great, and I hope to see other IPFS nods mirror popular data, but the main thing is being able to translate bitswap to the SWH APIs with the bridge node. For that least bit, we can ignore whether someone is manually connecting to the bridge node or got there via a DHT entry.

That it's not practical to just throw all of the objects in SWH into the IPFS DHT but instead the CIDs themselves should provide a hint for where to go to retrieve such objects. Or something like that.

Err it's my understanding that only small objects reside directly in the DHT, and otherwise the DHT just tells you what nodes to try dialing? In any event, I'm not against the DHT and storing things natively on regular IPFS nodes, I just want the basics to work first. It's very much analogous to using IP just for the internetwork, building momentum, and then trying to use it for the LAN too :).

Does this simply come back to the content routing problem of wanting to intercept requests for certain CIDs and convert them to request from SWH, and without this additional information you can't properly query SWH's systems because it needs this additional information in order to form a valid identifier for their current API to handle?

Exactly! See https://docs.softwareheritage.org/devel/swh-graph/api.html and https://docs.softwareheritage.org/devel/swh-web/uri-scheme-api.html for the interfaces in question.

In terms of object identification, this additional prefix information is redundant (perhaps except for the fact that it slightly hardens the use of SHA-1), so I suppose there's some other limitations in their systems that require partitioning of these objects.

Yes it indeed isn't strictly-needed information. I don't actually know the engineering backstory. I could forward the question if you like.

@vmx
Copy link
Member

vmx commented Jan 28, 2021

I've taken a closer look at how the SWHIDs are generated as I see the need to be able to go from something stored in IPFS to a SWHIDs and back. I've read some of the docs and a bit of the source code. @Ericson2314 please let me know if I understand things correctly. Let's leave the snapshots aside for now. The other four types have a 1:1 mapping to Git objects. They are not only like Git objects, they are actual Git objects. So if you just look at the bytes of the objects, you couldn't tell whether they come from the SoftWare Heritage system or from some Git repository. Is that correct?

The mapping is (I'm using the identifiers Git is using in their objects):

  • commit => SWH Revision (rev)
  • tree => SWH Directory (dir)
  • blob => SWH Content (cnt)
  • tag => SWH Release (rel)

So with having a CID, which contains the hash that also the SWHID is using and with looking at the data (which is compressed, which makes things more complicated) one can construct a full SWIHID. Is that correct?

@Ericson2314
Copy link
Contributor Author

@vmx I think that is all correct, expect for small part about compression (I should have mentioned this earlier in the thread, but I forgot). It is my recollection from the IPFS and Nix work work that uncompressed data is hashed with git.

Still, SWHIDs and Git are in agreement on the compression part, whatever the answer may be, so it doesn't subtract from your larger point.

@Ericson2314
Copy link
Contributor Author

(Also "git tag hashes" are very hard to find information about. I'm not sure what's up with that.)

@vmx
Copy link
Member

vmx commented Jan 29, 2021

(Also "git tag hashes" are very hard to find information about. I'm not sure what's up with that.)

I don't understand what you are referring to :)

@Ericson2314
Copy link
Contributor Author

Oh I just mean I had never heard of tag hashes before reading them in the SWH docs, and I have a hard time finding hashes associated with tags in the wild that aren't just the that of commit being tagged. This is something I should ask SWH about, but wanted to mention it here in case someone else is confused as I am.

@ribasushi
Copy link

@ribasushi
Copy link

Ironically github lists it, but can not render it

@Ericson2314
Copy link
Contributor Author

I see https://git-scm.com/docs/git-tag. I guess i was confusing lightweight and annotated tags. Thanks @ribasushi.

@vmx
Copy link
Member

vmx commented Feb 16, 2021

@Ericson2314 This PR has been stale for a while, so I want to make sure you're not blocked. As the discussion at #204 shows, there still need to be things figured out, what exactly qualifies as codec and whether your use case should fall into that or be its separate thing.

The only thing currently everyone in this discussion agrees on, is having a codec for SWH Snapshots. Can you move forward with the developments you've planned when we only add that codec and use git-raw for the rest? Once there is a better understanding based on the actual implementation and it turns out more codecs are needed, we could then revisit it again. Would this be a workable way forward?

@Ericson2314
Copy link
Contributor Author

@vmx To be clear, I am no rush until I know when I will be able to start working on the stuff proposed here in earnest. But if you like I broke it into two commits and opened #207 which just has swhid-1-snp.

@vmx
Copy link
Member

vmx commented Feb 17, 2021

Thanks @Ericson2314 for being so patient and working together on that. I really hope we find a good solution for maximum interoperability for all systems involved.

@warpfork
Copy link
Contributor

I see that @Ericson2314 has tersely summarized the reason to include the additional four codes in the message of the most recently rebased commit pushed here.

I see no reason to oppose the use of these code numbers.

Does anyone want to object to merging these?

Copy link
Contributor

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I do not see a reason to oppose merging all of these codes. While we have had some interesting discussion of the subtleties here, ultimately, this range does not seem overly precious to me, and if there is code and community that will use them, then I support enabling that.

@Ericson2314
Copy link
Contributor Author

I suppose now that there is the draft vs permanent distinction, there is a the option of initially adding all 5 but down the road making snapshots permanent before the other 4.

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still objecting it, as mentioned in this discussion and also at #204 (comment) I think there should only be one codec be added (the one that doesn't have a corresponding one in gi) and the other ones should use the git-raw codec for maximum interoperability. I still think the problem SoftWare Heritage has is a real one, but shouldn't be solved by an new IPLD codec.

@vmx
Copy link
Member

vmx commented Aug 22, 2021

Sorry for the last comment. I wasn't fully concentrated and though we are about to merge this PR, but the e-mail I got was about #207, which I already approved and is totally fine. Sorry for the noise :)

@vmx
Copy link
Member

vmx commented Aug 22, 2021

And even more confusion, so I'd prefer merging #207, but not merging this one as mentioned above.

@Ericson2314
Copy link
Contributor Author

Just doing #207 unblocks us so that's fine for now.

@rvagg
Copy link
Member

rvagg commented Aug 25, 2021

I wrote some additional thoughts down that I've been having regarding this (and related) topics but decided to post it in a gist because it's not short and maybe best not to clog this thread up: https://gist.github.com/rvagg/1b34ca32e572896ad0e56707c9cfe289

I think in there I might be suggesting one or two ways ahead with this PR:

  1. New codecs for SWH data, that overlap heavily with git* codecs but see additional data within the bytes they're decoding
  2. Compromise due to the fact that we keep on hitting problems like this with CIDs not being able to extend far enough into territory that users want them to go and we just don't have solutions for them yet. We either have to be rigid and force everything through the narrow definitions we create, be flexible and recognise that we don't yet have the tools to answer everything properly, or actually do some work to extend our specs to open additional usecases, perhaps that even means a CIDv2.

@aschmahmann
Copy link

aschmahmann commented Aug 25, 2021

While we have had some interesting discussion of the subtleties here, ultimately, this range does not seem overly precious to me, and if there is code and community that will use them, then I support enabling that.

There seems to be some misunderstanding that this is strictly about which code ranges are valuable vs not, it's not. If it was then people would just suggest a higher range for these values.


I too think that this is not a good idea. As @rvagg mentioned earlier we get these requests to basically embed content routing hints in the CIDs every so often and generally they're not a good idea (e.g. #200 (comment)).

I also added some comments to the more general "what is an IPLD codec" thread #204 (comment). The TLDR there is that these location addressing/hinting codecs proposal generally plan to integrate with go-ipfs by leveraging quirks of how Bitswap works today that may not continue to be there once we fix some outstanding issues. Saying yes here potentially sets up the proposer for failure by encouraging them to build a system that there is no compatibility guarantee for due to them abusing the codec and CIDv1 in general.

My major objection here is basically due to wanting to setup SWHIDs' interoperability with go-ipfs, and the rest of the IPLD ecosystem, for success and require relatively low software maintenance burden from the contributors, without adding new requirements to base elements of the stack.

It sounds like for the time being #207 should be enough to move things along. If we run into more problems we can revisit if it's worth doing especially given the unsupported nature of the proposal.

See
https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html
for the technical details of that spec.

These remaining four sorts of objects coincide with `git-raw`, which is
already in this table. However, adding them here is not redundant,
because the deserialization direction. While a git-raw CID may legally
point to any sort of git object, the relevant SWHIDs specify in the
identifier what sort of git object is expected, and if it points to a
different type of valid git object, deserialization fails. That means
there is no way to losslessly convert a git-coinciding SWHID into a
git-raw CID.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants