-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
explore 'AST fingerprinting' for module/function identification (eg. to assist smart / stable renames, etc) #74
Comments
Have been spending some more time in binary reverse engineering land lately, and (re-)stumbled across this tool (Diaphora). While it's focus is on binary reverse engineering, some of the features it mentioned sounded like they would be interesting/useful to look deeper into for this 'AST Fingerprinting' sort of idea, eg.
There might be some ideas/patterns/algorithms/similar that we could use from there for implementing AST fingerprinting on JS code.
|
The Stack Graph / Scope Graph links/references I shared in #34 (comment) may be relevant to this issue as well. |
Some more 'prior art' from the binary reverse engineering world:
While the exact specifics of that method won't be relevant here (since we're operating on JS, and not raw bytes); some of the more general concepts might be. Interestingly, that ends up being a more refined version of some binary offset finding code I wrote for another project:
|
I've been thinking about this topic and found this repo when searching for if someone had done it before and/or for a debundler to build it on top of. The basic idea I was imagining was:
After this we can lookup the hashes in a database of known hashes for library functions. If we have information on which libraries may have been used, e.g. through a license file, creating this database should be fairly simple. This approach doesn't do any kind of fuzzy matching, but as long as the normalization works well enough and the output doesn't vary in too many ways that are difficult to normalize away depending on e.g. bundler config, it should be fairly reliable. The approach is kind of similar to how the content addressed programming language Unison does their hashing. If we want to allow more fine-grained fingerprinting we could use some kind of De Bruijn index instead for the local variables, so local snippets would have the same variable names regardless of their context. This wouldn't produce valid JS code, but that doesn't matter since the result is only used for hashing, not for output. But I think focusing on just entire modules would be a good start and give much value. |
@anka-213 Curious (if you're open to/able to share), what your use case for this sort of thing would be?
@anka-213 This basically aligns with one of the ways of how I was thinking it would probably work at a high level as well; though I think the key/crux of it would be figuring out the normalisation (including stabilising or not including variable/function identifiers that churn) in a way that is resilient to all the 'optimisations' a bundler/minifier might choose to make. That may mean that it would need to run on 'partially unminified' code, though in the ideal case, it should be able to work with as little 'pre-processing' of the minified code as possible; as this module identification would be used as part of the unminification process (for certain aspects).
@anka-213 Just had a read through that blog, and it sounds like a really interesting approach!
@anka-213 I only quickly skimmed the wiki pages for De Bruijn index / De Bruijn notation, so I might not be grasping it fully, but from what I saw, it seems like you could probably model it in a way that would fit the semantics to produce valid JS variable names/code still. Another method (that I can't remember if I've ever written out in full here) is somewhat based on the more manual approach I was taking at one point:
Specifically, identifying the types of things that are usually not minified/mangled by a bundler/minifier ( One benefit of this approach, is that those same 'key identifiers' can be used with GitHub Code search or similar tools to help narrow down and identify an otherwise unknown module/library. This could even probably be partially automated using the GitHub API; and then provide an easy way for users to contribute the relevant details/hash/etc for an identified module back to the 'core database' (in a similar way to how Here is some further 'prior art' from a tool that seems to use this sort of method to target the functions it wants to interact with:
This is potentially more of a generalised/'naive' approach to the problem, but it would also be interesting to see if/how well an embedding model tuned for code would do at solving this sort of problem space:
Also, here's the latest version of my open tabs 'reading list' in this space of things, in case any of it is relevant/interesting/useful here: Unsorted/Unreviewed Link Dump RE: 'AST fingerprinting' / Code Similarity (v2)
You can also find the first link dump of content in the collapsible in the first post on this issue. Edit: Started a new gist to keep my notes/references altogether in one place in a better way + added the above linkdump + the previous one to it: |
A little more (speculative) 'prior art' from the binary reversing world:
Edit: Tracked in my gist in this revision (Ref) |
Further 'prior art', an example of an 'obfuscation detector' based on AST structure:
|
There has recently been a new source of discussion around code fingerprinting and module identification over on the |
See Also
un-mangle
identifiers #34un-mangle
identifiers #34 (comment)un-mangle
identifiers #34 (comment)un-mangle
identifiers #34 (comment)The text was updated successfully, but these errors were encountered: