Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic detection PoC #103

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

brokad
Copy link
Contributor

@brokad brokad commented Aug 14, 2021

Semantic detection PoC

This defines a framework for more advanced, statistics based, ways of importing data into synth. This paves the way for more automation in the process of writing synth schemas tailored to a specific data source.

Underpinning this is the semdet crate which aims to provide synth with the ability to do fast, zero-copy, in-memory trainable analytics for table instances provided by the user as an import data source. It is built on arrow, ndarray and tch.

The PoC is an end-to-end implementation of a dummy model that detects the most likely fake generator based on a simple dictionary lookup. The example is simple enough that we can get it done very quickly and yet involves enough moving parts to evidence the possibility of implementing more complex data driven inference mechanisms.

How to test it

cargo test --features torch in semdet/ will run the dummy E2E scenario and should be successful.

Roadmap to readiness

  • Composable API for the embedding of input data as valid module inputs
  • Composable API for handling prediction targets in our domain-specific application
  • Load a 'pre-trained' dummy module embedded at compile-time
  • Document the Encoder/Decoder/Module APIs
  • Attach to the CLI's import logic
    • Project down string columns from sqlx query results
  • Windows build needs fixing
  • Make tch optional so the built binary does not have to carry a dynamic dependency into libtorch

@brokad brokad marked this pull request as ready for review August 23, 2021 08:13
@brokad brokad force-pushed the feat/semantic-detection branch 2 times, most recently from 0e9e05e to 004c467 Compare August 25, 2021 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant