Semantic inference: better imports #75

brokad · 2021-07-28T06:46:47Z

brokad
Jul 28, 2021

TL;DR: We're thinking about adding model-driven inference to the synth import stage. This will make importing from complex schemas much more automated (e.g. by recognizing countries, zip codes, phone numbers, etc and auto-generating the correct JSON profile).

Synth currently only has an elementary ability to generate namespaces from data sources. At this point, we rely on simple heuristics such as

If a field is not always present tag it as nullable,

or

The min-max range for an integer value is the lowest-highest range observed in a small sample of the data.

However, Synth is not able to look into the content of the imported data to make a decision on its semantics. This makes importing certain types very hit or miss. For example, a column state of type VARCHAR in a psql database with entries like Georgia, California, etc, will be imported as the default Synth string type. This default type happens to be hard-coded as a regex, but the user probably wants something like state_name.

We're exploring potential ways that Synth could look at the imported data content in order to auto-generate more relevant Synth schemas. And for now we think string data sources (i.e. VARCHARs and the likes) are a good first candidate to zero in on. In the medium run, this will make the importing experience much better, ideally even allowing users to not have to hand-configure most VARCHAR types that are imported.

By the way, this kind of feature is not new. For instance, Tableau has the ability to look at column names and match them with custom semantic types. More recent is research [1, 2] into sophisticated models that look at sets of custom engineered features with a NN ensemble. We think they are all good approaches and especially view [2] as an appropriate loose guideline for our own design. But we believe the best implementation of semantic inference in Synth needs not be that complex to be very effective - especially at the beginning.

We'd love to know if there is interest out there for such a semantic inference feature to make life easier when running synth import! And if you guys have any comments or ideas, please share it all here!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic inference: better imports #75

{{title}}

Replies: 0 comments

Select a reply

Semantic inference: better imports #75

brokad Jul 28, 2021

Replies: 0 comments

brokad
Jul 28, 2021