You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: We're thinking about adding model-driven inference to the synth import stage. This will make importing from complex schemas much more automated (e.g. by recognizing countries, zip codes, phone numbers, etc and auto-generating the correct JSON profile).
Synth currently only has an elementary ability to generate namespaces from data sources. At this point, we rely on simple heuristics such as
If a field is not always present tag it as nullable,
or
The min-max range for an integer value is the lowest-highest range observed in a small sample of the data.
However, Synth is not able to look into the content of the imported data to make a decision on its semantics. This makes importing certain types very hit or miss. For example, a column state of type VARCHAR in a psql database with entries like Georgia, California, etc, will be imported as the default Synth string type. This default type happens to be hard-coded as a regex, but the user probably wants something like state_name.
We're exploring potential ways that Synth could look at the imported data content in order to auto-generate more relevant Synth schemas. And for now we think string data sources (i.e. VARCHARs and the likes) are a good first candidate to zero in on. In the medium run, this will make the importing experience much better, ideally even allowing users to not have to hand-configure most VARCHAR types that are imported.
By the way, this kind of feature is not new. For instance, Tableau has the ability to look at column names and match them with custom semantic types. More recent is research [1, 2] into sophisticated models that look at sets of custom engineered features with a NN ensemble. We think they are all good approaches and especially view [2] as an appropriate loose guideline for our own design. But we believe the best implementation of semantic inference in Synth needs not be that complex to be very effective - especially at the beginning.
We'd love to know if there is interest out there for such a semantic inference feature to make life easier when running synth import! And if you guys have any comments or ideas, please share it all here!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
TL;DR: We're thinking about adding model-driven inference to the
synth import
stage. This will make importing from complex schemas much more automated (e.g. by recognizing countries, zip codes, phone numbers, etc and auto-generating the correct JSON profile).Synth currently only has an elementary ability to generate namespaces from data sources. At this point, we rely on simple heuristics such as
or
However, Synth is not able to look into the content of the imported data to make a decision on its semantics. This makes importing certain types very hit or miss. For example, a column
state
of typeVARCHAR
in a psql database with entries likeGeorgia
,California
, etc, will be imported as the default Synth string type. This default type happens to be hard-coded as a regex, but the user probably wants something like state_name.We're exploring potential ways that Synth could look at the imported data content in order to auto-generate more relevant Synth schemas. And for now we think string data sources (i.e. VARCHARs and the likes) are a good first candidate to zero in on. In the medium run, this will make the importing experience much better, ideally even allowing users to not have to hand-configure most
VARCHAR
types that are imported.By the way, this kind of feature is not new. For instance, Tableau has the ability to look at column names and match them with custom semantic types. More recent is research [1, 2] into sophisticated models that look at sets of custom engineered features with a NN ensemble. We think they are all good approaches and especially view [2] as an appropriate loose guideline for our own design. But we believe the best implementation of semantic inference in Synth needs not be that complex to be very effective - especially at the beginning.
We'd love to know if there is interest out there for such a semantic inference feature to make life easier when running
synth import
! And if you guys have any comments or ideas, please share it all here!Beta Was this translation helpful? Give feedback.
All reactions