2024‐11‐05 The 5 1 Levels of IDL translation approaches

The translation challenge

Connecting to the previous article about the purpose of IFEX project, I want to now highlight how an easily understandable "translation" description conveys both the primary purpose of IFEX project (let's call that the Core Challenge), and also conveys the result of investigations into semantic-mapping in a readable but still formal way. Such a result could more easily be contributed to and built upon by others.

Therefore, if a translation-mapping can be described in a minimal form, we can more easily communicate to anyone else, what this challenge is about. Translation of interfaces / IDLs is sometimes trivial, sometimes very complex. When seeing the mapping-table, it is easier to see the complexity of each mapping. The table also more easily expresses "what does this IFEX thing actually do?". The complexity/challenge might be understood, and in this way we establish the validity of IFEX project as a place to get together and work on this challenging topic.

Throughout the IFEX project my vision has always been to reach a point of describing the required transformation steps from input to output in a simple and declarative way, and that this description is not only a specification but also machine-readable and able to be interpreted/executed by the program.

In the next chapter I describe the incremental steps towards this vision.

Learning by doing

If the vision is clear, it might seem wasteful to go through each stage. Why not just jump directly to the end result? I think an iterative process is very useful to learn more about the problem. Trying things out gives you a clear idea of the limitations of each. In fact, in some cases you may find that the result of the earlier approaches is fine, and the implementation of this particular case could then stay using this primitive approach. If it's "good enough" then just use it.

However, it has always been clear to me that reaching a point where we can describe a machine-processable "translation table" from input to output would have great advantages for complex cases. Not only could it increase the pace of implementing more translations, it would also enable feedback from people who understand the meaning of the IDLs, but don't want to dig through detailed implementation code. The complexity (or simplicity, depending) of the project's Core Challenge becomes evidently visible by the translation definition.

Furthermore, any translation/mapping requires some document to explain semantic meaning and intended translation. An attempt usually starts by listing all the features of the input and output language, and then stepwise consider if the information content in each is equal, or where the gaps may be. Where gaps or different approaches are found, we can start to see what the translation might be, or if some things must even be deemed as not feasible and therefore "unsupported".

That type of analysis document is useful, and usually required for human understanding of the problem, but if we can use the same (or very similar) declarative descriptions that both a human and the program can understand, (often known as an "executable specification") then that is even better!

5+1 levels of maturity for implementing programming language translation tools

The IFEX project has basically gone through all of these approaches, even if the vision towards the later stages was always there.

(Level 0: Ad-hoc text processing)

Text-to-text, with ad-hoc input processing, print-statements, and data structures as needed

Level 1: Input parser and templated text output

Proper input parser -> Abstract Syntax Tree (AST) internal "model" is created from the input
Read AST model and output result using text-templating language (e.g. Jinja). A mix of control logic in code and embedded into templates using Jinja template directives.

Level 2: Model-to-model transformation, read/print support for models, imperative code

Parsing input like in 1 into one AST for the input, but define another AST for the output.
Perform Model-to-model transformation. with explicit, imperative code.
"Print out" the new model

Level 3: Declarative description of model translation rules

Replace explicit imperative code with declarative transformation rules
A "table" of translation-rules configures a general model-to-model transformation engine
The table is expressed using native data types in the programming language = not optimal but reasonably readable.
The generic translation code reads the declarative rules for this exact case, and performs them.

Level 4: Domain-Specific syntax for IDL translation

Define an optimized Domain-Specific Language (DSL) for the translation rules
Replace native data types with a DSL expression of the rules => More expressive and more readable.
Updated "Translation Engine" reads input format, output format, and now also parses the DSL to understand the translation rules (likely resulting in another internal AST representation). The generic code then executes translation according to the defined rules

Level 1000 = AI magic?

We are all of course aware of the remarkable ability of Large Language Models to understand structure (and seemingly, meaning) of programming languages. So, to a certain level of detail it is even today possible to ask any of the world's top LLM models to simply translate an input from format A to format B.

It should also be said that knowing how to translate in each case is a bit of a continuous work. Many decisions of how we want the output to be are for "corner cases" or tied to a certain way a technology is used/interpreted in a particular company. There is not always one and only one semantic-mapping between IDL standards, and the end result is decided by the users of IFEX-related tools. Tools need to be crafted to consider peculiarities of a certain company's preferred way of handling things, that a generalized LLM trained on public data might not understand (without significant prompting?).

Real-world usage needs to be able to tweak the implementation of how we want things translated in a particular situation. At least currently humans are likely needed to decide on those matters and encode them into tool logic or parameters (possibly using Layers).

A number of years down the line, these tools might be so deeply integrated into working methods, computationally efficient enough for large scale usage, and having reached a level of reliability that we can trust them to do translations as part of our software build infrastructure. Until then, the algorithmic approach of IFEX tools seems to be useful.

We could also envision training a bespoke LLM on the specifics of the IFEX environment. The result of that could potentially meet all the demands of reliability and reasonable resource use. If someone is interested in doing some research on this - let me know!

November 2024

Written by Gunnar Andersson

Provide feedback

Saved searches

Use saved searches to filter your results more quickly