This is an evolving toolkit of capabilities helpful for reverse engineering legacy Cobol code. As of now, the following capabilities are available:
- Program / section / customisable node level flowchart generation based on AST (SVG or PNG)
- Parse Tree generation (with export to JSON)
- Control Flow Tree generation (with export to JSON)
- Allows embedding code comments as comment nodes in the graph
- The SMOJOL Interpreter (WIP)
- Injecting AST and Control Flow into Neo4J
- Injecting Cobol data layouts from Data Division into Neo4J (with dependencies like MOVE, COMPUTE, etc.) + export to JSON
- Injecting execution traces from the SMOJOL interpreter into Neo4J
- Integration with OpenAI GPT to summarise nodes using bottom-up node traversal (AST nodes or Data Structure nodes)
- Exposes a unified model (AST, CFG, Data Structures with appropriate interconnections) which can be analysed through JGraphT, together with export to GraphML format and JSON.
- Support for namespaces to allow unique addressing of (possibly same) graphs
- ALPHA: Support for building Glossary of Variables from data structures using LLMs
- ALPHA: Support for extracting Capability Graph from paragraphs of a program using LLMs
- ALPHA: Injecting inter-program dependencies into Neo4J (with export to JSON)
- ALPHA: Paragraph similarity map
Cobol-REKT is more of a library of useful things intended to be embedded in more formal reverse engineering workflows/pipelines, rather than being a standalone tool (though you can certainly use it as such). Many of the higher-level wrappers are merely sensible defaults; you are encouraged to modify them to suit your needs.
The toolkit consists of Java components, most of which are around parsing, ingestion of, and export from Cobol sources, and Python components, which carry out the analyses. In addition, the support for Neo4J allows you to build custom graph analyses as per your needs.
If you are trying out Cobol-REKT in your project, I'd love to know and feature it in the README (simply open a PR, and we can go from there)! I'm also welcoming contributors, and/or feature requests for your use cases.
- The toolkit uses the grammar available in the Eclipse Che4z Cobol Support project to create the parse tree.
- The toolkit uses the API from Woof to interact with Neo4J.
- Graphviz for flowchart generation; see its documentation for OS-specific installation instructions.
- JGraphT for Java-based graph analyses and some export functionality.
- NetworkX for Python-based graph analyses
Some reverse engineering use cases are listed below. Descriptions of the capabilities which support these use cases are provided in later sections.
- The Parse Tree can be fed to a Graph DB for consumption by an LLM through agents to answer questions about the codebase
- Static analysis of the parse tree to reveal important operations (database operations, variable dependencies)
- The interpreter can be used to trace flows to experiment with different conditions
- Trace variable impact analysis (Which variables are affected by which in a particular run)
- Serve as input for LLM to explain specific flows in the program
- Serve as a proxy for testing behaviour at section/paragraph level if needed
- Identify dead code?
- Try out new rules?
- Identify different flows in the report - use cases for forward engineering
- Graph Edit distance between paragraphs through NetworkX
- Similar Code Detection (WIP)
- Integrating Domain Knowledge
This capability allows the engineer to transform Cobol source (or part of it) into a flowchart. The flowchart stays true to the source but omits syntactic noise to produce a detailed flow of logic through the source. The capability uses Graphviz to generate the flowchart images. The flowchart can be generated by specifying the DRAW_FLOWCHART
task.
- Specifying the
--generation
parameter asSECTION
creates one flowchart per section, while specifying the value asPROGRAM
generates one flowchart for the entire program. - You can specify the output format as SVG or PNG by setting
--fileOutputFormat
toSVG
orPNG
respectively.
The dotted lines indicate things that are inside a node. So, in the above example, after the first beige "Processing" block, there is a node which leads to an IF statement through the dotted line. What happens inside the IF statement can be understood by "stepping into" this dotted line. The normal flow after the IF statement finishes can be continued by returning to the node where the dotted line originates.
This allows the engineer to produce the parse tree of Cobol source. This is suitable for use in further static analysis, transformation (into a control flow tree, for example), and inputs to other systems (informed chunking to an LLM, into a graph database for further exploration, etc.). See [Reverse Engineering Use Cases] for more examples.
Most of the capabilities are already present in the Che4z library. Some new grammars have been added. They are:
- IDMS panel definitions which are used in user interfaces
- Cobol Data Layouts, which are used in defining the records in the DATA DIVISION
This capability can be used by specifiying the WRITE_RAW_AST
task.
This capability allows the engineer to produce a control flow tree for the Cobol source. This can be used for straight-up visualisation (the flowchart capability actually uses the control flow tree behind the scenes), or more dynamic analysis through an interpreter. See SMOJOL (SMol Java-powered CobOL Interpreter) for a description of how this can help.
The CFG generation is part of the INJECT_INTO_NEO4J
task.
Both the AST and the Control Flow Graph can be injected directly into Neo4J. The AST injected can be in the same format as the SMOJOL interpreter, or the bare parse tree generated by ANTLR. The Control Flow Graph is in the SMOJOL AST format.
When generating the AST and CFG, the library allows configuring them to be the same, i.e., the same nodes are reused for creating both AST and CFG connections. For example, in the screenshot below: the same CFG has CONTAINS
(AST relationship), FOLLOWED_BY
(CFG relationship), and the MODIFIES
/ACCESSES
relationships (data structure relationship).
This provides a rich unified view of the entire program, without having to jump between multiple disconnected views of the source code, for analysis.
This can be done by specifiying the INJECT_INTO)NEO4J
task.
The OpenAI integration can be leveraged to summarise nodes in a bottom-up fashion (i.e., summarise leaf nodes first, then use those summaries to generate summarise the parent nodes, and so on).
The following diagram shows the AST, the Control Flow Graph, and the data structures graph. The yellow nodes are the summary nodes (generated through an LLM) attached to the AST (to provide explanations) and the data structures (to infer domains).
This capability connects records which modify other records, with a FLOWS_INTO
relation. The dependencies traced include variables which are used in expressions, as well as free-standing literals. Below is an example of a set of record dependencies from a program. It also generates connections based on REDEFINES clauses.
This capability connects comments to the nearest subsequent node, with a HAS_COMMENT
connection. This works for comments in the PROCEDURE division and all data structures. Comments before copybooks are connected to the main program node. Any comments which cannot be attached to found nodes, end up being connected to the main program node.
This can be done by specifying the ATTACH_COMMENTS
task. Note that for the comment nodes to appear in the graph, the INJECT_INTO_NEO4J
task must appear after the ATTACH_COMMENTS
task.
The example below shows all node-comment sets for a reasonably large program.
The interpreter is a natural extension to building the parse tree for a Cobol source. Since syntax nodes are temporally ordered, it is possible to build an execution tree which covers all possible flows. This is the basis for flowchart generation, and also the basis for a tree-walk interpreter. The interpreter sets up the AST which is a projection of the parse tree more suitable for execution. Parts of the interpreter convert specific nodes in the AST into more suitable forms at runtime (for example, expression evaluation).
The primary motivation for the interpreter is to be able to simulate the execution of programs (or fragments of programs) in a sandboxed environment where the engineer needn't worry about fulfilling dependencies required to run the code in a true mainframe environment. Rather, they can inject these dependencies (values of variables, for example) as they see fit, to perform their true task: namely, performing control flow analysis.
The interpreter can run in two modes:
- No-Operation mode: In this mode, none of the processing statements like MOVE, ADD, etc. are actually executed, but control flow is still respected. This mode is useful in many contexts where the actual change in variables isn't as important as knowing / logging the action that is taking place. This is a good default starting point for ingesting runtime execution paths into a graph. Decisions which affect control flow are evaluated based on the kind of evaluation strategy specified, so the full expression evaluation strategy will not be effective. More specific strategies can be written, or interactive resolution through the console can be used.
- Full evaluation mode (Experimental): In this mode, expressions are actually evaluated to their final results, and is the closest to actual execution of the program including storing variable state. Note that this is a work in progress, since every nook and cranny of the Cobol standard is not supported yet.
- Support for most control constructs: IF/THEN, NEXT SENTENCE, GO TO, PERFORM, SEARCH...WHEN, IDMS ON
- Support for expression evaluation in COMPUTE, MOVE, ADD, SUBTRACT, MULTIPLY, DIVIDE
- Support for interactive resolution of conditions
- Most common class comparisons supported
- Support for abbreviated relation condition forms (IF A > 10 OR 20 AND 30...)
- Functioning type system (supports zoned decimals and alphanumerics) with a large subset of z/OS behaviour compatibility for scenarios undefined in the Cobol standard
- Support for fixed-size tables and single subscripting
- Support for elementary, composite, and recursive REDEFINES (REDEFINES of REDEFINES)
- Multiple subscript access
- Automatic detection of types from DATA DIVISION specifications
- Supports evaluation of level 88 variables
- Support for tracking variable state
- Set breakpoints based on conditions or specific AST node
- View current stack at a breakpoint
- View variable values at a breakpoint
- Support for different strategies to deal with unresolved record references (ignore / throw exception)
- Support for listeners to extract specific information about the current state of the program (all the Neo4J integrations are via these listeners)
- Handle GIVING phrases in ADD/SUBTRACT/MULTIPLY/DIVIDE
- PERFORM VARYING
- PERFORM INLINE...VARYING
- Initialise values of variables from DATA DIVISION
- Support for floating point and alphabetic
- Support for REDEFINES larger than original record
- Variable snapshot per stack frame
- Evaluate IDMS expressions
- ON clauses on common operations
- ...
The interpreter also supports injecting a complete execution path through the program into Neo4J. The screenshot below shows the execution trace of a reasonably complex program.
You can export to GraphML from Neo4J pretty easily. If you don't have Neo4J set up though, the toolkit allows exporting the following entities to the GraphML format (with all interconnections) as a single supergraph, so that they can be consumed further downstream for analysis:
- Abstract Syntax Tree (the
WRITE_FLOW_AST
task) - Control Flow Graph (the
WRITE_CFG
task) - Data Structures + Dependencies (included under both of the above tasks)
The screenshot below shows a sample program's unified model exported to GraphML and loaded through the yEd Graph Editor.
To export the unified graph to GraphML from Neo4J, you can run the following command:
CALL apoc.export.graphml.all("<export.graphml>", {})
The file will be in the import
directory inside the directory where the current database files are stored (in Neo4J Desktop).
In addition to writing to Neo4J and leveraging its data science capabilities to analyse the graph(s), the library also embeds JGraphT, a powerful library of graph algorithms. The JGraphTBuilder
class converts the unified model (AST, CFG, Data Structures) into a DirectedPseudograph (because there can be both loops and parallel edges between two nodes), for consequent analysis through the JGraphT API.
Custom analyses are a work in progress.
You can find some useful Neo4J-based analysis queries in Analysis
The toolkit supports building glossaries of variables given the data structures in a program. This capability is provided through Python in the smojol_python
component.
To use this facility, start by exporting the data structures to JSON, through the JAR, like so:
java -jar smojol-cli/target/smojol-cli.jar run YOUR_PROGRAM --commands="WRITE_DATA_STRUCTURES" --srcDir /path/to/sources --copyBooksDir /path/to/copybooks --dialectJarPath che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --dialect IDMS --reportDir /path/to/report/dir
This will generate a JSON file in /path/to/report/dir
. After this, you can run:
cd smojol_python
python -m src.llm.glossary_builder.main /path/to/report/dir/program-data.json out/glossary.md
This will generate the glossary in out/glossary.md
. Integrating other out-of-band data sources is a work in progress.
The toolkit supports extracting a capability map from the paragraphs of a source. For this, you need to generate both the AST in Neo4J, as well as the data structures JSON, you can do this via:
java -jar smojol-cli/target/smojol-cli.jar run YOUR_PROGRAM --commands="INJECT_INTO_NEO4J WRITE_DATA_STRUCTURES" --srcDir /path/to/sources --copyBooksDir /path/to/copybooks --dialectJarPath che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --dialect IDMS --reportDir /path/to/report/dir
After this, you will want to extract the paragraph capabilities, like so:
python -m src.llm.capability_extractor.paragraph_capabilities /path/to/data/structures/json /paragraph/capabilities/output/path ../paragraph/variables/explanation/output/path
This will generate the capabilities in /paragraph/capabilities/output/path
. At this point, you may need to clean parts of the output manually, if some entries do not correpond to a comma-separated list of domain terms (efforts to eliminate this manual process are in progress).
The final step is to actually generate the capability map:
python -m src.llm.capability_extractor.capabilities_graph /paragraph/capabilities/output/path
This will take a little time, depending upon the number of paragraphs and their sizes. At the end, it will generate a dendrogram visualisation, as will as the capability map in Neo4J, as illustrated below (for a 10000+ line COBOL program).
The toolkit uses JDK 21 features; so you'll need the appropriate JDK set up.
Run: mvn clean install
.
The Checkstyle step is only applicable for the Eclipse Cobol parser project. You can skip the Checkstyle targets with:
mvn clean verify package -Dcheckstyle.skip=true
You can skip the tests as well, using:
mvn clean verify package -Dmaven.test.skip=true
For generating flowcharts, you will need to set up Graphviz on your box; see the site for OS-specific installation instructions.
For running the Python scripts, you can simply do:
cd smojol_python
pip install -r requirements.txt
- See
FlowChartBuildMain.java
for examples of how flowcharts are created. - See
InterpreterMain.java
for an example of how to run the interpreter on your code, as well as inject execution traces into Neo4J. - See
GraphExplorerMain.java
for an example of how to inject ASTs, data structures, and CFGs into Neo4J. - CLI support is ongoing.
- More detailed guides on programmatic use are on the way.
The individual functionalities can be invoked using different commands. Further tasks / commands will be added.
This command encapsulates almost all the tasks that you are likely to run. The descriptions of the various commands are listed below.
WRITE_FLOW_AST
: Writes a more useful form of the AST to JSON. This form is used by the interpreter and other analyses.INJECT_INTO_NEO4J
: This injects the unified model into Neo4J. Exposing more fine-grained options is in progress. This requires the environment variableNEO4J_URI
,NEO4J_USERNAME
, andNEO4J_PASSWORD
to be defined. If you wish to include comments in the graph, theATTACH_COMMENTS
needs to have run first.ATTACH_COMMENTS
: This parses the original source file (excluding copybooks for now) to find comments and attach them to the nearest subsequent AST node.EXPORT_TO_GRAPHML
: This exports the unified model to GraphML. Exposing more fine-grained options is in progress.WRITE_RAW_AST
: This writes the original parse tree to JSON. Useful for downstream code to build their own AST representations.DRAW_FLOWCHART
: This outputs flowcharts for the whole program or section-by-section of the program in PNG format.WRITE_CFG
: This outputs the control flow graph of the program as JSON.WRITE_DATA_STRUCTURES
: This exports the data structure hierarchy of the program as JSON.BUILD_PROGRAM_DEPENDENCIES
(ALPHA): Builds direct program dependencies fromCALL
and IDMSTRANSFER CONTROL
statements. Indirect dependencies are not traced. For tracing the full dependency graph, see thedependency
task.COMPARE_CODE
(ALPHA): Builds a map of inter-paragraph similarity based on node edit distances (using the Zhang-Shasha algorithm). Work in Progress.
For example, if you wanted to run all of the above, you could run the following command:
java -jar smojol-cli/target/smojol-cli.jar run test-exp.cbl hello.cbl --commands="WRITE_FLOW_AST INJECT_INTO_NEO4J EXPORT_TO_GRAPHML WRITE_RAW_AST DRAW_FLOWCHART WRITE_CFG" --srcDir /Users/asgupta/code/smojol/smojol-test-code --copyBooksDir /Users/asgupta/code/smojol/smojol-test-code --dialectJarPath ./che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --reportDir out/report --generation=PROGRAM
Passing the validation flag (--validate
or -v
) skips running all tasks, and simply validates whether the source is syntactically correct.
The help text is reproduced below (obtained by adding --help
):
Usage: app run [-hvV] [-d=<dialect>] [-dp=<dialectJarPath>]
[-f=<flowchartOutputFormat>] [-g=<flowchartGenerationStrategy>]
-r=<reportRootDir> -s=<sourceDir> -c=<commands>
[-c=<commands>]... -cp=<copyBookDirs>[,<copyBookDirs>...]
[-cp=<copyBookDirs>[,<copyBookDirs>...]]... [<programNames>...]
Implements various operations useful for reverse engineering Cobol code
[<programNames>...] The programs to analyse
-c, --commands=<commands> The commands to run (INJECT_INTO_NEO4J,
EXPORT_TO_GRAPHML, WRITE_RAW_AST,
DRAW_FLOWCHART, WRITE_FLOW_AST, WRITE_CFG,
ATTACH_COMMENTS, WRITE_DATA_STRUCTURES,
BUILD_PROGRAM_DEPENDENCIES, COMPARE_CODE)
-cp, --copyBooksDir=<copyBookDirs>[,<copyBookDirs>...]
Copybook directories (repeatable)
-d, --dialect=<dialect> The COBOL dialect (COBOL, IDMS)
-dp, --dialectJarPath=<dialectJarPath>
Path to dialect .JAR
-f, --flowchartOutputFormat=<flowchartOutputFormat>
Format of the flowchart output (PNG, SVG)
-g, --generation=<flowchartGenerationStrategy>
The flowchart generation strategy. Valid values
are SECTION, PROGRAM, and NODRAW
-h, --help Show this help message and exit.
-r, --reportDir=<reportRootDir>
Output report directory
-s, --srcDir=<sourceDir> The Cobol source directory
-v, --validate Only run syntax validation on the input
-V, --version Print version information and exit.
This command is used to trace the inter-program dependencies starting from a root program. To run this, use something like:
java -jar smojol-cli/target/smojol-cli.jar dependency if-test.cbl --srcDir /Users/asgupta/code/smojol/smojol-test-code --copyBooksDir /Users/asgupta/code/smojol/smojol-test-code --dialectJarPath ./che-che4z-lsp-for-cobol-integration/server/dialect-idms/target/dialect-idms.jar --export=out.json
Specifying the --neo4j
flag injects those dependencies into Neo4J, while setting a path through export
will export it to JSON. The screenshot below shows a very simple dependency graph generated from if-test.cbl
(in the smojol-test-code
directory). Dynamic dependencies (resolved through variables and expressions) are not resolved any deeper.
The help text for this command is reproduced below:
Usage: app dependency [-hnV] [-d=<dialect>] [-dp=<dialectJarPath>]
-s=<sourceDir> [-x=<exportPath>] -cp=<copyBookDirs>[,
<copyBookDirs>...] [-cp=<copyBookDirs>[,
<copyBookDirs>...]]... <programName>
Implements various operations useful for reverse engineering Cobol code
<programName> The program to analyse
-cp, --copyBooksDir=<copyBookDirs>[,<copyBookDirs>...]
Copybook directories (repeatable)
-d, --dialect=<dialect> The COBOL dialect (COBOL, IDMS)
-dp, --dialectJarPath=<dialectJarPath>
Path to dialect .JAR
-h, --help Show this help message and exit.
-n, --neo4j Export to Neo4J
-s, --srcDir=<sourceDir> The Cobol source directory
-V, --version Print version information and exit.
-x, --export=<exportPath> Export path
NOTE: The API is under active development, and may be subject to change.
The simplest way to invoke tasks through the API is using CodeTaskRunner
, like so:
new CodeTaskRunner(
"/path/to/src",
"path/to/report-dir",
ImmutableList.of(new File("/path/to/copybooks")),
"/path/to/dialect-idms.jar",
LanguageDialect.COBOL,
new FullProgram(FlowchartOutputFormat.SVG),
new UUIDProvider(),
new OccursIgnoringFormat1DataStructureBuilder())
.generateForPrograms(ImmutableList.of(
WRITE_RAW_AST,
WRITE_FLOW_AST
), ImmutableList.of("program.cbl"));
This invocation uses some specific conventions when deciding where to output file artifacts under the report-dir
directory.
If you want more fine-grained control of the location of output artifacts, you can use the SmojolTasks
class, which gives you more configurability in exchange for having to provide more detailed specifications.
NOTE: For all analyses, specifying the OccursIgnoringFormat1DataStructureBuilder
class is preferable to prevent extra noise that can be generated when creating arrays for structures using OCCURS
clauses. However, the DefaultFormat1DataStructureBuilder
should be specified when running the interpreter, because that will require the correct number of elements in array data structures.
This toolkit is distributed under the MIT License. However, the Eclipse Cobol Parser project is distributed under the Eclipse Public License V2. Accordingly, all modifications to the parser fall under the EPL v2 license, while the toolkit proper falls under the MIT License.
- This was built based on a time-boxed PoC, and thus isn't well-covered by tests yet. More are being added on an ongoing basis.
- Cobol is a large language, and thus the interpreter's capabilities are not exhaustive. However, the hope is that the subset currently present is useful enough to get started with reverse engineering legacy code. Obviously, more capabilities are being added on an ongoing basis.
- There are 4 superfluous directories at the top (engine, parser, dialect-daco, dialect-idms), which are a hack to get the Che4z Checkstyle targets to run because of a path configuration issue.
The rest of this file is mostly technical notes for my personal documentation.
Sym-1 / Sym-2 | S (Sign) | P (Left) | P (Right) | V (Decimal Point) | X (Alphanumeric) | 9 (Number) |
---|---|---|---|---|---|---|
S (Sign) | - | X | X | X | - | - |
P (Left) | - | X | - | X | - | X |
P (Right) | - | - | X | - | - | - |
V (Decimal Point) | - | - | X | - | - | X |
X (Alphanumeric) | - | - | - | - | X | X |
9 (Number) | - | - | X | X | X | X |
- Sentences which are GO TO need to not connect with the immediate next sentence in the code. The internal flow branches off correctly.
dot -Kdot -v5 -Gsize=200,200\! -Goverlap=scale -Tpng -Gnslimit=4 -Gnslimit1=4 -Gmaxiter=2000 -Gsplines=line dotfile.dot -oflowchart-level5.png
These are some other commands tried on larger graphs:
dot -Kneato -v5 -Tpng dotfile.dot -oflowchart-level5.png
dot -Kdot -v5 -Gsize=200,200\! -Goverlap=scale -Tpng -Gnslimit=2 -Gnslimit1=2 -Gmaxiter=2000 -Gsplines=line dotfile.dot -oflowchart-level5.png
dot -Kfdp -v5 -Goverlap=scale -Gsize=200,200\! -Tpng dotfile.dot -oflowchart-level5.png
dot -Ktwopi -v5 -Gsize=200,200\! -Tpng dotfile.dot -oflowchart-level5.png
dot -Kdot -v5 -Gsize=200,200\! -Goverlap=scale -Tpng -Gnslimit=4 -Gnslimit1=4 -Gmaxiter=2000 -Gsplines=line dotfile.dot -oflowchart-level5.png
Run:
act -W workflows-local/local-generate-flowchart.yml --container-architecture linux/amd64