Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User guide documentation update #5

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 210 additions & 3 deletions user_guide.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,21 @@ enums:
17: udp
----

Alternatively, hexadecimal notation can also be used to define an enumeration:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally ok, but I'd also noted that this is a service provided by YAML, not something specific to KS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that a new section of the document could be created for general syntax and a very brief overview of YAML and what it provides. This example I provided may be better suited there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Construct features are Python features, but I would advertise them just the same. Purpose of documentation is to show capabilities, not attribution. =) Just saying.


[source,yaml]
----
seq:
- id: key
type: u4
enum: keys
enums:
keys:
0x77696474: width #widt
0x68656967: height #heig
0x64657074: depth #dept
----

There are two things that should be done to declare a enum:

1. We add `enums` key on the type level (i.e. on the same level as
Expand Down Expand Up @@ -472,7 +487,25 @@ structure:

[source,yaml]
----
TODO
seq:
- id: header
type: file_header
- id: metadata
type: metadata_section
types:
file_header:
seq:
- id: version
type: u2
metadata_section:
seq:
- id: author
type: strz
encoding: UTF-8
- id: publisher
type: strz
encoding: UTF-8
if: _parent.header.version >= 2
----

==== `_root`
Expand Down Expand Up @@ -799,6 +832,39 @@ other value which was not listed explicitly.
_: rec_type_unknown
----

If an enumeration has already been defined, you can use references to
items in the enumeration instead of specifying integers a second time:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if you defined key as enum, then you don't have much choice. You can't compare enums to integers without additional conversions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point, I'll update the text accordingly


[source,yaml]
----
seq:
- id: key
type: u4
enum: keys
- id: data
type:
switch-on: key
cases:
keys::width: data_field_width
keys::height: data_field_height
keys::depth: data_field_depth
types:
data_field_width:
seq:
#...
data_field_height:
seq:
#...
data_field_depth:
seq:
#...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic person in me cries for that misaligned #... ;)
And, anyway, seq is totally optional, so may be it's better to wrap it up as:

types:
  data_field_width: # ...
  data_field_height: # ...
  data_field_depth: # ...

for brevity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agreed

enums:
keys:
0x77696474: width #widt
0x68656967: height #heig
0x64657074: depth #dept
----

=== Instances: data beyond the sequence

So far we've done all the data specifications in `seq` - thus they'll
Expand Down Expand Up @@ -1024,7 +1090,117 @@ bytes sparsely.

=== Streams and substreams

TODO
====Introduction and simple example====

A stream is a flow of data from an input file into a parser which is
generated by a KS script. The parser can request one or more bits of
data from the stream at a time, but cannot request the same data twice
and cannot request data be provided out of sequential order. A stream
knows the maximum amount of data available to be requested by the
parser and the actual amount of data which has already been
requested by the parser.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is pretty abstract and somewhat misleading. "Stream" can be re-read as many times as needed, and it can be seeked: that's exactly how positional parse instances work, they use seek operations on a stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think of another way to explain streams then, especially with reference to how pos: works (seeking) and how io: can be used to designate which stream to use.


When a file is first opened for parsing by a parser generated by KS,
a root stream is created. This root stream can be accessed via
`_root._io` at any time and in any place. In this scenario, `_root`
returns the top level object defined in a script, and `_io` is a
method which can be called on an object to return the associated
stream. The root stream will know the maximum amount of data available
to be requested by the parser as the file size of the input file which
is being parsed. Initially, the root stream will know that 0 bits of
data have been requested by the parser.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streams can be used on in-memory byte arrays too, not necessarily files (which have file sizes). And, actually, stream does not "know" full file size, but it can query it on demand. File size can change if file is modified when KS parsing is in progress, so it's actually ok to have _io.size to return varying values in different points in time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point, probably one worth adding to the pitfalls section (or troubleshooting or similar) for the few people who may encounter the issue and not understand what is going on.


Below is an example script which is used to generate a parser which
is then used to parse an input file. Assume that this example input
file simply consists of a 32-bit unsigned integer value of 1000
followed by 1000 bytes of payload data. This example input file thus
has a total file size of 1004 bytes.

[source,yaml]
----
meta:
- id: example_file
seq:
- id: header
type: file_header
- id: body
type: file_body
size: header.body_size
types:
file_header:
seq:
- id: body_size
type: u4
file_body:
seq:
- id: payload
size-eos: true
----

The parser generated by the script will first request 4 bytes of data
from the root stream to copy into the object `header.body_size`. After
the stream has returned the 4 bytes of data to the parser, the stream
will know that it has returned 4 out of the 1004 bytes of data available
to the parser. The parser is now only able to request 1000 bytes of
additional data from the stream.

The definition of the `body` object in the example script specifies the
size of the `body` object to be the already-parsed value of
`header.body_size`. Defining an object size results in something
interesting happening with the KS-generated parser--a new substream is
created to specifically parse the `body` object.

Similar to how the root stream operates, the new substream initially
knows the maximum amount of data available to be requested, and the
actual amount of data already returned. In this example, the substream
upon creation has a maximum of 1000 bytes of data which can be
requested by the parser. The substream will know the actual amount of
data which has been provided is 0 bytes.

The parser will then continuously request data from the new substream
to copy into the object `file_body.payload`. As the substream receives
requests for more data, the substream will pass all requests to the
root stream. Unlike the root stream, substreams are only able to
request data from either the root stream or other substreams.
Substreams do not read from an input file directly.

Because `size-eos: true` is specified for the `file_body.payload`
object, the parser will continue requesting data from the substream
until the actual amount of data provided by the substream is 1000
bytes (the maximum amount of data which the substream is available
to provide). Upon all 1000 bytes of data being copied from the input
file, via the root stream and then via the substream to the
`file_body.payload` object, the internal state of the two streams
would be:
* root stream--maximum bytes of data available remains 1004, actual
amount of data already requested is 1004 bytes
* substream--maximum bytes of data available remains 1000, actual
amount of data already requested is 1000 bytes

Alternatively, if `header.body_size` happens to be a value larger than
the input file size, the root stream would be unable to fulfill this
request, and the KS-generated parser would abruptly raise an exception
for trying to read non-existent data beyond the end of the input file.

The `_io` method can be used to access the stream associated with an
object. An object can be obtained by identifier, or alternatively by
methods `_root` and `_parent`. Once a stream has been obtained with
the `_io` method, a number of different methods can be used to obtain
the internal state of the stream:
* `size` to return the maximum amount of data which is available to be
requested from the stream
* `pos` to return the actual amount of data which has already been
requested from the stream
* `eof` to return a boolean value of `false` when `pos != size` and
`true` when `pos == size` (has the maximum amount of data available
via the stream already been requested?)

Substreams can be nested many layers deep by defining the `size` of
each object in the nested tree.

Related expressions which are useful when working with streams include:
* `repeat: eos`
* `size-eos: true`

=== Processing: dealing with compressed, obfuscated and encrypted data

Expand Down Expand Up @@ -1903,7 +2079,38 @@ beginner Kaitai Struct users.

=== Specifying size creates a substream

TODO
In the following example script, an erronous attempt is made to parse
an input file with a file size of 2000 bytes:

[source,yaml]
----
seq:
- id: body
type: some_body_type
size: 1000
types:
some_body_type:
seq:
- id: payload
size: 999
- id: overflow
size: 2
----

The parser can successfully copy the required 999 bytes into
`body.payload` as the `body` substream has 1000 bytes available to
be requested, and the root stream has 2000 bytes available.

Where an exception occurs is upon attempting to copy data from the
`body` substream into the `overflow` object. After data has been
copied from the `body` substream into the `payload` object, the
`body` substream will only have 1 byte of data still available for
the parser to request. As 2 bytes of data are attempted to be
requested, the `body` substream is exhausted of available data and
thus an exception occurs. The fact that the root stream still has
1001 bytes available to be requested from the input file does not
matter, as the `body` substream never has the opportunity to request
any more than the first 1000 bytes of the input file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not a pitfall, but a legitimate behavior, and well-explained in previous section.

The "pitfall" I was thinking about in this section is the following: when a new substream is created, all parse instances with positions act within that substream by default.

So, this one works as expected:

seq:
  - id: skipped
    size: 1000
  - id: indexing
    type: file_index_entry
    # but adding "size: 24" here will ruin "file_body" instance,
    # although it looks legitimate at the first glance
types:
  file_index_entry:
    seq:
      - id: file_name
        type: str
        size: 16
      - id: file_pos
        type: u4
      - id: file_len
        type: u4
    instances:
      file_body:
        pos: file_pos
        size: file_len

To overcome that, one needs to use something like io: _root._io in file_body. Of course, documentation warrants a somewhat better example and explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. I didn't know about io: either, so that's a good one to document! Nice feature!


=== Applying `process` without a size

Expand Down