Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More sixel renderer optimizations. #13

Open
ismail-yilmaz opened this issue Jun 28, 2021 · 92 comments
Open

More sixel renderer optimizations. #13

ismail-yilmaz opened this issue Jun 28, 2021 · 92 comments
Assignees
Labels
enhancement Terminal Terminal package related issues.

Comments

@ismail-yilmaz
Copy link
Owner

ismail-yilmaz commented Jun 28, 2021

Hello @jerch,

I don't want to pollute your repos or other active discussion threads with my findings until I can come up with some practical optimization suggestions for your renderer. So I'll put this here.

I've been doing some heavy profiling on my SixelRenderer since last week. Here are some preliminary results:

  • I identified both some bottlenecks in my code and the performance sensitive critical parts of sixel rendering in general.
  • I achieved some nice progress by some simple tweaks since last week:
    • I have managed to sequeze out a steady 42 MiB/s (2.90 secs) (parsing + rendering) on your sixel-bench animation, which means ~12 MiB/s gain. for me
    • There is a small performance penalty associated with displaying the animation due to the gui event handling (latency) of our GUI toolkit, But my SixelViewer app (a small testing and benchmarking app) can display the full sixel-bench video on average 38.00 Mib (3.10 secs.). As you can guess, this does not translate 1:1 to final terminal performance, because other variables also effect the vte rendering performance. Still it can achieve slightly higher performance than MLTerm now.

sixel-renderer-optimized

The above improvements are all due to some really simple tricks -one of which I borrowed from you- and the examination of the produced assembly code. I am confident that we can raise the throughput bar even higher, because there is still room for optimization.

In the following days this week I'd like to share my findings (code changes/alterations) with you, which you might find interesting and, hopefully, useful for your c/wasm decoder.

@ismail-yilmaz ismail-yilmaz added enhancement Terminal Terminal package related issues. labels Jun 28, 2021
@ismail-yilmaz ismail-yilmaz self-assigned this Jun 28, 2021
@jerch
Copy link

jerch commented Jun 28, 2021

@ismail-yilmaz Wow sounds promising - and ofc I am really curious about these improvements on asm level and whether they also apply to wasm to some degree or could be ported back. Its pure stack machine design makes some optimizations behave worse (even pointer arithmetic tends to run slower from my findings).

@jerch
Copy link

jerch commented Jun 28, 2021

@ismail-yilmaz Minor note on what I've found worth looking into, but did not get down to it yet:

  • The sixel bit deconstruction with branching is actually much more expensive than always writing something somewhere. With more clever index "routing" like writing empty bits to dummy pixels I can gain ~30% speedup in put_single. This somewhat surprised me, would have thought that the memory access to dummy pixels would outweigh the branching + bit testing by far. But seems the branching creates a heavy load here.
  • Imho the conditional checks, if the byte at hand is a sixel (eg. in range 63 to 126), can be made much faster with clever bit shifting. Did only profile that one so far, it adds up to quite some runtime, if the sixel data contains only few compression directives.

Both ideas are somewhat branching related, and I think that is promising ground for further optimizations.

Edit: Btw my decoder function kinda invites to use computed gotos, sadly they are not yet natively supported in wasm. If you want to squeeze 10 - 20% more perf, this might be worth a look as well.

@jerch
Copy link

jerch commented Jul 1, 2021

@ismail-yilmaz I implemented the "write always something" in put_single by offsetting the canvas pixels and routing empty bits to canvas[0]. Boosted the raw decoding throughput from ~85 MB/s to 95-100 MB/s for my test images (~15% boost). Still not sure, why it shows that much of a speedup, could be a cache side effect - wild theory: I have the data chunk for reading right before that in memory, maybe it gets always loaded into cache due to this "accidental" locality.

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 1, 2021

@jerch ,

implemented the "write always something"

I apologize for my late reply (Really busy now, and I'll comment on the things you've written, tonight - GMT+3). But I think I finally understand why you get so high results (which is very impressive by the way, congrats!) When I do the same thing, the performance of my decoder falls down to ~25 MiBs (from 42). I now know why (at least, I know what's the main culprit. I'll write the whole thing tonight (GMT + 3) and share some findings, with a suggestion on your decoder's main loop.

Best regards,
İsmail

@jerch
Copy link

jerch commented Jul 1, 2021

I apologize for my late reply ...

Well wouldn't call 30 min "late" 😸

Yes I am really interested in your asm findings and whether I can backport some ideas to wasm as well. Wasm itself is pretty reduced in op codes and its stack mechanics, so some things might not work out there.

When I do the same thing, the performance of my decoder falls down to ~25 MiBs (from 42).

Oh, thats weird. Note that those numbers are raw decoder speed, the whole turnover in xterm.js with the sixel-bench test is just at 36 MB/s (gets slowed down by the JS handling pre/posthand). Furthermore my test machine is a i7-2760QM with 2.40GHz and 1333 MHz (0.8 ns) double bank memory. The CPU is kinda old, but I remember seeing big differences in i3/i5 types vs. i7, esp. due their different cache settings. If we want comparable metrics across machines, maybe we should switch to cycles instead.

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 1, 2021

@jerch

A disclaimer: My experience with, and knowledge of, webassembly is pretty limited at the moment. (Not so with Asm/C/C++ though. )

I will be using the sixel-bench as my main reference here. And the machine I use for testing is old enough Athlon FX 6100 (six core) @3.3 GHz, 8MB, dual channel, @1866 Mhz.

My findings suggest that there are three critical sections that impact the sixel rendering performance. Let's go step by step, shall we?

1. Parameter collection

This was the first interesting part. The fact that higher color palette settings (>= 256) and sixel compression relies on a supplied parameters list makes this section a good candidate for optimization. Throughout my tests and benchmarking I came to the conclusion that for effective cpu cache utilization it would be better to collect the consecutive parameters in a single, separate loop at once (i.e via a dedicated inline function for better instruction and data cache utilization.)
To this end I've created three versions of ReadParams() function, each with a slight modification and run them all 100 times on a 10 MB file that contains random and valid sixel parameters ranging from 0 to 999999.

The code of the said three function can be inspected here. (Note that I use Upp::Stream instead of std::stringstream on the actual code, which is faster and its Peek() and Get() methods are trivially inlineable.)

Here are the results (with -O3 optimization):

// TIMING     - readparams_plain    :  2.97 s  - 29.71 ms ( 2.97 s  / 100 ), min: 29.00 ms, max: 32.00 ms, nesting: 0 - 100
// TIMING     - readparams_faster   :  1.94 s  - 19.37 ms ( 1.94 s  / 100 ), min: 19.00 ms, max: 27.00 ms, nesting: 0 - 100
// TIMING     - readparams_unrolled :  1.72 s  - 17.25 ms ( 1.73 s  / 100 ), min: 17.00 ms, max: 18.00 ms, nesting: 0 - 100

// THROUGHPUT - readparams_plain    :  3.40 MiB/s
// THROUGHPUT - readparams_faster   :  5.20 MiB/s
// THROUGHPUT - readparams_unrolled :  5.80 MiB/s

While the difference between faster and unrolled version can be considered negligible, it is still there. And it also shows that conditional branching may not always be the worst path to take. Which brings me to the question: Does this also apply to wasm? Did you already consider this approach for your decoder.cpp? Because you seem to be handling the parameters on the main loop level.

A side note: A code line from decoder.cpp written in two different ways (CLANG-webassembly, -O3)

else if (code > 62 && code != 127) // original version.
        local.get       3
        i32.const       63
        i32.lt_u
        br_if           0
# %bb.32:               
        local.get       3
        i32.const       127
        i32.eq  
        br_if           0


else if (unsigned(code - 63) < 64) // My altered version
        local.get       3
        i32.const       -63
        i32.add 
        local.tee       2
        i32.const       63
        i32.gt_u
        br_if           0

Now, I don't know which one would be faster as I don't have in-depth knowledge of wasm, so please feel free to correct me at every level, but the fact is this code used for put_pixel is on a very perf-sensitive path, and the altered version seems at least shorter and involves less branching (I don't know how many cycles each instruction takes on wasm) so it may have an impact on the performance - albeit small.

To be continued...

@jerch
Copy link

jerch commented Jul 2, 2021

@ismail-yilmaz Thx a bunch, I'll check these things out.

About critical sections - I kinda have a hard time to profile it in wasm, I simply dont know any tools, that can achieve it within wasm. So I went the faulty route of deactivating code paths and comparing the runtimes, which ofc carries the risk of creating bigger dead code branches, that get eliminated by the compiler. Still my findings are these in wasm:

  • only inner state handling (deactivated sixel, cursor, params and color processing): 150 - 200 MB/s
  • all minus sixel processing (bit decomposition + pixel safe): 95 - 180 MB/s
  • all minus params and color processing: 140 - 190 MB/s
  • all: 93 - 165 MB/s

Note that the big spread in the runtimes is a result of one test image with 4096 colors and random pixels. This image has almost no compression, thus very high throughput numbers. Normal images are grouped around the lower number with only small variations (< +20 MB/s).

Interpretation of the numbers:

  • Sixel processing and pixel save accounts for ~40% of the runtime. This is actually lower than I expected.
  • Params and color processing accounts for ~7 % of the runtime. No surprise here, it is within my expectations with slightly worse runtime ratio compared to data ratio, thus slower in processing than the sixel path. This could be further optimized, but the gains are unlikely to surpass 5% overall runtime.
  • Most runtime is spent for the inner state handling with ~50% runtime. WTH, the inner state handling only switches branches based on ps.state. Have to admit, that the branches are quite scattered/nested with that weird jump helper, still my expectations were in the 15 - 30% range, not 50%.

So I gonna try to find a simpler state model, as it promises great savings. My original JS implementation uses a DFA switching on every input byte. While this is fairly equal in JS, it actually performs worse in wasm. The current wasm state model is a direct transfer of the DFA states into switch and if-else conditions with less switching per byte but deeper nested side paths (the jump helper results from preserving the DFA states, but might not be needed everywhere).
A first dig into the branching direction shows, that the switch statements might be problematic. They get compiled to br_table instructions in wasm, which dont seem to get the jump table optimization in the final machine code created by the wasm runtime (needs more investigation, did only a quick test with different amount of cases, which revealed a O(n) dependency, bummer).

Edit: Fixed the mixed numbers above.

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 2, 2021

@jerch ,

I will hopefully write the second and third parts this weekend, and try to clear up some points with the numbers and profiling.
But let me say that I found the culprit in my decoder and changing a single thing skyrockets my decoders throughput to 61 Mib/s (on `sixel-bench) But it is costly, and not really ideal for my decoder's use cases.

I gonna try to find a simpler state model, as it promises great savings.

This is my main loop, by the way:

			for(;;) {
				int c = stream.Get();
				switch(c) {
				case 0x21:
					GetRepeatCount();
					break;
				case 0x22:
					GetRasterInfo();
					break;
				case 0x23:
					SetPalette();
					break;
				case 0x24:
					Return();
					break;
				case 0x2D:
					LineFeed();
					break;
				case -1: // eof or err
					goto Finalize;
				default:
					if(check_range(c, 0x3F, 0x7E))
						PaintSixel(buffer, c - 0x3F);
					break;
				}
			}

Each function is inlined and the only other two inner loops are in ReadParams() (inlined inside SetPalette, GetRepeatCount, GetRasterInfo) and PaintSixel() Also on sixel bench, ReadParams gets called 17575647 times (total), while the PaintSixel is called 69508135 times (total, and called for both compressed and single sixels), And these numbers do not even account for inner loops of these functions! That's why I think they are very sensitive.

@jerch
Copy link

jerch commented Jul 2, 2021

@ismail-yilmaz 61 Mib/s for the full deal with sequence parsing, sixel decoding and painting in the terminal? Wow, thats far beyond what I can achieve (xterm.js input processing alone is slower lol).

May I ask you to run the benchmarks of my lib? I have the feeling, that our machines are quite different in the throughput numbers, sadly I have no clue, how to measure wasm execution in terms of CPU cycles, which would be a better number for comparison.

If you are up to test my wasm decoder I can give you the compiled version, thus you wont need to install the emscripten SDK.

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 2, 2021

@jerch

61 Mib/s for the full deal with sequence parsing, sixel decoding and painting in the terminal? Wow, thats far beyond what I can achieve (xterm.js input processing alone is slower lol).

Unfortunately no. It is sequence parsing + sixel decoding.

  • If I display the full animation in the sixel viewer, I get -56 Mib/s. (gui-processing penalty)
  • If I display the full animation in the terminal via pty -> ~31 Mib/s (where mlterm gets 23 Mib/s)
  • If I dispaly the full animation in the terminal, by direclty feeding to it, -> ~40 Mib/s.

The drop has several reasons. But the most important reason is I have to stick with the gui calls and rules provided by our framework to ensure backend compatibility, as our vte is a widget, not an app. But the raw throughput (when only displaying the image is disabled) for sixel-bench in our vte is around 44 Mib/s. So it is pretty much optimized for its environment.

But this "extremely fast" rendering is not feasible for me because I need to use a static buffer with fixed length to get this. You can see where I'm getting at.. :))

It appears that 42-44 Mib/s range I've recently achieved is the limit for my decoder unless a fundamental change in my assumptions and decoder design.

If you are up to test my wasm decoder I can give you the compiled version, thus you wont need to install the emscripten SDK.

Ah, I totally forgot to mention that I already compiled it just two days ago, but did not have time to test it, sorry. However, if you have an additional script to test sixel-bench with yout wasm, that would be very nice. :)

I will run the existing tests and compare the results with mine. (I'll share the results within next week)

I am really interested in optimizing sixel rendering now, and I really like to see xterm.js with top-notch sixel support. Besides, your decoder.cpp already gets a lot of things right. In fact I was planning to build an experimental renderer around it to see how well it fares against mine in the same environment, but I am very busy, so this will take some time...

@jerch
Copy link

jerch commented Jul 2, 2021

Ah, I totally forgot to mention that I already compiled it just two days ago, but did not have time to test it, sorry. However, if you have an additional script to test sixel-bench with yout wasm, that would be very nice. :)

I can prepare a script for the sixel-bench test, sure thing.

If you pulled the whole node-sixel repo, you can run its benchmarks with:

cd node-sixel
git checkout faster_decode
# edit wasm/build.sh to point EMSCRIPTEN_PATH to your emsdk_env.sh
# then run
npm install  # will abort, just ignore it
npm run build-wasm
npm install
# benchmark
node_modules/.bin/xterm-benchmark ./lib/index.benchmark.js

Have not tested that yet on a clean checkout, so bear with me if its not working out of the box. Also my newer optimizations are not yet checked in, thus you will see slightly lower numbers.

Edit: Fixed the checkout/build commands.

@ismail-yilmaz
Copy link
Owner Author

I can prepare a script for the sixel-bench test, sure thing.

I'd be grateful, thanks!

I will benchmark it on this sunday.

@jerch
Copy link

jerch commented Jul 2, 2021

Damn switch statement - with flat switch statement over byte values I get worse numbers (80 - 150 MB/s), but reshaped into sorted if-else cascades I get slightly higher numbers (100 - 180 MB/s). Lol, not a good sign, if the conditional cascade actually performs better, normally the switch statement should outperform that by far with proper jump table optimization for multiple cases, grrrrr. So the neckbreaker for the current complicated state model seems to be the extensive usage of switch.

Will see, if I can get a simpler branching model from the cascading thingy (still misses several edge case transitions).

@jerch
Copy link

jerch commented Jul 2, 2021

This is the best compromise I could find with reduced state handling:

inline void maybe_color() {
  if (ps.state == ST_COLOR) {
    if (ps.p_length == 1) {
      ps.color = ps.palette[ps.params[0] % ps.palette_length];
    } else if (ps.p_length == 5) {
      if (ps.params[1] < 3
        && ps.params[1] == 1 ? ps.params[2] <= 360 : ps.params[2] <= 100
        && ps.params[3] <= 100
        && ps.params[4] <= 100) {
        switch (ps.params[1]) {
          case 2:  // RGB
            ps.palette[ps.params[0] % ps.palette_length] = ps.color = normalize_rgb(
              ps.params[2], ps.params[3], ps.params[4]);
            break;
          case 1:  // HLS
            ps.palette[ps.params[0] % ps.palette_length] = ps.color = normalize_hls(
              ps.params[2], ps.params[3], ps.params[4]);
            break;
          case 0:  // illegal, only apply color switch
            ps.color = ps.palette[ps.params[0] % ps.palette_length];
        }
      }
    }
  }
}

void decode(int length) {
  if (ps.not_aborted && ps.y_offset < ps.height) {
    for (int i = 0; i < length; ++i) {
      int code = ps.chunk[i] & 0x7F;
      if (62 < code && code < 127) {
        switch (ps.state) {
          case ST_COMPRESSION:
            put(code - 63, ps.color, ps.params[0]);
            ps.cursor += ps.params[0];
            ps.state = ST_DATA;
            break;
          case ST_COLOR:
            maybe_color();
            ps.state = ST_DATA;
          default:
            put_single(code - 63, ps.color);
            ps.cursor++;
        }
      } else if (47 < code && code < 58) {
        params_add_digit(code - 48);
      } else
      switch (code) {
        case 59:
          params_add_param();
          break;
        case 33:
          maybe_color();
          params_reset();
          ps.state = ST_COMPRESSION;
          break;
        case 35:
          maybe_color();
          params_reset();
          ps.state = ST_COLOR;
          break;
        case 36:
          ps.cursor = 0;
          break;
        case 45:
          ps.y_offset += 6;
          ps.offset = ps.y_offset * ps.width + 8;
          ps.cursor = 0;
          break;
        case 34:
          maybe_color();
          params_reset();
          ps.state = ST_ATTR;
          break;
      }
    }
  }
}

While if-cascades are slightly faster in wasm, they kinda make the code alot uglier. Furthermore I dont want to optimize for shortcomings of wasm runtimes (only tested it with nodeJS so far), if they have a hard time to opt for real jump tables, they better get that fixed imho. Throughput numbers are between 92 - 180 MB/s, binary size dropped from ~6KB to ~3.8KB.

maybe_color still needs some cleanup (just copied it over), also the reduced state handling needs extensive testing.

Well it seems I cannot find a significantly faster version in wasm even with reduced state model, and I dont think that the states can be further reduced without introducing faulty handling. I did not yet try your suggested optimizations above.

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 3, 2021

@jerch

In the meantime, some preliminary benchmarks, including wasm:

(Sixel tests, on my development decoder, the one that gets`~42 Mib/s on sixel-bench)

Clean decoding, 20 rounds.
SixelRenderer_Clean(sem_clean.six):                 1.40ms   (64.00 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           1.80ms   (29.00 MB/s)
SixelRenderer_Clean(screen_clean.six):              2.50ms   (21.00 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             4.50ms   (70.00 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           30.00m   (66.00 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             18.00ms  (33.00 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.70ms   (42.00 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  120.00ms (131.40 MB/s)
SixelRenderer_Clean(oriole_clean.six):              1.10ms   (9.90 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.70ms   (7.40 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.45ms   (1.10 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            1.10ms   (13.00 MB/s)
SixelRenderer_Clean(michael_clean.six):             1.20ms   (17.00 MB/s)
SixelRenderer_Clean(trontank_clean.six):            1.10ms   (15.00 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    20.00ms  (31.00 MB/s)
SixelRenderer_Clean(time_clean.six):                3.70ms   (42.00 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.95ms   (11.00 MB/s)
SixelRenderer_Clean(biplane_clean.six):             1.50ms   (24.00 MB/s)
SixelRenderer_Clean(demo05_clean.six):              1.00ms   (34.00 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.45ms   (0.07 MB/s)
SixelRenderer_Clean(chess_clean.six):               1.40ms   (17.00 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             1.70ms   (19.00 MB/s)
SixelRenderer_Clean(space_clean.six):               0.85ms   (9.70 MB/s)


"Fringe" decoder 61 Mib/ss version. ( static RGBA buffer with fixed size of 1920x1080. Yeah, kinda insane. But this can happen when we make compilers really happy. But making compilers happy can make life miserable for a developer. Hence this decoder is not really usable.

Clean decoding, 20 rounds.
SixelRenderer_Clean(sem_clean.six):                 0.90ms  (99.97 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           0.70ms  (73.60 MB/s)
SixelRenderer_Clean(screen_clean.six):              0.70ms  (72.50 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             2.20ms  (143.40 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           12.00ms (168.40 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             3.50ms  (168.70 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.55ms  (52.91 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  74.00ms (210.30 MB/s)
SixelRenderer_Clean(oriole_clean.six):              0.50ms  (21.73 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.45ms  (11.53 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.45ms  (1.15 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            0.50ms  (28.75 MB/s)
SixelRenderer_Clean(michael_clean.six):             0.55ms  (35.07 MB/s)
SixelRenderer_Clean(trontank_clean.six):            0.50ms  (32.82 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    3.40ms  (189.30 MB/s)
SixelRenderer_Clean(time_clean.six):                1.30ms  (118.20 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.45ms  (22.98 MB/s)
SixelRenderer_Clean(biplane_clean.six):             0.60ms  (59.01 MB/s)
SixelRenderer_Clean(demo05_clean.six):              0.65ms  (51.93 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.45ms  (0.07 MB/s)
SixelRenderer_Clean(chess_clean.six):               0.55ms  (41.01 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             0.65ms  (48.84 MB/s)
SixelRenderer_Clean(space_clean.six):               0.45ms  (18.35 MB/s)

And wasm on my base testing machine which I mentioned in my earlier messages:

   Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.11 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.20 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.91 ms
            Case "decodeString" : 20 runs - average runtime: 5.32 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.86 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.84 ms
            Case "decodeString" : 20 runs - average runtime: 2.26 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.89 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.23 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 32.78 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.75 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.76 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.61 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.66 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 234.74 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 66.26 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.17 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 58.24 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.36 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 72.63 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 11.45 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 56.30 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 124.64 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 124.38 MB/s

and perf on sixel-bench

       Output:
       SixelRenderer(data.sixel):          2.90s (42.00 MB/s)
       SixelRenderer(data.sixel):          2.80s (42.00 MB/s)
       SixelRenderer(data.sixel):          2.80s (42.00 MB/s)
       SixelRenderer(data.sixel):          2.80s (42.00 MB/s)

       Performance counter stats for './SixelBench' (4 runs):
     

      3.214,11 msec task-clock:u              #    0,998 CPUs utilized            ( +-  0,23% )
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
         4.940      page-faults:u             #    0,002 M/sec                    ( +-  0,03% )
11.763.759.200      cycles:u                  #    3,660 GHz                      ( +-  0,10% )  (83,12%)
   837.240.854      stalled-cycles-frontend:u #    7,12% frontend cycles idle     ( +-  0,26% )  (83,36%)
 2.512.438.090      stalled-cycles-backend:u  #   21,36% backend cycles idle      ( +-  1,25% )  (33,52%)
14.418.053.773      instructions:u            #    1,23  insn per cycle         
                                              #    0,17  stalled cycles per insn  ( +-  0,53% )  (50,13%)
 3.225.163.249      branches:u                # 1003,440 M/sec                    ( +-  0,17% )  (66,73%)
   112.190.910      branch-misses:u           #    3,48% of all branches          ( +-  0,35% )  (83,26%)

       # Table of individual measurements:
       3,24364 (+0,02208) #
       3,22007 (-0,00148) #
       3,20740 (-0,01416) #
       3,21511 (-0,00644) #

       # Final result:
       3,22156 +- 0,00781 seconds time elapsed  ( +-  0,24% )

`

I'll continue to explore the optimization strategies, part 2, with some more findings, tomorrow. Have a nice weekend.

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 4, 2021

@jerch ,

Another treat: A modified version of your wasm decoder (I adapted the logic of mine to yours. Seems slightly faster, but contains less checking.) You can compare it with the above benchmarks. (Warning: the patch code is ugly - in C++ standards, at least..)

Results:

 Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.20 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.29 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 5.01 ms
            Case "decodeString" : 20 runs - average runtime: 5.43 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.77 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.66 ms
            Case "decodeString" : 20 runs - average runtime: 2.17 ms
          Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.77 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.07 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.13 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.75 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.76 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.69 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.51 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 236.52 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 65.68 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.34 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 57.31 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.33 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 73.11 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 11.73 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 54.91 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 115.23 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 134.54 MB/s

Please find attached the modified code..
decoder.zip

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 4, 2021

@jerch

I've made some corrections to my patch and as a result it seems a bit faster than previous:

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.16 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.22 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.77 ms
            Case "decodeString" : 20 runs - average runtime: 5.59 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.84 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.59 ms
            Case "decodeString" : 20 runs - average runtime: 2.08 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.83 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.10 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.12 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.65 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 41.38 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.68 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.54 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 250.48 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 62.10 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.38 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 57.28 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.30 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 73.60 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 11.83 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 54.51 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 111.23 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 139.39 MB/s

The code can be easily modified and unrolled, which might let you gain even faster speeds, and test the assembly behavior for your final decoder.

decoder_corrected.zip

@jerch
Copy link

jerch commented Jul 4, 2021

@ismail-yilmaz Hmm weird, with your last code I get slightly worse runtimes than with my lastest optimizations (still have to cleanup and push that). Btw in line 187 you do this:

		for(int n = 0; n < 6; ++n) {
			if(c & (1 << n))
				ps.canvas[p + ps.jump_offsets[n]] = ps.color;
		}

Replaced with:

		for(int n = 0; n < 6; ++n) {
				ps.canvas[((c >> n) & 1) * (p + ps.jump_offsets[n])] = ps.color;
		}

I get much higher throughput. It is again the "write always something" trick, can you try if this gives you higher numbers as well? (Because you said above that this is worse for you, so I am not sure, if it misuses some CPU cache settings).

Btw my numbers are constantly higher than yours, so your 42-ish number is more like 65 for my machine. Maybe CPU and bus speed differences can explain most of that, not sure if there are bigger differences in cache loading times between AMD and Intel, I also think that the different pipeline lengths will equal out in the end (well I have not really an informed opinion in that field anymore, lost interest in CPU differences around Pentium 4, which was known to have insane long pipelines).

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 4, 2021

@jerch ,

Sure (not much difference here, interesting...):

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.07 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.19 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.62 ms
            Case "decodeString" : 20 runs - average runtime: 5.38 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.74 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.58 ms
            Case "decodeString" : 20 runs - average runtime: 2.04 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.98 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.20 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 32.96 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.88 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.08 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.51 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.83 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 243.80 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 63.77 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.55 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 56.14 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.35 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 72.80 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 12.24 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 52.60 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 111.89 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 138.63 MB/s

(Because you said above that this is worse for you, so I am not sure, if it misuses some CPU cache settings).

Well, for one, it is because you are using a static integer array with fixed size. Compilers usually vectorize the hell out of them (on x86 arch). I allocate the memory on heap (resizeable) and use aligned RGBA structure. integer operations can use registers. Setting a rgba buffer (even their length be the same) is usually done using memcpy. That's why the performance of my renderer takes hit if I "always write" .

Here's a synthetic benchmark stressing the difference between RGBA/integer and static vs dynamic allocation:

OP         - 500 rounds of random color fill.
BUFFERSIZE - 1920 * 1080,
THROUGHPUT - Buffer<RGBA>  : 56.00 MiB/s
THROUGHPUT - Buffer<int>   : 58.00 MiB/s
TIMING     - Buffer<int>   : 17.06 s  - 34.12 ms (17.06 s  / 500 ), min: 33.00 ms, max: 42.00 ms, nesting: 0 - 500
TIMING     - Buffer<RGBA>  : 17.69 s  - 35.38 ms (17.69 s  / 500 ), min: 34.00 ms, max: 49.00 ms, nesting: 0 - 500


OP         - 100000 rounds of random fill.
BUFFERSIZE - 1920 * 1080,
THROUGHPUT - RGBA   buffer[]  : 5700.00 MiB/s
THROUGHPUT - uint32 buffer[]  : 6200.00 MiB/s
TIMING     - RGBA   buffer[]  :  7.72 ms - 77.20 ns (99.00 ms / 100000 ),  min:  0.00 ns, max:  1.00 ms, nesting: 0 - 100000
TIMING     - uint32 buffer[]  :  9.72 ms - 97.20 ns (101.00 ms / 100000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 100000

if there are bigger differences in cache loading times between AMD and Intel

Possibly, because it appears that Athlon FX 6100 is somewhere in between i5 and i7, closer to i5.

@ismail-yilmaz
Copy link
Owner Author

@jerch
Oops, I've made a mistake. Here are the real results with always write:

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.13 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.23 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.96 ms
            Case "decodeString" : 20 runs - average runtime: 5.56 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.87 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.51 ms
            Case "decodeString" : 20 runs - average runtime: 2.08 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 28.15 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 18.95 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.44 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.76 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.72 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.73 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.40 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 235.48 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 65.95 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 8.88 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 66.80 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.80 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 83.18 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 9.37 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 68.81 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 107.65 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 144.03 MB/s

@jerch
Copy link

jerch commented Jul 4, 2021

@ismail-yilmaz Pushed my latest optimizations, which is actually the fastest with all proper state transitions. Thats the one where I get >90 MB/s for all tests of the benchmark, and ~170MB/s for the 12bit noise image (which should not be taken too serious, as it is "very opinionated" in its data). The reduced state model is also contained as decode_ (so you flip and test as well).

Edit: Ah I see, your last numbers are more like mine only showing a small gap now.

@jerch
Copy link

jerch commented Jul 4, 2021

Yes the static nature is much easier to deal with, but thats all I need for the wasm instance. The level to get multiple instances here is the wasm container itself, thus one would just spawn multiple wasm decoders if needed.

Ofc with the pixels on the heap you can not get the pseudo cache locality from chunk bytes, as it could be allocated elsewhere (and the chunk bytes prolly live elsewhere as well).

@ismail-yilmaz
Copy link
Owner Author

Yes the static nature is much easier to deal with, but thats all I need for the wasm instance. The level to get multiple instances here is the wasm container itself, thus one would just spawn multiple wasm decoders if needed.

Yeah, that's, unfortunately, not really affordable for me That's why I'd better explore other options...

@jerch
Copy link

jerch commented Jul 4, 2021

I did not like the static memory in the first place, because it creates tons of frictions for the higher level JS integration. But emscripten currently does not allow "memory.grow" for direct wasm builds (its simply not yet implemented lol), thus removing the allocators and going fully static was the easiest way for me to overcome the shortcomings. And since it shows better performance I am not really mad about it and will reshape the JS integration to suit that model.

Edit: But cant you do something alike - like allocating a bigger area once and do the memory semantics on your own within that area? I remember doing that for a game AI once, where I misused a bigger portion of stack memory with alloca resulting in a real big performance jump during the negamax descent. Ofc the stack is not suitable here, but maybe some own byte sink on the heap...

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 4, 2021

@jerch

I'll update the node-sixel shortly and report the results later today.

Edit: But cant you do something alike - like allocating a bigger area once and do the memory semantics on your own within that area? I remember doing that for a game AI once, where I misused a bigger portion of stack memory with alloca resulting in a real big performance jump during the negamax descent. Ofc the stack is not suitable here, but maybe some own byte sink on the heap...

Of course. I already did some "cool/clumsy" tricks that "just work" and failed miserably at others, but they were not feasible in general, because the real problem is that our vte is a widget and it can, and should continue to, work on a variety of devices and backends, ranging from liimited hobby hardware and rasberry Pi to hi-perf desktops, on SDL/linuxfb and even in web browsers. That's another reason why I find your work on wasm and opinions important and very productive for me. You see, I am also the co-author and maintainer of our HTML5 backend which allows apps that use our framework to run inside web-browsers (canvas + websockets). It is called Turtle. As its name suggests, it uses javascript :)) Thus I am also exploring the possibilities that webassembly can offer us. After all our HTML backend boils down to image and input handling which I believe can be satisfyingly optimized using webassembly.

@jerch
Copy link

jerch commented Jul 4, 2021

Oh I see, well thats interesting. What JS engine do you use for that? If you want to go that route we prolly should share more general ideas about wasm and JS interactions.

I did some more testing before getting down to a wasm sixel decoder implementation, mainly around nasty tasks in xterm.js like base64 and utf8 decoding. Wasm beats all of my JS implementations by far (at least 2 times faster), and those are already among the fastest in JS land (all tested on V8). Furthermore - calling into wasm creates almost no overhead, if done right. Currently I think that wasm can be used to replace performance critical JS sections, even if called over and over on a hot path.
But note that I have only limited experience with other JS/wasm engines (did only surface tests with Firefox/spidermonkey and WebKit/JSScriptCore - both are a bit slower than V8 from what Ive seen so far).

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 4, 2021

@jerch

Oh I see, well thats interesting. What JS engine do you use for that? If you want to go that route we prolly should share more general ideas about wasm and JS interactions.

I did some more testing before getting down to a wasm sixel decoder implementation, mainly around nasty tasks in xterm.js like base64 and utf8 decoding. Wasm beats all of my JS implementations by far (at least 2 times faster), and those are already among the fastest in JS land (all tested on V8). Furthermore - calling into wasm creates almost no overhead, if done right. Currently I think that wasm can be used to replace performance critical JS sections, even if called over and over on a hot path.
But note that I have only limited experience with other JS/wasm engines (did only surface tests with Firefox/spidermonkey and WebKit/JSScriptCore - both are a bit slower than V8 from what Ive seen so far).

Well, turtle is pretty lightweight as it does not try to recreate a gui with JS. What it does is simply redirect the app window's output (and an associated background) to a web browser via a simple base64 encoded binary protocol. It gets decoded by javascript and displayed and key + mouse input are tracked. This all happens in a simple loop so it can work on all major browsers. So all it does is some image processing and blitting partial or complete images from a server application to a client web browser capable of canvas and web sockets. That's why I think it can be easily optimized using webassembly (I may be dead wrong of course.) To give you a clear idea of what I am talking about, here is an example. (Note that this is two years ago and neither TerminalCtrl nor turtle are as slow as this now.)

But then again, I don't really want to pollute further our sixel and wasm optimization discussion wih other stuff. Later when I implement this I'd love to discuss this and learn more about webassembly from your experience with it.

@jerch
Copy link

jerch commented Jul 10, 2021

Oh thx for finding this one, yeah the unaligned load is alot slower (had to get rid of it with __attribute__((aligned(16))) all over the place, which boosted native, but not wasm). No I know why...

@jerch
Copy link

jerch commented Jul 10, 2021

@ismail-yilmaz Making baby steps towards SIMD parsing. Got the easy part working - slicing data at single state changing bytes (CR, LF, color/attribs/compression introducer). This high level lexing runs at 2GB/s native, and 700 MB/s in wasm (heavy usage of _mm_load_si128, no clue how to avoid that). It is ~8 times faster in native than the byte-by-byte loop, and still ~4 times faster under wasm.

The real challenge now is not to waste too much cycles on actual parsing of the subchunk data. 😸
Basically I see there two options:

  • go with minimal copying - will run alot into unaligned SIMD instructions
  • move SIMD digestable bytes into aligned memory before processing them

Normally I am a big fan of less data moving, but the penalty of unaligned SIMD access seems rather high, so an additional copy step might help, if many SIMD instructions will follow. Also not sure if my SIMD-fu is good enough to get number/color parsing done with it, might just drop to the old code here (which is no biggy, as those are low on the runtime meter). Still the sixel data bytes will need special care, and better get moved over to SIMD paint in aligned batches.

Regarding parsing numbers, this might help: http://0x80.pl/articles/simd-parsing-int-sequences.html

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 10, 2021

This high level lexing runs at 2GB/s native, and 700 MB/s in wasm (heavy usage of _mm_load_si128, no clue how to avoid that). It is ~8 times faster in native than the byte-by-byte loop, and still ~4 times faster under wasm.

Those numbers are impressive and sounds very promising!

Still the sixel data bytes will need special care, and better get moved over to SIMD paint in aligned batches.

Yeah this is what I have in my mind too. For the next round, my strategy will be to focus on batching single sixels with color information (map them), without modifying the original loop much, then flush them at once using SIMD where count % 4 == 0 . I'd like to see the difference for better or worse.

@jerch
Copy link

jerch commented Jul 11, 2021

Did some thinking/modelling of parsing with SIMD. My idea is currently the following:

input          ['#n!nnsss#nnss#nns#nss...']
load as epi8   ['#n!nnsss#nnss#nn']
slice singles  color:['n'] compression:['nnsss'] color:['nnss'] color:['nn']

These are "typed" fragments containing:

  • color /color definition (+ sixel bytes)
  • compression (+ sixel bytes)
  • pure sixel bytes (as continuation of previous vector)

The first fragment might be continuing from the previous vector, thus needs some sort of state carry. In the same sense the last fragment might not be finished yet. Not sure yet how efficiently deal with that.

All other fragments are "final", thus have proper outer state boundaries and can be processed atomic. The main problem here I currently have is to efficiently identify, whether a fragment has follow-up sixel bytes or not. In SIMD that could be done as follows:

__m128i data = _mm_loadu_si128(fragment);
__m128i lesser_63 = _mm_cmplt_epi8(data, 63);
__m128i lesser_127 = _mm_cmplt_epi8(data, 127);
__m128i sixels_marked = _mm_andnot_si128(lesser_63, lesser_127);

While this works, it creates a rather big penalty, and is not yet helpful in the later sixel processing.

And there is another problem with that fixed fragment idea - the biggest atomic directive the sixel format can have, is a non-disturbed color definition like:

#nnnn;2;xxx;xxx;xxx

This can only be processed in 128 SIMD in one go, if we limit the register numbering to one digit, bummer. Means at least for this one it is not possible without looping again.
Furthermore sixel is a very relaxed format, thus any data can be disturbed by NOOP chars. Means we either need a pre-cleanup step, or have to do some sanitizing on the fly. Both is rather expensive. Also not sure yet how to deal with that.

@jerch
Copy link

jerch commented Jul 11, 2021

Modified my top-level parser abit, which makes it slower (at 605 MB/s native), but now also precalculates the sixel byte offsets:

void decode(int length) {
  __m128i LF = _mm_set1_epi8(45);
  __m128i CR = _mm_set1_epi8(36);
  __m128i COLOR = _mm_set1_epi8(35);
  __m128i COMP = _mm_set1_epi8(33);

  int l = length / 16 * 16;
  int rem = length - l;
  for (int i = 0; i < l; i += 16) {
    __m128i part = _mm_load_si128((__m128i *) &ps.chunk[i]); // this is actually very slow in wasm!
    int start = i;
    int end = i;

    // test for singles: CR, LF, COMPRESSION and COLOR introducer
    __m128i testLF = _mm_cmpeq_epi8(part, LF);
    __m128i testCR = _mm_cmpeq_epi8(part, CR);
    __m128i testCRLF = _mm_or_si128(testLF, testCR);

    __m128i testCOLOR = _mm_cmpeq_epi8(part, COLOR);
    __m128i testCOMP = _mm_cmpeq_epi8(part, COMP);
    __m128i testCOLORCOMP = _mm_or_si128(testCOLOR, testCOMP);

    __m128i testSINGLE = _mm_or_si128(testCRLF, testCOLORCOMP);
    int hasSINGLE = _mm_movemask_epi8(testSINGLE);

    // identify sixels
    __m128i lesser_63 = _mm_cmplt_epi8(part, _mm_set1_epi8(63));
    __m128i lesser_127 = _mm_cmplt_epi8(part, _mm_set1_epi8(127));
    __m128i sixels_marked = _mm_andnot_si128(lesser_63, lesser_127);
    int sixelPos = _mm_movemask_epi8(sixels_marked);
    
    while (hasSINGLE) {
      // get LSB: __builtin_ctz (better with _tzcnt_u32, but not supported in wasm)
      int adv = __builtin_ctz(hasSINGLE);
      end = i + adv;
      if (end - start) parse_fragment(start, end, i + (sixelPos ? __builtin_ctz(sixelPos) : 16));
      handle_single(i + adv, ps.chunk[i + adv]);
      start = end + 1;
      hasSINGLE &= ~(1 << adv);
      sixelPos &= ~((1 << adv) - 1);
    }
    end = i + 16;
    if (end - start) parse_fragment(start, end, (sixelPos ? __builtin_ctz(sixelPos) : 16) + i);
  }

  // TODO: rem handling...
}

handle_single does the top level state switching for LF, CR, compression and color. This is pretty expensive due to compression handled here as well (throughput dropped from 2 GB/s to 800 MB/s by that), might need to handle that separately, not sure yet. (There is one thing that would support this idea - simd_paint does not need to flush output at compression borders, if the 4-pixel area is not yet full. Currently I cannot spot that from the fragment borders alone leading to superfluous paint calls.)

Now parse_fragment gets its own start and end offsets, and the start index of the next sixel data (which might be in fragment range or above). With this precalc'ed sixel byte offsets I hope I can do faster aligned sixel-pixel writes. No clue yet if the first part of a fragment (color/compression numbers) should be done in SIMD as well, prolly yes.

Edit: Fixed lousy do-while loop in code.

@jerch
Copy link

jerch commented Jul 11, 2021

@ismail-yilmaz The simd_paint code above is incomplete, it needs an additional ~bitmask operation on old before mixing, otherwise the OR'ing will blend with a previously set color (eg. from fillColor).

@jerch
Copy link

jerch commented Jul 12, 2021

More findings about sixel parsing - my basic mixed sixel painter looks like this now:

inline void simd_paint(__m128i sixels, int cur) {
  __m128i colors = _mm_set1_epi32(ps.color);
  __m128i ones = _mm_set1_epi32(1);

  int p = cur * 4 + ps.offset;

  for (int i = 0; i < 6; ++i) {
    __m128i singles = _mm_and_si128(sixels, ones);
    __m128i bitmask = _mm_cmpeq_epi32(ones, singles);
    __m128i updated = _mm_and_si128(colors, bitmask);

    __m128i prev = _mm_load_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]]);
    __m128i keep = _mm_andnot_si128(bitmask, prev);

    __m128i final = _mm_or_si128(keep, updated);
    _mm_store_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]], final);

    sixels = _mm_srai_epi32(sixels, 1);
  }
}

It is meant to be fed with 4 consecutive sixels (minus 63) and the current 4-pixel aligned cursor position (128bit progression). The color blending with previous color is fixed with a & ~bitmask.

On caller level I found 2 ways being suitable to digest the mixed sixels:

  • additional "register array" holding vector components
    This was my first naive approach. It does an additional copy step, which is subpar, but turned out to be very flexible. It can literally be hooked on top of any byte-by-byte parser impl as follows:
    // defined somewhere (initialized on decode entry)
    int reg[4] __attribute__((aligned(16)));
    int cursor4;
    int has_data;
    ...
    // on sixel write
    reg[cursor & 3] = sixel;
    has_data |= sixel;
    ...
    // on cursor position change
    if (cursor4 != cursor / 4) {
      if (has_data) {
        simd_paint(_mm_setr_epi32(reg[0], reg[1], reg[2], reg[3]), cursor4);
        // or with _mm_load_si128((__m128i *) reg) (faster on native)
        has_data = 0;
        reg[0] = 0; reg[1] = 0; reg[2] = 0; reg[3] = 0;
      }
      cursor4 = cursor / 4;
    }
    This scattered sixel collecting by only writing when cursor4 changes reduces the paint calls by far. On the other hand the array resetting is rather expensive (can be optimized with int64 or int128 nulling). has_data is a needed optimization due to sixel's way to move the cursor around with empty sixels, without it nullish dummy paint calls would be triggered over and over.
  • faster local processing in sixel state
    Next I tried to speedup the sixel processing for multiple sixels. The fastest isolated version I've found, is a direct load into SIMD from aligned memory:
    __m128i zero = _mm_set1_epi8(0);
    ...
    __m128i chunk = _mm_load_si128((__m128i *) chunk_p);
    __m128i spead1 = _mm_unpacklo_epi8(chunk, zero);
    __m128i sixels = _mm_unpacklo_epi16(spead1, zero);
    ...
    simd_paint(sixels, cursor4);
    With shifting one can access the other bytes in chunk likewise. But this fastest solution only holds its promise, if the chunk data is properly aligned and you already have ruled/filtered out other bytes. Unless taking expensive pre-steps, thats not given for "natural" chunk data, in fact we have sixel bytes scattered all over the place behind color and compression directives at various positions. I started to investigate top level SIMD parsing with fragment slicing as described above, but thats still work in progress (currently 30% slower with painting applied).
    A better interim solution for the existing parsers might be, to benefit from the sixel byte state, thus we already know that the current byte describes a sixel, and the next one has a high chance to be a sixel byte as well, without taking chunk alignment into account. Here the fastest version I've found is a shift-or one:
    // pretest: check of code fits sixel range
    sixels = _mm_srli_si128(sixels, 4);
    if (code) {
      sixels = _mm_or_si128(sixels, _mm_set_epi32(code, 0, 0, 0));
      has_data |= code;
    }
    This can be used for any sixels, but needs shift corrections at the beginning and end of the sixel bytes to move sixels in half done 4-pixel ranges into the right spot (we still have to respect the pixel aligment).

The second variant is in general ~20% faster (prolly due to omitting the extra memory roundtrip of the register array), but degrades for images with highly scattered sixel bytes. The reason is obvious - if you always do the shift variant but only one sixel byte at a time comes in, you have to do the shift correction and dummy paints over and over. Imho a combination of both will help here in local sixel byte state:

  • Do the first unaligned sixels with array register (automatically deals with the "one sixel a time" issue).
  • Once you hit the cursor mod 4 border, switch to shift variant eagerly, hoping more sixels to read.
  • When leaving, either do a shift correction + paint or save sixels back to the register. Also measured this one - for cursor % 4 == 1 the register variant is faster, for 2 and 3 the instant paint is faster.

About paint performance:
Applying the register array variant to my fastest byte-by-byte parser gives ~20% boost (110 - 120 MB/s) and the shift variant ~30% for most images (120 - 130 MB/s). The paint runtime dropped from ~40% down to <20% total runtime, thus more than halved (did not count the instructions nor cycles for simd_paint vs. put_single, guess SIMD is >2 times lower). I did not yet try an optimized combination of both, but dont expect much higher values with this, just more sustainable numbers across different image data. There still might be perf gems hidden in the sixel to pixel paint path (esp. lowering the amount of dummy paints), currently I hope to settle around stable 130 MB/s on my machine with the byte-by-byte parser loop (all with wasm, native is ~15% faster already).

Things not yet covered:

  • SIMD paint for compressed sixels - not done yet (expected: 5-10% boost from your numbers)
  • SIMD number parsing for compression and colors - is that worth to be considered?
  • SIMD top level slicing - promising interim results as described above, but way to go to cover all edge cases in a speedy manner (expect this one to be >>200 MB/s once all is done)

Cheers 😸

Edit:
Oh sweet, just found out that wasm directly supports _mm_insert_epi32, although it is sse4.1. This makes both variants above obsolete, we can simply carry around a __m128i variable to stack things up and paint on cursor change. Will further speedup things. - Thats actually not better, the dev guide even warns about these intrinsics showing poor performance...

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 12, 2021

@jerch Just to be clear: I can't comment or give feedback much these days because of my job right now.

I am reading your findings and really appreciate your hard work. Thus I think we should later gather these findings for a recommendation guide. Also maybe, just maybe, we can later use these findings to create a reference SIMD optimized encoder and decoder.

(Btw. Last night I have started to implement the parser in SIMD but now it has to wait for the next week until I can be free to finish & test it and give you some feedback. )

@jerch
Copy link

jerch commented Jul 12, 2021

Ah no worries, no need to feel pushed or anything, to me this is mostly optimization for fun. If we get somewhere down to a ref decoder/encoder - awesome, but if not - I dont care much either.
Furthermore SIMD introduces machine dependencies, not even sure how that wasm thingy will do on an ARM (whether they abstracted/shimmed things away).

@jerch
Copy link

jerch commented Jul 12, 2021

One more idea to further lower half filled paint calls - by stacking the colors in a ringbuffer[4], we could lower paint calls to real 4-pixel progression (including CRLF jumps). Not sure, if this will show a real benefit (or even runs worse because of the ringbuffer overhead), as far as I know most encoder libs do a CR reset on a color change not mixing colors from current line cursor position onwards. Well. would need some investigation on typical sixel encoding schemes...

@jerch
Copy link

jerch commented Jul 13, 2021

Some callgrind profiling data (shortened):

file: test1_clean.sixel
sixels:	380283
bytes:	619290

--------------------------------------------------------------------------------
        Ir        Dr        Dw  I1mr D1mr   D1mw ILmr  DLmr   DLmw 
--------------------------------------------------------------------------------

inline put_single:
32,406,294 3,793,752 2,740,265 1,988 29,0 151,80 1,88 7,867 68,950  PROGRAM TOTALS
28,843,839 3,216,772 2,262,066   85 4,806 78,876   85     3     16  ???:decode

inline simd_paint:
31,131,809 4,417,437 2,439,052 1,980 113,0 90,21 1,87 7,868 68,950  PROGRAM TOTALS
27,622,662 3,802,861 1,922,498   79 89,072 16,28   79     4     15  ???:decode

noinline put_single:
33,145,259 4,627,920 3,670,545 1,97 29,08 151,48 1,87 7,867 68,950  PROGRAM TOTALS
15,212,388 2,151,142 1,292,548   69 4,836 15,803   69     3     16  ???:decode.part
14,370,416 1,899,798 1,899,798    4     0 62,724    3     .      .  ???:put_single(int, int)

noinline simd_paint (with prepare_paint helper):
32,095,566 4,907,242 2,565,930 1,97 113,59 90,17 1,874 7,86 68,950  PROGRAM TOTALS
16,726,671 1,835,401 1,093,683   64  6,272 15,24   64     3     15  ???:decode
11,261,790 2,377,489   875,917    6 83,365     0    6     1      .  ???:prepare_paint(int, int)
   597,958    79,776    79,776    4      0   954    4     .      .  ???:put_single(int, int)

noinline simd_paint (optimization - avoid dummy paints):
29,182,602 4,320,719 2,380,693 1,97 102,42 92,29 1,87 7,868 68,949  PROGRAM TOTALS
15,910,154 1,856,080 1,020,775   22  5,775    36   22     3     15  ???:decode
 8,271,960 1,711,440   665,560    6 72,695     0    6     .      .  ???:prepare_paint(int, int)
   798,301    58,846    98,027   39      6 15,94   39     .      .  ???:put(int, int, int)
   597,958    79,776    79,776    4      0  2,83    4     .      .  ???:put_single(int, int)
  • Instructions/sixel for single paints dropped from ~38 to ~22. Now one SIMD paint calculates to ~88 instructions, which is pretty close to my rough estimate from code (84 instructions), means we have literally no dummy paints anymore. 22 sounds still pretty high per sixel byte, well I currently see no way how to make that any faster.
  • Instructions/byte for non-sixel bytes is really high with ~90, guess we found the slowpoke 😸

@jerch
Copy link

jerch commented Jul 14, 2021

Started to wonder, how about a simple thing like the RGB color conversion, thus went on and tried SIMD versions of it:

int normalize_rgb(int r, int g, int b) {
  return 0xFF000000 | ((b * 255 + 99) / 100) << 16 | ((g * 255 + 99) / 100) << 8 | ((r * 255 + 99) / 100);
}

int normalize_rgb_simd_int(int r, int g, int b) {
  // algo: ((x * 255 + 99) * 0xA3D7 + 0x8000) >> 22
  __m128i reg = _mm_set_epi32(r, g, b, 100);
  reg = _mm_mullo_epi32(reg, _mm_set1_epi32(255));
  reg = _mm_add_epi32(reg, _mm_set1_epi32(99));
  reg = _mm_mullo_epi32(reg, _mm_set1_epi32(0xA3D7));
  reg = _mm_add_epi32(reg, _mm_set1_epi32(0x8000));
  reg = _mm_srli_epi32(reg, 22);
  __m128i result = _mm_shuffle_epi8(reg, _mm_set_epi8(
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x00, 0x04, 0x08, 0x0C
  ));
  return _mm_cvtsi128_si32(result);
}

int normalize_rgb_simd_float(int r, int g, int b) {
  __m128 reg = _mm_set_ps(r, g, b, 100);
  reg = _mm_mul_ps(reg, _mm_set1_ps(2.55f));
  reg = _mm_round_ps(reg, _MM_FROUND_TO_NEAREST_INT);
  __m128i result = _mm_cvtps_epi32(reg);
  result = _mm_shuffle_epi8(result, _mm_set_epi8(
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x00, 0x04, 0x08, 0x0C
  ));
  return _mm_cvtsi128_si32(result);
}

and here is the asm for those (gcc 11) with cycle count in front (taken from intel optimization guide):

  normalize_rgb(int, int, int):
1         mov     r8d, edx
1         mov     edx, esi
1         sal     edx, 8
1         sub     edx, esi
1         add     edx, 99
1         movsx   rax, edx
1         sar     edx, 31
3         imul    rax, rax, 1374389535
1         sar     rax, 37
1         sub     eax, edx
1         mov     edx, edi
1         sal     edx, 8
1         sal     eax, 8
1         sub     edx, edi
1         add     edx, 99
1         movsx   rcx, edx
1         sar     edx, 31
3         imul    rcx, rcx, 1374389535
1         sar     rcx, 37
1         sub     ecx, edx
1         or      eax, ecx
1         mov     ecx, r8d
1         sal     ecx, 8
1         sub     ecx, r8d
1         add     ecx, 99
1         movsx   rdx, ecx
1         sar     ecx, 31
3         imul    rdx, rdx, 1374389535
1         sar     rdx, 37
1         sub     edx, ecx
1         sal     edx, 16
1         or      eax, edx
1         or      eax, -16777216
39        ret

 normalize_rgb_simd_int(int, int, int):
1         mov     eax, 100
1         movd    xmm0, esi
1         movd    xmm1, eax
2         pinsrd  xmm0, edi, 1
2         pinsrd  xmm1, edx, 1
1         punpcklqdq      xmm1, xmm0
1         movdqa  xmm0, xmm1
1         pslld   xmm0, 8
1         psubd   xmm0, xmm1
1         paddd   xmm0, XMMWORD PTR .LC0[rip]
10        pmulld  xmm0, XMMWORD PTR .LC1[rip]
1         paddd   xmm0, XMMWORD PTR .LC2[rip]
1         psrld   xmm0, 22
1         pshufb  xmm0, XMMWORD PTR .LC3[rip]
1         movd    eax, xmm0
26        ret

 normalize_rgb_simd_float(int, int, int):
1         pxor    xmm2, xmm2
1         pxor    xmm1, xmm1
1         pxor    xmm3, xmm3
1         movss   xmm0, DWORD PTR .LC4[rip]
5         cvtsi2ss        xmm2, edx
5         cvtsi2ss        xmm1, esi
5         cvtsi2ss        xmm3, edi
1         unpcklps        xmm0, xmm2
1         unpcklps        xmm1, xmm3
1         movlhps xmm0, xmm1
5         mulps   xmm0, XMMWORD PTR .LC5[rip]
6         roundps xmm0, xmm0, 0
3         cvtps2dq        xmm0, xmm0
1         pshufb  xmm0, XMMWORD PTR .LC3[rip]
1         movd    eax, xmm0
38        ret

Performance: the normal version wins, simd_int is 2% slower, simd_float is 5% slower. This is somewhat weird when only looking at the cycle counts, but makes sense when taking parallel µ-ops into account. This will change, if the channel values would not be provided from normal registers but are already in SIMD (the alignment and preparation steps are quite expensive currently). Comparing simd_int to simd_float the int variant for the "inner algo" is one cycle faster (15 vs, 16 cycles), even with the nasty pmulld.

Edit: the float version drops down to 20 cycles, if already fed with floats (and currently wins the race in the decoder). Still think the int variant will be the winner in the end, as the values are not meant to be floats. I really wonder, if I could get rid of that pmulld somehow...

@jerch
Copy link

jerch commented Jul 16, 2021

@ismail-yilmaz I kinda give up with the high level SIMD tokenization. I tried 3 different approaches now (with the last being the fastest, but still 25% slower than my fastest byte-byte loop). It turns out that the high fragmentation of sixel data needs many state changes for one 16 byte slice (one register load), which creates lots of single byte extractions (which are cumbersome with SIMD) and alignment corrections on number and sixel bytes (the only ones that are consecutive to some degree). But again those consecutive bytes cannot be processed directly in SIMD, as they need special preparations:

  • sixel bytes must be padded to 32bit and aligned to the 4-pixel progression
  • number bytes must be padded to 16bit to use _mm_madd_epi16, number of digits have different code paths
    The number conversion with madd is indeed much faster for +3-digit numbers (3 times faster), but falls short for 1-digit numbers (4 times slower). The additional need for branching over the number of digits introduces another nonsense penalty, esp in wasm (that digit switch made the wasm build 30% slower). Ofc chances are high that I overlooked here something, yet sixel data often contain 1- and 2-digit numbers, thus I think there is not much to be gained here.

Nonetheless here is my last and fastest top level SIMD attempt, in case you want to play with it or have better idea, how to approach the fragment data:

typedef union Vec128 {
  __m128i  vector;
  long long i64[2];
  int i32[4];
  short i16[8];
  char byte[16];
} Vec128;

inline void handle_unclear(char code) {
  switch (code) {  // <---- thats the speed showstopper...
    case '!':
      ps.state = ST_COMPRESSION;
      break;
    case '#':
      ps.state = ST_COLOR;
      break;
    case '$':
      ps.cursor = 0;
      break;
    case '-':
      ps.y_offset += 6;
      ps.offset = ps.y_offset * ps.width + 16;
      ps.cursor = 0;
      break;
    case ';':
      break;
  }
}

static int sixels_or_numbers_counter = 0;
inline void handle_sixels_or_numbers(Vec128 reg, int start, int end, int number_bits, int sixel_bits) {
  // to not get removed by optimzation
  sixels_or_numbers_counter++;

  // follow-up code removed for less cluttering
}

// like BLSR, but not available in wasm
inline int clear_bits_below(int x) {
  return x & (x - 1);
}

inline __m128i mask_sixels(const __m128i input) {
  const __m128i tmp = _mm_add_epi8(input, _mm_set1_epi8(65));
  return _mm_cmplt_epi8(tmp, _mm_set1_epi8(-64));
}

inline __m128i mask_numbers(const __m128i input) {
  const __m128i tmp = _mm_add_epi8(input, _mm_set1_epi8(80));
  return _mm_cmplt_epi8(tmp, _mm_set1_epi8(-118));
}

void decode____(int length) {
  Vec128 reg;

  int l = length / 16 * 16;
  int rem = length - l;
  if (rem) {
    for (int k = l + rem; k < l + 16; ++k) ps.chunk[k] = 0;
    l += 16;
  }
  for (int i = 0; i < l; i += 16) {
    reg.vector = _mm_lddqu_si128((__m128i *) &ps.chunk[i]);

    // strip high bit
    reg.vector = _mm_and_si128(reg.vector, _mm_set1_epi8(0x7F));

    // identify sixel & numbers
    int sixels = _mm_movemask_epi8(mask_sixels(reg.vector));
    int numbers = _mm_movemask_epi8(mask_numbers(reg.vector));

    // identify unclear bytes
    int unclear = 0xFFFF & ~(sixels | numbers);

    int pos = 0;
    while (unclear) {
      int adv = __builtin_ctz(unclear);
      if (pos < adv) {
        handle_sixels_or_numbers(reg, pos, adv, numbers, sixels);
      }
      handle_unclear(reg.byte[adv]);
      pos = adv + 1;
      unclear = clear_bits_below(unclear);
    }
    if (pos < 16) {
      handle_sixels_or_numbers(reg, pos, 16, numbers, sixels);
    }
  }
}

This top level SIMD loop needs alot less instructions at the 16-bytes slice than my first version, and groups slices into unclear and sixels_or_numbers. The __builtin_ctz is problematic for machines without native TZCNT (< haswell I think), typically it gets triggered around 2 - 6 times for every 16-byte from my data analysis. Everything else is a piece of cake in that main loop (throughput ~3 GB/s).
The real perf showstopper is the switch statement in handle_unclear (throughput drops down to 700 MB/s). At first I wondered if the byte extraction in reg.byte[adv] is the problem, but replacing the switch with a simple add on a static variable restore most of the speed. It seems that the CPU doesnt like branching here. Maybe you have a good idea how to prevent that perf decrease, I didnt so far, the code paths are way to different to think about a branch-less version.

When applying the real data parsing (color/compression/sixels) down the code paths, things get ugly due to extra work needed for padding and alignment (not shown above). So far the only ground that shows real speed improvement with SIMD is the sixel-to-pixel path.

I gonna stop those top level SIMD attempts for now until you have a fundamental better idea, how to approach the data. In the meantime I gonna try to micro optimize my byte-by-byte loop (repeated SIMD painting still missing). The last trick I have in mind is to reduce the dummy paints* to almost 0, at least across compression-sixels-compression progression (not really useful across color changes as they almost always contain cursor movements with CR). Looking at the real asm output, simd_paint is only 72 instructions, means that my calculated 88 instructions still contains 20% dummy calls. Getting rid of those should give another 3-5% speedup. Thats not much, but would give me 130-160 MB/s for most images I have for testing here, which is still good compared to 92-95 MB/s without SIMD. After that change I am out of ideas. 😸


[*] You might have wondered in my last posts about "dummy paints". I call any paint a "dummy paint", if the 4-pixel range is underfull, thus sixels are missing, which create additional "nonsense" paints with simd_paint. Example:

$!5abc!3def creates the following:

  • repeated paint a 5 times --> 2 4-pixel aligned paints: [a, a, a, a], [a, 0, 0, 0]
  • paint bc --> 1 4-pixel aligned paint: [0, b, c, 0]
  • repeated paint d 3 times --> 2 4-pixel aligned paints: [0, 0, 0, d], [d, d, 0, 0]
  • paint ef --> 1 4-pixel aligned paint: [0, 0, e, f]

In total simd_paint gets called 6 times, but only writes [a, a, a, a], [a, b, c, d], [d, d, e, f] in the end (3 calls would do), thus we have 3 dummy paints.

@jerch
Copy link

jerch commented Jul 17, 2021

Did some more local changes to my byte-to-byte loop (mostly introducing inner loops with less conditions):

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 4.85 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 127.80 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 2.75 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 116.20 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 4.41 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 147.54 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 84.26 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 184.07 MB/s

This gets me to 39-40 MB/s synchronous in xterm.js. I think I gonna stop with optimizations here, as the numbers in xterm.js start to look funny (one run of sixel-bench):
image

(The utf8 conversion does not more than a uint8 --> uint32 padding expansion (xterm.js works with utf32 internally) and the sequence parser already has a fast inner loop with only 3 conditions to check for DCS data. I hate JS for being so slow on such simple tasks. Would not have happened with wasm.)

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 17, 2021

@jerch ,

This gets me to 39-40 MB/s synchronous in xterm.js.

That sounds like a very good number for a JS sixel decoder (o so I assume, because the only one I know in the wild is yours.)

I'll take your findings and try moving it forward on a test setup , starting this monday as I will have a window (we'll have a national holiday season this whole week.) to work on something fun. Will post my findings on x86_64. So. stay tuned. :)

@jerch
Copy link

jerch commented Jul 20, 2021

Latest numbers:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 4.68 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 134.27 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 2.58 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 124.60 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 4.79 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 142.69 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 77.71 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 199.62 MB/s

While I missed my >130 MB/s goal with "test2_clean.sixel" (not even sure why), I got a nice boost on the fullHD noise image, which is now at 12fps. Thats still very low, on the other hand the image is really hard to chew on with its 4096 colors and the single pixel addressing (cursor kinda moving from 0 to target position for tons of 1bit-sixel). I have no test data for normal fullHD stuff, but I think it is capable to stay >30 fps for less degenerated data.

@jerch
Copy link

jerch commented Jul 22, 2021

@ismail-yilmaz Not sure how good you know callgrind, I have an interesting case, which I dont quite understand, maybe you have an idea or can explain, whats going on.

Playing around with the byte-by-byte loop I found this loop schematics to be the fastest:

int decode(int length) {
  for (int i = 0; i < length; ++i) {
    int code = ps.chunk[i] & 0x7F;
    switch (ps.state) {
      case ST_DATA:
        while (62 < code && code < 127) {
          ...  // process sixel bytes
          if (++i >= length) break;  // <--------
          code = ps.chunk[i] & 0x7F;
        }
        break;
      case ST_COMPRESSION:
        while (47 < code && code < 58) {
          ...  // process number bytes
          if (++i >= length) break;  // <--------
          code = ps.chunk[i] & 0x7F;
        }
        break;
    }
  }
}

It basically sub-loops at all positions, where multiple occurrences are possible skipping the outer loop and switch statement. Ofc now I need those inner break conditions on length (marked with <-----).

Now looking into callgrind I see many instructions being burnt on those length-break checks. Thus I went further and restructured the break conditions as outer fall-through:

int decode(int length) {
  ps.chunk[length] = 0xFF;  // new fall-through break condition in data
  for (int i = 0; i < length; ++i) {
    int code = ps.chunk[i] & 0x7F;
    switch (ps.state) {
      case ST_DATA:
        while (62 < code && code < 127) {
          ...  // process sixel bytes
          code = ps.chunk[++i] & 0x7F;
        }
        break;
      case ST_COMPRESSION:
        while (47 < code && code < 58) {
          ...  // process number bytes
          code = ps.chunk[++i] & 0x7F;
        }
        break;
    }
  }  // 0x7F in code will reach here as NOOP breaking the for loop eventually on next i < length check
}

This works great and reduces counted instructions for test1_clean.sixel by almost 4%! But runtime did not change at all for native, and in wasm it is even ~2% slower. Any clue whats going on? Is the instruction counter flawed in callgrind? Or is this some sort of a pipeline effect, where those excess instructions are hidden as uop not altering whole pipeline?

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 22, 2021

@jerch

Or is this some sort of a pipeline effect, where those excess instructions are hidden as uop not altering whole pipeline?

I haven't checked it yet ( I will), but it is likely a pipeline effect :(on x86, at least):

Remember this? That is exactly how I process parameters (and also optimized my sequence parser) on x86_64. In almost all scenarios the while loop works faster than the for loop for such operations that it is likely to loop more than once.

E.g two style of reading parameter chunks (Size: 95.37 MB):

For loop:

int GetNum1(const String& s)
{
	int n = 0;
	for(const int& c : s) {
		if(dword(c - '0') > 9)
			break;
		 n = n * 10 + (c - '0');
	}
	
	return n;
}

While loop

int GetNum2(const String& s)
{
	const char *p = ~s;
	int c = 0, n = 0;
	while(*p && dword((c = *p++) - '0') < 10)
		n = n * 10 + (c - '0');

	return n;
}

Timing (20 rounds, O3):

TIMING GetNum2 (while):  2.75 s  - 137.50 ms ( 2.75 s  / 20 ), min: 135.00 ms, max: 141.00 ms, nesting: 0 - 20
TIMING GetNum1 (for)    :  2.94 s  - 147.20 ms ( 2.94 s  / 20 ), min: 144.00 ms, max: 151.00 ms, nesting: 0 - 20

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 22, 2021

The problem with this approach, in my experience at least, is it performs worse with single sixel paints (branch misses tends to be high and winding/unwinding the stack for the loop at that "hot" point has a greater cost than its potential gains.)

@jerch
Copy link

jerch commented Jul 22, 2021

The problem with this approach, in my experience at least, is it performs worse with single sixel paints (branch misses tends to be high and winding/unwinding the stack for the loop at that point has a greater cost than its gain.)

Yes thats true, but the image triggers that case only 157 times. Overall the sub loops are much faster in wasm (not so much in native). Ofc the stack unwinding back from the inner switch case might be much more expensive here (br_table moves quick some stuff on the stack in wasm). Well I dont have something like callgrind for wasm, thus can only profile things indirectly.

Nonetheless will try to reshape the outer thing into a while loop as well, it is abit annoying that I cannot use pointer stuff that easy in wasm, not clue why I get such a bad penalty for using those.

@jerch
Copy link

jerch commented Jul 23, 2021

@ismail-yilmaz Note there is an issue with your while loop above - your loop condition tests for *p not being zero. Imho you cannot assume null termination here. Although NULL in terminal data is a NOOP for all states I know, it does not mark end of data by any means. And the parser from vt100.net would pass NULL along, thus it has to be handled by the subparser (sixel decoder here).

(Some background on this: NULL was used by some very early terminals as "time padding bytes" (give the device some time to keep up), thus it is theoretically possible to see cascades of NULLs in old data.)

@ismail-yilmaz
Copy link
Owner Author

ismail-yilmaz commented Jul 23, 2021

@jerch ,

Ah, that's just a synthetic test loop to display the difference, the actual parser is based on Upp::Stream and uses Stream::Get() and Stream::Peek(), which return -1 on eof or stream error (meaning, 0 is considered a valid byte.

Besides, that loop can be used to parse the parameters in my use-cases, because our parser, firstcollects and then splits the parsed valid parameter chunks into a vector of Strings. Strings are null terminated. So at that point it is guaranteed to have no '\0' in the string. except for the null terminator.

@jerch
Copy link

jerch commented Jul 28, 2021

Another optimization idea I had while looking at callgrind metrics - currently simd_paint puts colors for bits into final canvas target cells, which creates a rather high cache miss rate for the load/store ops (~7%, reason is the massive cell offset created by the the 6 pixel progression in y-direction and loading/blending with previous colors). This could be avoided by doing a local construction with contiguous writes:

input sixel: [a, b, c, d]
local canvas line writes: [[a1, b1, c1, d1], ..., [a6, b6, c6, d6]]

and a final copy step on LF:

memcpy: *[a1-d1] --> canvas_offset_1
...
memcpy: *[a6-d6] --> canvas_offset_6

This has several advantages in terms of cache and SIMD utilisation:

  • load/blend/write dance during construction has much better cache locality due to contiguous mem access
  • writes are guaranteed to be 4 * 32bit aligned (good for simd load/store)
  • final copy step is still cache-unfriendly, but can be done in steps with high cache utilization (e.g. in 16 pixel progression)
  • final copy step is a "write-through", no loading/blending needed anymore

Ofc this has a major drawback - the additional copy step itself, which might just be more expensive than the savings from better cache utilization achieved. Remains uncertain until actually tried.

Another similar idea would be to spread sixel bits into SIMD registers (horizontal spreading), instead of the current shift-or looping (vertical spreading):

input sixel: [a, b, c, d]
local canvas line writes: [[a1, a2, a3, a4], [a5, a6, b1, b2], ..., [d3, d4, d5, d6]]

Well I dont expect that to be any faster for 3 reasons:

  • bit spreading is quite cumbersome with SIMD (needs several shuffle ops)
  • 6 bits of sixel dont align well with 4 * 32bit in registers - needs branching/looping over [a1-a4], [a5-a6, b1-b2], [b3-b6] writes with expensive construction step for the interleaved 2. case
  • final copy step gets really awkward and prolly really expensive (has to extract/shuffle bit positions into contiguous order)

Not sure if I will try any of these ideas, they need quite some canvas construction refactoring, before they would work at all...

@jerch
Copy link

jerch commented Jul 30, 2021

Some more microoptimizations here and there:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 4.16 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 149.45 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 2.42 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 132.83 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 4.05 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 161.45 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 67.41 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 230.24 MB/s

Also tested against 5 fullHD screenshots, they all run at 40-50 FPS, and the degraded noise example is now at 15 FPS.

Furthermore I partially tested the first idea of my last post - well, it is not worth the trouble. A cache optimized load/store in paint_simd gives only 3% boost, but gets more than eaten by the final canvas construction/copy, which suffers the same cache problem (well less often, but still taking >5%).

Will push the code once I cleaned it up abit.

@jerch
Copy link

jerch commented Aug 2, 2021

Extended the benchmark script with more image relevant metrics to get some more explanation of bottlenecks:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 50 runs - average runtime: 3.68 ms
         Case "test1_clean.sixel" : 50 runs - average throughput: 164.39 MB/s
          --> image throughput { FPS: '271.50', PPS: '250.21 M', pixelWrite: '1.00 GB/s' }
         Case "test2_clean.sixel" : 50 runs - average runtime: 2.15 ms
         Case "test2_clean.sixel" : 50 runs - average throughput: 148.45 MB/s
          --> image throughput { FPS: '465.29', PPS: '104.46 M', pixelWrite: '417.86 MB/s' }
         Case "sampsa_reencoded_clean.six" : 50 runs - average runtime: 3.60 ms
         Case "sampsa_reencoded_clean.six" : 50 runs - average throughput: 181.07 MB/s
          --> image throughput { FPS: '277.93', PPS: '305.00 M', pixelWrite: '1.22 GB/s' }
         Case "FullHD 12bit noise" : 50 runs - average runtime: 64.36 ms
         Case "FullHD 12bit noise" : 50 runs - average throughput: 240.98 MB/s
          --> image throughput { FPS: '15.54', PPS: '32.22 M', pixelWrite: '128.87 MB/s' }
         Case "640x480 9bit tiles" : 50 runs - average runtime: 0.67 ms
         Case "640x480 9bit tiles" : 50 runs - average throughput: 151.24 MB/s
          --> image throughput { FPS: '1494.92', PPS: '459.24 M', pixelWrite: '1.84 GB/s' }
         Case "FullHD 1" : 50 runs - average runtime: 20.36 ms
         Case "FullHD 1" : 50 runs - average throughput: 166.72 MB/s
          --> image throughput { FPS: '49.11', PPS: '101.84 M', pixelWrite: '407.36 MB/s' }
         Case "FullHD 2" : 50 runs - average runtime: 22.69 ms
         Case "FullHD 2" : 50 runs - average throughput: 167.93 MB/s
          --> image throughput { FPS: '44.07', PPS: '91.38 M', pixelWrite: '365.50 MB/s' }

FPS is frame per second (here full images decoded per second), PPS is pixels per second and pixelWrite the effective pixel write memory throughput.

To make sense of these numbers, we need to know some facts about the images itself:

  • real world images:
    • test1: palette 256, RGB encoded, no transparency, dimensions in 4^n, encoder node-sixel
    • test2: palette 256, RGB encoded, no transparency, dimensions not in 4^n, encoder node-sixel
    • sampsa: palette 256, RGB encoded (monochrome), no transparency, dimensions in 4^n, encoder node-sixel
    • FullHD 1 & 2: palette 256, RGB encoded, no transparency, dimensions in 4^n, encoder img2sixel
  • constructed images:
    • worst case - FullHD 12bit noise: palette 4096, RGB encoded, no transparency, dimensions in 4^n, degraded with tons of CR repositioning and single sixel bits
    • best case - 640x480 9bit tiles: palette 512, RGB encoded, no transparency, dimensions in 4^n, compressed sixels with high bit density (10x10 color tiles)

Interpretation:

  • Real world images, that did not see special care, seem to settle around ~100M PPS or 400 MB/s effective pixel write throughput (that is "test2", "FullHD 1", "FullHD 2"). Still sounds rather lowish to me.
  • "test1" runs much better than the other real images, maybe some sort of over-optimization (it is my standard profiling image).
  • No clue yet, why "sampsa" runs that much better (only difference is it being monochrome, but the RGB converter does not special case that).
  • "test2" dimensions are not 4^n aligned, which leads to unaligned SIMD access. While I suspected this to be a major bottleneck, it only accounts for -5% speed on "test2" (tested with an aligned version of that image).
  • The images show all the same cache behavior (4-5% missrate), the results are not affected by different cache behavior for different image sizes (tested up to 1920x1080). This was further confirmed by tests with pixel writes against a fixed zero offset - the throughput only changes in +- 3%. I expect a big perf dent for very "wide" images, where 6 * (RGBA8888) image_width will overflow L3 cache size / invalidate often (not tested). But since most CPUs have at least 2MB cache these days, this will not happen for normally shaped images. (Ofc there is another cache drain step at L1/L2 border, that will be triggered much earlier somewhere around 600 - 1300px width, did not yet test for that). To make it short - the cache unfriendly sixel format is not a problem on L2/L3 vs. DRAM level.
  • pixelWrite highly varies across the images, while the input throughputs are pretty close. This indicates, that the input loop and the preparation is the main limiting factor, not the pixel writes (we already knew that). It also reveals, that, while "worst case" has the highest absolute input throughput, it really sucks in pixel writes. Not sure yet, if the degraded CR + empty sixel jumps can be further optimized. This image also has the worst sixel bit density with mainly 1 bit per sixel byte. Also the 4-pixel simd write cannot help here much, as a color change is very likely for every sixel, which leads to more "dummy paints" with excess 4-pixel load/store cycles - theoretically up to 4 times more often for single color single pixel addressing (have not analysed the data, at least the PPS numbers suggest degradation around 3 times).

Summary:
These new numbers do not help as much as I had hoped at the decoder alone, they mostly confirm what we already knew - the input consumption being slow, especially for low sixel bit density. But they might come handy for an encoder to try different strategies and maybe speed up the decoder indirectly by outputting more decoder-friendly data.

@jerch
Copy link

jerch commented Aug 19, 2021

@ismail-yilmaz Some updates from my side. Have not yet pushed the code as the changes and integration still takes time, because of the level 1 vs. level 2 handling and the question, whether to truncate at raster dimensions.

While trying to find a working API to cover all these details I made and interesting observation, that might also help you to get better cache utilization:

My wasm implementation started as level2 truncating only, as it promised way greater optimizations. In the end we are only interested in the final pixel array, thus it seemed obvious to do that as one big blob of memory to avoid copy costs (even static to save pointer indirections).
Now while trying to solve the level1 / non-truncating issue I started playing around with a sixel-line decoder (6-pixels in height), where after finishing the line the pixels need to be copied over before the next line can be processed. This involves one additional copy action for the whole pixel data in the end. To my surprise this is in wasm slightly faster than the big blob approach (~ 1-3%, I think in native as well, but did only a few checks).

Imho this is a pretty big cache side effect overcompensating an additional copy step (prolly due to the big blob going into far memory regions for big images with bad cache locality again). Still have to understand the details, also I am not settled yet, whether to keep the level 1/2 distinction, or go with just one general purpose decoder in the end (that would be way easier to maintain).

If you want to try something similar, I did the following:

  • set line length to 32768 pixels at max (should be some time before we get to 64k)
  • spread the 6 pixel lines of a sixel at those offsets (needs 12 * 65536 memory pages in wasm, ~ 0.75 MB)
  • whenever a sixel line is finished (or at end of image), copy the written pixels elsewhere
  • flush line memory with fill_color
  • repeat with next sixel line

On the first sight it is not really obvious, why this might lead to better cache utilization, as the memory usage is quite spread across those 0.75 MB. Imho the following happens: really every memory interaction (sixel-to-pixel write, off-copy, flush) touches these 6 memory areas prolly keeping it hot in the cache. Only for very wide images thrashing will occur, when the cursor gets too far to the right. Just a theory atm, if it is true there should be a significant throughput drop at some line width depending on the cache size.

@ismail-yilmaz
Copy link
Owner Author

Hello @jerch ,

Thanks for the update! I am reading your posts, passively. As you prolly noted, I've taken a short break, after a long and very exhausting season. I'll be back, and try to implement + continue testing + reporting back as usual starting from Aug 29 and on.

@jerch
Copy link

jerch commented Aug 20, 2021

Trying to get a hold of the cache effects - with a very simple autogenerated image across various widths I see the following behavior in my benchmark:

      Context "Cache"
         Case "16" : 100 runs - average throughput: 49.26 MB/s
         Case "64" : 100 runs - average throughput: 95.27 MB/s
         Case "128" : 100 runs - average throughput: 116.08 MB/s
         Case "256" : 100 runs - average throughput: 111.16 MB/s
         Case "512" : 100 runs - average throughput: 106.59 MB/s
         Case "1024" : 100 runs - average throughput: 106.55 MB/s
         Case "4096" : 100 runs - average throughput: 93.51 MB/s
         Case "16384" : 100 runs - average throughput: 90.70 MB/s

It seems the throughput peaks somewhere around 128px width and has a bigger drop between 1024 and 4096. Doing these areas more in detail reveals, that the throughput peaks between 120-180px with 125 MB/s max, and around 1200-1400px the throughput drops from ~107 to ~95 MB/s out of a sudden.

The drop around 1300px is pretty close to my L1 cache size of 32kB, which strongly suggests that cache behavior changed here. I have no good explanation for the lower peak yet, it simply might be a summation effect of overall buffer/cache utilization. Furthermore my tests are partially screwed by GC actions, which might show up non-deterministically (might be the reason for the rather big ranges across the runs). Trying to test cache behavior with JS is def. not a good idea 😅.

(Btw very small images <64px width are much slower in throughput, I guess that the function call overhead for off-copying pixels gets significant here.)

@jerch
Copy link

jerch commented Aug 21, 2021

The new line-based decoder opened the field for further optimizations, which kinda makes the SIMD attempts obsolete under wasm.

  • line-based w'o SIMD:
        Context "decode - testfiles (WasmDecoder)"
           Case "test1_clean.sixel" : 50 runs - average throughput: 158.07 MB/s
           Case "test2_clean.sixel" : 50 runs - average throughput: 157.47 MB/s
           Case "sampsa_reencoded_clean.six" : 50 runs - average throughput: 156.20 MB/s
           Case "FullHD 12bit noise" : 50 runs - average throughput: 325.19 MB/s
           Case "640x480 9bit tiles" : 50 runs - average throughput: 146.03 MB/s
           Case "FullHD 1" : 50 runs - average throughput: 172.81 MB/s
           Case "FullHD 2" : 50 runs - average throughput: 180.82 MB/s
    
  • big blob with SIMD:
        Context "decode - testfiles (WasmDecoder-simd)"
           Case "test1_clean.sixel" : 50 runs - average throughput: 164.39 MB/s
           Case "test2_clean.sixel" : 50 runs - average throughput: 148.45 MB/s
           Case "sampsa_reencoded_clean.six" : 50 runs - average throughput: 181.07 MB/s
           Case "FullHD 12bit noise" : 50 runs - average throughput: 240.98 MB/s
           Case "640x480 9bit tiles" : 50 runs - average throughput: 151.24 MB/s
           Case "FullHD 1" : 50 runs - average throughput: 166.72 MB/s
           Case "FullHD 2" : 50 runs - average throughput: 167.93 MB/s
    

It seems SIMD is not yet that usable and well optimized for wasm, thus I will stick with the generic version for now, which also covers more browsers. Note that for x86-native the sixel SIMD decoder runs "test1_clean.sixel" with 220 MB/s (~40% faster).

offtopic:
Kinda made an even worse observation with my base64 lib, where under wasm SIMD actually runs at half the speed of the generic version, while under x86-native it is 5x faster. Well, the base64 decoder in SIMD is very stack-var demanding with cumbersome access needs, which hurts alot more on a stack machine like wasm, while a register machine can carry along those values in registers.

@jerch
Copy link

jerch commented Aug 28, 2021

@ismail-yilmaz FYI - pushed the my latest wasm code to the PR: jerch/node-sixel#20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Terminal Terminal package related issues.
Projects
None yet
Development

No branches or pull requests

2 participants