Add AVX2 assembly code for LMCS filter #46

nuomi2021 · 2023-03-09T15:06:49Z

based on this LMCS consume about 2.81% time for Tango2_3840x2160_60_10_420_27_LD.266, maybe we can use VPGATHERDD to optimize it.

11.96% ffmpeg_g [.] put_vvc_luma_hv_10
5.88% ffmpeg_g [.] alf_get_coeff_and_clip_10
5.25% ffmpeg_g [.] ff_vvc_inv_dct2_64
4.30% [kernel] [k] __lock_text_start
4.22% ffmpeg_g [.] ff_vvc_alf_filter_luma_w16_16bpc_avx2
3.46% ffmpeg_g [.] put_vvc_luma_bi_hv_10
3.45% ffmpeg_g [.] alf_filter_luma_vb_10
3.13% ffmpeg_g [.] vvc_loop_filter_luma_10
2.81% ffmpeg_g [.] lmcs_filter_luma_10
2.46% ffmpeg_g [.] put_vvc_luma_uni_hv_10
2.27% ffmpeg_g [.] put_vvc_chroma_hv_10
2.21% libc-2.31.so [.] 0x000000000018b733
2.05% libc-2.31.so [.] 0x000000000018bb41
1.95% ffmpeg_g [.] put_vvc_chroma_uni_hv_10
1.84% ffmpeg_g [.] put_vvc_chroma_bi_hv_10
1.81% ffmpeg_g [.] vvc_deblock_bs
1.41% ffmpeg_g [.] ff_vvc_predict_inter
1.25% libpthread-2.31.so [.] __pthread_mutex_lock
1.24% libpthread-2.31.so [.] __pthread_mutex_unlock
1.22% ffmpeg_g [.] ff_vvc_residual_coding
1.08% ffmpeg_g [.] alf_filter_cc_10
1.03% ffmpeg_g [.] apply_prof_uni_10
0.99% ffmpeg_g [.] ff_vvc_alf_filter
0.98% ffmpeg_g [.] ff_vvc_inv_dct2_32
0.94% ffmpeg_g [.] vvc_deblock_bs_luma_vertical
0.92% ffmpeg_g [.] add_residual_10

Anant-2005 · 2023-04-01T02:27:01Z

Can you explain this issue, please?
I want to work on this issue as a qualification task for GSOC 2023

nuomi2021 · 2023-04-01T03:22:41Z

@Anant-2005 , thank you for your interest.
if you check the lmcs_filter_luma
It's very simple,
Using VPGATHERDD instruction we can gather 8 pixels at one time. It may speed up the process.

Anant-2005 · 2023-04-12T12:25:44Z

@nuomi2021 I was asking if I have to create an asm file or I have to embed inline assembly, and my second question is which assembler do we use, coz there were a lot of assembler available, when I researched about it

nuomi2021 · 2023-04-12T13:55:27Z

Please use https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-using-x86inc-asm/
an example is #58

stone-d-chen · 2024-03-04T00:37:06Z

Hiya,

Is anyone currently working on this? If not could I try it?

nuomi2021 · 2024-03-04T06:13:25Z

Hi @stone-d-chen
thank you for being interested in this
for GSOC or just want to help with the project.
If for GSOC, please start work on inter or sao for arm. you can check the upstream c and x86 asm code to see how to do it with arm
If you want to help the project, please go ahead, and do what interests you most.

thank you

stone-d-chen · 2024-03-04T11:26:50Z

Hi @nuomi2021

Oh has the non-arm vvc project been taken? I've already met the qualification requirement (patch accepted for ffmpeg) but I wanted to try this as well.

I will attempt this regardless, more asking so I can plan my time re: setting up an arm dev environment, etc.

Thanks!

nuomi2021 · 2024-03-04T13:14:01Z

I've already met the qualification requirement (patch accepted for ffmpeg) but I wanted to try this as well.

Good to know.

Oh has the non-arm vvc project been taken?

No, but maybe you can choose a tough one (and will used by many phones/macs )

I will attempt this regardless

👍

stone-d-chen · 2024-03-04T13:45:14Z

No, but maybe you can choose a tough one

Fair enough 😂 I'll give it an attempt

Quick question re: VPGATHERDD, it seems like since it's only operating on int32 and the arrays (at least with the example video) are 16 bit. So I was thinking a way to do it would loading using punpcklwd and register of 0s to pad out the pixels. Then shifting off the garbage bits.

Rough outline:

    mova             m1, [srcq]
    punpcklwd        m1, m0 ; pxor m0 m0
    vpgatherdd       m2, [lutq + m1 * 2], m4
    vpslld           m3, m2, 16
    vpsrld           m3, 16

    ; final pack and write back
    packssdw         m0, m3, m4
    mova             [srcq], m0

Mainly wondering if I'm missing a int16 version of vpgather

frankplow · 2024-03-04T14:11:49Z

@stone-d-chen ~~Could you use PBLENDW?~~

Ignore me, I didn't realise lut was signalled rather than a constant. I don't think there's an equivalent to VPGATHERD which acts on words in AVX2.

nuomi2021 · 2024-03-04T14:30:20Z

vvdec has an implementation, you can refer to it :)

stone-d-chen · 2024-03-06T22:02:16Z

vvdec has an implementation, you can refer to it :)

ah took me a bit to realize that vvdec was a different repo haha. They are using a shuf instead of shifting.

Results from my rough draft shows a speedup. I've only used 2 vpgatherdds per loop so far to simplify. It seems like having another set would be helpful since according to fog's table it has latency 24 and cpi 5.

Next I think I probably should generalize this? I assumed width = 128, pixel was 2 bytes etc. I've been looking more into how the macro system works.

I was also wondering if there was a more official way of comparing outputs, I just eyeball'd so far since any errors were very obvious. I saw there was some conformance tests.

Before
+    4.48%     4.48%  ffmpeg_g  ffmpeg_g  [.] lmcs_filter_luma_10
     0.16%     0.16%  ffmpeg_g  ffmpeg_g  [.] lmcs_scale_chroma_10
     0.00%     0.00%  ffmpeg_g  ffmpeg_g  [.] run_lmcs
     0.00%     0.00%  ffmpeg_g  ffmpeg_g  [.] ff_vvc_lmcs_filter

After
+    1.26%     1.26%  ffmpeg_g  ffmpeg_g  [.] lmcs_filter_luma_10
     0.15%     0.15%  ffmpeg_g  ffmpeg_g  [.] lmcs_scale_chroma_10
     0.01%     0.01%  ffmpeg_g  ffmpeg_g  [.] run_lmcs
     0.01%     0.01%  ffmpeg_g  ffmpeg_g  [.] ff_vvc_lmcs_filter

nuomi2021 · 2024-03-07T04:29:03Z

you are so fast.
not always 128, it can be 32 or 64, not always 2 bytes, it can be 1 bytes for 8 bpc.
better start like this:

write a checkasm like https://github.com/ffvvc/FFmpeg/blob/up/tests/checkasm/vvc_mc.c, make sure it passed.
make sure ci test case passed https://github.com/ffvvc/FFmpeg/pull/198/files#diff-ab2f63759f09e4a4c7d039b6831fe49d10a7b39c9ea91d86ee5e516c84966003R85
make sure "valgrind ffmpeg -i a_lmcs_clip.vvc -f null -" passed

stone-d-chen · 2024-03-09T17:11:01Z

Quick update/Q's
Updated the 2 byte version to take multiple widths, I noticed however there were some 8 and 16 pixel widths in predict_inter (printing cu->cb_width) are these also possible widths?

New profiling numbers: I messed up the profiling originally
before ~= 4.29%
after ~= 3.11%
So maybe 35% faster.

I have an 1byte version working, going to start writing checkasms. Also is there an 8bit video I can test against as well?

  Children      Self  Command   Shared O  Symbol
+    4.29%     4.29%  ffmpeg_g  ffmpeg_g  [.] lmcs_filter_luma_10
     0.14%     0.14%  ffmpeg_g  ffmpeg_g  [.] lmcs_scale_chroma_10
     0.01%     0.01%  ffmpeg_g  ffmpeg_g  [.] run_lmcs
     0.00%     0.00%  ffmpeg_g  ffmpeg_g  [.] ff_vvc_lmcs_filter
     
  Children      Self  Command   Shared O  Symbol
+    2.97%     2.97%  ffmpeg_g  ffmpeg_g  [.] ff_lmcs_128_16bpc_avx2
     0.14%     0.14%  ffmpeg_g  ffmpeg_g  [.] lmcs_scale_chroma_10
     0.14%     0.14%  ffmpeg_g  ffmpeg_g  [.] lmcs_filter_luma_10 
     0.01%     0.01%  ffmpeg_g  ffmpeg_g  [.] run_lmcs
     0.01%     0.01%  ffmpeg_g  ffmpeg_g  [.] ff_vvc_lmcs_filter

stone-d-chen · 2024-03-16T21:09:56Z

In my fork I've created a pr with my current implementation.
stone-d-chen#1

AVX2 code paths for width 8, 16, 32, 64, 128 (width 4 downshifts to the scalar version)
16bpc, currently only loaded for 10bits but should "just work" for 12bits as well
Currently it just uses if-else statements to decide which function to call based on width; should probably be changed to a proper jump table
Test on conformance, one issue was that the ffvvc main branch seemed to also fail on 10 tests? My implementation fail the same ones. Still investigating finding a build where all 10 pass so add my changes to.
8bit version is in the works still, took a pause to figure out how to setup checkasm code
Somehow had a bit of performance regression when I was cleaning the code up; I think jump tables will help with branch misses and maybe I can tune the prefetch a bit.

// fails
    IBC_E_Tencent_1.bit
    CodingToolsSets_D_Tencent_2.bit
    IBC_D_Tencent_2.bit
    IBC_C_Tencent_2.bit
    IBC_A_Tencent_2.bit
    IBC_B_Tencent_2.bit
    sintel_120.266
    LOSSLESS_B_HHI_3.bit
    10b444_A_Kwai_3.bit
    10b444_B_Kwai_3.bit

nuomi2021 · 2024-03-18T14:21:14Z

@stone-d-chen good progress.
Please use the upstream version. it will support more conformance clips
especially for IBC

stone-d-chen · 2024-03-21T22:46:11Z

Yep that fixed the issue! All conformance tests pass now
stone-d-chen#3

Cleaned up 16bpc and redid 8 pixels to take advantage for YMM registers
8bpc for widths 32-128; plan to do 16 pixels next
Still need to do checkasms for 8bpc

nuomi2021 · 2024-03-23T15:05:59Z

will check it next week,
thank you @stone-d-chen

stone-d-chen · 2024-03-24T13:52:55Z

Sounds good no rush, @nuomi2021

Latest update:

8bpc should be fully complete now, I need to clean up my checkasm still

stone-d-chen#4

I'll probably take a pause on this for now until you take a look, I might start looking at the arm instructions and/or the x86 deblocking avx code, been spending some time reading how those algos are implementeed.

nuomi2021 · 2024-03-27T13:07:10Z

@QSXW could you also help review stone-d-chen#4
thank you

nuomi2021 added asm good first issue Good for newcomers labels Mar 9, 2023

nuomi2021 added this to the FFmpeg7.1 milestone Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 assembly code for LMCS filter #46

Add AVX2 assembly code for LMCS filter #46

nuomi2021 commented Mar 9, 2023

Anant-2005 commented Apr 1, 2023

nuomi2021 commented Apr 1, 2023

Anant-2005 commented Apr 12, 2023

nuomi2021 commented Apr 12, 2023

stone-d-chen commented Mar 4, 2024

nuomi2021 commented Mar 4, 2024

stone-d-chen commented Mar 4, 2024 •

edited

Loading

nuomi2021 commented Mar 4, 2024

stone-d-chen commented Mar 4, 2024 •

edited

Loading

frankplow commented Mar 4, 2024 •

edited

Loading

nuomi2021 commented Mar 4, 2024

stone-d-chen commented Mar 6, 2024

nuomi2021 commented Mar 7, 2024

stone-d-chen commented Mar 9, 2024 •

edited

Loading

stone-d-chen commented Mar 16, 2024 •

edited

Loading

nuomi2021 commented Mar 18, 2024

stone-d-chen commented Mar 21, 2024

nuomi2021 commented Mar 23, 2024

stone-d-chen commented Mar 24, 2024

nuomi2021 commented Mar 27, 2024

Add AVX2 assembly code for LMCS filter #46

Add AVX2 assembly code for LMCS filter #46

Comments

nuomi2021 commented Mar 9, 2023

Anant-2005 commented Apr 1, 2023

nuomi2021 commented Apr 1, 2023

Anant-2005 commented Apr 12, 2023

nuomi2021 commented Apr 12, 2023

stone-d-chen commented Mar 4, 2024

nuomi2021 commented Mar 4, 2024

stone-d-chen commented Mar 4, 2024 • edited Loading

nuomi2021 commented Mar 4, 2024

stone-d-chen commented Mar 4, 2024 • edited Loading

frankplow commented Mar 4, 2024 • edited Loading

nuomi2021 commented Mar 4, 2024

stone-d-chen commented Mar 6, 2024

nuomi2021 commented Mar 7, 2024

stone-d-chen commented Mar 9, 2024 • edited Loading

stone-d-chen commented Mar 16, 2024 • edited Loading

nuomi2021 commented Mar 18, 2024

stone-d-chen commented Mar 21, 2024

nuomi2021 commented Mar 23, 2024

stone-d-chen commented Mar 24, 2024

nuomi2021 commented Mar 27, 2024

stone-d-chen commented Mar 4, 2024 •

edited

Loading

stone-d-chen commented Mar 4, 2024 •

edited

Loading

frankplow commented Mar 4, 2024 •

edited

Loading

stone-d-chen commented Mar 9, 2024 •

edited

Loading

stone-d-chen commented Mar 16, 2024 •

edited

Loading