-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX2 assembly code for LMCS filter #46
Comments
Can you explain this issue, please? |
@Anant-2005 , thank you for your interest. |
@nuomi2021 I was asking if I have to create an asm file or I have to embed inline assembly, and my second question is which assembler do we use, coz there were a lot of assembler available, when I researched about it |
Please use https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-using-x86inc-asm/ |
Hiya, Is anyone currently working on this? If not could I try it? |
Hi @stone-d-chen thank you |
Hi @nuomi2021 Oh has the non-arm vvc project been taken? I've already met the qualification requirement (patch accepted for ffmpeg) but I wanted to try this as well. I will attempt this regardless, more asking so I can plan my time re: setting up an arm dev environment, etc. Thanks! |
Good to know.
No, but maybe you can choose a tough one (and will used by many phones/macs )
👍 |
Fair enough 😂 I'll give it an attempt Quick question re: VPGATHERDD, it seems like since it's only operating on int32 and the arrays (at least with the example video) are 16 bit. So I was thinking a way to do it would loading using Rough outline:
Mainly wondering if I'm missing a int16 version of vpgather |
@stone-d-chen Ignore me, I didn't realise |
vvdec has an implementation, you can refer to it :) |
ah took me a bit to realize that vvdec was a different repo haha. They are using a shuf instead of shifting. Results from my rough draft shows a speedup. I've only used 2 vpgatherdds per loop so far to simplify. It seems like having another set would be helpful since according to fog's table it has latency 24 and cpi 5. Next I think I probably should generalize this? I assumed width = 128, pixel was 2 bytes etc. I've been looking more into how the macro system works. I was also wondering if there was a more official way of comparing outputs, I just eyeball'd so far since any errors were very obvious. I saw there was some conformance tests.
|
you are so fast.
|
Quick update/Q's New profiling numbers: I messed up the profiling originally I have an 1byte version working, going to start writing checkasms. Also is there an 8bit video I can test against as well?
|
In my fork I've created a pr with my current implementation.
|
@stone-d-chen good progress. |
Yep that fixed the issue! All conformance tests pass now
|
will check it next week, |
Sounds good no rush, @nuomi2021 Latest update:
I'll probably take a pause on this for now until you take a look, I might start looking at the arm instructions and/or the x86 deblocking avx code, been spending some time reading how those algos are implementeed. |
@QSXW could you also help review stone-d-chen#4 |
based on this LMCS consume about 2.81% time for Tango2_3840x2160_60_10_420_27_LD.266, maybe we can use VPGATHERDD to optimize it.
11.96% ffmpeg_g [.] put_vvc_luma_hv_10
5.88% ffmpeg_g [.] alf_get_coeff_and_clip_10
5.25% ffmpeg_g [.] ff_vvc_inv_dct2_64
4.30% [kernel] [k] __lock_text_start
4.22% ffmpeg_g [.] ff_vvc_alf_filter_luma_w16_16bpc_avx2
3.46% ffmpeg_g [.] put_vvc_luma_bi_hv_10
3.45% ffmpeg_g [.] alf_filter_luma_vb_10
3.13% ffmpeg_g [.] vvc_loop_filter_luma_10
2.81% ffmpeg_g [.] lmcs_filter_luma_10
2.46% ffmpeg_g [.] put_vvc_luma_uni_hv_10
2.27% ffmpeg_g [.] put_vvc_chroma_hv_10
2.21% libc-2.31.so [.] 0x000000000018b733
2.05% libc-2.31.so [.] 0x000000000018bb41
1.95% ffmpeg_g [.] put_vvc_chroma_uni_hv_10
1.84% ffmpeg_g [.] put_vvc_chroma_bi_hv_10
1.81% ffmpeg_g [.] vvc_deblock_bs
1.41% ffmpeg_g [.] ff_vvc_predict_inter
1.25% libpthread-2.31.so [.] __pthread_mutex_lock
1.24% libpthread-2.31.so [.] __pthread_mutex_unlock
1.22% ffmpeg_g [.] ff_vvc_residual_coding
1.08% ffmpeg_g [.] alf_filter_cc_10
1.03% ffmpeg_g [.] apply_prof_uni_10
0.99% ffmpeg_g [.] ff_vvc_alf_filter
0.98% ffmpeg_g [.] ff_vvc_inv_dct2_32
0.94% ffmpeg_g [.] vvc_deblock_bs_luma_vertical
0.92% ffmpeg_g [.] add_residual_10
The text was updated successfully, but these errors were encountered: