-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very impressive work! (Feel free to discuss about the paper here!) #1
Comments
Hi Yongxu, thanks for the impressive questions! These are good questions to discuss about~
Further discussions you might contact my Wechat: haoningnanyangtu (also open for all friends in this topic). |
@teowu Thanks for your kindly repely. This is just an open discussion, and as far as you know, is there any literature claiming that disentanglement representations perform better than the entangled ones? (thus we must do representation entanglement) |
"thus we must do representation entanglement" Not a must lah. About disentanglement will enhance representations, I quite think this is a common idea in higher-level tasks (our related works also cited some) Most most recently, I read a paper in this year's NeurIPS sharing similar ideas, but I cannot find it now...should I find it I will ping its link here.. |
BTW I like this discussion so I ping it here (as if I were on OpenReview for ICLR or NeuRIPS lol) |
@teowu Great thanks :) |
@Sissuire @teowu Question regarding finetuning DOVER for VQA datasets which has only overall video quality. Let's take KoNViD-1k as an example. If you look in meta-data, you can see that there is only the overall video quality (hereinafter Qo). There is no technical (hereinafter Qt ) or aesthetic (hereinafter Qa ) video quality. But if you look at the labels for this dataset, you can see that there are three values for prediction, it seems that these are Qa, Qt and Qo. How did the authors get Qa and Qt if there were only Qo in the original dataset? How data in labels.txt was obtained? And i found that some labels.txt files has following structure: -1, -1, MOS(Qo). If i have no Qa and Qt, but have Qo, i can just set Qa and Qt to -1 in labels.txt? |
Hi Alex, the Q_o is all where this is. The DIVIDE-3k database (the only database with Q_a and Q_t, as we proposed) will be released soon. In https://github.com/VQAssessment/DOVER/blob/master/examplar_data_labels/KoNViD/labels.txt, the second and third value are video length and framerate, which is deprecated in other datasets, so left with -1 as placeholders. |
Always the quality is entangled by aesthetic and technical effects, especially for UGC videos. The idea is quite clear and the performance is good!
After reading the work, some confusions come to my mind, and I'd like to discuss them with all the friends interested in this topic.
Throughout the work, it seems that the disentanglement results in the great improvement. Here's the confusion why disentanglement could improve the performance. Can we just believe that entangled representations with both aesthetic and technical features could restrict the task, and the disentangled ones work better?
I've seen different network structures adopted in the work (e.g., inflated-ConvNext, Swin Trasformer). The most popular model in VQA I thought must be ResNet-50. So, how the different networks affect the performance? Has anyone conducted a detailed experiment on the different network structures?
The text was updated successfully, but these errors were encountered: