Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Cannot connect through NoVNC #2045

Closed
Earlopain opened this issue Dec 3, 2023 · 24 comments · Fixed by #2058
Closed

[🐛 Bug]: Cannot connect through NoVNC #2045

Earlopain opened this issue Dec 3, 2023 · 24 comments · Fixed by #2058

Comments

@Earlopain
Copy link
Contributor

Earlopain commented Dec 3, 2023

What happened?

When upgrading standalone-chrome from 4.15.0-20231108 to 4.15.0-20231110, the NoVNC web interface is not able to connect. Issue exists on the latest version 4.15.0-20231129 as well, I just tested in which version it started.

It's perpetually stuck in the "Connecting..." screen, the websocket being openend is not recieving any data.
image

The only difference between these two versions is the upgrade from Focal to Jammy in PR #1923

Command used to start Selenium Grid with Docker (or Kubernetes)

version: "3"

services:
  selenium:
    image: selenium/standalone-chrome:4.15.0-20231110
    environment:
      - SE_VNC_NO_PASSWORD=1
    shm_size: 2gb
    ports:
      - ${EXPOSED_VNC_PORT:-7900}:7900

Relevant log output

None that I can see

Operating System

Arch Linux

Docker Selenium version (tag or chart version)

4.15.0-20231110

Copy link

github-actions bot commented Dec 3, 2023

@Earlopain, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

VietND96 commented Dec 4, 2023

Hi @Earlopain, the same docker compose file that you shared but I could not reproduce. The image tag is used 4.15.0-20231129
image
Can you check in DevTool if there is any request error?

@Earlopain
Copy link
Contributor Author

Hi there,

no error in the console. The websocket is being openened but is not receiving any data. The first two packet seems to be some kind ping/pong type of deal, but again, that's just not happening.

I did just now test on another machine, Windows this time, and I have no trouble getting it to work there. I tested with firefox running on the host, and firefox running through wsl as well because why not. Both worked no problem.

I'm going to set up a fresh linux vm and check how it behaves there.

@zhaoyaohui0
Copy link

I have encountered the same problem as you, but I encountered it on K8S. My VNC interface is blank, but my request can run normally. This is the same from the previous version 20231110 to this version 20231129.

@zhaoyaohui0
Copy link

I don‘t understand why the container shows vnc port is 7900,but the service open port 6900:5900,could you please explain it?@VietND96

@Earlopain
Copy link
Contributor Author

Hi @VietND96,

I have installed docker in a fresh linux vm with https://endeavouros.com/ installed. After setting up docker with the following commands and starting the selenium image I observe the same symptoms as in my initial report:

yay -S docker
yay -S docker-compose
sudo systemctl start docker

It may have something to do with arch/endevouros being a rolling release and as such always having the latest versions, or it may be linux specific. I'm not sure with what host OS you were testing with.

For the record, here are the docker/compose versions in use:

$ docker compose version
Docker Compose version 2.23.3
$ docker -v
Docker version 24.0.7, build afdd53b4e3

@VietND96
Copy link
Member

VietND96 commented Dec 5, 2023

I don‘t understand why the container shows vnc port is 7900,but the service open port 6900:5900,could you please explain it?@VietND96

Hi @zhaoyaohui0, as my understanding 5900 is container port for VNC (you can use any tools support VNC protocol to connect e.g VNC Viewer, Remmina, TigerVNC Viewer, etc.), and 7900 is container port for NoVNC (which is used to stream via websocket to live preview on Grid UI)

  • When start the container, -p 6900:5900 which means mapping port 5900 from the container to port 6900 on the host. Can we skip this mapping? Of course, YES, it does not impact grid works.
  • When we need to set -p 6900:5900? When you want to debug something or watch how the test executing via tools support VNC that I mentioned above. But not always people have tool ready to use, that is reason of NoVNC, you are able to watch live preview of each session on Grid UI quickly
  • Why -p 6900:5900, can we map -p 5900:5900 or any host port? Of course, YES. 6900 is used in documentation I guess because of avoid port clashing. As you know, if a host we setup e.g vncserver for remote access, VNC by default uses TCP port 5900+N, where N is the display number (usually :0 for a physical display). If a host we access via VNC, then continue mapping 590x ports for selenium vnc container port, it causes port clash for sure

@diemol
Copy link
Member

diemol commented Dec 5, 2023

@Earlopain @zhaoyaohui0 where is this failing? Which environments? The report is very ambiguous.

@Earlopain
Copy link
Contributor Author

@diemol I have provided additional information in my followup comment, is that not enough? I unfortunatly don't have more than "install this OS, setup up docker there and try again". How would I go about gathering more useful information for you, or what are you looking for?

@diemol
Copy link
Member

diemol commented Dec 5, 2023

You also mention Kubernetes at the beginning of the issue. Hence my question.

Also, how popular is that OS? I mean, we try to provide something that works in most OS, but if it fails in a few and the user base is small, we won't troubleshoot that because we are a small team, and we try to focus on the common use cases.

Having said that, do you see the same with Ubuntu? macOS? Windows?

@Earlopain
Copy link
Contributor Author

Earlopain commented Dec 5, 2023

Kubernetes was the other person, I'm just using it through docker. Endevouros is Arch with a GUI installer, it ships exactly the same software + some small GUI applications on top. I used it because it is convenient and easy to set up, contrary to when setting up Arch on your own.

I did test on Windows and had no trouble there. I don't own an Apple device so nothing for me to do there.

I can try out Ubuntu in a bit when I'm at my home PC. I will install latest docker versions, see how that turns out and let you know then.

@Earlopain
Copy link
Contributor Author

I gave it a try with Ubuntu 23.10 and it just worked as well.

Ended up installing plain Arch instead of EndeavourOS just to make sure and it doesn't work with that.

Here are some other findings: I enabled stdout logging for the other services and as expected NoVNC is trying to establish a connection. I accidentally left it open while testing and after a whooping 2.5 minutes it actually managed to connect.

selenium-1  | 172.23.0.1 - - [05/Dec/2023 16:32:44] 172.23.0.1: Plain non-SSL (ws://) WebSocket connection
selenium-1  | 172.23.0.1 - - [05/Dec/2023 16:32:44] 172.23.0.1: Path: '/websockify'
selenium-1  | 172.23.0.1 - - [05/Dec/2023 16:32:44] connecting to: localhost:5900
selenium-1  | 05/12/2023 16:35:18 Got connection from client 127.0.0.1

After establishing a connection once, future connections still take the 2.5 minutes to establish.

It doesn't seem to have anything to do with NoVNC. I exposed port 5900, wanting to connect with a local client, and that takes this long as well. I did a few runs, and the duration seems consistent. For 5 runs, it always took 154 seconds.

I don't know what one would do with this information though. This all seems very nonsensical to me especially considering it works with other OSes and its just docker in the end.

@VietND96
Copy link
Member

VietND96 commented Dec 6, 2023

I have encountered the same problem as you, but I encountered it on K8S. My VNC interface is blank, but my request can run normally. This is the same from the previous version 20231110 to this version 20231129.

For K8s, the URL to access grid UI that you are using with schema http:// right?
If yes, can you try to use https:// (ignore the insecure warning if any), live preview can access.

@Earlopain
Copy link
Contributor Author

I've started reducing the docker image and with a majority of the selenium things removed I still run into this issue.

At this point I'm almost certain it got nothing to do with anything in this repo, so feel free to close this issue, from my side at least. I'll continue to investigate myself and make the report for this at the proper place, if I manage to actually find it.

@diemol
Copy link
Member

diemol commented Dec 7, 2023

Thank you for your troubleshooting. I will close this based on your comments but feel free to add your findings in additional comments.

@diemol diemol closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2023
@Earlopain
Copy link
Contributor Author

Earlopain commented Dec 7, 2023

I did some digging and have found the root cause. Inside the docker container ulimit -n is incredibly high for some reason. ulimit -n => 1073741816

This code in libvncserver enumerates them all, taking up huge amounts of CPU time. I didn't notice CPU spinning beforehand.
https://github.com/LibVNC/libvncserver/blob/784cccbb724517ee4e36d9938f93b9ee168a29e7/src/libvncserver/sockets.c#L508-L527

The temporary solution is quite simple: set the ulimit for docker manually:

version: "3"

services:
  selenium:
    image: selenium/standalone-chrome:4.15.0-20231110
    environment:
      - SE_VNC_NO_PASSWORD=1
    shm_size: 2gb
    ports:
      - ${EXPOSED_VNC_PORT:-7900}:7900
    ulimits:
      nofile:
        soft: 65536
        hard: 65536

I don't know why these limits would differ from the host, documentation states they are inherited. My host value is just a measly 524288, but it is what it is.

As for why it worked with focal but not with jammy, perhaps this codepath wasn't hit before. The limit is still high inside docker, what do I know.

Here's some prior art: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=920913
Here's a thread on the arch forum where I'm going to probably talk a bit more about this: https://bbs.archlinux.org/viewtopic.php?id=290863
Here's an issue on the docker engine repo which I think is most relevant: moby/moby#44547
And a PR that supposedly fixes this but hasn't been part of a release yet: containerd/containerd#8924
Here's an issue I made in libvncserver talking about the consequences of having an incredibly high RLIMIT_NOFILE: LibVNC/libvncserver#600

@diemol
Copy link
Member

diemol commented Dec 8, 2023

Wow, great troubleshooting! Thanks for sharing.

@Abdillah
Copy link

Abdillah commented Dec 10, 2023

I verified the fix above.

My case was both the VNC and noVNC lead to very long wait to connect, next to forever. In rarity, it reached password prompt but it still waits afterward and timeout.

Can we put this in README on troubleshoot section?

@Earlopain
Copy link
Contributor Author

I'm not so sure on the value of that. This only happens when distros use the prepackaged systemd unit files with very recent docker and systemd versions, which in reality not very many actually do.

Once upstream releases versions that contain a fix this section would pretty much becomes obsolete. You seem to have found this through issues just fine, I think that is good enough.

@VietND96
Copy link
Member

I saw a few Dockerfiles have a practice that displays a warning if ulimit -n is too high when running Docker. I also tried added one to notice the user acda753
@Earlopain, do you think a workaround as below will work while waiting for upstream fixes that?

[program:vnc]
priority=5
command=ulimit -n 65536 && /opt/bin/start-vnc.sh

@Earlopain
Copy link
Contributor Author

Earlopain commented Dec 11, 2023

The idea is there, yes. However if ulimit is already set to a lower value in the container then trying to set it to something higher will return a non-zero exit code, at least for an unprivileged user. That needs to be accounted for.

In addition, TIL that ulimit is a shell buildin and supervisord seems to only starts actual binaries (so I think && would not work either. It needs to be part of the start script.

After doing both of that, it works fine for me. Nice that a workaround is being considered here (:

@hirowatari
Copy link

    ulimits:
      nofile:
        soft: 65536
        hard: 65536

Thank you. This fixed my issue as well. selenium/standalone-chrome:118.0 worked, but selenium/standalone-chrome:119.0 and 120 needed this fix.

@VietND96 VietND96 linked a pull request Dec 11, 2023 that will close this issue
8 tasks
VietND96 pushed a commit that referenced this issue Dec 12, 2023
* Guard against high `ulimit -n` when starting vnc

Recent versions of docker in combination with the upstream systemd
unit files pass an incredibly high `ulimit -n` to the docker container, up
to 1 billion. That causes minute high delays and CPU spinning when
connecting to VNC while it enumerates all the file descriptors.

See #2045

* Update failure message

* Allow the ulimit to be configurable by env
@Earlopain
Copy link
Contributor Author

New releases will contain a workaround, a section in the readme for this shouldn't be needed anymore. See #2058

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants