Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Interface being busy prevented instance creation #579

Merged
merged 1 commit into from
Mar 19, 2024

Conversation

hoh
Copy link
Member

@hoh hoh commented Mar 19, 2024

When attempting to schedule a Firecracker instance, the error below came in a loop.

Solution: Log a warning instead of raising an exception.

 2024-03-19 15:45:49,346 | ERROR |   File "<frozen runpy>", line 198, in _run_module_as_main
   File "<frozen runpy>", line 88, in _run_code
   File "/opt/aleph-vm/aleph/vm/orchestrator/__main__.py", line 4, in <module>
     main()
   File "/opt/aleph-vm/aleph/vm/orchestrator/cli.py", line 371, in main
     supervisor.run()
   File "/opt/aleph-vm/aleph/vm/orchestrator/supervisor.py", line 163, in run
     web.run_app(app, host=settings.SUPERVISOR_HOST, port=settings.SUPERVISOR_PORT)
   File "/opt/aleph-vm/aiohttp/web.py", line 544, in run_app
     loop.run_until_complete(main_task)
   File "/usr/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete
     self.run_forever()
   File "/usr/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
     self._run_once()
   File "/usr/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once
     handle._run()
   File "/usr/lib/python3.11/asyncio/events.py", line 80, in _run
     self._context.run(self._callback, *self._args)
   File "/opt/aleph-vm/aiohttp/web_protocol.py", line 452, in _handle_request
     resp = await request_handler(request)
   File "/opt/aleph-vm/sentry_sdk/integrations/aiohttp.py", line 129, in sentry_app_handle
     response = await old_handle(self, request)
   File "/opt/aleph-vm/aiohttp/web_app.py", line 543, in _handle
     resp = await handler(request)
   File "/opt/aleph-vm/aiohttp/web_middlewares.py", line 114, in impl
     return await handler(request)
   File "/opt/aleph-vm/aleph/vm/orchestrator/supervisor.py", line 65, in server_version_middleware
     resp: web.StreamResponse = await handler(request)
   File "/opt/aleph-vm/aiohttp/web_urldispatcher.py", line 200, in handler_wrapper
     result = await result
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 129, in run_code_on_request
     execution = await create_vm_execution_or_raise_http_error(vm_hash=vm_hash, pool=pool)
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 90, in create_vm_execution_or_raise_http_error
     return await create_vm_execution(vm_hash=vm_hash, pool=pool)
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 60, in create_vm_execution
     execution = await pool.create_a_vm(
   File "/opt/aleph-vm/aleph/vm/pool.py", line 113, in create_a_vm
     await self.network.create_tap(vm_id, tap_interface)
   File "/opt/aleph-vm/aleph/vm/network/hostnetwork.py", line 221, in create_tap
     await interface.create()
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 128, in create
     create_tap_interface(ipr, self.device_name)
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 32, in create_tap_interface
     ipr.link("add", ifname=device_name, kind="tuntap", mode="tap")
   File "/opt/aleph-vm/pyroute2/iproute/linux.py", line 1696, in link
     ret = self.nlm_request(msg, msg_type=msg_type, msg_flags=msg_flags)
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
     return tuple(self._genlm_request(*argv, **kwarg))
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 1209, in nlm_request
     self.put(msg, msg_type, msg_flags, msg_seq=msg_seq)
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 906, in put
     return self.engine.put(
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 443, in put
     self.socket.sendto_gate(msg, addr)
   File "/opt/aleph-vm/pyroute2/netlink/rtnl/iprsocket.py", line 52, in sendto_gate
     ret = self._sproxy.handle(msg)
   File "/opt/aleph-vm/pyroute2/netlink/proxy.py", line 61, in handle
     log.error(''.join(traceback.format_stack()))
 2024-03-19 15:45:49,353 | ERROR | Traceback (most recent call last):
   File "/opt/aleph-vm/pyroute2/netlink/proxy.py", line 43, in handle
     ret = plugin(msg, self.nl)
           ^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/rtnl/ifinfmsg/proxy.py", line 73, in proxy_newlink
     return manage_tuntap(msg)
            ^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/rtnl/ifinfmsg/sync.py", line 60, in decorated
     ret = f(msg)
           ^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/rtnl/ifinfmsg/tuntap.py", line 135, in manage_tuntap
     ioctl(fd, TUNSETIFF, ifr)
 OSError: [Errno 16] Device or resource busy
 2024-03-19 15:45:49,356 | ERROR | Interface vmtap4 is busy - is there another process using it ?
 Traceback (most recent call last):
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 32, in create_tap_interface
     ipr.link("add", ifname=device_name, kind="tuntap", mode="tap")
   File "/opt/aleph-vm/pyroute2/iproute/linux.py", line 1696, in link
     ret = self.nlm_request(msg, msg_type=msg_type, msg_flags=msg_flags)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
     return tuple(self._genlm_request(*argv, **kwarg))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 1214, in nlm_request
     for msg in self.get(
                ^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 873, in get
     return tuple(self._genlm_get(*argv, **kwarg))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 550, in get
     raise msg['header']['error']
 pyroute2.netlink.exceptions.NetlinkError: (16, 'Device or resource busy')
 The above exception was the direct cause of the following exception:
 Traceback (most recent call last):
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 90, in create_vm_execution_or_raise_http_error
     return await create_vm_execution(vm_hash=vm_hash, pool=pool)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 60, in create_vm_execution
     execution = await pool.create_a_vm(
                 ^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/aleph/vm/pool.py", line 113, in create_a_vm
     await self.network.create_tap(vm_id, tap_interface)
   File "/opt/aleph-vm/aleph/vm/network/hostnetwork.py", line 221, in create_tap
     await interface.create()
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 128, in create
     create_tap_interface(ipr, self.device_name)
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 37, in create_tap_interface
     raise InterfaceBusyError(
 aleph.vm.network.interfaces.InterfaceBusyError: Interface vmtap4 is busy - is there another process using it ?
 2024-03-19 15:45:49,362 | INFO | 127.0.0.1 [19/Mar/2024:15:45:30 +0000] "GET /vm/3fc0aa9569da840c43e7bd2033c3c580abb4

When attempting to schedule a Firecracker instance, the error below came in a loop.

Solution: Log a warning instead of raising an exception.

```
 2024-03-19 15:45:49,346 | ERROR |   File "<frozen runpy>", line 198, in _run_module_as_main
   File "<frozen runpy>", line 88, in _run_code
   File "/opt/aleph-vm/aleph/vm/orchestrator/__main__.py", line 4, in <module>
     main()
   File "/opt/aleph-vm/aleph/vm/orchestrator/cli.py", line 371, in main
     supervisor.run()
   File "/opt/aleph-vm/aleph/vm/orchestrator/supervisor.py", line 163, in run
     web.run_app(app, host=settings.SUPERVISOR_HOST, port=settings.SUPERVISOR_PORT)
   File "/opt/aleph-vm/aiohttp/web.py", line 544, in run_app
     loop.run_until_complete(main_task)
   File "/usr/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete
     self.run_forever()
   File "/usr/lib/python3.11/asyncio/base_events.py", line 607, in run_forever
     self._run_once()
   File "/usr/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once
     handle._run()
   File "/usr/lib/python3.11/asyncio/events.py", line 80, in _run
     self._context.run(self._callback, *self._args)
   File "/opt/aleph-vm/aiohttp/web_protocol.py", line 452, in _handle_request
     resp = await request_handler(request)
   File "/opt/aleph-vm/sentry_sdk/integrations/aiohttp.py", line 129, in sentry_app_handle
     response = await old_handle(self, request)
   File "/opt/aleph-vm/aiohttp/web_app.py", line 543, in _handle
     resp = await handler(request)
   File "/opt/aleph-vm/aiohttp/web_middlewares.py", line 114, in impl
     return await handler(request)
   File "/opt/aleph-vm/aleph/vm/orchestrator/supervisor.py", line 65, in server_version_middleware
     resp: web.StreamResponse = await handler(request)
   File "/opt/aleph-vm/aiohttp/web_urldispatcher.py", line 200, in handler_wrapper
     result = await result
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 129, in run_code_on_request
     execution = await create_vm_execution_or_raise_http_error(vm_hash=vm_hash, pool=pool)
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 90, in create_vm_execution_or_raise_http_error
     return await create_vm_execution(vm_hash=vm_hash, pool=pool)
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 60, in create_vm_execution
     execution = await pool.create_a_vm(
   File "/opt/aleph-vm/aleph/vm/pool.py", line 113, in create_a_vm
     await self.network.create_tap(vm_id, tap_interface)
   File "/opt/aleph-vm/aleph/vm/network/hostnetwork.py", line 221, in create_tap
     await interface.create()
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 128, in create
     create_tap_interface(ipr, self.device_name)
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 32, in create_tap_interface
     ipr.link("add", ifname=device_name, kind="tuntap", mode="tap")
   File "/opt/aleph-vm/pyroute2/iproute/linux.py", line 1696, in link
     ret = self.nlm_request(msg, msg_type=msg_type, msg_flags=msg_flags)
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
     return tuple(self._genlm_request(*argv, **kwarg))
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 1209, in nlm_request
     self.put(msg, msg_type, msg_flags, msg_seq=msg_seq)
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 906, in put
     return self.engine.put(
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 443, in put
     self.socket.sendto_gate(msg, addr)
   File "/opt/aleph-vm/pyroute2/netlink/rtnl/iprsocket.py", line 52, in sendto_gate
     ret = self._sproxy.handle(msg)
   File "/opt/aleph-vm/pyroute2/netlink/proxy.py", line 61, in handle
     log.error(''.join(traceback.format_stack()))
 2024-03-19 15:45:49,353 | ERROR | Traceback (most recent call last):
   File "/opt/aleph-vm/pyroute2/netlink/proxy.py", line 43, in handle
     ret = plugin(msg, self.nl)
           ^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/rtnl/ifinfmsg/proxy.py", line 73, in proxy_newlink
     return manage_tuntap(msg)
            ^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/rtnl/ifinfmsg/sync.py", line 60, in decorated
     ret = f(msg)
           ^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/rtnl/ifinfmsg/tuntap.py", line 135, in manage_tuntap
     ioctl(fd, TUNSETIFF, ifr)
 OSError: [Errno 16] Device or resource busy
 2024-03-19 15:45:49,356 | ERROR | Interface vmtap4 is busy - is there another process using it ?
 Traceback (most recent call last):
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 32, in create_tap_interface
     ipr.link("add", ifname=device_name, kind="tuntap", mode="tap")
   File "/opt/aleph-vm/pyroute2/iproute/linux.py", line 1696, in link
     ret = self.nlm_request(msg, msg_type=msg_type, msg_flags=msg_flags)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 870, in nlm_request
     return tuple(self._genlm_request(*argv, **kwarg))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 1214, in nlm_request
     for msg in self.get(
                ^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 873, in get
     return tuple(self._genlm_get(*argv, **kwarg))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/pyroute2/netlink/nlsocket.py", line 550, in get
     raise msg['header']['error']
 pyroute2.netlink.exceptions.NetlinkError: (16, 'Device or resource busy')
 The above exception was the direct cause of the following exception:
 Traceback (most recent call last):
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 90, in create_vm_execution_or_raise_http_error
     return await create_vm_execution(vm_hash=vm_hash, pool=pool)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/aleph/vm/orchestrator/run.py", line 60, in create_vm_execution
     execution = await pool.create_a_vm(
                 ^^^^^^^^^^^^^^^^^^^^^^^
   File "/opt/aleph-vm/aleph/vm/pool.py", line 113, in create_a_vm
     await self.network.create_tap(vm_id, tap_interface)
   File "/opt/aleph-vm/aleph/vm/network/hostnetwork.py", line 221, in create_tap
     await interface.create()
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 128, in create
     create_tap_interface(ipr, self.device_name)
   File "/opt/aleph-vm/aleph/vm/network/interfaces.py", line 37, in create_tap_interface
     raise InterfaceBusyError(
 aleph.vm.network.interfaces.InterfaceBusyError: Interface vmtap4 is busy - is there another process using it ?
 2024-03-19 15:45:49,362 | INFO | 127.0.0.1 [19/Mar/2024:15:45:30 +0000] "GET /vm/3fc0aa9569da840c43e7bd2033c3c580abb4
 ```
Copy link

codecov bot commented Mar 19, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 35.25%. Comparing base (0b93f6a) to head (22c5a70).

Files Patch % Lines
src/aleph/vm/network/interfaces.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #579   +/-   ##
=======================================
  Coverage   35.25%   35.25%           
=======================================
  Files          53       53           
  Lines        4862     4862           
  Branches      577      577           
=======================================
  Hits         1714     1714           
  Misses       3127     3127           
  Partials       21       21           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the BLACK This PR has critical implications and must be reviewed by a senior engineer. label Mar 19, 2024
@aleph-im aleph-im deleted a comment from github-actions bot Mar 19, 2024
@hoh hoh merged commit e7d97fa into main Mar 19, 2024
20 of 21 checks passed
@hoh hoh deleted the hoh-fix-error-interface-busy branch March 19, 2024 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BLACK This PR has critical implications and must be reviewed by a senior engineer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants