Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow launcher to tracker processes across restarts #521

Open
jpmckinney opened this issue Jul 18, 2024 · 2 comments
Open

Allow launcher to tracker processes across restarts #521

jpmckinney opened this issue Jul 18, 2024 · 2 comments

Comments

@jpmckinney
Copy link
Contributor

jpmckinney commented Jul 18, 2024

The default Launcher stores self.processes as a dict in memory, so when Scrapyd restarts, it loses track of the processes. The processes are stored as ScrapyProcessProtocol instances.

A new Launcher could perhaps store these in a backend, that it loads at startup.

Scrapyd only needs to restart rarely (e.g. version upgrade, configuration change, or server restart). If #519 is done, then configuration changes won't be a concern here.

@jpmckinney
Copy link
Contributor Author

jpmckinney commented Jul 23, 2024

Having now read the code, the protocol stores a deferred, such that when the subprocess returns, it runs the _process_finished callback. Attaching this callback can be done on restart. The harder part is reestablishing whatever calls the protocol's outReceived, errReceived and processEnded methods (maybe originally done by reactor.spawnProcess?).

We would need to change the model significantly to resolve this issue. So, closing unless there are new ideas on how to resolve.

Edit: This was also discusssed starting from this comment #12 (comment) and one idea was to somehow use Scrapy's JOBDIR logic - I haven't looked into it.

Anyway, I'll at least re-open in case anyone has new ideas, but for now this seems hard to resolve.

@jpmckinney jpmckinney removed this from the Priority milestone Jul 24, 2024
@jpmckinney jpmckinney reopened this Jul 24, 2024
@jpmckinney
Copy link
Contributor Author

jpmckinney commented Jul 24, 2024

Okay, so I did a little experimenting and wrote this small app that can run test files.

The test files are print1.py

import time

while True:
    print(1)
    time.sleep(1)

and printonce.py

print(1)

The app is:

import sys

from twisted.application.service import Application, Service
from twisted.internet import protocol, reactor
from twisted.python.runtime import platformType
from twisted.python import log
from twisted.scripts import twistd

from scrapyd.launcher import ScrapyProcessProtocol

if platformType == "win32":
    from twisted.internet._dumbwin32proc import Process
else:
    from twisted.internet.process import Process


def finish(proto):
    print("done")


class Launcher(Service):
    def startService(self):
        args = [sys.executable, "print1.py"]
        env = {}
        process = ScrapyProcessProtocol("p1", "s1", "j1", env, args)
        process.deferred.addBoth(finish)

        # reactor.spawnProcess(process, sys.executable, args=args, env=env)
        Process(reactor, sys.executable, args, env, path=None, proto=process)


def app():
    app = Application("test")
    launcher = Launcher()
    launcher.setServiceParent(app)
    return app


application = app()

And the app is run with:

twistd -n -y spawn.py

To explain:

Importing twisted.internet.reactor calls twisted.internet.default.install(), which in turn calls one of:

  • twisted.internet.epollreactor.install(), installing EPollReactor
  • twisted.internet.pollreactor.install(), installing PollReactor
  • twisted.internet.selectreactor.install(), installing SelectReactor

which all inherit from twisted.internet.posixbase.PosixReactorBase. This class implements spawnProcess, which, for Scrapyd, instantiates twisted.internet.process.Process if twisted.python.runtime.platformType == "posix", and twisted.internet._dumbwin32proc.Process if "win32" (that is, os.name in ("nt", "ce")).

Anyway, Process.__init__ always calls its _fork method, which always calls os.fork() or os.posix_spawnp to create a subprocess – so there doesn't seem to be an option to hook into a running process using Twisted. We would have to figure out our own way, while also reestablishing the logic for processEnded, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant