Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once. #313

Open
JWBWork opened this issue Jun 25, 2024 · 0 comments

Comments

@JWBWork
Copy link

JWBWork commented Jun 25, 2024

this specific website is throwing an exception I can't understand.

QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.

it results in the splash docker container hanging. It becomes unresponsive to all future requests. More verbose logs didn't reveal any more info

The logs

(.venv) C:\Users\me\path\to\project>docker run -p 8050:8050 scrapinghub/splash:latest                                                                          
2024-06-25 20:39:41+0000 [-] Log opened.
2024-06-25 20:39:41.947216 [-] Xvfb is started: ['Xvfb', ':769163157', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2024-06-25 20:39:42.012362 [-] Splash version: 3.5
2024-06-25 20:39:42.045852 [-] Qt 5.14.1, PyQt 5.14.2, WebKit 602.1, Chromium 77.0.3865.129, sip 4.19.22, Twisted 19.7.0, Lua 5.2
2024-06-25 20:39:42.046036 [-] Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0]
2024-06-25 20:39:42.046099 [-] Open files limit: 1048576
2024-06-25 20:39:42.046140 [-] Can't bump open files limit
2024-06-25 20:39:42.061355 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2024-06-25 20:39:42.061513 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2024-06-25 20:39:42.170427 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2024-06-25 20:39:42.170695 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2024-06-25 20:39:42.171427 [-] Site starting on 8050
2024-06-25 20:39:42.171615 [-] Starting factory <twisted.web.server.Site object at 0x7f96c40ae5c0>
2024-06-25 20:39:42.172103 [-] Server listening on http://0.0.0.0:8050
QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.

Minimum replication

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_splash import SplashRequest


class ResearchSpider(scrapy.Spider):
    name = "research_spider"

    custom_settings = {
        'SPLASH_URL': 'http://localhost:8050',
        'ROBOTSTXT_OBEY': True,
        'DOWNLOAD_DELAY': 2,
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse
            )

    def parse(self, response):
        print(f"parsing {response.url=}")


def crawl_process(websites: list[str]):
    print(f"Initializing crawler process - {websites=}")
    process = CrawlerProcess()
    process.crawl(ResearchSpider, start_urls=websites)
    process.start()
    print(f"Completed crawl")


if __name__ == "__main__":
    crawl_process([
        "http://www.crazyplumbers.com/",
    ])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant