Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] URL Scraper finder #1013

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft

Conversation

Belleyy
Copy link
Contributor

@Belleyy Belleyy commented Jun 5, 2022

This scraper will get trigger by any url (it match a dot .).

What I need to do ?

  • You have the url you want to scrape but don't know if there is a scraper for it.
  • Put the URL in the field like other url scraper.
  • Press the scrape button.
  • Scrape window appear or not, check the log for more info 😄

Process

  • It will create a folder in your scraper folder called tmp. It will download the whole repo (master.zip) and place it inside this new folder.
    • It will re-download the zip every week (7days).
    • In this folder, it should have only 2 files permanently, the zip and list.
  • It extract the list (SCRAPERS-LIST.md), and find a scraper for your URL. It will also check your local scraper file.
  • ❌If there no scraper with that URL, it don't do anything. Request the scraper because it probably don't exist. (There is some scraper that could exist but my script don't match like mgstage)
  • ✔If a scraper is found, it will extract it, reload scraper on Stash, scrape, remove the scraper, reload scraper again.
    • ⚠If the scraper is a python script, it won't extract it. They often require setup so don't want to deal with that. It will warn you in the log.
    • If you have the scraper file in your scraper folder but don't have the url, the script will tell you that there is a update to this file, that added a new URL to it.
  • 🔷If you already have the scraper locally, it will scrape your scene normally.
    • This script sometime overwrite the correct scraper when you press the URL scrape button. There is no fixed order when you press the scrape button.
  • To don't call himself, the script rename his own .yml (to .yml.tmp) during the operation.

Why ?

  • You are sure to have latest scraper.
  • Don't have tons of file inside your scraper folder, only keeping scraper you use often:
    • Scraper that need setup (like Python & ThePornDB)
    • Scraper that you use for ScraperByName & Fragment
  • Lazy to check the scraper list / download the file.

Draft, because I don't know if it's should be in the repo. If someone download the repo without knowledge, this scraper will most likely to be trigger.

@JaseNZC
Copy link
Contributor

JaseNZC commented Jun 6, 2022

Awesome idea @Belleyy

@bnkai bnkai added the script Scraper executes a script label Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
script Scraper executes a script
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants