Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move list of searx-instances from searx to searx-stats2 #7

Closed
return42 opened this issue Jan 6, 2020 · 20 comments
Closed

move list of searx-instances from searx to searx-stats2 #7

return42 opened this issue Jan 6, 2020 · 20 comments

Comments

@return42
Copy link
Member

return42 commented Jan 6, 2020

Started with commit 200c3a31 from PR 1791 the list of public searx instances moved to the documentation tree. If this PR is merged, the SEARX_INSTANCES_URL has to be changed to the new location.

TL;DR; all discussed here are ended in #12 and #13

@dalf
Copy link
Member

dalf commented Jan 6, 2020

Some thoughts about the instance list:

  • If searx-stats2 displays the public instance list, it can make sense to store the instance list directly in "searx-stats2".
  • if not and searx-stats2 makes sense, how to integrate searx-stats2 ?
  • anyway, the current process (fetch/parse from the searx project) will always work.

@return42
Copy link
Member Author

return42 commented Jan 6, 2020

See searx/searx#1791 (comment)

First, lets move the wiki entry to the docs folder, this will also close a bug at searx.

@return42 return42 changed the title move wiki/Searx-instances from wiki to docs move list of searx-instances from searx to searx-stats2 Jan 6, 2020
@return42
Copy link
Member Author

return42 commented Jan 6, 2020

FYI: I removed the URLs from the wiki entry at searx: https://github.com/asciimoo/searx/wiki/Searx-instances

@fgossel
Copy link

fgossel commented Jan 10, 2020

I'm just curious. What is actually the reason for keeping the list of offline instances? Most of them will be never back on again, I guess.

@dalf
Copy link
Member

dalf commented Jan 10, 2020

What is actually the reason for keeping the list of offline instances?

Right now, the list lack some maintenance (all sections, including online, incorrect SSL certificate sections).
As long the instance list was stored in the wiki, it was difficult to do this maintenance.

Note: It would be easier if searx-stats2 would record the last time the instance was seen online.

@dalf
Copy link
Member

dalf commented Jan 11, 2020

How to store the instance list is linked to workflows to add / update / delete items in this list.

I guess the main workflows are:

  • anyone can suggest a new instance to add to the list.
  • a new instance needs to be approved by a moderator.
  • an existing instance can be tagged: privacy problem, ...
  • an existing instance can be removed by the author or a moderator.
  • once an instance has been added or removed, the stats should be updated (the cache must be clear for only one instance)

Here some ideas how to store the instance list:

  1. one yaml file modified with github PR or patch sent by email. (Move away the documentation from the Wiki searx/searx#1785 (comment))
    • possible merge conflict for the PR: an issue can be used instead of a PR (but see the last solution).
    • an order can be enforced (a post-commit script can check this).
  2. one yaml file per instance:
    • easy to keep the track of the history per instance.
    • no conflict compare to the first solution
    • https / onion URLs can be in the same .yaml file.
    • perhaps, a file name convention must be defined.
  3. a database
    • it is possible to use OAuth from different providers.
    • it requires a server rather than a few static HTML pages.
  4. one github issue per instance in a dedicated project:
    • use issue template.
    • allow a comment thread per instance.
    • a label set by the moderators for the approved instances.
    • other labels can be added by moderators (tracker, etc...)

About the first solution, a yaml format:

- url: str # mandatory, https URL
  addtional_urls: # optional
   - url: str # searx instance URL (example https://search.gibberfish.org/tor/ )
   - relation: str # comment about the link (example _Proxied through Tor_ )
  comments: str # optional, str or a list of str (?)
  unsafe: bool # optional, see https://github.com/dalf/searx-stats2/issues/6

Example:

- url: "https://search.gibberfish.org/"
  addtional_urls:
    - url: "http://o2jdk5mdsijm2b7l.onion/"
      relation: "Hidden Service"   
    - url: "https://search.gibberfish.org/"
      relation: "Proxied through Tor"

@dalf
Copy link
Member

dalf commented Jan 13, 2020

Question: should the instance list be in this git repository or another one ?

Why another repository:

  • after all, the tool (code to check the instances) and the data (instance list) are two different things with different life cycles.
  • the commit messages don't have to follow the same rules.
  • perhaps on the long term, different management rights.

On the downside, it is another repository to manage.

@return42
Copy link
Member Author

May be I was unclear. I want to replace the lists below https://asciimoo.github.io/searx/user/public_instances.html#alive-and-running with a paragraph similar:

At https://searx.space you will find a list of public instances. If you want to see your searx instance added or removed from https://searx.space/ list, please add a comment to issue https://github.com/dalf/searx-stats2/issues/12

By this, It is up to searx-stats2 how to maintain the (internal) list, no need for a separate maintained list.

@dalf
Copy link
Member

dalf commented Jan 14, 2020

Note: for now, searx-stats2 scrapes the searx github repository few times per day.

It is up to searx-stats2 how to maintain the (internal) list, no need for a separate maintained list.

👍

My previous comments tries to talk about the "how to store and manage this list ?" question.
My wish is to make sure we all agree about the way the instance list is managed, that's why I put some answers on the table:

  • Heavy solution: a web UI (OAuth, database, etc...).
  • Lighter / simpler: a yaml file in the searx-stats2 repository. Next question: how to update this file ?
    • People sends a PR to update the file.
    • OR People creates one github issue per instance :
      • Maintainers process the issues ( = add a commit to change the .yaml file).
      • OR searx-stats2 scrapes the opened issues (no need for a .yaml file, issue are created with an issue template, a label tags issue about instance, closed issue = delete the instance).
    • OR what you suggest, one unique github issue ( Change request to the list of searx instances  #12 ). Maintainers process the comments ( = add a commit to change the .yaml file).

Why not about the central issue. Question: wouldn't be difficult to follow the add / remove requests ? Perhaps we can a 👍 (or 👀 ) to the comments that have been processed (and add a notice about that).

About emails: I prefer a mailing list rather receiving emails directly. I can create something like request at searx . space (gandi mail).

Note a mailing list also exists : searx/searx#578

@unixfox @asciimoo > what are your view points ?

@return42
Copy link
Member Author

for now, searx-stats2 scrapes https://raw.githubusercontent.com/asciimoo/searx/master/dpublic_instances.rst

Really? For what is SEARX_INSTANCES_URL needed? (sorry if question is dump, I haven't looked through the whole sources).

Lighter / simpler: a yaml file in the searx-stats2 ... People sends a PR to update the file.

is what I vote for

Question: wouldn't be difficult to follow the add / remove requests ?

Adding a link to the commit message should be enough to track.

mailing list

is dead

Do not try to make it perfect from the beginning: 80/20 rule

Most often it is better to establish a simple workflow initial and when you see it fails under some aspects in practical usage, you are able to fix/optimize your workflows with the experience from the practice.

rather receiving emails directly

That's OK, adding issue comment should be enough to start (BTW I modified #12 that way).

@dalf
Copy link
Member

dalf commented Jan 14, 2020

Really? For what is SEARX_INSTANCES_URL needed? (sorry if question is dump, I haven't looked through the whole sources).

https://github.com/dalf/searx-stats2/blob/master/searxstats/source/searx_docs.py#L6
I haven't delete the previous code.

Do not try to make it perfect from the beginning: 80/20 rule

Sure, but:

  • since it is a public "workflow", if we change it later, there will be some lag / confusion.
  • so I prefer to double check before: no big deal.

I'm okay with #12 solution.

BTW, I've created #13

@unixfox
Copy link
Member

unixfox commented Jan 15, 2020

I think it should be better if we have a dedicated issue template than having a general issue because :

  • This won't create a huge mess like it would when there are too many comments in one single issue.
  • When there are too many comments in an issue scrolling becomes awful.
  • GitHub tends to hide a huge number of comments if the issue is too big, this makes viewing historical comments a nightmare.
  • Having an issue for each new instance request makes the debate for adding the instance more pleasant than having a one big issue.

@dalf
Copy link
Member

dalf commented Jan 16, 2020

@unixfox > make sense.

In this case, the issues about the instances and the one about the code will merge in one big list. I think it will be confusing ?

Labels can be a way to solve this :

  • according to the github documentaiton, issue template can assign a label automatically.
  • I don't know if there is a way to make template usage mandatory ? So an issue has either the label "code" either the label "instance".

Another way is to create an additional github repository. The user rights can be different between this project and the new one.

@return42
Copy link
Member Author

The repository and the commits do matter, github dependencies only reduce the degrees of freedom.

@dalf you are the master of searx-stats2 and the decision is up to you. I can only repeat myself: lets keep things simple and have progress.

@dalf
Copy link
Member

dalf commented Jan 19, 2020

Why the instance list hosted by the wiki was a problem ?
As I understand, anyone could modified the content, especially delete an instance without notice.

The solution here is to add an human review:

  • anyone can send a request to add or to delete an instance.
  • a human reviews the request then accepts or denies it.

How to review a delete request ?
Should the request to add the instance and the request to delete the instance come from the same github account ? If it comes from a different account, I don't know to deal with it.


Here a solution:

  • instance list stores in a yaml file : the instance list doesn't rely on github.
    this file is store in a branch named instance: code and data are not in the place.
    Example:
    - url: "https://search.gibberfish.org/"
      addtional_urls:
      - url: "http://o2jdk5mdsijm2b7l.onion/"
        relation: "Hidden Service"   
      - url: "https://search.gibberfish.org/"
        relation: "Proxied through Tor"
      comments: 
      safe: true  # safe instead of "unsafe" (compare to previous example).
  • how to modify this list ?
    One issue per action: from the reviewers point of view, an opened issue is a review to do.
    Issue templates are mandatory (all the requests have the "instance" label).
    Template:
    • an issue to add a new instance --> it is reviewed, the instance list is modified. Issue closed.
    • an issue to remove the instance, with a reference to the "add a new instance" issue --> it is reviewed, the instance is list is modified. Issue closed.
    • an issue to change the "safe" status --> it is reviewed, commit, issue closed.

When a reviewer accepts the change, the instance list is modified with a commit (no need for PR) : reviewer are trusted to make good commit message.

The draw backs:

  • an searx instance administrator won't know when the "safe" status change, except if (s)he checks the issue list.
  • instance request and bug / feature requests about the code are mixed in the same thing: it is possible to exclude issue with a label from the issue list.
  • a github account is required for the reviewers and searx administrators (side note, to remove the need of github account, I thought about something like letsencrypt: a HTTP challenge to add the searx instance, a deny entry in the robots.txt to delete the instance, all manage automatically by a setting in settings.yml but that's a heavy solution).

@return42 : it is basically you have suggested except there is an issue per request instead of a long list. I think it makes the reviewer life easier.

@unixfox
Copy link
Member

unixfox commented Jan 19, 2020

I thought about something like letsencrypt: a HTTP challenge to add the searx instance, a deny entry in the robots.txt to delete the instance, all manage automatically by a setting in settings.yml but that's a heavy solution

Why not instead a TXT entry in the DNS?

@dalf
Copy link
Member

dalf commented Jan 19, 2020

Why not instead a TXT entry in the DNS?

With the HTTP challenge / robots.txt solution, searx code can deal with it automatically:

  • make the instance public: change a boolean in settings.yml, reload, done (frankly speaking: can be tricky and error prone since uwsgi / multiple processes can run at the same time).
  • remove the instance: change back the boolean, reload, done.

The DNS solution requires another layer of complexity: most probably it requires a "check my DNS configuration" step in searx-stats2.

Anyway, both can be implemented, but each requires a database and a web server.

Are you saying that you prefer this solution to the ".yaml file + github issues" solution ?

@dalf
Copy link
Member

dalf commented Feb 2, 2020

So here a proposal:

  • if we create one issue per instance in searx-stats2, the issues about the code will be lost
  • one issue to manage all the instances will be a mess as soon someone wants to talk about an instance.
  • so I've create another project: https://github.com/dalf/searx-instances/ : code and data are (nearly) split.
  • @asciimoo it would make sense to move it to searx/searx-instances what do you think ?
  • the instance list is stored in searxstats/instances.yml
  • the current instances.yml is the result of the importation using searxinstances/utils/import_rst.py
  • on each commit, instances.yml is checked
  • python -m searxinstances.update allows to edit the instances.yml
usage: update.py [-h]
                 [--github-issues [GITHUB_ISSUE_LIST [GITHUB_ISSUE_LIST ...]]]
                 [--add [ADD_INSTANCES [ADD_INSTANCES ...]]]
                 [--delete [DELETE_INSTANCES [DELETE_INSTANCES ...]]]
                 [--edit [EDIT_INSTANCES [EDIT_INSTANCES ...]]]

Update the instance list according to the github issues.

optional arguments:
  -h, --help            show this help message and exit
  --github-issues [GITHUB_ISSUE_LIST [GITHUB_ISSUE_LIST ...]]
                        Github issue number to process, by default all
  --add [ADD_INSTANCES [ADD_INSTANCES ...]]
                        Add instance(s)
  --delete [DELETE_INSTANCES [DELETE_INSTANCES ...]]
                        Delete instance(s)
  --edit [EDIT_INSTANCES [EDIT_INSTANCES ...]]
                        Edit instance(s)

The tool :

  • shows the default editor to only edit one instance at a time.
  • once the user quits the editor, the script checks everything is okay, if not it goes back to the editor with the error added at the end of the buffer.
  • if everything is okay, the script modifies the instances.yml file.
  • then it creates a commit.

--github-issues reads the github issues.

There are issue templates : https://github.com/dalf/searx-instances/issues/new/choose

So:

  • the instance list is stored on a git repository.
  • github is used as an helper, but it doesn't create a hard dependency
  • the commit messages make sense.

An example what is shown in the default editor:

https://nibblehole.com:
  safe: false

# Add https://nibblehole.com
#
# Close https://github.com/dalf/searx-instances/issues/2
# From @dalf

#> The above text is the commit message
#> Delete the whole buffer to cancel the request

#> -- MESSAGE -----------------------
#> See https://github.com/asciimoo/searx/pull/1818

Here is it possible to modify the yaml, the commit message and validate or delete the whole buffer to cancel.

Note: this tool is not mandatory, it is only an helper.


searx-stats integration: pip install does not update package referenced on a git repository. So here the PR #16 which basically git clone https://github.com/dalf/searx-instances/ or git pull on each run, and make an ugly change of the PYTHONPATH.

@dalf
Copy link
Member

dalf commented Feb 13, 2020

The PR #16 has been merged.
The instance list is hosted here: https://github.com/dalf/searx-instances/

You can see the result in https://searx.space/
In the top right corner, the Show comments checkbox allows to display something similar to https://asciimoo.github.io/searx/user/public_instances.html with the exceptions of the "Useful information" and the "Meta-searx instances" sections.

@return42
Copy link
Member Author

@dalf excellent work, much more than I ever expected / thanks a lot!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants