Caching hosts lists to avoid unessessery downloads for speed and searching the cache to find which source blocked which domain #88

yuannan · 2021-12-02T22:11:54Z

Is your feature request related to a problem? Please describe:
Sometimes a list blocks a domain and I will want to either exclude that hostlist or report it up stream if it's a false positive.

This is hard if not impossible currently without the help of scripts as it involves the user downloading the host lists individually and then searching for it.

Describe the solution you'd like:
The host lists are cached within /etc/hblock/host_lists/
Using -fhs or --find-host-source it will find the host list in question that blocked the domain.

Describe alternatives you've considered:
So far I've written my own script to search the lists. This has to download them first and then search. Hence why I think a cache is a good idea.

#!/bin/env bash
block_list_path="/etc/hblock/sources.list"
dir=$(mktemp -d)
echo $dir

source=$(cat $block_list_path)

cd $dir

if [[ -z "$1" ]]
then
	echo "Needs domain to search for..."
else
	for s in $source
	do
		cd $dir
		if [[ ! $s =~ ^# ]]
		then
			wget $s 2>/dev/null &
		fi
	done
	wait

	grep "$1" $dir/*
fi

The cache deserves it's own feature request but I think it should either:

Keep the hosts are they are right now but a meta file at https://raw.githubusercontent.com/hectorm/hmirror/master/data/lists.meta.txt to keep track of when they have been updated. This speeds up downloads and avoids unnecessary files. This moves hblock to be more like a package manager with host files.
Have a header on top of the file indicating when it's been updated with a # like a comment. The header can be downloaded without downloading the entire file. The "raw" version of this file should be kept cached. When it's processed all line starting with '#' will be ignored before made into /etc/hosts. I've tested this with

curl -r 0-100 https://raw.githubusercontent.com/hectorm/hmirror/master/data/ublock/list.txt

to get the first 100 bytes of the file. This can easily be downloaded first checked against the device database. If the header within the file is newer then the rest is downloaded. I think this should be avoided for the first option as I imagine others have setup their own scripts with the assumption that your mirrored lists have no comments. Option 1 is much easier to implement as has less side effects.

The text was updated successfully, but these errors were encountered:

hectorm · 2021-12-06T14:24:04Z

Hi, thank you for taking the time to write this request.

I think a cache is outside the scope of the project, I would not like hBlock to create more files than the one specified in the --output option.

However I understand that it would be useful to easily know which sources are blocking a particular domain, so I'm thinking about adding a feature that downloads the sources and prints this information. Quite similar to what your script does.

yuannan added the enhancement label Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching hosts lists to avoid unessessery downloads for speed and searching the cache to find which source blocked which domain #88

Caching hosts lists to avoid unessessery downloads for speed and searching the cache to find which source blocked which domain #88

yuannan commented Dec 2, 2021 •

edited

Loading

hectorm commented Dec 6, 2021

Caching hosts lists to avoid unessessery downloads for speed and searching the cache to find which source blocked which domain #88

Caching hosts lists to avoid unessessery downloads for speed and searching the cache to find which source blocked which domain #88

Comments

yuannan commented Dec 2, 2021 • edited Loading

hectorm commented Dec 6, 2021

yuannan commented Dec 2, 2021 •

edited

Loading