Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different results for same search, two years later #496

Open
sofiatipa opened this issue Dec 19, 2023 · 7 comments
Open

different results for same search, two years later #496

sofiatipa opened this issue Dec 19, 2023 · 7 comments

Comments

@sofiatipa
Copy link

Hi,

I repeated a search I did nearly 2 years ago through Hyphe, I am trying to find the co-linkages between two webentities, but the results are quite different. The original search came up with 6 pages that were used by both sites, while the new search shows 3 different pages. Why is that happening? And, is there any way to retrieve the original search from your online version?

@boogheta
Copy link
Member

Hello @sofiatipa, I can only guess, but over two years it would sound reasonable that the websites you crawled did change quite a bit since, hence returning logically different results as of today.
You can try and use the webarchives to retrieve the same corpus as it was back then (activating it from an empty corpus in the Settings tab), but archives are not always complete so there's no warranty.

@sofiatipa
Copy link
Author

sofiatipa commented Dec 19, 2023 via email

@boogheta
Copy link
Member

boogheta commented Dec 19, 2023

Hello again,

It looks like the Geopolitika.ru website has quite an aggressive approach towards web crawler and it basically refuses most robots through some (quite smart) methods, which apparently also block Web.Archive.org from archiving it (see for instance here https://web.archive.org/web/20200417113623/https://www.geopolitika.ru/).

There is no way to make Hyphe work with this website as of today unfortunately.

You can although go back far enough in time before they put those measures in place: just explore the web archives until you find a functional version and ask Hyphe to crawl at that date.
You can do so by inputting the url of the web archive directly into the IMPORT box of Hyphe.

For instance I got a crawl working with more than 70 pages visited in 2018 by using this url as startpoint: https://web.archive.org/web/20180212120000/https://www.geopolitika.ru

@sofiatipa
Copy link
Author

Hi Benjamin,

I have a new question to ask: the installed version of Hyphe stopped creating web entities out of some sites it previously crawled (in fact the last crawl was in August). I tried the crawl in the online demo version and it works perfectly. Any ideas why it might be happening with the desktop version?

Also, is it possible that the amount of pages hyphe crawls may vary from one day to another?

Many thanks!

Sofia

@boogheta
Copy link
Member

Hello @sofiatipa, it's hard to tell without more information. But there's a priori no reason your desktop version of Hyphe would behave differently than the online demo. Did you try in a new corpus or in a preexisting one?

@sofiatipa
Copy link
Author

sofiatipa commented Sep 20, 2024 via email

@boogheta
Copy link
Member

I apologize Sofia but I don't really understand what you mean by "unable to define the web entities". Could you precisely explain the steps you did and where you get stuck at? It might be that your whole local hyphe instance would require to be restarted, have you tried that?

Regarding the crawl depth, you can present it as the number of links a user would click from the starting page:
For instance if you start from a specific startpage with a depth 2, the crawler will visit all pages (only those belonging to that specific website) that are linked from that page, and these are the pages of depth 1. Then it will similarly visit all pages of the website linked from those depth 1 pages, and these will be depth 2. Then it will stop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants