Skip to content

Commit

Permalink
Adding separator to the Web Crawler
Browse files Browse the repository at this point in the history
  • Loading branch information
spugachev committed Feb 22, 2024
1 parent bc6f120 commit d73f057
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def parse_url(url: str):
raise Exception(
f"Invalid content type {response.headers['Content-Type']}")
soup = BeautifulSoup(response.content, "html.parser")
content = soup.text
content = soup.get_text(separator=' ')
content = re.sub(r"[ \n]+", " ", content)

links = list(set([a["href"] for a in soup.find_all("a", href=True)]))
Expand Down

0 comments on commit d73f057

Please sign in to comment.