Skip to content

Commit

Permalink
Improve website text extractor
Browse files Browse the repository at this point in the history
  • Loading branch information
Elehiggle committed May 24, 2024
1 parent 5164f8f commit 09c820b
Showing 1 changed file with 6 additions and 0 deletions.
6 changes: 6 additions & 0 deletions chatbot.py
Original file line number Diff line number Diff line change
Expand Up @@ -1045,6 +1045,12 @@ def request_link_text_content(link, prev_response):
soup = BeautifulSoup(raw_content, "html.parser")
website_content = soup.get_text(" | ", strip=True)

# Replace with a tokenizer once there is one for latest Anthropic models
if len(website_content) > 1_000_000:
logger.debug("Website text content too large, trying to extract article content only")
article_texts = [article.get_text(" | ", strip=True) for article in soup.find_all('article')]
website_content = " | ".join(article_texts)

if not website_content:
raise Exception("No text content found on website")

Expand Down

0 comments on commit 09c820b

Please sign in to comment.