Skip to content

Commit

Permalink
Address failing google sources tests
Browse files Browse the repository at this point in the history
Two google sources failed to return the expected output. I looked into
each case why parsing failed:

- lyrics on musica.com contain <aside> Google Ads
- each lyrics line on lacoccinelle.net is wrapped within alternating
  <em> and <strong> tags

Thus remove these tags as part of the HTML cleanup logic.
  • Loading branch information
snejus committed Oct 30, 2024
1 parent fbca24c commit 1ad2619
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions beetsplug/lyrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -536,6 +536,8 @@ def _scrape_strip_cruft(html, plain_text_out=False):
html = BREAK_RE.sub("\n", html) # <br> eats up surrounding '\n'.
html = re.sub(r"(?s)<(script).*?</\1>", "", html) # Strip script tags.
html = re.sub("\u2005", " ", html) # replace unicode with regular space
html = re.sub("<aside .+?</aside>", "", html) # remove Google Ads tags
html = re.sub(r"</?(em|strong)[^>]*>", "", html) # remove italics / bold

if plain_text_out: # Strip remaining HTML tags
html = COMMENT_RE.sub("", html)
Expand Down

0 comments on commit 1ad2619

Please sign in to comment.