-
-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to achieve the effect of BeautifulSoup get_text? #443
Comments
Hello, I'm not familiar with BeautifulSoup, what does this achieve? It seems like it would be something like https://pkg.go.dev/github.com/PuerkitoBio/goquery#Selection.Text, but with some text handling applied, space normalization or something? Martin |
@mna Thank you for your reply, I want to do data extraction, using goquery can not achieve the effect similar to BeautifulSoup, I will give the comparison between the two below This is the result of goquery output, which contains a lot of space and js code This is the result of BeautifulSoup output, very simple and clean |
|
It's hard to tell from those screenshots but it looks like (and the function documentation seems to confirm this) it optionally trims each text node and concatenates them using the provided separator, and it ignores comments and some other nodes ("processing instructions", not sure what that means in this context). Based on your screenshots, it looks like doing this would indeed get you something similar. This is not supported in goquery out of the box, but it should be doable relatively easily using I wouldn't be opposed to add a top-level function (i.e. not a But yeah, to answer your initial question, there's nothing equivalent but it should be possible using the methods I linked above. Hope this helps, |
@mna Thank you for your reply, This is the code I wrote, help me to see why many nested nodes do not parse out child nodes such as div, form
|
How to achieve the effect of BeautifulSoup get_text?
The text was updated successfully, but these errors were encountered: