Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture function only returns one match #342

Open
lev-tonkean opened this issue May 21, 2024 · 7 comments
Open

Capture function only returns one match #342

lev-tonkean opened this issue May 21, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@lev-tonkean
Copy link

I'm trying to get all the img src URLs from an HTML body in one of the json fields:
{ "body" : "<div class=\"intercom-container\"><img src=\"https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34\"></div>. <div class=\"intercom-container\"><img src=\"https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34\"></div>" }

If I do the following JSLT code capture($node.body, "<img src=\"(?<url>https://[^\"]+)\">") I will get just the first img URL but not the second.

There should be a way to return all matches...

@catull
Copy link

catull commented May 22, 2024

You make an assumption that the function capture works in a way, that is not documented.

How is JSLT supposed to know that more than 1 URL appears in the text ?

Instead, it only finds at most 1 occurence.

Currently you have to structure your node.body attribute to be:

{
    "body": [
        "<div class='intercom-container'><img src='https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34'></div>",
        "<div class='intercom-container'><img src='https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34'></div>"
   ]
}

The JSLT transformation then is

[
  for (.body)
     capture (., "<img src='(?<url>https://[^']+)'>")
]

resulting in:

[ {
  "url" : "https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34"
}, {
  "url" : "https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34"
} ]

@larsga
Copy link
Collaborator

larsga commented May 22, 2024

How is JSLT supposed to know that more than 1 URL appears in the text ?

I don't think this is the right way to view the issue. The issue being raised is:

There should be a way to return all matches...

And clearly there has to be a way to do that. We can't require people to structure the input in a way that fits JSLT. The language has to be designed to handle all JSON inputs.

It is possible to do this now by using capture(), then finding the url match in the string, then slicing the string to remove the matched part, and then using capture() again. A recursive function can do this for any number of matches. It's slow, however, and pretty cumbersome.

One way to solve it would be to give capture() a third argument to tell it to return all matches. This would then be an array of dicts instead of just a dict. It's a bit ugly to have different return signatures, so one might add capture-all() as an alternative. I see the -all() variant makes no sense for the other two regexp functions, so we don't risk suddenly having to make 6 regexp functions.

@larsga larsga added the enhancement New feature or request label May 22, 2024
@catull
Copy link

catull commented May 22, 2024

I see your points, regarding capture().

Until we have a capture-all(), you have to use what's there.

Whether implementing a recursive function, or restructuring the input, neither is elegant.

Not knowing a lot about the original use case, if the developer is capable of chunking some source HTML into a JSON object carrying a body attribute, it is safe to assume that the same source HTML can be split into chunks of divs, such as //div[class='intercom-container'].

@catull
Copy link

catull commented May 22, 2024

Found another solution, without having to change the input:

[ for (split (.body, "</div>"))
   capture (., "<img src=\"(?<url>https://[^\"]+)\">")
]

@lev-tonkean
Copy link
Author

Found another solution, without having to change the input:

[ for (split (.body, "</div>"))
   capture (., "<img src=\"(?<url>https://[^\"]+)\">")
]

this solution worked! thanks.

@lev-tonkean
Copy link
Author

i do think having a capture-all function makes a lot of sense.

@catull
Copy link

catull commented May 23, 2024

Try this one:

[ for (split (.body, "<img ")[1:])
  capture (., "^src=\"(?<url>[^\"']+)\"")
]

It supports all kinds of URLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants