Capture function only returns one match #342

lev-tonkean · 2024-05-21T22:55:27Z

I'm trying to get all the img src URLs from an HTML body in one of the json fields:
{ "body" : "<div class=\"intercom-container\"><img src=\"https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34\"></div>. <div class=\"intercom-container\"><img src=\"https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34\"></div>" }

If I do the following JSLT code capture($node.body, "<img src=\"(?<url>https://[^\"]+)\">") I will get just the first img URL but not the second.

There should be a way to return all matches...

The text was updated successfully, but these errors were encountered:

catull · 2024-05-22T10:00:18Z

You make an assumption that the function capture works in a way, that is not documented.

How is JSLT supposed to know that more than 1 URL appears in the text ?

Instead, it only finds at most 1 occurence.

Currently you have to structure your node.body attribute to be:

{
    "body": [
        "<div class='intercom-container'><img src='https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34'></div>",
        "<div class='intercom-container'><img src='https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34'></div>"
   ]
}

The JSLT transformation then is

[
  for (.body)
     capture (., "<img src='(?<url>https://[^']+)'>")
]

resulting in:

[ {
  "url" : "https://downloads.intercomcdn.com/i/o/243069600/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34"
}, {
  "url" : "https://downloads.intercomcdn.com/i/o/24306960011/b5cf534d3975fafc7eafa9e7/IMG_2568.PNG?expires=1671464976&amp;signature=6dc0e026cb490829b7e333f8254dd50358356f9bf3369339ddb3aaf11d14ca34"
} ]

larsga · 2024-05-22T10:09:41Z

How is JSLT supposed to know that more than 1 URL appears in the text ?

I don't think this is the right way to view the issue. The issue being raised is:

There should be a way to return all matches...

And clearly there has to be a way to do that. We can't require people to structure the input in a way that fits JSLT. The language has to be designed to handle all JSON inputs.

It is possible to do this now by using capture(), then finding the url match in the string, then slicing the string to remove the matched part, and then using capture() again. A recursive function can do this for any number of matches. It's slow, however, and pretty cumbersome.

One way to solve it would be to give capture() a third argument to tell it to return all matches. This would then be an array of dicts instead of just a dict. It's a bit ugly to have different return signatures, so one might add capture-all() as an alternative. I see the -all() variant makes no sense for the other two regexp functions, so we don't risk suddenly having to make 6 regexp functions.

catull · 2024-05-22T11:01:53Z

I see your points, regarding capture().

Until we have a capture-all(), you have to use what's there.

Whether implementing a recursive function, or restructuring the input, neither is elegant.

Not knowing a lot about the original use case, if the developer is capable of chunking some source HTML into a JSON object carrying a body attribute, it is safe to assume that the same source HTML can be split into chunks of divs, such as //div[class='intercom-container'].

catull · 2024-05-22T11:24:08Z

Found another solution, without having to change the input:

[ for (split (.body, "</div>"))
   capture (., "<img src=\"(?<url>https://[^\"]+)\">")
]

lev-tonkean · 2024-05-23T19:11:37Z

Found another solution, without having to change the input:
[ for (split (.body, "</div>"))
   capture (., "<img src=\"(?<url>https://[^\"]+)\">")
]

this solution worked! thanks.

lev-tonkean · 2024-05-23T19:11:55Z

i do think having a capture-all function makes a lot of sense.

catull · 2024-05-23T20:20:10Z

Try this one:

[ for (split (.body, "<img ")[1:])
  capture (., "^src=\"(?<url>[^\"']+)\"")
]

It supports all kinds of URLs.

larsga added the enhancement New feature or request label May 22, 2024

samer1977 mentioned this issue May 23, 2024

string replace should have options for both literal string replacement and regex replacement #346

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture function only returns one match #342

Capture function only returns one match #342

lev-tonkean commented May 21, 2024

catull commented May 22, 2024

larsga commented May 22, 2024

catull commented May 22, 2024

catull commented May 22, 2024

lev-tonkean commented May 23, 2024

lev-tonkean commented May 23, 2024

catull commented May 23, 2024

Capture function only returns one match #342

Capture function only returns one match #342

Comments

lev-tonkean commented May 21, 2024

catull commented May 22, 2024

larsga commented May 22, 2024

catull commented May 22, 2024

catull commented May 22, 2024

lev-tonkean commented May 23, 2024

lev-tonkean commented May 23, 2024

catull commented May 23, 2024