Lushu (short for the Chinese 记录树, 录树), is a system that detects and reacts to user-defined string events in a never-ending stream of text, in real time. The idea is to have pugglable reactions (JVM functions) be triggered whenever the string event occurs. That reaction can be obfuscation of the string, counting occurrences, sending an alert email, etc. To know more about Lushu, read its companion paper or watch its video tutorial.
Check out this 3-minute Lushu Tutorial on YouTube.
Run gradle fatJar
to generate the file ./Lushu/build/libs/Lushu.jar
. Run it
following the example:
cat example/log/test/cpf-is-sensitive.log | \
java -jar ./Lushu/build/libs/Lushu.jar \
./example/config.yaml ./example/log/train/cpf-is-sensitive.log
You should see an output like the following:
Training Lushu Grammar with file './example/log/train/cpf-is-sensitive.log'
----------------------------------------
Training with log: The user <s>000.000.000-01</s> logged in on 2023-04-29 21:42:04.
Training with log: A new user <s>586.431.715-65</s> was created on 2023-04-30 12:48:53.
Training with log: The user <s>000.000.000-01</s> sent a message to user <s>417.231.715-86</s> on 2023-04-30 12:52:47.
Training with log: The product with ID RZbhCMwa was added to the cart by user <s>316.819.054-49</s> on 2023-04-30 12:53:36.
----------------------------------------
Finished training grammar
A new user ***** was created on 2023-04-30 13:16:51.
A payment of $1957800,00 was processed on 2023-04-30 13:16:51.
The user ***** downloaded video.mp4 on 2023-04-30 13:16:51.
...
Run gradle grammarJar
to generate the file
./Lushu/build/libs/Grammar.jar
. Run it following the example:
cat example/log/test/simple-ip.log | \
java -jar ./Lushu/build/libs/Grammar.jar ./example/config.yaml
You should see an ouput like the following:
R0 :: [023]{4,4}[-]{1,1}[04]{2,2}[-]{1,1}[29]{2,2} | R1
R1 :: [0]{2,2}[:]{1,1}[0]{2,2}[:]{1,1}[0]{2,2}[,]{1,1}[123456789]{3,3} | R2
R2 :: [RScdeimov]{4,8} | R3
R3 :: [ehoqrstu]{5,7} | R4
R4 :: [acfmoryz]{4,5} | R5
R5 :: [0123456789]{1,3}[.]{1,1}[0123456789]{1,3}[.]{1,1}[0123456789]{1,3}[.]{1,1}[0123456789]{1,3} | [glo]{3,3} | R6
R6 :: [ehr]{4,4} | R7
R7 :: [abl]{3,3} | R8
R8 :: [abl]{3,3}
Note that the first production of the grammar in rule R5
has the format of an
IP address. This is because the file example/log/test/simple-ip.log
we gave as
an input contains examples of IP addresses at that position.
Run gradle mergerJar
to generate the file ./Lushu/build/libs/Merger.jar
. Run
it following the example:
echo '8.8.8.8 0.0.0.0' | java -jar ./Lushu/build/libs/Merger.jar ./example/config.yaml
You should get the result:
[08]{1,1}[.]{1,1}[08]{1,1}[.]{1,1}[08]{1,1}[.]{1,1}[08]{1,1}
Notice that both IP addresses 8.8.8.8
and 0.0.0.0
were merged into a single
regular expression. Try different combinations, and different number of words!
Here are some more examples of words you can input:
- Date:
2023/03/26 2023/02/26 2023/12/11 1999/09/09
- Timestamp:
00:00:00 12:34:56 12:34:57
- Key in KV database:
key1#secondary key2#secondary
Also, try specifying different YAML configuration files. You may find it easier
to edit the example file in ./example/config.yaml
.
To test, run gradle test
. Find all source code for the tests under
./Lushu/src/test/
.
Lushu includes a novel way to merge regular expressions, based on a lattice we
call the Regex Lattice. The meet of two regexes in the Regex Lattice indicates
the result of their merge. A single word may be composed of multiple lattice
nodes. It all depends on how we structure the lattice. For instance, if we say
that punctuations are "blacklisted" by "alpha" characters, then their meet will
go to the lattice top. This can be configured by the following config.yaml
file:
latticeBase:
alpha:
interval: 1,32
charset: "abcdefghijklmnopqrstuvwxyz"
punct:
interval: 1,2
charset: "\"!#\\$%&'()*+,-./:;<>=?@\\[\\]^_`{}|~\\\\"
blacklist:
- alpha
Arbitrary text is not in the format we require, originally. So the first thing
we do with text is divide it into words separated by space. We call these words
tokens. Each token might be composed of multiple lattice nodes. For instance,
suppose we have two tokens, ab:c
and de:fg
. They are first transformed to
primitive lattice nodes:
[a]{1,1}[b]{1,1}[:]{1,1}[c]{1,1}
[d]{1,1}[e]{1,1}[:]{1,1}[f]{1,1}[g]{1,1}
These are called primitive because the charset for each node is a single character, and the interval is (1,1). Then, we reduce these primitive nodes into a more compact format. We collapse as much as possible, using the lattice meet to check if the GLB is the Top node. If it is the top node, we do not merge the nodes. For our example:
reduce([a]{1,1}[b]{1,1}[:]{1,1}[c]{1,1}) ==>
[ab]{2,2}[:]{1,1}[c]{1,1}
reduce([d]{1,1}[e]{1,1}[:]{1,1}[f]{1,1}[g]{1,1}) ==>
[de]{2,2}[:]{1,1}[fg]{2,2}
Finally, two turn these two regular expressions into one, we perform a zip and
then a map operation (in the functional sense). The zip operation checks
that the lists must have the same size and forms pairs like ([ab]{2,2}, [de]{2,2})
. For each pair, we map their elements to their lattice meet. In
a pseudo-functional syntax:
map(zip(nodes1, nodes2), (first, second) => {
lattice.meet(first, second)
})
If the lattice goes to top, the words are not mergeable. Otherwise, we merge them. The result for our example would be:
merge(ab:c, de:fg) =
map(zip(reduce(ab:c), reduce(de:fg)), (first, second) -> {
lattice.meet(first, second).then { it ->
when(it) {
is Top: not mergeable
else: it
}
})
==> merge(ab:c, de:fg) = [abde]{2,2}[:]{1,1}[cfg]{1,2}