Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBSD support #106

Open
gbluma opened this issue Sep 29, 2017 · 4 comments
Open

OpenBSD support #106

gbluma opened this issue Sep 29, 2017 · 4 comments

Comments

@gbluma
Copy link
Contributor

gbluma commented Sep 29, 2017

Just a thing to link commits to.

@gbluma
Copy link
Contributor Author

gbluma commented Oct 8, 2017

What I find interesting about the OpenBSD port is that there are a lot of extra run-time checks for memory safety. Maybe it can lead to pinning down these garbage collection bugs.

Take the following backtrace from a segfault on flx_pkgconfig:

#0  0x05bd5137 in thrkill () at {standard input}:5
#1  0x05b9558b in *_libc___stack_smash_handler (func=0x38aab014 "j__udyInsWalk", damaged=5) at /usr/src/lib/libc/sys/stack_protector.c:79
#2  0x18b46be4 in j__udyInsWalk () from /root/Projects/felix/build/release/host/bin/flx_pkgconfig
#3  0x18b3dc19 in JudyIns () from /root/Projects/felix/build/release/host/bin/flx_pkgconfig

BTW, this segfault is triggered in a version of felix that works on Windows, Linux, and OSX.

@skaller
Copy link
Member

skaller commented Oct 8, 2017

You mean "appears to work" which is a different animal. Many small tests work because the GC is never triggered. You can force it to be triggered in two ways:

  1. Set the environment variable FLX_FINALISE=1 and make sure to use the correct English, not American, spelling. This forces the GC to run just before the process terminates. By default it doesn't because this speeds up termination.

  2. Set FLX_MIN_MEM=N to reduce the trigger point for the first GC to the N Megabytes. You can also set FLX_FREE_FACTOR=N.M where N.M is a floating point number telling Felix where to set the threshhold for the next GC after a collection. 1.1 says to set it at 10% more than the used memory, you get lots of GC's that way. Note, its the trigger after collection, not 10% above the previous trigger.

Now run the test suite or the build process and you gets lots more crashes.

@gbluma
Copy link
Contributor Author

gbluma commented Oct 8, 2017

Of course, but I'm not talking specifically about triggered GC cleanup events here.

I mean, using both systems in the same configuration (i.e. not cleaning up garbage), one will run programs and the other will not. OpenBSD seems to have some special protection mechanisms that helps diagnose memory misuse, on insertion--which is where the bugs seems to be lurking.

I'm mentioning GC here because fixing these particular stack-smashing issues may help isolate why the GC is buggy later on.

@skaller
Copy link
Member

skaller commented Oct 8, 2017

Yes, but the question is: is this a bug in Judy, or is Judy OK and something else is corrupting its indices? In the latter case, the bug may be detected when Judy functions run. One may ask, why only Judy functions? The answer may be that it could be any function, however most Felix programs have a simple collection of linked heap objects, often almost none at all because the optimiser gets rid of them, whereas Judy is a digital trie with cache line sized objects and lots and lots of them appear very fast.

One may also ask, why insertion? Because every allocation causes insertions. There is no other Judy action until the GC is run. At that time it does lookups, scans, and removal of keys.

It isn't possible, in a Judy perspective, that two OS run the program in the same environment because Judy is managing machine addresses returned by malloc the shape of the Judy array tries is heavily sensitive to the actual values malloc returns. Which also depends on the exact binary code being run, dynamic linkage, and all the other system facilities that also use memory.

The key here is that the crashes we get are highly sensitive BUT they're quite determinate for the same binary on the same OS (because the process image is identical each time). Actually even a tiny change like setting an environment variable may matter because the program is ultimately run under shell in the same process as the shell.

Actually we could check this by simply running a loop a random number of times that mallocs some random amount of memory, before doing anything else: the random has to be really random though (seeded by the date and time or something). If some runs go and some crash, that tells us something but I don't know what or how it helps. Changing the GC parameters also changes the behaviour, but again, its not clear if Judy just bugs out earlier, or something else bugs it out.

The problem here is that it is not just the code that matters. Judy is driven by RTTI tables which are hand written for RTL objects and generated for the rest of the program by the compiler, and any error in any of them will screw up the GC. The RTTI is used on allocation to calculate how much store to allocate (n-objects x size). If it is too small an amount we get a corruption from ordinary Felix usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants