-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#220] avoid resource leaking in scanners #236
base: master
Are you sure you want to change the base?
[#220] avoid resource leaking in scanners #236
Conversation
Problem: Currently we use lazy `readFile`, and xrefcheck opens all the files immediately for reading, while file descriptors are resource that can be finite. Solution: Replaced lazy `readFile` with strict version.
When I tried to parallelize file parsing, I got some strange issues. I tried to run a program with ~100 md files repository with 8 core available. This is needed time info:
And here is the threadscope profile: We see that CPU activity is not very high, although there is work to do. Something prevents threads to work normally... This is what can be found in the eventlog for that time: There are a lot of foreign calls, although it seems in my code there is no anything that can result in such behaviour. I tried to run our program under the Googling this returns one Stackoverflow entry, but I didn't understand much there 😄 |
Woow, elaborate investigation 👍 However, what do you think exactly is wrong? I'll tell my thoughts protecting the idea that everything seems fine, and you show where I'm wrong. In the threadscope report, to me it looks like what I would expect - low CPU consumption during the scanning stage because it is computation interleaved with I/O (and I/O is not shown on the chart). It may be reasonable that I/O takes more time than parsing, since the former is reading from the disc, and the latter is reading from memory + some logic written in C that On what you see in the eventlog - by default, foreign calls cannot be interrupted, at that moment runtime system gives up on trying to manage anything until the foreign call is finished. Maybe that's why we see those messages, but I think they are fine, they may just mean that the control is passed to the C logic, but everything is fine with threads (I'm not sure here though, may be worth googling particularly the meaning of that eventlog message). On If my understanding is correct, most of the witnessed things will be still there even if we try parsing with no parallelization as before. If you had some intuition that told you "something is wrong here" or "things are too slow" that pushed you onto this investigation, could you share it too? |
Well, I also tried to write code such that all of the files were read at once, and then only pure computations were performed in parallel, but results were the same. I can later show same graphs for this case.
In my mind, when we have a big list of pure tasks, they can be parallelized very effectively, using ~100% load for all available CPUs. But here, although there are some CPUs, total activity is not higher (in average) than just one CPU 100% load (I look at the dashed lines on the activity graph), and this is not really satisfying. When I tried to read all the files first and then parse them in parallel, reading was performed with one thread which took 100% of CPU load, but total activity was lesser for further parsing stage, even for 8 CPUs. You can even see how activity grows up after 0.47 second (start repo verifying). Also, I don't like a lot of errors in a row for a system calls log, it looks similar to active waiting, the thing that good programs usually (in my mind) don't want to use. But GHC runtime is written with highly-skilled guys, so that can indicate that we are using their creature in a wrong way. So I believe that there's a problem we could solve, although I don't quite understand what it actually is. Well, I also forgot that our raw md parsing is performed by the C library... Maybe that can influence this, too. |
Yeah, that's true. Also not sure why it is done this way.
Oh that's a very good setup for testing, then we have to investigate what happens there. I'll try to think about it, if you have more info to share - it is welcome. |
I'm not sure if this will reveal anything, but could you try the following in order to check if the parallel performance looks more like what you expected? Derive return . ScanResult errs $ RepoInfo
{ riFiles = M.fromList $ processedFiles <> notProcessedFiles
, riDirectories = M.fromList
$ map (, TrackedDirectory) trackedDirs
<> map (, UntrackedDirectory) untrackedDirs
, riRoot
} by let result = ScanResult errs $ RepoInfo
{ riFiles = M.fromList $ processedFiles <> notProcessedFiles
, riDirectories = M.fromList
$ map (, TrackedDirectory) trackedDirs
<> map (, UntrackedDirectory) untrackedDirs
}
evaluateNF result |
@@ -118,7 +119,8 @@ data ScanErrorDescription | |||
| FileErr | |||
| ParagraphErr Text | |||
| UnrecognisedErr Text | |||
deriving stock (Show, Eq) | |||
deriving stock (Show, Eq, Generic) | |||
deriving anyclass (NFData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can derive NFData
as in Xrefcheck.Core
module if you prefer to follow that style:
instance NFData Position
instance NFData Reference
instance NFData AnchorType
instance NFData Anchor
instance NFData FileInfo
instance NFData RepoInfo
Interesting. I guess foreign calls should be parallelizable... So from the recent picture we can tell, that if you don't use sparks for I'd really like to see the threadscope report from this case:
|
Although 0.07 seconds that this unparallelized call to cmark-gfm currently takes is really not that much, I agree we could stop optimizing here if we get stuck at it. |
I'm sorry for confusing you, but the highlighted period is not the full period of CMark parsing, just an example.
Yep, later I'll show more accurate examples for different cases. |
Oh I see, okay. Then we should try hard I suppose 😸 That's just a wild guess, but as one of the measures I would try replacing sparks with Mm, probably to have this organized, I'd also suggest preserving all your attempts in separate branches, maybe call them |
On the current state of this PR:
@YuriRomanowski I'm not sure if you manage to work on this PR during the remaining time, but that's fine, we can transfer this PR to someone. In case you manage to allocate time on this, I suggest asking in Slack in Haskell channel, maybe someone will manage to explain what we observe. If not, let's suggest that we did quite a lot to make it performant and stop on what we have gained. |
Description
[#220] Avoid resource leaking in scanners
Problem: Currently we use lazy
readFile
, and xrefcheck opens allthe files immediately for reading, while file descriptors are
resource that can be finite.
Solution: Replaced lazy
readFile
with strict version.To be done: Understand issues with parallelism
Related issue(s)
Fixes #220
✅ Checklist for your Pull Request
Ideally a PR has all of the checkmarks set.
If something in this list is irrelevant to your PR, you should still set this
checkmark indicating that you are sure it is dealt with (be that by irrelevance).
Related changes (conditional)
Tests
silently reappearing again.
Documentation
Public contracts
of Public Contracts policy.
and
Stylistic guide (mandatory)
✓ Release Checklist
package.yaml
.under the "Unreleased" section to a new section for this release version.
vX.Y.Z
.xrefcheck-action
.