-
Notifications
You must be signed in to change notification settings - Fork 1
/
IDEAS
117 lines (77 loc) · 4.09 KB
/
IDEAS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
## Data collection
- Add option for running the command, instead of requiring it to exist in the
first place. Might be useful for short-lived processes.
- Use eBPF for acquiring advanced kernel statistics:
- https://pypi.org/project/pyebpf/
- https://github.com/iovisor/bcc
- Explore collecting Unique Set Size (USS): probably has quite the performance
impact, see
- https://psutil.readthedocs.io/en/release-5.3.0/#psutil.Process.memory_info
- http://grodola.blogspot.com/2016/02/psutil-4-real-process-memory-and-environ.html
- https://github.com/astrofrog/psrecord/issues/48
- disk write latency: use BPF stuff for getting *max*, and for comparison *mean*,
for starters. Just the mean is not saying anything, need more data about
the actual distribution.
See https://github.com/cloudflare/ebpf_exporter/blob/master/examples/bio.yaml
## Data emission
- allow for emitting data into a metrics pipeline via network (statsd, ...)
## goeffel-analysis
- control log level with command line flag
- add argument to simpleplot (magic) to set rolling window size
## Ideas for walk-through / tutorial / blog:
- Show how PID command picks up new PID
## Docs
- Add note about alpha/beta software, that the CLI in particular may be subject
to change.
## Misc notes
- Add (a) metric(s) related to the network throughput of a process and/or TCP
metrics in particular. This is more difficult on Linux than it should be.
Also see https://unix.stackexchange.com/q/224201 and nethogs source.
https://stackoverflow.com/a/1983000
http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html
- goeffel cmdline args: make first argument positional arg, and auto-detect
whether it's an integer (PID) or otherwise PID command (cleaner semantics)?
- I am not quite sure about this. Compare this in terms of readability:
goeffel 'pgrep stress'
goeffel 1337
goeffel --pid-command 'pgrep stress'
goeffel --pid 1337
- gracefully detect which things are accessible based on privilege level and
then don't record some metrics if not root? Would be a bad surprise for a
user to find that certain data are not there after measuring for a long time.
- build a mode without process-specific monitoring (no pid, no pid command)?
- measure resource utilization of goeffel with another instance of goeffel.
- Replace HDF5 type for columns with unit 'percent' to be Float16 instead of Float32?
- architecture: if the sample writer crashes for whichever reason we want to
crash the entire program, right?
- Allow for installing `goeffel` w/o the depdencies required solely for
`goeffel-analysis`.
- non-interactive mode is default because of the figure size issue:
https://github.com/matplotlib/matplotlib/issues/7338 Write PNG by default.
Writing PDF is not really an option w/o thinning out data.
- Add option to not measure IP connections? Can be CPU costly. Measure/compare.
Does this matter?
- On a loaded system profile goeffel and see where it spends its time. Pops up
in top with 6 % CPU utilization on a really busy machine. A quick `perf top`
on the sampling process shows that it spends most of its time in kernel space,
not too bad:
Overhead Shared Object Symbol
14.90% [kernel] [k] vsnprintf
14.87% [kernel] [k] established_get_first.isra.40
11.22% [kernel] [k] format_decode
6.13% [kernel] [k] number.isra.2
3.98% [kernel] [k] __memcpy
## Precision of sampling interval
This shows the systematic error of sleeping for a constant time (instead of
using a deadline-based approach) pretty well:
In [7]: df['unixtime'].diff().mean()
Out[7]: 1.0908292996759117
Here, Goeffel was started using a sampling interval of 1.0 seconds.
With two conceptual optimizations this is now much better (a large number of
samples, 0.5 s sampling interval):
- mean: 0.5003 s
- max: 0.5014 s
- min: 0.4989 s
- std: 0.0004 s
btw, done with pandas like this:
pd.read_hdf('./goeffel_timeseries__20190806_205617.hdf5', key="goeffel_timeseries")['unixtime'].diff().max()