Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schedstats instrument #123

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bjackman
Copy link
Contributor

@bjackman bjackman commented May 11, 2017

Not sure how useful this will actually be hence RFC marker and not too much care or documentation. This basically jimmies /proc/schedstat into the Instrument API. What do you think?

Low priority at the moment, don't spend ages reviewing if you're busy :)

msg = 'Unexpected channel "{}"; must be in {}'
raise ValueError(msg.format(chan_name, self.channels.keys()))
else:
for chan in self.channels.values():
Copy link
Collaborator

@setrofim setrofim May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the else clause, channels is None so this will raise a NoAttributeError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

channels is None but not self.channels

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my bad.

``active_channels`` attribute of the ``Instrument``.

If ``channels`` is provided, it is a list of names of channels to enable and
``sites`` and ``kinds`` are ignored.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the fact that user-provided parameters may be silently ignored. Either, they should be combined with the channels or an error should be raised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll make it raise an error if you provide both channels and sites or kinds

# params. Does it matter?
measurement_type = MeasurementType(
measurement_name, '', measurement_category)
self.add_channel(site=site,
Copy link
Collaborator

@setrofim setrofim May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this should be done int the __init__ -- setup() is not guaranteed to be invoked on every execution. (Set up is intended for persistent changes, such as deploying assets to the target.

measure=measurement_type)

def teardown(self):
self.target.write_value(self.sysctl_path, self.old_sysctl_value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably happen in reset() instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, ignore the above. Doing this in teardown is correct.

based on identifiers from the scheduler code.
"""
sysctl_path = '/proc/sys/kernel/sched_schedstats'
schedstat_path = '/proc/schedstat'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should set mode to INSTANTANEOUS

measures = DOMAIN_MEASURES
else:
raise TargetError(
'Unrecognised schedstats line: "{}"'.format(line))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fails in all our Android systems, since we have additional stats reporting EAS codepaths.
In Android kernel we will have stats like:

cpu0 9703 0 227370 45501 209763 87404 75510237990 76927923766 168389
eas 50510 0 0 29148 0 21362 51469 11340 3022 43 14180 169 22712 40129 3 0 3023 28011 15055 11640
domain0 0f 5237 5089 86 95279 66 0 0 5089 2021 1945 62 74554 19 1 3 1942 43224 36754 5499 4393208 971 0 7128 29626 1 0 1 0 0 0 0 0 0 39102 1466 0
eas 0 0 0 9389 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13480 0
domain1 ff 3981 3733 220 673383 33 2 5 1149 1076 1069 6 36930 2 0 3 176 42131 36952 4441 18425660 738 3 261 36691 1 0 1 0 0 0 0 0 0 83215 794 0
eas 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15055 0

Where basically, after each cpu and domain we do have an eas entry with EAS specific counters.

I was wondering if we can end up with a more "generic" parser which allows to accept custom defined lines.

Maybe we can use a "map" to define what are the columns for each specific "channel", the map can be defined by default using the hardcoded values you have now... but we can allow the user to specify a custom map.
In that case, the custom map should allow to parse additional fields (e.g. eas stats) as well as open the door to parse stats versions not supported by default.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not design & document an API for that given that these EAS stats are probably the only ever user (also TBH we have broken schedstats with those additions, we should really change the version string to "15.eas0" or something IMO). How about if:

  1. Unrecognised lines are ignored with a warning
  2. I rewrite this so we can inherit from it and write a subclass that knows about the EAS stats
  3. Then, if we decide that the EAS stats format is "stable", or come up with a way to fixup the version string so we can detect changes, we just add support for the EAS stats directly here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrite this so we can inherit from it and write a subclass that knows about the EAS stats
+1 for this - if we handle each line in a class member which can access curr_cpu and add curr_domain, then 'eas' lines can be parsed easily in a subclassed instrument

Copy link
Contributor Author

@bjackman bjackman May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started doing part 2. and realised that it's also pretty unpalatable; the parsing state (curr_cpu and curr_domain) makes the code quite arcane and it still isn't generic. If we want to keep fiddling with schedstats we're going to have to fiddle with this code again anyway so I propose just skipping part 2. and only doing parts 1. and 3.

@bjackman bjackman force-pushed the schedstats branch 5 times, most recently from f7c2a09 to 2e4943f Compare May 23, 2017 16:20
@bjackman
Copy link
Contributor Author

Latest version

  • Doesn't fail with the "eas" schedstats data but does spam WARNINGs. Could always just silently ignore "eas" data without a warning, until we add the ability to parse it.
  • Doesn't use the sysctl unless it's necessary, as the sysctl is v4.6+

@setrofim
Copy link
Collaborator

@derkling @bjackman sorry for the delay in following up on this. Is this still needed? Is it ready for merge? If so, please could you rebase onto tip of master.

@bjackman
Copy link
Contributor Author

I think this could be really useful for use with WA3 but I would say it's low priority. It's also just occurred to me that I should think about how to get this to report diffs, i.e. you'd like to take a sample at the beginning and a sample at the end, and have the metric report the diff in schedstats during workload execution (maybe WA already does that?). Also would like some input from @derkling on the EAS stats thing. Are you OK with having this PR sit around a little bit longer?

@setrofim
Copy link
Collaborator

maybe WA already does that?

WA has a generic "sysfs_extractor", which takes a list of sysfs locations, which it then pulls before and after each iteration, and then performs a generic diff; but it has no awareness of what it's diff'ing, so it doesn't generate any metrics (you just get a directory structure with the diffs).

Are you OK with having this PR sit around a little bit longer?

Yeah, no problem. There is no rush. It's just that there hasn't been any activity on this for a while, so I just wanted to make sure this is still relevant.

@bjackman bjackman force-pushed the schedstats branch 2 times, most recently from 0821c1f to 3675dee Compare October 11, 2017 13:56
@bjackman
Copy link
Contributor Author

I've just refreshed this and plugged it into WA3 as an experiment.. definitely interesting but the only problem is it produces literally hundreds of metrics for each workload!

@setrofim
Copy link
Collaborator

I've just refreshed this and plugged it into WA3 as an experiment.. definitely interesting but the only problem is it produces literally hundreds of metrics for each workload!

That's not necessarily an issue, if those metrics are actually individually useful. An alternative though, if they're more of a "state dump" for debug/deep-dive analysis, is to write them out to a file that is than added as an Artifact, rather than individually adding them as metrics in WA (or adding only a few key/summary ones as metrics).

@bjackman
Copy link
Contributor Author

Hmm yeah.. my dream for this Instrument is that you can immediately see that your kernel change e.g. "increased the average number of failed load balances" for a certain workload. All the added stats are potentially individually useful, but the particular one that's individually useful is probably going to change in every situation.

On the other hand I've just realised that's only going to happen if you explicitly ask for schedstats data, so maybe it's fine.

There are also a few potential aggregates I can think of (aggregate across CPU/sched_domain/idle type/any combination of those 3) but none of them makes any more sense than the others, so probably best considered a future feature. Subject to a bit more testing this is probably ready for merge.

@setrofim
Copy link
Collaborator

That sounds good to me. There is nothing in WA which demands or even encourages reducing the number of metrics. We specifically use __slots__ for metrics to minimize the impact of having vast numbers of these objects. And, as you've say, this only matters if the user enables the instrument.

Ultimately, it comes down to what's easier to work with. As you'll likely be, at least initially, the only/the primary user of this instrument, this is for you to decide.

I should also point out that there is also the option of tagging the metrics with some kind of classifier to make it easier to filter/split them from the rest of the results.

@bjackman
Copy link
Contributor Author

OK sweet. Last time I tested this I saw some suspicious values, once I've investigated that, let's go ahead (this is still low priority though so might be a while).

@bjackman bjackman force-pushed the schedstats branch 2 times, most recently from 5db1f83 to 0f45fbc Compare December 1, 2017 16:03
@bjackman bjackman changed the title [RFC] Add schedstats instrument Add schedstats instrument Dec 1, 2017
@bjackman
Copy link
Contributor Author

bjackman commented Dec 1, 2017

OK I am now using this instrument in anger and getting sane results.


- If :method:`reset` is called with ``kinds=['alb_pushed']`` then the count
of migrations successfully triggered by active_load_balance will be
colelcted for each sched domain, with a channel for each domain.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/colelcted/collected/

match = re.search(r'version ([0-9]+)', lines[0])
if not match or match.group(1) != '15':
raise TargetError(
'Unsupported schedstat version string: "{}"'.format(lines[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth to report which versions we support?


# On 4.6+ kernels, schedstats needs to be enabled via kernel cmdline or
# sysctl. If not, we'll just get a load of zeroes.
self.old_sysctl_value = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for code readability, maybe by using some "returns" on success/failure, can we move the following checks (lines 128-145) into a dedicated function, something like _check_availability() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems like a good idea, will try it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I split out this particular chunk (that enables the runtime config) into a separate function but kept the bit that checks for the presence and version of schedstat in the kernel, since the lines variable is re-used later to add the channels (to avoid more round-trips to the target slowing things down). If you like I can split it up further and just drop the "optimisation".

# create.
# We'll create a site for each CPU and a site for each sched_domain.
for site, measures in self._get_sample().iteritems():
if site.startswith('cpu'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not more generic to parse the first column (up to excluding the ID) and create a category with that name?
Something like that should do the job:

site_re = regexp.compile("^(?P<category>[a-z]+)(?P<id>[0-9]+)")

Copy link
Contributor Author

@bjackman bjackman Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC I thought about that, but the thing is that domains are a sub-category of cpus, and there's no way to detect that: the parser has to be aware of this relationship. I can't imagine any other types of category (tasks? They have a separate schedstats file though. Task groups & sched entities would too, if they had schedstats), so I can't predict whether it makes sense to assume that the other category is a sub-entity of cpus, a sub-category of domains, or a new category entirely. Does that make sense?

Edit: Wait, that makes no sense.. I thought this comment was on different code (I assumed it was attached to a line in _get_sample) and interpreted it totally wrong. Will think about this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK now I understand you!

To be honest while it would be more generic I don't think this is worth it unless there are more categories.. even with the EAS stats there are still only 2 category types, and even with the addition of tasks there are only 3. I don't think it's worth adding a regex (which nobody likes reading) just to deal with that..

# it might not be stable.
continue
else:
self.logger.warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno if it really can make sense... but... can we think about generating at run-time a category which has the name of this line's first column and a "generic list of measures", e.g. measure1, measure2, measure3... etc..

This will allows to parse Android specific fileds and then map the measures to proper names in the client code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems like a sensible idea (although in fact it might be easier to just add the EAS-specific schedstats?). Do you mind if we leave that for a separate feature though?


return ret


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove empty lines?

Copy link
Contributor Author

@bjackman bjackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I think I will have to wait until monday to fix it up.


# On 4.6+ kernels, schedstats needs to be enabled via kernel cmdline or
# sysctl. If not, we'll just get a load of zeroes.
self.old_sysctl_value = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems like a good idea, will try it.

# create.
# We'll create a site for each CPU and a site for each sched_domain.
for site, measures in self._get_sample().iteritems():
if site.startswith('cpu'):
Copy link
Contributor Author

@bjackman bjackman Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC I thought about that, but the thing is that domains are a sub-category of cpus, and there's no way to detect that: the parser has to be aware of this relationship. I can't imagine any other types of category (tasks? They have a separate schedstats file though. Task groups & sched entities would too, if they had schedstats), so I can't predict whether it makes sense to assume that the other category is a sub-entity of cpus, a sub-category of domains, or a new category entirely. Does that make sense?

Edit: Wait, that makes no sense.. I thought this comment was on different code (I assumed it was attached to a line in _get_sample) and interpreted it totally wrong. Will think about this!

# it might not be stable.
continue
else:
self.logger.warning(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that seems like a sensible idea (although in fact it might be easier to just add the EAS-specific schedstats?). Do you mind if we leave that for a separate feature though?

This instrument collects snapshots of the CPU/sched_domain scheduler
statistics in /proc/schedstat, and reports a Measure for each of
them. It currently doesn't support the per-task stats reported in
/proc/<pid>/schedstat, and it ignores the additional EAS schedstats
reported by Android kernels.

The measures are named according to the corresponding identifiers in
the kernel source code. Although This might appear to be an obscure
way to refer to them, the information in /proc/schedstats is not
going to be any use to anyone who doesn't have the kernel code
handy. There is an English description in
Documentation/scheduler/sched-stats.txt but it can only really be
considered a reference for those who already understand the
scheduler.

Some of the measures are representing time intervals, but their units
are jiffies, the meaning of which depends on the kernel
configuration. Therefore no units are reported for the measurements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants