-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large variations in signal/background distributions #73
Comments
Hi Tommy, In estimating properties of final classifier, I would not recommend to look at individual components. Like, BDTs behavior is poorly explained by individual trees, and similarly Uboost behavior is poorly explained by looking at individual efficiency-targeted BDTs in it. Instead, analyze the predictions of Uboost in general, make a sweep of thresholds and plot singal efficiency and background efficiency. I expect you plots to be more smooth (at the very least, both should be monotonically increasing) Cheers |
Hi Alex, Thank you for your answer. I understand better the conceptual workflow of uBoost now. However, I struggle on the technical side. I guess your imply to use the I'm obviously doing something wrong but I can't quite understand what :/ |
Yeah, the thinking is that you use full model (that is uBoost, that is ensemble of ensembles). It is notoriously slow, but that's how it was designed.
That's surprising. What about area under the ROC curve? It may be that all predictions are shifted (like, all probabilities are > 0.5), but resulting classifier still has ok properties in terms of sig vs bck separation and flatness |
All the probabilities for the signal are between 0.5 and 0.7. I have a 100% signal and background efficiency in this case. I looked at the probabilities at every stage (and therefore at different target efficiencies) with the function I noticed this behaviour quite some time ago, which is why I stopped using the full classifier. |
I see, the reason is this squashing function, An important comment though is just don't interpret uBoost outputs as probabilities (I know name of function says so, but in reality you'll need additional steps to calibrate that to probability). Proper way to think about outputs as some new discriminating variable that is more useful than existing ones. (So, that's not cool that uBoost returns output in such a narrow range, but that's not a problem either - users shouldn't expect it to behave like probs, and select thresholds according to their needs) As of .predict method that is part of sklearn interface - there are practically no cases in HEP when you should use it. Better just forget about its existence :) |
Thank you for all your answers ! I managed to make things work now. I would still have one more question. I cannot see the parameters |
Not really, they can be exposed. Just at that time idea was to follow original uBoost paper (and in original paper, there is a modification of 'vanilla' adaboost, which does not have learning rate as a parameter). From a practical perspective, I think LR would be very helpful. |
Hi everyone,
I'm currently using uBoost for my Belle II analysis. For context, I'm trying to separate between B -> Xu l nu signal and B -> Xc l nu background.
As I'm investigating the best target efficiency for my case, I plotted the signal and background counts after uBoost classification for 100 target efficiencies. I noticed some kinks at various points in the distributions. Please see the attached plots for more clarity.
Even though, these variations don't seem to be large visually they actually correspond to ~10% fluctuations and eventually impact other variables such as the significance. Again, see the attached plots for an example.
I'm wondering if this is a known feature of uBoost or if this behaviour is caused by my sample or choice of variables/parameters.
Thank you for your answers !
If you need more info/context to my question, please tell me, I realise my explanation is quite shallow for now. #
Cheers,
Tommy
The text was updated successfully, but these errors were encountered: