Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

[8] Find a way how to get project name from NVD CVE data #2485

Closed
5 tasks done
msrb opened this issue Mar 7, 2018 · 5 comments
Closed
5 tasks done

[8] Find a way how to get project name from NVD CVE data #2485

msrb opened this issue Mar 7, 2018 · 5 comments

Comments

@msrb
Copy link
Collaborator

msrb commented Mar 7, 2018

Description

Mapping CVE entries to actual package names is much easier when we at least know name of a project (e.g. "Apache NiFi", or "Apache POI") that is affected by given vulnerability. Knowing the project name will help us to get better results and less false positives.

The output of this task should be a function that takes one NVD CVE record on input and returns list of possible project name candidates. Having confidence score for each candidate would be nice, but is not necessary.

Sub tasks for sprint #2433

  • Have a set of labeled data to train, validate and test accuracy with.
    Update: It is possible to use label the NVD feeds that reference GitHub and hence the project name can be infered from the description of the CPE. This set however might not be sufficient and the approach should be further discussed.
  • Discover whether the data evinces latent pattern.
  • Model selection based on the description pattern properties
    Update: Based on the description properties, Naive Bayes classifier was selected for the implementation.
  • Classifier implementation
  • Accuracy evaluation
    Update: Accuracy has been evaluated on a relatively small dataset (cca 20% of real data) due to lack of labeled data.
@CermakM
Copy link

CermakM commented Mar 7, 2018

A sub task has been added. In order to be able to at least estimate the accuracy of a model, we need an accessible toy set of labeled data.

@CermakM CermakM changed the title Find a way how to get project name from NVD CVE data [8] Find a way how to get project name from NVD CVE data Mar 7, 2018
@krishnapaparaju
Copy link
Collaborator

@CermakM can you please share the approach for retrieving the project name ? Which ecosystems being considered for this work ?

CermakM pushed a commit to CermakM/nvdlib-msrb- that referenced this issue Mar 14, 2018
- commit jupyter notebook
  DISCLAIMER: the code in the notebook is in NON-production quality and
serves only as sketch of possible solution
- the notebook provides a POC for project name inference
from cpe description
- the notebook is supposed to visualize possible results when
implementing such kind of classifier for the task

GitHub issue: openshiftio/openshift.io#2485

Signed-off-by: Marek Cermak <[email protected]>

new file:   cve-desrciption-cracker.ipynb
@CermakM
Copy link

CermakM commented Mar 14, 2018

@krishnapaparaju sure thing,
This approach is not ecosystem specific, since it is based on project name extraction from CPE description only, without using any external search engines.

In the notebook you can see a suggestion of the approach that could greatly improve our current approach (also a slight comparison is present in the ntb).
This is still a WIP and a lot of things need to be discussed / implemented / verified.
Also, please expect and tolerate some slapdash code.

@msrb
Copy link
Collaborator Author

msrb commented Mar 27, 2018

This experiment will continue in Sprint 147.

@CermakM please update this issue, thanks 👍

@msrb msrb modified the milestones: Sprint 146, Sprint 147 Mar 27, 2018
@CermakM
Copy link

CermakM commented Mar 27, 2018

Conclusion for the current sprint #2433

We were able to prove that the description data evince a pattern. With a suitable feature extractor, a classifier can be trained to provide a decent predictions of a project name candidate. Such candidates are evaluated with a numeric confidence score and can be further processed (ordered, filtered, etc.)

To be done in sprint #2775 :

  • It is yet unknown whether the accuracy of the classifier is sufficient, only basic accuracy evaluation has been implemented atm, cross-validation remains to be done, as suggested by @rootAvish .
  • Also, the classifier used for this proof was an implicit one and can be further improved to provide more flexible and more suitable way to integrate with our current tools. The decision about the implementation will be based on cross-validation results and team members opinions.

cc @msrb

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants