Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Parallel Hashmap for potential performance benefits #258

Open
donpellegrino opened this issue Mar 13, 2022 · 2 comments
Open

Evaluate Parallel Hashmap for potential performance benefits #258

donpellegrino opened this issue Mar 13, 2022 · 2 comments

Comments

@donpellegrino
Copy link
Contributor

The libhdt/src/dictionary/PlainDictionary.hpp implementation previously made reference to "GOOGLE_HASH" with some compiler directives to include <sparsehash/sparse_hash_map> in place of <ext/hash_map>. No corresponding configuration options for that library were found in configure.ac.

It looks like greg7mdp/parallel-hashmap is the currently maintained evolution of that codebase. It may offer performance benefits over standard <unordered_map>.

Enabling Parallel Hashmap can be done by pointing to a clone of the GitHub repository via the compiler's include path. The following changes would be made in libhdt/src/dictionary/PlainDictionary.hpp:

  • Replace #include <unordered_map> with #include <parallel_hashmap/phmap.h>
  • Replace typedef std::unordered_map<const char *, DictionaryEntry *, hash<std::string>, str_cmp> DictEntryHash; with typedef phmap::parallel_node_hash_map<const char *, DictionaryEntry *, hash<std::string>, str_cmp> DictEntryHash;

Corresponding changes would be needed to the GNU Build System (e.g., configure.ac) and installation instructions in README.md.

Testing would need to be done to determine if Parallel Hashmap could speed up HDT operations.

Related issues include #245 and #256. Getting more of the test suite operational may be a good prerequisite for a change such as this to ensure there are no breakages.

@mielvds
Copy link
Member

mielvds commented Mar 14, 2022

Hi @donpellegrino ! Thanks for doing these cleanup PR's most of them seem straightforward changes to keep the CPP modern, so no problem merging those. Could you tell me a little about your interest in HDT?

Back to this issue: I'm not that into CPP, so can't really assess, but that parallel hashmap looks like a nice addition. But would it make sense to have this in a testing branch first to see whether there are performance impacts?

@donpellegrino
Copy link
Contributor Author

Yes, this change needs regression testing to ensure the operations give equivalent results when done in parallel and performance testing to ensure there is justification for switching from the standard library. I intend to baseline with Intel Advisor and then try out Parallel Hashmap in place of the standard library.

One challenge is identifying a good dataset that is large enough to use for performance testing. Ideally, it would be standardized and applicable to other approaches for processing triples to support future comparative analyses. If there are any suggestions for datasets, please let me know.

donpellegrino added a commit to DeciSym/hdt-cpp that referenced this issue Aug 9, 2022
donpellegrino added a commit to DeciSym/hdt-cpp that referenced this issue Aug 9, 2022
…_set not found to be in use. Replaced some std::map with phmap, but more remain. std:set still to be replaced with phmap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants