Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it = map.erase(it) potentially passes entries twice #20

Open
martinus opened this issue Jul 31, 2019 · 6 comments
Open

it = map.erase(it) potentially passes entries twice #20

martinus opened this issue Jul 31, 2019 · 6 comments

Comments

@martinus
Copy link

martinus commented Jul 31, 2019

I've stumbled upon a bug in my robin_hood map which also uses backward shift deleting, and wondered how you solve this issue in tsl::robin_map, and it seems you have exactly the same issue. (see martinus/robin-hood-hashing#42)

The problem is that due to backward shift deleting, when iterating a map and erasing elements, it is possible that the same elements will be iterated over twice.

Here is a reproducer, where I insert two elements into the map. When erasing the second element, the first element will wrap around and be iterated over a second time. I currently think it is best to just claim that's the way it is, because I am afraid there is no easy workaround (except not using backward shift deleting)

template <typename T>
struct BadHash {
    size_t operator()(T const& val) const {
        return static_cast<size_t>(val);
    }
};

int main(int, char**) {
    tsl::robin_map<int, int, BadHash<int>> map;
    map.reserve(16); // resizes with mask 31

    map.emplace(31, 1);        // gets into last bucket
    map.emplace(1024 + 31, 2); // would also get into last bucket, but wraps around

    // first entry in the map
    auto it = map.begin();
    std::cout << it->first << "->" << it->second << std::endl;

    // second entry in the map, erase it
    ++it;
    std::cout << it->first << "->" << it->second << std::endl;
    it = map.erase(it);

    // now all two elements are iterated, we should be at the map's end, but due to backward shift
    // deletion we are not
    if (it != map.end()) {
        std::cout << it->first << "->" << it->second << std::endl;
    }
    it = map.erase(it);
    std::cout << "are we now at the end? " << (it == map.end()) << std::endl;
}

I wonder what you think about that issue?

@martinus martinus changed the title it = map.erase(it) potentially passes object twice it = map.erase(it) potentially passes entries twice Jul 31, 2019
Tessil added a commit that referenced this issue Jul 31, 2019
@Tessil
Copy link
Owner

Tessil commented Jul 31, 2019

Hello,

Thank you very much for the report. I effectively completely missed that when implementing the backward shift deletion.

I started the implementation of a fix in the fix_issue_20 branch. I still have to check a bit if everything works fine and improve the documentation before merging the changes.

Basically I check if I wrap around during the backward shift deletion and if it's the case i just return end().

Thibaut

@martinus
Copy link
Author

martinus commented Aug 1, 2019

I think simply checking for a wrap around is not enough. E.g when you insert 3 elements at location 30, 2 elements are at the end and one will wrap around. When erasing the second to last, backward shift will wrap around but there is still one element left to be iterated

@Tessil
Copy link
Owner

Tessil commented Aug 1, 2019

Effectively I went a bit too fast with the proposed solution, it will not work in multiple cases. I'll check this evening to see how we can do that.

@Tessil
Copy link
Owner

Tessil commented Aug 10, 2019

Hi,

I checked a bit and currently I see the following possible solutions:

  • Keeping the same behaviour as std::unordered_map.
    • Don't wrap around, we just add a buffer at the end of the buckets (that is not counted in the bucket_count) to store the elements that would have wrapped around. We rehash the hash table if the buffer gets full regardless of the current load factor. The problem is the risk of memory DoS attacks as an attacker may force the hash table to rehash by inserting elements at the end.
    • When creating an iterator, add a reference to the last current element of the hash table except on erase where we propagate the reference to the returned iterator. If on erase we see that the element of the iterator is the same as the reference to the last element in the iterator we return end() after erasing it instead of the next element.
    • When calling begin(), return the first element that was not wrapped around. When incrementing the iterator, wrap around and continue until the last wrapped element
  • Different behaviour compare to std::unordered_map.
    • Leave it as it is and document the difference. When using erase in a loop it's usually to erase an element that meets a condition (e.g. if(it->first % 2 == 0) it = erase(it);) and not erase everything that come after an element.
    • Just don't return anything, we just change the return type to void. Not really viable as erasing while iterating can be quite useful.

The problem is more complex than expected and I have to think a bit if I can find another solution or which solution to pick-up.

Thibaut

@martinus
Copy link
Author

I also thought about adding an overflow buffer at the end. This might have a slight performance advantage because less need to & (tablesize-1). The DoS attack could also be mitigated with a hash that e.g. uses the this pointer, then the positions become less predictable.

Your iterator idea brings me to a new idea: How about instead of a reference, add a counter of the number of remaining items. begin() sets the counter to size(), and end() sets the counter to 0. Each ++it or it = erase(it) decrease the counter, comparing iterators just need to compare the counter. That needs no specil wrap-around handling, and begin() and end() stay fast operations

@Tessil
Copy link
Owner

Tessil commented Aug 11, 2019

Adding a counter would also be possible but the problem is that you also need to initialize it in other methods than begin(). For example on insertion you need to return an iterator to the inserted element, how can you get the remaining size until end() efficiently?

I also realised that a reference to the last element is not really possible either as it may be invalidated. It also has the same problem as with a counter, how do you efficiently get a reference to the last element in some methods? You could eventually maintain a pointer to the last element in the class and update it on each insert/erase. But it's getting quite complex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants