-
Notifications
You must be signed in to change notification settings - Fork 5
Home
Q: What is pyrsistence?
A: pyrsistence is a Python extension, developed in C, which allows a developer to use standard Python objects like dict and list, with the difference that data is transparently stored in files using memory mapped I/O. Class names exported by pyrsistence have names that begin with EM (stands for "External Memory") followed by a capitalized standard Python object name e.g. EMList or EMDict.
To use an external memory dictionary, for example, one can do the following:
import pyrsistence
# Constructor takes the directory where memory mapped files will be stored.
em_dict = pyrsistence.EMDict('/tmp/em_dict')
em_dict['key'] = 'value'
em_dict.close()
Keys and values can be objects of any type just like in normal Python dictionaries.
Q: What can I do with pyrsistence?
A: Personally I've been using pyrsistence to store large graphs representing disassembled programs' CFGs. Even for small programs with a few thousands of CFG nodes, RAM consumption increases rapidly when assigning properties to each node. An external memory dictionary, like EMDict can solve this problem (Python dictionaries can be used to build graphs where node identifiers are keys and the corresponding values are sets of child nodes).
pyrsistence, however, was not developed to target reverse engineering specific problems only. It can be used to process large datasets of any kind. At its current version only external memory dictionaries and lists are supported, but more data structures will be implemented in the near future.
Q: Why not use Redis or Memcached?
A: Redis and Memcached are two famous open source projects that do store their memory contents to disk, so that their state can be restored on restart. It's important to understand, however, that the actual data resides in main memory and thus physical RAM is consumed. On the contrary pyrsistence uses memory mapped files; VA range is consumed, but physical RAM is not (apart from a limited amount of memory used by OS kernels for caching purposes).
Q: Why not use SQL? SQLite is simple and doesn't require server software!
A: SQL is optimal for relational problems. On the contrary, dictionaries, lists, trees etc cannot be efficiently modeled using SQL. Dictionaries can be implemented in SQL using unique indices and trees using recursive queries. For example, a dictionary can be represented in SQLite using the following:
CREATE TABLE IF NOT EXISTS dict (key BLOB, value BLOB);
CREATE UNIQUE INDEX IF NOT EXISTS dict_key_unq ON dict(key);
To insert, select or delete entries, one would have to issue queries like the following:
INSERT OR REPLACE INTO dict (key, value) VALUES (?, ?);
SELECT value FROM dict WHERE key = ?;
DELETE OR IGNORE FROM dict WHERE key = ?;
Unfortunately, the unique indices result in serious performance degradation.
Q: Why not use no-SQL software?
A: Indeed, key-value stores are a good option for implementing dictionaries. Unfortunately most no-SQL solutions require server software to be set up. Additionally these implementations are optimized for and scale well on networks with several nodes (they are products of the cloud-era after all). Others are commercial and cost a lot. Ah... and Hadoop is developed in Java.
Q: What about shelve, bsddb, dbhash, dbm, dumpdbm and gdbm?
A: They are all DBM-based and thus very slow. Will soon publish some benchmarks.
Q: Is there any similar software?
A: Yes! The only alternative I can think of is GraphChi, however, I'd be glad to hear about others. Unfortunately, even though the latest version of GraphChi allows a graph to be dynamically modified, it requires an initial graph description file for the input graph e.g. a CSV file. Additionally, the code is not optimized and definitely not clean.