Try to answer these question.
Whatare dictionaries and sets good for?
How are dictionaries and set the same?
What is the overhead when using a dictionary?
How can I optimize the performance of a dictionary?
How does Python use dictionaries to keep track of namespace?
Dictionary object type – mapping from hashable object to object
In the implemenation, Python don’t choose
Red-Black Tree as the basic data structure but Hash Table with a special technology –
About open addressing:
In computer science, lazy deletion refers to a method of deleting elements from a hash table that uses open addressing. In this method, deletions are done by marking an element as deleted, rather than erasing it entirely. Deleted locations are treated as empty when inserting and as occupied during a search. – Wikipedia
There three kinds of slots in the table:
Does not hold an active (key, value) pair now and never did. Unused can transition to Active upon key insertion. This is the only case in which
me_keyis NULL, and is each slot’s initial state.
me_key!= NULL and
me_key!= dummy and
me_value!= NULL Holds an active (key, value) pair. Active can transition to Dummy upon key deletion. This is the only case in which
me_key== dummy and
Previously held an active (key, value) pair, but that was deleted and an active pair has not yet overwritten the slot. Dummy can transition to Active upon key insertion. Dummy slots cannot be made Unused again (cannot have
me_keyset to NULL), else the probe sequence in case of collision would have no way to know they were once active.
Here, Python define the entry data type of dict in python
PyDictObejct. A entry is like a paire (key, value). To avoid calculating the hash value of
me_hash used as a cached value for it.
To ensure the lookup algorithm terminates, there must be at least one Unused slot (NULL key) in the table.
ma_fill is the number of non-NULL keys (sum of Active and Dummy);
ma_used is the number of non-NULL, non-dummy keys (== the number of non-NULL values == the number of Active items).
To avoid slowing down lookups on a near-full table, we resize the table when it’s two-thirds full.
PyDict_MINSIZE is the minimum size of a dictionary. This many slots are allocated directly in the dict object.
It must be a power of 2, and at least 4. 8 allows dicts with no more than 5 active entries to live in
ma_smalltable (and so avoid an additional malloc); instrumentation suggested this suffices for the majority of dicts (consisting mostly of usually-small instance dicts and usually-small dicts created to pass keyword arguments).
Here defined a array
ma_smalltable which store 8
ma_table points to
ma_smalltable for small tables, else to
additional malloc’ed memory.
ma_table is never NULL!
This rule saves repeated runtime null-tests in the workhorse getitem and setitem calls.
I draw a figure to make the idea easy to be understand.
The question is coming. When Python will resize the diction table of that object?
The answer is in function
dictresize in file
mp-ma_fill bigger or equal to (2/3)
mp->ma_mask+1 and we have finished a insert operation(
n_used), we should resize the container(ma_table) of the dictionary.
Yes, we should resize the dictionary object if there have 6 or more object in the container but not 8.
Here is the result of hacking.
dummy PyDictObject is just a
PyStringObject in Python. Initially, dummy is a pointer in C but it finally comes to a
PyStringObject after initialization of
PyDictObject first time.
It’s important to note that resizing can happen to make a hash table larger or smaller. This is , if sufficiently many element of a hash table are deleted, the table can be scaled down in size. However, resizing only happens during an insert.
There are two ways to create a dict:
PyDict_New() is the main C API function,
tp_new slot maps to
dict_new(). In the latter case we
can save a little time over what PyDict_New does because it’s guaranteed
that the PyDictObject struct is already zeroed out.
Everyone except dict_new() should use EMPTY_TO_MINSIZE (unless they have an excellent reason not to).
Photo By Jason Leaster
作者: Jason Leaster
本文采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可