how best to organize a container for storing IP addresses (1-3Mb)

Alexander2010-12-15 14:35:34

Algorithms

Alexander, 2010-12-15 14:35:34

to organize a cache of IP addresses, you need a ready-made solution or an effective algorithm for quick access to data (something like a hash table)
data may become outdated, it is necessary to store the last access time.
Cleaning or extrusion of irrelevant data should be provided,
which is better in this case?
the access rate is high (200-300 requests per second)!

the solution should be similar to memcash
fast and efficient!

input IP - 32-bit number
output - no more than 64 bytes - fixed size data
additional parameter - access time

Purpose - to check the time of the last access.

possible options:
1)
I would not like to pull out information about IP, scan the entire table of addresses.
in this case, inserting a new IP is inserting into an empty data slot, instead of the first expired slot.
2)
ordering by IP is not suitable due to the high frequency of access (maybe it is ???).
we are looking for IP by the method of half divisions. If you don't find it, paste it. Which slot storage model to use in this case?
3)
organize a queue by access time - it turns out a scan of the entire (part starting from the first not expired IP - before it - the method of half divisions) of the address table.
add to the end.

what other options are there????

Answer the question

In order to leave comments, you need to log in

8 answer(s)

Dervish66, 2010-12-15
@Dervish66

You need to access the same data in two ways. One way is by IP, the second is by access time. At the same time, it is necessary that the search for a change in the data (change of IP in the data slot and change of access time) be performed quickly and efficiently.
I would solve this problem by separating the data itself (an array of slots) from the indexes that need to be accessed. If we take the implementation in C ++, then something like this:

// Описатель слота данных
struct CDataItem {
    __int64 accessTime; // Любое представление времени
    DWORD ip;
    BYTE  userData [64];
};

// Индекс по IP
std::map<DWORD, CDataItem *> ipIndex;

// Индекс по времени доступа
std::map<__int64, CDataItem *> accessIndex;

// Память для хранения массива слотов
CDataItem * dataArray = new CDataItem[32000];

In this case, of course, after each change (update) of the slot, you will have to update the indexes.

amc, 2010-12-15
@amc

Remember that an IP address is perfectly represented as a 4-byte number.

outself, 2010-12-15
@outself

For a quick search, like an index, you can use the Bloom Filter ( en.wikipedia.org/wiki/Bloom_filter )
And Chrome uses Bloom Filters to pre-judgment whether a website is malicious. In practice, - Compact storage of a million addresses in ~ 18 megabytes.

Dervish66, 2010-12-15
@Dervish66

It is unlikely that there will be a container that would provide efficient access to data using two indexes at once. But instead of std::map, you can use anything, for example, balanced trees, especially since the code will be written in C. The main idea of \u200b\u200bmy proposal is to separate the indices and the data itself. Then the costs of searching for slots, inserting slots and updating indexes can be minimized.
Preemption is also implemented quite simply: by the access time index, we find the slot with the lowest access time value (the oldest) and replace all the fields in it. In this case, of course, you will have to update both indexes.
Updating an index can be done by deleting an index entry and adding a new one.

shsmad, 2010-12-15
@shsmad

and than actually memcached described by you did not approach?

eternals, 2010-12-15
@eternals

If standard tools cannot be used, then build a simple binary tree. Lock individual nodes when rebuilding.

catap, 2010-12-16
@catap

A mixture of LRU cache and some radix32 tree

Alexander, 2010-12-17
@akalend

interested, I decided to do this:
search by IP structure in the form of a b-Tree,
then check for time, the
times are stored in the form of a reverse list
, the last one in the queue is forced out.
When updating IP, we transfer the value of the element to the end of the list.