How to count the number of unique rows greater than 1 million while typing?

F

Fridary2019-08-06 22:10:46

Python

Fridary, 2019-08-06 22:10:46

I can not solve the problem in any way, maybe someone knows. The input is N < 1m (one million) number and then N lines of length less than 1k.
I need to output the number of unique rows as a result. But an important limitation is that you can only use the numpy library, the time is no more than 5s and 5 MB of memory (!)
The script below took me 97ms and 12 MB of memory:

import numpy as np

N = int(input())
a = np.array([])
for i in range(N):
    x = input()
    if not np.any(a == x):
        a = np.append(a, x)

print(len(a))

This code took 10MB of memory:

N = int(input())
results = np.empty(N, dtype=object)
for i in range(N):
    results[i] = input()

print(len(np.unique(results)))

Any ideas what else can be done?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

R

Roman Kitaev, 2019-08-06
@deliro

N = int(input())
s = set()
for i in range(N):
    s.add(input())
print(len(s))

More optimal in terms of memory - the strings themselves are not stored, only their hashes are stored:

N = int(input())
s = set()
for i in range(N):
    s.add(hash(input()))
print(len(s))

L

longclaps, 2019-08-07
@longclaps

The problem has no solution within the stated constraints. If someone's solutions are rolled, then the editors are cheating with test data. Here's a demo for that. You can finish it by throwing out the excess and replacing it randrangewith hash(input()), and try to shove it as a solution.

from numpy import zeros, uint32
from random import randrange
from sys import getsizeof


N = 10 ** 6
hashes = zeros(N, uint32)
print(f'hashes занимает  {getsizeof(hashes)} байт')
control = set()  # здесь считаем по-честному
for i in range(N):
    # вместо строк я использую большие случайные числа
    r = randrange(0x4000000000000000)
    control.add(r)
    # сохраняем последние 4 байта r - больше не лезет
    hashes[i] = r & 0xffffffff
hashes.sort()
a, cnt = hashes[0], 1
for b in hashes:
    if a != b:
        a = b
        cnt += 1
print(f'control - целых {getsizeof(control)} байт (для строк длиной до 1к было бы больше)')
print(f'{cnt:8} разных хэшей\n{len(control):8} разных чисел')

A hash that is too short (32 bits) per 10^6 lines generates too many collisions (see the birthday paradox ). You can't shove the unimpressed.
UPDATE
Roman Kitaev suggested using bloom filter, here is a solution based on this idea. It carries the disadvantages of the Bloom filter: it works slowly and makes mistakes; it is also possible that my simplifications killed the filter, but maybe it will work.

bitmap, cnt = bytearray(0x400000), 0
for _ in range(int(input())):
    h, f = hash(input()), 0
    for _ in range(16):
        m = b'\x01\x02\x04\x08\x10\x20\x40\x80'[h & 7]
        h = ((h >> 4) ^ i) | ((h & 15) << 60)
        if not bitmap[h & 0x3fffff] & m:
            bitmap[h & 0x3fffff] |= m
            f = 1
    cnt += f
print(cnt)

A

Astrohas, 2019-08-06
@Astrohas

Have you tried using multiples? Do you have an example input?
If the main thing is the use of memory and time is not important, then you can hash the strings and store the hashes in the set. Memory must be less than 5 MB. By time 2 - 3 seconds

M

Michael, 2019-08-06
@moonz

The task of knowledge of algorithms) try Counter