I
I
Ivan Yakushenko2019-08-14 16:03:55
Python
Ivan Yakushenko, 2019-08-14 16:03:55

Why does RAM consumption continue to rise?

Example code used:

def get_data(url, data, data_counter):
    r = requests.get(url)
    soup = BS(r.text, 'lxml')
    ...
    ...
    ...
    scraped_data = {
        'title': title,
        'name': name,
        'description': description,
        'image': image,
        'rating': rating,
        'category': category,
        'link': link,
        'rss': extra_data['rss'],
        'email': extra_data['email'],
        'latest_date': latest_date
    }
    for a in range(len(articles)):
        podcast_data['listener_{}'.format(a+1)] = articles[a]
    data.append(scraped_data)
    data_counter.value += 1
    print('DONE №{}: {}'.format(data_counter.value, url))


if __name__ == "__main__":
    manager = Manager()
    data = manager.list()
    data_counter = manager.Value('i', 0)
    with Pool(999) as pool:
        for url in urls:
            pool.apply_async(get_data, (url, data, data_counter))
        pool.close()
        pool.join()
    result = []
    for d in data:
        result.append(d)
    create_csv(result)
    print(len(result))

Initially, this code was run on a VPS with 12 CPUs and 64GB of RAM to process 600,000 pages, worked for almost 10 hours and then in the logs I saw the message:
Traceback (most recent call last):
File "scrape_mp.py", line 46, in get_email
match = re.search(r'[\w\.-][email protected][\w\.-]+', r.text )
File "/home/kshnkvn/.local/lib/python3.6/site-packages/requests/models.py", line 861, in text
content = str(self.content, encoding, errors='replace')
MemoryError

After this error, the logs are empty, i.e. the script stopped running.
Tried to connect via SSH to VPS - connection error, had to reboot.
I increased the VPS specifications to 16 CPU and 102 GB of RAM, ran the script, as soon as the script started working, there were about 55 GB free, and began to monitor memory consumption.
It took about 5GB of RAM for the first hour, 3GB for the second hour, 2.5GB for the third hour, and memory consumption continues further, but it seems to be less and less.
Actually: what is RAM spent on and is it possible to somehow prevent this for this particular code?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
R
Roman Kitaev, 2019-08-14
@kshnkvn

Create a pool for 1k processes and be surprised that the memory is pouring? Well, you, my friend, are a pervert

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question