I
I
Ivan2022-03-17 22:52:47
Python
Ivan, 2022-03-17 22:52:47

What is the most efficient way to read and work with a large csv in Python?

There is a large list of real estate transactions of approximately 26m lines.
You need to find all the properties that have been dealt with more than once.

File contents

0   {F887F88E-7D15-4415-804E-52EAC2F10958}   70000  1995-07-07 00:00  MK15 9HP             D       N        F             31   NaN       ALDRICH DRIVE             WILLEN      MILTON KEYNES              MILTON KEYNES     MILTON KEYNES                 A                                 A
1   {40FD4DF2-5362-407C-92BC-566E2CCE89E9}   44500  1995-02-03 00:00   SR6 0AQ             T       N        F             50   NaN         HOWICK PARK         SUNDERLAND         SUNDERLAND                 SUNDERLAND     TYNE AND WEAR                 A                                 A
2   {7A99F89E-7D81-4E45-ABD5-566E49A045EA}   56500  1995-01-13 00:00   CO6 1SQ             T       N        F             19   NaN    BRICK KILN CLOSE         COGGESHALL         COLCHESTER                  BRAINTREE             ESSEX                 A                                 A
3   {28225260-E61C-4E57-8B56-566E5285B1C1}   58000  1995-07-28 00:00   B90 4TG             T       N        F             37   NaN    RAINSBROOK DRIVE            SHIRLEY           SOLIHULL                   SOLIHULL     WEST MIDLANDS                 A                                 A
4   {444D34D7-9BA6-43A7-B695-4F48980E0176}   51000  1995-06-28 00:00   DY5 1SA             S       N        F             59   NaN          MERRY HILL      BRIERLEY HILL      BRIERLEY HILL                     DUDLEY     WEST MIDLANDS                 A                                 A
6   {709FB471-3690-4945-A9D6-4F48CE65AAB6}   58000  1995-04-28 00:00   PE7 3AL             D       Y        F              4   NaN          BROOK LANE             FARCET       PETERBOROUGH               PETERBOROUGH    CAMBRIDGESHIRE                 A                                 A
8   {E78710AD-ED1A-4B11-AB99-5A0614D519AD}   20000  1995-01-16 00:00   SA6 5AY             D       N        F            592   NaN        CLYDACH ROAD           YNYSTAWE            SWANSEA                    SWANSEA           SWANSEA                 A                                 A
9   {1DFBF83E-53A7-4813-A37C-5A06247A09A8}  137500  1995-03-31 00:00   NR2 2NQ             D       N        F             26   NaN      LIME TREE ROAD            NORWICH            NORWICH                    NORWICH           NORFOLK                 A                                 A
10  {BCD607D8-7698-4C5A-97D8-52EB0A6ED632}   57500  1995-12-20 00:00   WS6 7BQ             D       N        F              6   NaN       MERRILL CLOSE            WALSALL            WALSALL        SOUTH STAFFORDSHIRE     STAFFORDSHIRE                 A                                 A
11  {461BAE3D-DD53-40C1-BE48-52EB1511C5F3}   70000  1995-09-20 00:00  GL52 3LH             D       N        F           139B   NaN       NEW BARN LANE         CHELTENHAM         CHELTENHAM                 CHELTENHAM   GLOUCESTERSHIRE                 A                                 A
12  {003DB740-4F22-46E5-AD3C-52EB15407DE5}   62750  1995-10-02 00:00   BR3 4AT             T       N        F             56   NaN           EDEN ROAD          BECKENHAM          BECKENHAM                    BROMLEY    GREATER LONDON                 A                                 A
14  {8B5FB0C0-01CF-4584-9777-4F4908B02FEA}   78000  1995-09-29 00:00  SW16 3BL             S       N        F            208   NaN          GREEN LANE             LONDON             LONDON                    CROYDON    GREATER LONDON                 A                                 A
15  {38008828-F33E-4F54-927C-4BB90B83248E}   51000  1995-10-31 00:00   UB5 5LH             S       N        F             57   NaN       BENGARTH ROAD           NORTHOLT           NORTHOLT                     EALING    GREATER LONDON                 A                                 A
17  {8982E939-2927-457E-B695-5A066720BF59}   41000  1995-02-10 00:00  NE45 5AP             T       N        F             21   NaN        FRONT STREET          CORBRIDGE          CORBRIDGE                   TYNEDALE    NORTHUMBERLAND                 A                                 A
18  {C6AC299D-A75A-481D-BE03-5A06749177E3}  100000  1995-09-29 00:00   WR5 3EU             D       N        F              8   NaN   BARNESHALL AVENUE          WORCESTER          WORCESTER                  WORCESTER    WORCESTERSHIRE                 A                                 A
19  {EAFDBDAA-0C4D-4A9B-8CA2-5A06776A1BBB}  123000  1995-09-05 00:00  NG34 7TF             D       Y        F              5   NaN      CLAY HILL ROAD           SLEAFORD           SLEAFORD             NORTH KESTEVEN      LINCOLNSHIRE                 A                                 A
20  {3A947F9C-D5C3-4178-8382-566E952D20A1}   47000  1995-10-30 00:00  BS15 9UR             S       N        F              9   NaN         AMBLE CLOSE            BRISTOL            BRISTOL                  KINGSWOOD              AVON                 A                                 A
21  {EFC499C5-9329-46AF-92FA-4F492E0A8BC3}   58000  1995-11-23 00:00   SN6 8LH             S       N        F  THYME COTTAGE   NaN                 NaN            IDSTONE            SWINDON        VALE OF WHITE HORSE       OXFORDSHIRE                 A                                 A
22  {6D414E04-656D-41AB-98F0-4F493154CE27}   85000  1995-06-09 00:00  KT16 8DW             S       N        F             40   NaN      EASTWORTH ROAD           CHERTSEY           CHERTSEY                  RUNNYMEDE            SURREY                 A                                 A
23  {2AF070FA-D882-4A62-9ABD-4F494D410DB1}   38000  1995-08-31 00:00   LU2 7LZ             S       N        F             36   NaN         HERON DRIVE              LUTON              LUTON                      LUTON             LUTON                 A                                 A
25  {245992BA-92D4-4529-B5AD-4BB942B6F435}   56500  1995-04-20 00:00   BN3 5AG             T       N        F             15   NaN  SHAKESPEARE STREET               HOVE               HOVE                       HOVE       EAST SUSSEX                 A                                 A
26  {444EAAD2-A559-4847-9A33-5A06847257AF}   94500  1995-06-01 00:00  LE65 1HF             D       Y        F              8   NaN         KELSO CLOSE  ASHBY-DE-LA-ZOUCH  ASHBY-DE-LA-ZOUCH  NORTH WEST LEICESTERSHIRE    LEICESTERSHIRE                 A                                 A
27  {852368A9-09D6-4BA6-B33A-5A068BAD2D48}   89000  1995-11-24 00:00   B75 5PG             D       N        F             28   NaN          SLADE ROAD   SUTTON COLDFIELD   SUTTON COLDFIELD                 BIRMINGHAM     WEST MIDLANDS                 A                                 A
28  {CB70FDBB-49CC-412F-911C-5A069B724EE2}  110000  1995-12-21 00:00  BS15 3HH             D       N        F            148   NaN         HIGH STREET             HANHAM            BRISTOL                  KINGSWOOD              AVON                 A                                 A
29  {2A9D6B70-FC5D-46A1-A8EF-5A06AF633839}   51000  1995-02-17 00:00  NN15 6UQ             S       N        F             80   NaN          GOTCH ROAD    BARTON SEAGRAVE          KETTERING                  KETTERING  NORTHAMPTONSHIRE                 A                                 A



I want to complete the task as efficiently as possible in terms of resource and time consumption.

def main(filename):
    st = time.time()
    df = pd.read_csv(filename, delimiter=",")
    print(df.info(memory_usage='deep'))
    print(time.time() - st)

But reading the file is already out of bounds.

Data columns (total 16 columns):
 #   Column                             Dtype
---  ------                             -----
 0   Transaction unique identifier      object
 1   Price                              int64
 2   Date of Transfer                   object
 3   Postcode                           object
 4   Property Type                      object
 5   Old/New                            object
 6   Duration                           object
 7   PAON                               object
 8   SAON                               object
 9   Street                             object
 10  Locality                           object
 11  Town/City                          object
 12  District                           object
 13  County                             object
 14  PPD Category Type                  object
 15  Record Status - monthly file only  object
dtypes: int64(1), object(15)
memory usage: 23.6 GB
221.08039593696594 сек.

When splitting into chunks of 10 ** 5, we managed to reduce the performance, but this approach greatly complicates further work, since they still need to be stored.

memory usage: 91.4 MB
140.3509292602539 сек.


Dask also did not give any increase in speed.

The goal is to write a program that should consume as little memory and CPU time as possible.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
E
Elvis, 2022-03-17
@Dr_Elvis

You can import csv into the database. In the same DBeaver , you can simply open cvs as a database and write the desired select

V
Vindicar, 2022-03-17
@Vindicar

It's not very clear which columns identify the property in the file. The GUID seems to refer to the fact of the sale. So tell me what to look for.
I would read the file line by line, choosing Postcode and counting the number of values. It will be voluminous, but certainly not 20 gigs. Then I would look at the histogram - are there many unique Postcodes (that is, those that meet once). If there are many, then it is enough to look only at those Postcodes that occur repeatedly. This will reduce the file size. In the future, it can be filtered further according to the same scheme, but with different fields.

S
ScriptKiddo, 2022-03-18
@ScriptKiddo

Count the number of transactions in a dictionary with a counter.
The dictionary key will be a tuple that explicitly points to a specific house.
Define the required fields yourself

import csv
from collections import defaultdict


result = defaultdict(int)

with open('1.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        full_address_tuple =(row['town'], row['street'], )
        result[full_address_tuple] += 1

for key, value in result.items():
    print(key, ' - ', value)

Result:
('town1', 'street1')  -  1
('town1', 'street2')  -  2
('town2', 'street2')  -  1
('town2', 'street1')  -  1

1.csv

town,street
town1,street1
town1,street2
town1,street2
town2,street2
town2,street1

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question