Answer the question
In order to leave comments, you need to log in
What is the most efficient way to read and work with a large csv in Python?
There is a large list of real estate transactions of approximately 26m lines.
You need to find all the properties that have been dealt with more than once.
0 {F887F88E-7D15-4415-804E-52EAC2F10958} 70000 1995-07-07 00:00 MK15 9HP D N F 31 NaN ALDRICH DRIVE WILLEN MILTON KEYNES MILTON KEYNES MILTON KEYNES A A
1 {40FD4DF2-5362-407C-92BC-566E2CCE89E9} 44500 1995-02-03 00:00 SR6 0AQ T N F 50 NaN HOWICK PARK SUNDERLAND SUNDERLAND SUNDERLAND TYNE AND WEAR A A
2 {7A99F89E-7D81-4E45-ABD5-566E49A045EA} 56500 1995-01-13 00:00 CO6 1SQ T N F 19 NaN BRICK KILN CLOSE COGGESHALL COLCHESTER BRAINTREE ESSEX A A
3 {28225260-E61C-4E57-8B56-566E5285B1C1} 58000 1995-07-28 00:00 B90 4TG T N F 37 NaN RAINSBROOK DRIVE SHIRLEY SOLIHULL SOLIHULL WEST MIDLANDS A A
4 {444D34D7-9BA6-43A7-B695-4F48980E0176} 51000 1995-06-28 00:00 DY5 1SA S N F 59 NaN MERRY HILL BRIERLEY HILL BRIERLEY HILL DUDLEY WEST MIDLANDS A A
6 {709FB471-3690-4945-A9D6-4F48CE65AAB6} 58000 1995-04-28 00:00 PE7 3AL D Y F 4 NaN BROOK LANE FARCET PETERBOROUGH PETERBOROUGH CAMBRIDGESHIRE A A
8 {E78710AD-ED1A-4B11-AB99-5A0614D519AD} 20000 1995-01-16 00:00 SA6 5AY D N F 592 NaN CLYDACH ROAD YNYSTAWE SWANSEA SWANSEA SWANSEA A A
9 {1DFBF83E-53A7-4813-A37C-5A06247A09A8} 137500 1995-03-31 00:00 NR2 2NQ D N F 26 NaN LIME TREE ROAD NORWICH NORWICH NORWICH NORFOLK A A
10 {BCD607D8-7698-4C5A-97D8-52EB0A6ED632} 57500 1995-12-20 00:00 WS6 7BQ D N F 6 NaN MERRILL CLOSE WALSALL WALSALL SOUTH STAFFORDSHIRE STAFFORDSHIRE A A
11 {461BAE3D-DD53-40C1-BE48-52EB1511C5F3} 70000 1995-09-20 00:00 GL52 3LH D N F 139B NaN NEW BARN LANE CHELTENHAM CHELTENHAM CHELTENHAM GLOUCESTERSHIRE A A
12 {003DB740-4F22-46E5-AD3C-52EB15407DE5} 62750 1995-10-02 00:00 BR3 4AT T N F 56 NaN EDEN ROAD BECKENHAM BECKENHAM BROMLEY GREATER LONDON A A
14 {8B5FB0C0-01CF-4584-9777-4F4908B02FEA} 78000 1995-09-29 00:00 SW16 3BL S N F 208 NaN GREEN LANE LONDON LONDON CROYDON GREATER LONDON A A
15 {38008828-F33E-4F54-927C-4BB90B83248E} 51000 1995-10-31 00:00 UB5 5LH S N F 57 NaN BENGARTH ROAD NORTHOLT NORTHOLT EALING GREATER LONDON A A
17 {8982E939-2927-457E-B695-5A066720BF59} 41000 1995-02-10 00:00 NE45 5AP T N F 21 NaN FRONT STREET CORBRIDGE CORBRIDGE TYNEDALE NORTHUMBERLAND A A
18 {C6AC299D-A75A-481D-BE03-5A06749177E3} 100000 1995-09-29 00:00 WR5 3EU D N F 8 NaN BARNESHALL AVENUE WORCESTER WORCESTER WORCESTER WORCESTERSHIRE A A
19 {EAFDBDAA-0C4D-4A9B-8CA2-5A06776A1BBB} 123000 1995-09-05 00:00 NG34 7TF D Y F 5 NaN CLAY HILL ROAD SLEAFORD SLEAFORD NORTH KESTEVEN LINCOLNSHIRE A A
20 {3A947F9C-D5C3-4178-8382-566E952D20A1} 47000 1995-10-30 00:00 BS15 9UR S N F 9 NaN AMBLE CLOSE BRISTOL BRISTOL KINGSWOOD AVON A A
21 {EFC499C5-9329-46AF-92FA-4F492E0A8BC3} 58000 1995-11-23 00:00 SN6 8LH S N F THYME COTTAGE NaN NaN IDSTONE SWINDON VALE OF WHITE HORSE OXFORDSHIRE A A
22 {6D414E04-656D-41AB-98F0-4F493154CE27} 85000 1995-06-09 00:00 KT16 8DW S N F 40 NaN EASTWORTH ROAD CHERTSEY CHERTSEY RUNNYMEDE SURREY A A
23 {2AF070FA-D882-4A62-9ABD-4F494D410DB1} 38000 1995-08-31 00:00 LU2 7LZ S N F 36 NaN HERON DRIVE LUTON LUTON LUTON LUTON A A
25 {245992BA-92D4-4529-B5AD-4BB942B6F435} 56500 1995-04-20 00:00 BN3 5AG T N F 15 NaN SHAKESPEARE STREET HOVE HOVE HOVE EAST SUSSEX A A
26 {444EAAD2-A559-4847-9A33-5A06847257AF} 94500 1995-06-01 00:00 LE65 1HF D Y F 8 NaN KELSO CLOSE ASHBY-DE-LA-ZOUCH ASHBY-DE-LA-ZOUCH NORTH WEST LEICESTERSHIRE LEICESTERSHIRE A A
27 {852368A9-09D6-4BA6-B33A-5A068BAD2D48} 89000 1995-11-24 00:00 B75 5PG D N F 28 NaN SLADE ROAD SUTTON COLDFIELD SUTTON COLDFIELD BIRMINGHAM WEST MIDLANDS A A
28 {CB70FDBB-49CC-412F-911C-5A069B724EE2} 110000 1995-12-21 00:00 BS15 3HH D N F 148 NaN HIGH STREET HANHAM BRISTOL KINGSWOOD AVON A A
29 {2A9D6B70-FC5D-46A1-A8EF-5A06AF633839} 51000 1995-02-17 00:00 NN15 6UQ S N F 80 NaN GOTCH ROAD BARTON SEAGRAVE KETTERING KETTERING NORTHAMPTONSHIRE A A
def main(filename):
st = time.time()
df = pd.read_csv(filename, delimiter=",")
print(df.info(memory_usage='deep'))
print(time.time() - st)
Data columns (total 16 columns):
# Column Dtype
--- ------ -----
0 Transaction unique identifier object
1 Price int64
2 Date of Transfer object
3 Postcode object
4 Property Type object
5 Old/New object
6 Duration object
7 PAON object
8 SAON object
9 Street object
10 Locality object
11 Town/City object
12 District object
13 County object
14 PPD Category Type object
15 Record Status - monthly file only object
dtypes: int64(1), object(15)
memory usage: 23.6 GB
221.08039593696594 сек.
memory usage: 91.4 MB
140.3509292602539 сек.
Answer the question
In order to leave comments, you need to log in
You can import csv into the database. In the same DBeaver , you can simply open cvs as a database and write the desired select
It's not very clear which columns identify the property in the file. The GUID seems to refer to the fact of the sale. So tell me what to look for.
I would read the file line by line, choosing Postcode and counting the number of values. It will be voluminous, but certainly not 20 gigs. Then I would look at the histogram - are there many unique Postcodes (that is, those that meet once). If there are many, then it is enough to look only at those Postcodes that occur repeatedly. This will reduce the file size. In the future, it can be filtered further according to the same scheme, but with different fields.
Count the number of transactions in a dictionary with a counter.
The dictionary key will be a tuple that explicitly points to a specific house.
Define the required fields yourself
import csv
from collections import defaultdict
result = defaultdict(int)
with open('1.csv') as f:
reader = csv.DictReader(f)
for row in reader:
full_address_tuple =(row['town'], row['street'], )
result[full_address_tuple] += 1
for key, value in result.items():
print(key, ' - ', value)
('town1', 'street1') - 1
('town1', 'street2') - 2
('town2', 'street2') - 1
('town2', 'street1') - 1
town,street
town1,street1
town1,street2
town1,street2
town2,street2
town2,street1
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question