S
S
snrt5osp2021-08-29 17:18:27
PowerShell
snrt5osp, 2021-08-29 17:18:27

How to find and display duplicate lines in a text file?

There are two text files containing lines: file_0.txt and file_1.txt. The number of lines may vary. The length of the lines can be different. The files contain a large number of lines. You need to efficiently output to another file lines that are contained in two files at the same time.
Example:
The contents of the file file_0.txt:

file_0.txt
j43j72h531
b2x891ow52
rr35986z77
x77jm9lp7g
q0pprcp52yawc10
wh3h476m2u
e7h0cv6rh5
5l7i700939
l3ri0p8p2f
l1h14no300

File_1.txt content:
file_1.txt
l1h14no300
j2615a2e0y
815555v33h
q0pprcp52yawc10
2vhhh0ugxv
rc2jl8lhdl
79qn640321
b2x891ow52

Required contents of the file_2.txt file after the program/command is running:
file_2.txt
b2x891ow52
q0pprcp52yawc10
l1h14no300

I tried to do this using the CMD findstr command, but for some reason I didn’t get all the matching lines in the output, although I checked their presence manually. On the i5-8400 processor, the comparison speed of 100'000 lines in one and 100'000 in another file suits me quite well: 10-15 seconds.
Prompt a CMD/PowerShell command or a program to do what you have planned.

Answer the question

In order to leave comments, you need to log in

4 answer(s)
M
MaxKozlov, 2021-08-29
@MaxKozlov

It looks like something like this (almost in c#) will help you:

$c = [string[]](Get-Content .\0.txt)
$sk1 = [System.Collections.Generic.HashSet[string]]::new($c)
$c = [string[]](Get-Content .\1.txt)
$sk2 = [System.Collections.Generic.HashSet[string]]::new($c)
$sk1.IntersectWith($sk2)
$sk1

on your data, it gave out what you need (though not sorted in the same order)
Well, from memory - everything is loaded into memory

A
Andrew AT, 2021-08-30
@AAT666

$f0 = Get-Content -Path C:\tmp\file_0.txt
$f1 = Get-Content -Path C:\tmp\file_1.txt

[system.linq.enumerable]::Intersect([object[]]$f0, [object[]]$f1) | Out-File -FilePath C:\tmp\file_2.txt

K
kalapanga, 2021-08-29
@kalapanga

If I were you, I would try to figure out what kind of lines these are, which are not included in the result now. Maybe they're not exactly the same.
Findstr is asking here. Moreover, it suits the speed.

A
azarij, 2021-08-29
@azarij

and so?

(Compare-Object -ReferenceObject (get-content c:\test\1.txt) -DifferenceObject (get-content c:\test\0.txt) -ExcludeDifferent -IncludeEqual).inputobject | out-file c:\test\2.txt

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question