P
P
PO6OT2015-10-19 15:22:55
PHP
PO6OT, 2015-10-19 15:22:55

How to avoid duplicate entries in a large list?

Let's say we have this code:

<?php
function getlist($string){
 if($string>0 && file_exists('./list.txt') && filesize('./list.txt')>=($string*2)){
  $list=fopen('./list.txt', 'a+');
  fseek($string*2-2);
  return fgets($list, 1);
 }
}

function putlist($data){
 if($data){
  $data=substr($data, 0, 1);
  file_put_contents('./list.txt', $data."\n", FILE_APPEND);
 }
}

This is a simplified version, there are no locks.
Here the input string is reduced to one character, but in reality there will be more and there will be many lines (list.txt is more than 1 TB).
How to avoid repeating strings without performance loss?
You can try like this:
$data=$_GET['data']; //$data=='d';
for($i=1; $d=getdata($i), $d; $i++){
 if($d===$data){
  $cancel=true;
  break;
 }
}
if(!$cancel)
 putdata($data);

But with large amounts of data, this method is not suitable.
How, for example, does google remove duplicate links from indexing?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Alexey Ostin, 2015-10-19
@woonem

Are you for a one-time use (to clean the base) or for permanent use?

  • You can break the entire volume into packs and apply the map-reduce approach (just like Google)
  • You can go through all the data once and build an index to use it later
  • Or you can just put all the data in any database that supports this functionality out of the box

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question