B
B
bozuriciyu2019-10-09 17:05:00
Parsing
bozuriciyu, 2019-10-09 17:05:00

How to download a million pictures?

There are a million links to pictures. Suppose there is no rate limit on the donor site. What is the fastest way to download them? If you download alternately, then let's say 1 picture in 1 second is 11 days of pumping without a break. If you put in 10 streams, then for a day. But all the same, somehow for a long time (pictures in general and 10 million maybe)
I planned to download using a script on the Node. Hence the question is one more, how to parallelize in 10 threads? Well, i.e. Stupidly going through the cycle will not work, he will apparently make a thousand requests in milliseconds. Those. I need to optimize it somehow.
Maybe there are suitable tools for this pumping task? Node modules or python, you can use some kind of Gooey on the kraynyak.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
A
Andrew, 2019-10-09
@bozuriciyu

download_images_from_csv.sh (finish for yourself if necessary)

spoiler
#!/bin/bash
COLUMN=1 # csv column to extract
RENAME=false # if we should rename the file, note that is was really specific for my problem.
THREADS=16 # threads to use by parallel

#Set Script Name variable
SCRIPT=`basename ${BASH_SOURCE[0]}`

#Set fonts for Help.
NORM=`tput sgr0`
BOLD=`tput bold`
REV=`tput smso`

# Help function
function HELP {
  echo -e \\n"Help documentation for ${SCRIPT}."\\n
  echo -e "Basic usage: ./$SCRIPT"\\n
  echo "Command line switches are optional. The following switches are recognized."
  echo "-f csv file = required should be last argument"
  echo "-c column, default $COLUMN"
  echo "-t threads, default $THREADS"
  echo "-r renamd, should be renamed - work in progress here because this is really specific renaming"
  echo -e "-h --Displays this help message. No further functions are performed."\\n
  echo -e "Example: ./${BOLD}$SCRIPT -rc 2 -f file.csv"\\n
  exit 1
}


#Check the number of arguments. If none are passed, print help and exit.
NUMARGS=$#
if [ $NUMARGS -eq 0 ]; then
  HELP
  exit 1
fi

while getopts ::c::r:h:f FLAG; do
  case $FLAG in
    t)
        THREADS=$OPTARG
      ;;
    c)
        COLUMN=$OPTARG
      ;;
    r)
        RENAME=true
      ;;
    h)  #show help
      HELP
      ;;
    \?)
      echo -e \\n"Option -${BOLD}$OPTARG${NORM} not allowed."
      HELP
      ;;
  esac
done

shift $((OPTIND-1))

FILE=$1
# shift ops, all optional args are now removed $1 will have to be the filename

if [ "$RENAME" = true ]; then
    mkdir -p images && cat $FILE | tail -n +2 | cut -d ',' -f$COLUMN | grep http | sed -e 's/^[ \t\r]*//' | \
        (cd images; parallel -j$THREADS -d'\r\n' --gnu 'wget {}; mv {/} `echo "{/}" | tr "." "_" | cut -d "_" -f1,3 | tr "_" "."`')
else
    mkdir -p images && cat $FILE | tail -n +2 | cut -d ',' -f$COLUMN | grep http | sed -e 's/^[ \t\r]*//' | \
        (cd images; parallel -j$THREADS -d'\r\n' --gnu 'wget {};')
fi

H
htr, 2019-10-17
@htr

Why write a bike - file downloader? if there is already a list of files? there are a lot of these utilities for any OS) in terms of timing everything depends on the channel, I would not count pcs / sec, but in Mb / sec ... and photos are different in 10Kb or 25Mb.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question