Answer the question
In order to leave comments, you need to log in
How to download a million pictures?
There are a million links to pictures. Suppose there is no rate limit on the donor site. What is the fastest way to download them? If you download alternately, then let's say 1 picture in 1 second is 11 days of pumping without a break. If you put in 10 streams, then for a day. But all the same, somehow for a long time (pictures in general and 10 million maybe)
I planned to download using a script on the Node. Hence the question is one more, how to parallelize in 10 threads? Well, i.e. Stupidly going through the cycle will not work, he will apparently make a thousand requests in milliseconds. Those. I need to optimize it somehow.
Maybe there are suitable tools for this pumping task? Node modules or python, you can use some kind of Gooey on the kraynyak.
Answer the question
In order to leave comments, you need to log in
download_images_from_csv.sh (finish for yourself if necessary)
#!/bin/bash
COLUMN=1 # csv column to extract
RENAME=false # if we should rename the file, note that is was really specific for my problem.
THREADS=16 # threads to use by parallel
#Set Script Name variable
SCRIPT=`basename ${BASH_SOURCE[0]}`
#Set fonts for Help.
NORM=`tput sgr0`
BOLD=`tput bold`
REV=`tput smso`
# Help function
function HELP {
echo -e \\n"Help documentation for ${SCRIPT}."\\n
echo -e "Basic usage: ./$SCRIPT"\\n
echo "Command line switches are optional. The following switches are recognized."
echo "-f csv file = required should be last argument"
echo "-c column, default $COLUMN"
echo "-t threads, default $THREADS"
echo "-r renamd, should be renamed - work in progress here because this is really specific renaming"
echo -e "-h --Displays this help message. No further functions are performed."\\n
echo -e "Example: ./${BOLD}$SCRIPT -rc 2 -f file.csv"\\n
exit 1
}
#Check the number of arguments. If none are passed, print help and exit.
NUMARGS=$#
if [ $NUMARGS -eq 0 ]; then
HELP
exit 1
fi
while getopts ::c::r:h:f FLAG; do
case $FLAG in
t)
THREADS=$OPTARG
;;
c)
COLUMN=$OPTARG
;;
r)
RENAME=true
;;
h) #show help
HELP
;;
\?)
echo -e \\n"Option -${BOLD}$OPTARG${NORM} not allowed."
HELP
;;
esac
done
shift $((OPTIND-1))
FILE=$1
# shift ops, all optional args are now removed $1 will have to be the filename
if [ "$RENAME" = true ]; then
mkdir -p images && cat $FILE | tail -n +2 | cut -d ',' -f$COLUMN | grep http | sed -e 's/^[ \t\r]*//' | \
(cd images; parallel -j$THREADS -d'\r\n' --gnu 'wget {}; mv {/} `echo "{/}" | tr "." "_" | cut -d "_" -f1,3 | tr "_" "."`')
else
mkdir -p images && cat $FILE | tail -n +2 | cut -d ',' -f$COLUMN | grep http | sed -e 's/^[ \t\r]*//' | \
(cd images; parallel -j$THREADS -d'\r\n' --gnu 'wget {};')
fi
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question