How to crawl a website using wget until 300 html pages are saved
I want to crawl a website recursively using wget in Ubuntu and stop it after 300 pages are downloaded. I only save the html file of a page. Currently, this is the command I am using:
wget -r --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL --follow-tags=aI want the code to somehow count the html files inside LOCAL-DIR and if the counter shows 300, stop the crawling. Is there anyway to do this?
31 Answer
You could try something like this:
background your
wgetcommand and record its PID ($!)set up an
inotifywatchon the receiving directory to count fileskill the
wgetprocess when the count exceeds a threshold
To illustrate, using a shell function to simulate the recursive wget:
#!/bin/bash
local_dir=tmp
wgetcmd() { local i=0 while : do # simulate page download echo "Downloading... $((++i))" touch "$local_dir/file${i}.html" sleep 2 done
}
wgetcmd & pid=$!
j=1
while kill -s 0 $pid && read path action file
do if (( ++j >= 30 )); then echo "Reached page limit" kill $pid break; fi
done < <(inotifywait -m "$local_dir" -e close_write)