How to crawl a website using wget until 300 html pages are saved

I want to crawl a website recursively using wget in Ubuntu and stop it after 300 pages are downloaded. I only save the html file of a page. Currently, this is the command I am using:

wget -r --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL --follow-tags=a

I want the code to somehow count the html files inside LOCAL-DIR and if the counter shows 300, stop the crawling. Is there anyway to do this?

1 Answer

You could try something like this:

background your wget command and record its PID ($!)
set up an inotifywatch on the receiving directory to count files
kill the wget process when the count exceeds a threshold

To illustrate, using a shell function to simulate the recursive wget:

#!/bin/bash
local_dir=tmp
wgetcmd() { local i=0 while : do # simulate page download echo "Downloading... $((++i))" touch "$local_dir/file${i}.html" sleep 2 done
}
wgetcmd & pid=$!
j=1
while kill -s 0 $pid && read path action file
do if (( ++j >= 30 )); then echo "Reached page limit" kill $pid break; fi
done < <(inotifywait -m "$local_dir" -e close_write)

How to crawl a website using wget until 300 html pages are saved

1 Answer

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

What are Dragons most vulnerable to?

What is the significance of II.9 in a Kingdom Hearts 3 scene?

Is there a way to craft Podzol in Minecraft?