Celeb Glow
general | March 28, 2026

How to crawl a website using wget until 300 html pages are saved

I want to crawl a website recursively using wget in Ubuntu and stop it after 300 pages are downloaded. I only save the html file of a page. Currently, this is the command I am using:

wget -r --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL --follow-tags=a

I want the code to somehow count the html files inside LOCAL-DIR and if the counter shows 300, stop the crawling. Is there anyway to do this?

3

1 Answer

You could try something like this:

  1. background your wget command and record its PID ($!)

  2. set up an inotifywatch on the receiving directory to count files

  3. kill the wget process when the count exceeds a threshold

To illustrate, using a shell function to simulate the recursive wget:

#!/bin/bash
local_dir=tmp
wgetcmd() { local i=0 while : do # simulate page download echo "Downloading... $((++i))" touch "$local_dir/file${i}.html" sleep 2 done
}
wgetcmd & pid=$!
j=1
while kill -s 0 $pid && read path action file
do if (( ++j >= 30 )); then echo "Reached page limit" kill $pid break; fi
done < <(inotifywait -m "$local_dir" -e close_write)

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy