wget (Recursive Download): Copy Entire Websites

Overview

wget is a non-interactive network downloader that downloads files from web servers using HTTP, HTTPS, and FTP protocols. Its recursive download capability is a powerful tool for copying all or parts of a website locally for offline access or for bulk collection of specific file types.

Key Features

Website mirroring and offline browsing
Preserves directory structure up to a specified depth
Selectively downloads specific file types
Automatically converts links to local file paths after download
Resumes interrupted downloads

Key Options

Key options related to recursive downloading.

Recursive Download Control

Download Filtering and Behavior

Generated command:

Try combining the commands.

Description:

`wget` Executes the command.

Combine the above options to virtually execute commands with AI.

Usage Examples

Various scenarios using wget's recursive download feature.

Basic Recursive Download

wget -r http://example.com/docs/

Starts from the specified URL and downloads files by following all sub-links.

Mirror Entire Website

wget -m -k -p http://example.com/

Mirrors a website completely to the local system. Converts links to local paths, downloads all necessary page elements, and uses timestamps to download only updated files.

Download to a Specific Depth

wget -r -l 2 http://example.com/blog/

Limits the recursion depth to 2, following sub-links only up to two levels from the starting URL.

Download HTML and Related Files (for Offline Viewing)

wget -r -p -k http://example.com/article.html

Downloads a specific HTML page and all files (images, CSS, JS, etc.) required to display it correctly, converting links to local paths.

Download Only Specific Extensions

wget -r -A "*.pdf,*.doc" http://example.com/documents/

Recursively downloads only PDF and DOC files from the specified directory.

Set User-Agent and Ignore robots.txt

wget -r -U "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36" -e robots=off http://example.com/

Sets the User-Agent and ignores the robots.txt file to access all content. (Use with caution)

Limit Download Rate and Set Wait Time

wget -r --limit-rate=200k --wait=5 http://example.com/

Limits the download speed to 200KB/s and waits for 5 seconds between each request to reduce server load.

Tips & Precautions

wget's recursive download feature is powerful, but it can put excessive load on servers or download unnecessary data, so it should be used with caution.

Key Tips

**Prevent Server Overload**: Use the `--wait` option to introduce delays between requests, preventing excessive server load. You can also use `--random-wait` for randomized delays.
**Respect `robots.txt`**: By default, `wget` respects the `robots.txt` file. Unless there's a specific reason, avoid using the `-e robots=off` option. Check the website's policy.
**Set User-Agent**: Some websites may block specific User-Agents or serve different content. Setting a common browser User-Agent with the `--user-agent` option can be helpful.
**Limit Download Depth**: Use the `-l` option to limit the recursion depth, preventing unnecessary downloads of sub-pages and saving disk space.
**Certificate Warnings**: `--no-check-certificate` disables SSL/TLS certificate validation, posing a security risk. It's best not to use it on untrusted sites.
**Resume Downloads**: Use the `-c` or `--continue` option to resume interrupted downloads. This is useful for large files or unstable network conditions.