Home > Network Management > wget

wget (Recursive Download): Copy Entire Websites

The recursive download feature of wget is used to copy entire websites locally or download files while maintaining a specific directory structure. This is very useful for website mirroring, offline browsing, and collecting specific types of files. The `-r` option enables recursive downloading, and various options can control the download depth, file types, and link conversion.

Overview

wget is a non-interactive network downloader that downloads files from web servers using HTTP, HTTPS, and FTP protocols. Its recursive download capability is a powerful tool for copying all or parts of a website locally for offline access or for bulk collection of specific file types.

Key Features

  • Website mirroring and offline browsing
  • Preserves directory structure up to a specified depth
  • Selectively downloads specific file types
  • Automatically converts links to local file paths after download
  • Resumes interrupted downloads

Key Options

Key options related to recursive downloading.

Recursive Download Control

Download Filtering and Behavior

Generated command:

Try combining the commands.

Description:

`wget` Executes the command.

Combine the above options to virtually execute commands with AI.

Usage Examples

Various scenarios using wget's recursive download feature.

Basic Recursive Download

wget -r http://example.com/docs/

Starts from the specified URL and downloads files by following all sub-links.

Mirror Entire Website

wget -m -k -p http://example.com/

Mirrors a website completely to the local system. Converts links to local paths, downloads all necessary page elements, and uses timestamps to download only updated files.

Download to a Specific Depth

wget -r -l 2 http://example.com/blog/

Limits the recursion depth to 2, following sub-links only up to two levels from the starting URL.

Download HTML and Related Files (for Offline Viewing)

wget -r -p -k http://example.com/article.html

Downloads a specific HTML page and all files (images, CSS, JS, etc.) required to display it correctly, converting links to local paths.

Download Only Specific Extensions

wget -r -A "*.pdf,*.doc" http://example.com/documents/

Recursively downloads only PDF and DOC files from the specified directory.

Set User-Agent and Ignore robots.txt

wget -r -U "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36" -e robots=off http://example.com/

Sets the User-Agent and ignores the robots.txt file to access all content. (Use with caution)

Limit Download Rate and Set Wait Time

wget -r --limit-rate=200k --wait=5 http://example.com/

Limits the download speed to 200KB/s and waits for 5 seconds between each request to reduce server load.

Tips & Precautions

wget's recursive download feature is powerful, but it can put excessive load on servers or download unnecessary data, so it should be used with caution.

Key Tips

  • **Prevent Server Overload**: Use the `--wait` option to introduce delays between requests, preventing excessive server load. You can also use `--random-wait` for randomized delays.
  • **Respect `robots.txt`**: By default, `wget` respects the `robots.txt` file. Unless there's a specific reason, avoid using the `-e robots=off` option. Check the website's policy.
  • **Set User-Agent**: Some websites may block specific User-Agents or serve different content. Setting a common browser User-Agent with the `--user-agent` option can be helpful.
  • **Limit Download Depth**: Use the `-l` option to limit the recursion depth, preventing unnecessary downloads of sub-pages and saving disk space.
  • **Certificate Warnings**: `--no-check-certificate` disables SSL/TLS certificate validation, posing a security risk. It's best not to use it on untrusted sites.
  • **Resume Downloads**: Use the `-c` or `--continue` option to resume interrupted downloads. This is useful for large files or unstable network conditions.

Same category commands