Overview
wget is a non-interactive network downloader that downloads files from web servers using HTTP, HTTPS, and FTP protocols. Its recursive download capability is a powerful tool for copying all or parts of a website locally for offline access or for bulk collection of specific file types.
Key Features
- Website mirroring and offline browsing
- Preserves directory structure up to a specified depth
- Selectively downloads specific file types
- Automatically converts links to local file paths after download
- Resumes interrupted downloads
Key Options
Key options related to recursive downloading.
Recursive Download Control
Download Filtering and Behavior
Generated command:
Try combining the commands.
Description:
`wget` Executes the command.
Combine the above options to virtually execute commands with AI.
Usage Examples
Various scenarios using wget's recursive download feature.
Basic Recursive Download
wget -r http://example.com/docs/
Starts from the specified URL and downloads files by following all sub-links.
Mirror Entire Website
wget -m -k -p http://example.com/
Mirrors a website completely to the local system. Converts links to local paths, downloads all necessary page elements, and uses timestamps to download only updated files.
Download to a Specific Depth
wget -r -l 2 http://example.com/blog/
Limits the recursion depth to 2, following sub-links only up to two levels from the starting URL.
Download HTML and Related Files (for Offline Viewing)
wget -r -p -k http://example.com/article.html
Downloads a specific HTML page and all files (images, CSS, JS, etc.) required to display it correctly, converting links to local paths.
Download Only Specific Extensions
wget -r -A "*.pdf,*.doc" http://example.com/documents/
Recursively downloads only PDF and DOC files from the specified directory.
Set User-Agent and Ignore robots.txt
wget -r -U "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36" -e robots=off http://example.com/
Sets the User-Agent and ignores the robots.txt file to access all content. (Use with caution)
Limit Download Rate and Set Wait Time
wget -r --limit-rate=200k --wait=5 http://example.com/
Limits the download speed to 200KB/s and waits for 5 seconds between each request to reduce server load.
Tips & Precautions
wget's recursive download feature is powerful, but it can put excessive load on servers or download unnecessary data, so it should be used with caution.
Key Tips
- **Prevent Server Overload**: Use the `--wait` option to introduce delays between requests, preventing excessive server load. You can also use `--random-wait` for randomized delays.
- **Respect `robots.txt`**: By default, `wget` respects the `robots.txt` file. Unless there's a specific reason, avoid using the `-e robots=off` option. Check the website's policy.
- **Set User-Agent**: Some websites may block specific User-Agents or serve different content. Setting a common browser User-Agent with the `--user-agent` option can be helpful.
- **Limit Download Depth**: Use the `-l` option to limit the recursion depth, preventing unnecessary downloads of sub-pages and saving disk space.
- **Certificate Warnings**: `--no-check-certificate` disables SSL/TLS certificate validation, posing a security risk. It's best not to use it on untrusted sites.
- **Resume Downloads**: Use the `-c` or `--continue` option to resume interrupted downloads. This is useful for large files or unstable network conditions.