yastasoti
Yet another script to archive stuff off teh internets.
Was split off from Feedmark, which doesn't itself need to support this function.
Features
- input is a JSON list of objects containing links (such as those produced by Feedmark)
- output is a JSON list of objects that could not be retrieved, which can be fed back into the script as input
- checks links with
HEAD
requests by default; if--archive-links-to
is given, fetches a copy of each resource withGET
and saves it to disk - tries to be idempotent and not create a new local file if the remote file hasn't changed
- handles links that are local files; checks if the file exists locally
Planned features
- archive youtube links with youtube-dl.
- logging
- ignore certain URLs
- Handle failures (redirects, etc) better. Fall back to external tool like
wget
orcurl
.
Examples
Check that the links in a set of Feedmark documents all resolve:
feedmark --output-links article/*.md | yastasoti --article-root=article/ - | tee results.json
Since no --archive-links
options were given, this will make only HEAD
requests to check that the resources exist. It will not fetch them.
Archive stuff off teh internets:
yastasoti --archive-links-to=downloads links.json
If it is only desired that the links be checked, --check-links
will
make HEAD
requests and will not save any of the responses.
Requirements: requests
Optional dependencies: tqdm
TODO: Update this documentation and make it make sense
Commit History
@2c21ff1571e673ec4ca99d26c89381318e3b928d
git clone https://git.catseye.tc/yastasoti/
- The return value of handle_link() is not a Response object. Chris Pressey 5 years ago
- Add --fragile command-line argument. Chris Pressey 5 years ago
- Checkpoint an experimental thing, badly. Chris Pressey 5 years ago
- --delay-between-requests argument. Chris Pressey 5 years ago
- Continue to clean up. Chris Pressey 5 years ago
- Continue to clean up. Chris Pressey 5 years ago
- Refactor again. Chris Pressey 5 years ago
- Refactor. Chris Pressey 5 years ago
- Allow configuring a list of URLs that are simply ignored. Chris Pressey 5 years ago
- Some updates to documentation Chris Pressey 5 years ago