git @ Cat's Eye Technologies yastasoti / 4b8bda3
Don't even download file if routed to /dev/null. Chris Pressey 1 year, 10 months ago
2 changed file(s) with 22 addition(s) and 4 deletion(s). Raw diff Collapse all Expand all
00 yastasoti
11 =========
22
3 _Version 0.1-PRE_
4
35 Yet another script to archive stuff off teh internets.
46
5 Was split off from Feedmark, which doesn't itself need to support this function.
7 It's not a spider that automatically crawls previously undiscovered pages — it's intended
8 to be run by a human to make backups of pages they have already read and recorded.
9
10 It was split off from [Feedmark][], which doesn't itself need to support this function.
611
712 ### Features ###
813
1520 which is selected based on the URL of the link.
1621 * tries to be idempotent and not create a new local file if the remote file hasn't changed
1722 * handles links that are local files; checks if the file exists locally
23 * can log its actions verbosely to a specified logfile
24 * source code is a single, public-domain file with a single dependency (`requests`)
1825
1926 ### Examples ###
2027
4956 "*": "archive/"
5057 }
5158
52 Three guesses as to what these parts mean. Then you use it like
59 If a URL matches more than one pattern, the longest pattern will be selected.
60 If the destination is `/dev/null` it will be treated specially — the file will
61 not be retrieved at all. If no pattern matches, an error will be raised.
62
63 To use an archive router once it has been written:
5364
5465 yastasoti --archive-via=router.json links.json
5566
5667 ### Requirements ###
5768
5869 Tested under Python 2.7.12. Seems to work under Python 3.5.2 as well,
59 at least the link-checking parts.
70 but this is not so official.
6071
6172 Requires `requests` Python library to make network requests. Tested
62 with version 2.17.3.
73 with `requests` version 2.17.3.
6374
6475 If `tqdm` Python library is installed, will display a nice progress bar.
6576
6879 * Archive youtube links with youtube-dl.
6980 * Handle failures (redirects, etc) better (detect 503 / "connection refused" better.)
7081 * Allow use of an external tool like `wget` or `curl` to do fetching.
82
83 [Feedmark]: http://catseye.tc/node/Feedmark
178178 def handle_link(self, url):
179179 dirname, filename = url_to_dirname_and_filename(url)
180180 dest_dir = self.select_dest_dir(url)
181 if dest_dir == '/dev/null':
182 logger.info(u"{} routed to {}, skipping".format(url, dest_dir).encode('utf-8'))
183 return {
184 'status_code': 200
185 }
181186 dirname = os.path.join(dest_dir, dirname)
182187 logger.info(u"archiving {} to {}".format(url, dirname).encode('utf-8'))
183188 if not os.path.exists(dirname):