git @ Cat's Eye Technologies NaNoGenLab / 533c91d
Add an IllustrationCleaner and a --strip-illustrations option. Chris Pressey 10 years ago
2 changed file(s) with 28 addition(s) and 5 deletion(s). Raw diff Collapse all Expand all
1313 "Produced by" lines in, and fails completely on some less-standard texts.
1414
1515 This tool attempts to be a more complete, more robust, and public domain
16 replacement for it. It is probably not as robust yet, but it works on at
17 least some files.
16 replacement for it. It is probably not much more robust than the gutenizer
17 yet, but it works on at least my personal collection of Gutenberg files.
1818
1919 Requirements
2020 ------------
2525 -----
2626
2727 $ ./guten-gutter pg18613.txt > The_Golden_Scorpion.txt
28
29 You can also give the `--output-dir=DIR` option, which will place the
30 cleaned version of each file in that directory, with the same name as
31 the original.
32
33 You can also give the `--strip-illustrations` option, which will cause
34 the cleaner to strip out `[Illustration: foo]` lines. (Doesn't yet work
35 for illustration descriptions that span multiple lines.)
2836
2937 Theory of Operation
3038 -------------------
3838 def clean(self, lines, name=''):
3939 for line in lines:
4040 yield line.rstrip()
41
42
43 class IllustrationCleaner(AbstractBaseCleaner):
44
45 def clean(self, lines, name=''):
46 for line in lines:
47 match = re.match(r'^\s*\[Illustration.*?\]\s*$', line)
48 if not match:
49 yield line
4150
4251
4352 class SentinelCleaner(AbstractBaseCleaner):
131140
132141 def main(argv):
133142 optparser = OptionParser(__doc__.strip())
143 optparser.add_option("--strip-illustrations", default=False,
144 action='store_true',
145 help="also try to remove [Illustration: foo]'s")
134146 optparser.add_option("--output-dir", default=None, metavar='DIR',
135147 help="if given, save the resulting files to this "
136148 "directory (under their original names)"
144156 options.output_dir, os.path.basename(filename)
145157 )
146158 out = open(out_filename, 'w')
147 cleaner = MultiCleaner((
159 cleaners = [
148160 TrailingWhitespaceCleaner(),
149161 GutenbergCleaner(),
150 ProducedByCleaner()
151 ))
162 ]
163 if options.strip_illustrations:
164 cleaners.append(IllustrationCleaner())
165 cleaners.append(ProducedByCleaner())
166 cleaner = MultiCleaner(cleaners)
152167 with open(filename, 'r') as f:
153168 for line in cleaner.clean(f, name=filename):
154169 out.write(line + '\n')