Guten-gutter is a command-line filter for stripping the boilerplate off of text files from Project Gutenberg. I was using gutenizer for this purpose, but it has some shortcomings and there were several Project Gutenberg texts which it failed to properly strip, so I wrote this as a more robust replacement. It's also (like Project Gutenberg texts themselves) in the public domain.
If you want to get just the book's text out of a Project Gutenberg text file:
script/guten-gutter pg10662.txt > The_Night_Land.txt
If you want to do that to an entire collection of Project Gutenberg files:
mkdir cleaned script/guten-gutter ../gutenberg/*.txt --output-dir=cleaned
To use Guten-gutter from any working directory, add the
script directory in
this repository to your
PATH. For example, you might add this line to your
An easy way to accomplish this is to dock Guten-gutter using toolshelf:
toolshelf dock gh:catseye/guten-gutter
This is a test suite, written in Falderal format, for the
utility. (Note that this isn't a very paradigmatic usage of Falderal!)
-> Functionality "Count lines in processed Project Gutenberg file" -> is implemented by shell command "%(test-body-text) | wc -l | sed 's/ //g'" -> Tests for functionality "Count lines in processed Project Gutenberg file"
Our basic tests will be on Peter Rabbit.
| cat fixture/pg14838.txt = 618
In its default invokation, it tries to strip most things.
| script/guten-gutter fixture/pg14838.txt = 230
It can be told to strip illustrations, too...
| script/guten-gutter --strip-illustrations fixture/pg14838.txt = 201
If it's not given a Project Gutenberg file, it doesn't strip anything.
| cat fixture/plain.txt = 10 | script/guten-gutter fixture/plain.txt = 10
Rewrite ProducedByProcessor as a StartSentinelProcessor (or otherwise have it ignore the end sentinel)
Make IllustrationProcessor handle multiple lines
git clone https://git.catseye.tc/Guten-gutter/
- Allow tests to pass on systems that pad `wc -l` output w/spaces. Chris Pressey 4 years ago
- Added tag 0.2 for changeset 9e9fbea5f416 Chris Pressey 4 years ago
- The etext can also have been "created" by someone (for pg159) Chris Pressey 4 years ago
- Handle ancient etexts where the boilerplate ends with "end of small print" instead of with "start of text". Chris Pressey 4 years ago
- When --output-dir option is given, open output file UTF-8 encoded. Chris Pressey 4 years ago
- Added tag 0.1 for changeset ad88d2cfbeec Chris Pressey 4 years ago
- Missed two spots. Chris Pressey 4 years ago
- Follow our distribution organization guidelines: bin/ -> script/ Chris Pressey 4 years ago
- Make standalone - don't require T-Rext. Chris Pressey 4 years ago
- Initial import of guten-gutter sources. Chris Pressey 4 years ago