Guten-gutter is a command-line filter for stripping the boilerplate off of text files from Project Gutenberg. I was using gutenizer for this purpose, but it has some shortcomings and there were several Project Gutenberg texts which it failed to properly strip, so I wrote this as a more robust replacement. It's also (like Project Gutenberg texts themselves) in the public domain.
If you want to get just the book's text out of a Project Gutenberg text file:
script/guten-gutter pg10662.txt > The_Night_Land.txt
If you want to do that to an entire collection of Project Gutenberg files:
mkdir cleaned script/guten-gutter ../gutenberg/*.txt --output-dir=cleaned
To use Guten-gutter from any working directory, add the
script directory in
this repository to your
PATH. For example, you might add this line to your
An easy way to accomplish this is to dock Guten-gutter using toolshelf:
toolshelf dock gh:catseye/guten-gutter
This is a test suite, written in Falderal format, for the
utility. (Note that this isn't a very paradigmatic usage of Falderal!)
-> Functionality "Extract text from Project Gutenberg file" is implemented by -> shell command "%(test-body-text)" -> Tests for functionality "Extract text from Project Gutenberg file"
Our basic tests will be on Peter Rabbit.
| cat fixture/pg14838.txt | wc -l = 618
In its default invokation, it tries to strip most things.
| script/guten-gutter fixture/pg14838.txt | wc -l = 230
It can be told to strip illustrations, too...
| script/guten-gutter --strip-illustrations fixture/pg14838.txt | wc -l = 201
If it's not given a Project Gutenberg file, it doesn't strip anything.
| cat fixture/plain.txt | wc -l = 10 | script/guten-gutter fixture/plain.txt | wc -l = 10
Rewrite ProducedByProcessor as a StartSentinelProcessor (or otherwise have it ignore the end sentinel)
Make IllustrationProcessor handle multiple lines