Guten-gutter
Guten-gutter is a command-line filter for stripping the boilerplate off of text files from Project Gutenberg. I was using gutenizer for this purpose, but it has some shortcomings and there were several Project Gutenberg texts which it failed to properly strip, so I wrote this as a more robust replacement. It's also (like Project Gutenberg texts themselves) in the public domain.
Usage
If you want to get just the book's text out of a Project Gutenberg text file:
script/guten-gutter pg10662.txt > The_Night_Land.txt
If you want to do that to an entire collection of Project Gutenberg files:
mkdir cleaned
script/guten-gutter ../gutenberg/*.txt --output-dir=cleaned
To use Guten-gutter from any working directory, add the script
directory in
this repository to your PATH
. For example, you might add this line to your
.bashrc
:
export PATH=/path/to/this/repo/script:$PATH
An easy way to accomplish this is to dock Guten-gutter using shelf:
shelf_dockgh catseye/Guten-gutter
Tests
A small test script, test.sh, is included with this distribution.
TODO
Rewrite ProducedByProcessor as a StartSentinelProcessor (or otherwise have it ignore the end sentinel)
Make IllustrationProcessor handle multiple lines
Commit History
@master
git clone https://git.catseye.tc/Guten-gutter/
- Don't bother with Falderal for something like this. Chris Pressey 2 years ago
- toolshelf is deprecated. Use shelf instead. Chris Pressey 2 years ago
- Allow tests to pass on systems that pad `wc -l` output w/spaces. Chris Pressey 4 years ago
- Added tag 0.2 for changeset 9e9fbea5f416 Chris Pressey 4 years ago
- The etext can also have been "created" by someone (for pg159) Chris Pressey 5 years ago
- Handle ancient etexts where the boilerplate ends with "end of small print" instead of with "start of text". Chris Pressey 5 years ago
- When --output-dir option is given, open output file UTF-8 encoded. Chris Pressey 5 years ago
- Added tag 0.1 for changeset ad88d2cfbeec Chris Pressey 5 years ago
- Missed two spots. Chris Pressey 5 years ago
- Follow our distribution organization guidelines: bin/ -> script/ Chris Pressey 5 years ago