T-Rext
T-Rext is a command-line filter that attempts to clean up spacing, punctuation, and capitalization in a text file. Its purpose is so that, when you are writing a text generator, such as a Markov processor, you need not worry too much about its output format; just toss its output through T-Rext when you're done to make it more presentable.
The current version of T-Rext is 0.4, which runs under either Python 2.7 or Python 3.x. Docker images based on appropriate versions of cPython for each version are available on Docker Hub.
Usage
Usage from the Command Line
bin/t-rext raw_output.txt > cleaned_output.txt
This will take lines that look like this:
" Well , " said the king , , " no . "
and reformat them to look like this:
“Well,” said the king, “no.”
To use T-Rext from any working directory, add the bin
directory in this
repository to your PATH
. For example, you might add this line to your
.bashrc
:
export PATH=/path/to/this/repo/bin:$PATH
An easy way to accomplish the above is to install shelf, then
dock T-Rext by cd
ing into one of your shelf directories and running
git clone https://codeberg.org/catseye/T-Rext
shelf_link T-Rext
Usage from Python
T-Rext is built on an over-engineered library of pipeline processors, which
you can use directly (note, its interface is not stable and liable to change.)
To use the T-Rext Python modules in other Python programs, make sure the
src
directory of this repository is on your PYTHONPATH
. For example,
you might add this line to your .bashrc
:
export PYTHONPATH=/path/to/this/repo/src:$PYTHONPATH
Then you can add imports like this to the top of your script:
from t_rext.processors import TrailingWhitespaceProcessor
Tests
This is a test suite, written in Falderal format, for the t-rext
utility. It also serves as documentation for said utility.
-> Tests for functionality "Clean up punctuation and spaces"
Spaces before commas and periods are elided.
| Well , that is good .
= Well, that is good.
Multiple commas are collapsed into a single comma.
| Well , , that is good .
= Well, that is good.
Multiple periods are not collapsed into a single period.
| Well . . . that is good.
= Well... that is good.
Quotes are oriented.
| "Yes," he said.
= “Yes,” he said.
Single spaces after opening quotes and before closing quotes are elided.
| " Yes , " he said.
= “Yes,” he said.
But not the other way 'round.
| Muttering "Yes," he turned around.
= Muttering “Yes,” he turned around.
Multiple spaces after opening quotes and before closing quotes are elided.
| " Yes , " he said.
= “Yes,” he said.
This is the case even if the quotes are oriented single quotes.
| Don’t reply ‘ Yes ’ .
= Don’t reply ‘Yes’.
But not the other way 'round.
| Muttering "Yes," he turned around.
= Muttering “Yes,” he turned around.
Quotes do not match across paragraphs.
| Turbid "Waters" that "leak.
|
| You "don't" have a clue.
= Turbid “Waters” that “leak.
=
= You “don't” have a clue.
Single spaces before apostrophes are elided in some situations.
| It wasn 't Arthur 's car.
= It wasn't Arthur's car.
This is the case even if the apostrophes are oriented single quotes. In fact, in this case, trailing spaces are elided too.
| It wasn ’t Arthur ’s car.
= It wasn’t Arthur’s car.
Punctuation at the beginning of a line is elided in some cases.
| , where he said so.
= Where he said so.
Capitalization is applied at the beginning of a line, and the beginning of a sentence.
| , where. he said so.
= Where. He said so.
| Really? that was... so
= Really? That was... so
Two full stops becomes an ellipsis. Full stop then comma becomes just a comma.
| It was.. the nice., thing.
= It was... the nice, thing.
Commit History
@master
git clone https://git.catseye.tc/T-Rext/
- Remove .reuse/dep5, move licensing info into individual files. Chris Pressey 3 months ago
- Adopt a more REUSE-compliant phrasing for "Copyright" field. Chris Pressey 10 months ago
- Adopt an MIT license and clean up the license reference headers. Chris Pressey 10 months ago
- `shelf_dockgh` is deprecated; update instructions in README. Chris Pressey 10 months ago
- Rearrange licensing info in repo to follow REUSE 3.0 convention. Chris Pressey 10 months ago
- Bump version number in README. Chris Pressey 1 year, 9 days ago
- Also elide spaces inside/around oriented single quotes. Chris Pressey 1 year, 28 days ago
- Support the convention of having "-" refer to standard input. Chris Pressey 1 year, 1 month ago
- Script is run under Python 3 by default. Chris Pressey 1 year, 1 month ago
- Mercurial is no longer supported in this repo. Chris Pressey 1 year, 1 month ago