git @ Cat's Eye Technologies T-Rext / master
master

Tree @master (Download .tar.gz)

T-Rext

T-Rext is a command-line filter that attempts to clean up spacing, punctuation, and capitalization in a text file. Its purpose is so that, when you are writing a text generator, such as a Markov processor, you need not worry too much about its output format; just toss its output through T-Rext when you're done to make it more presentable.

The current version of T-Rext is 0.4, which runs under either Python 2.7 or Python 3.x. Docker images based on appropriate versions of cPython for each version are available on Docker Hub.

Usage

Usage from the Command Line

bin/t-rext raw_output.txt > cleaned_output.txt

This will take lines that look like this:

" Well , " said the king , , " no . "

and reformat them to look like this:

“Well,” said the king, “no.”

To use T-Rext from any working directory, add the bin directory in this repository to your PATH. For example, you might add this line to your .bashrc:

export PATH=/path/to/this/repo/bin:$PATH

An easy way to accomplish the above is to install shelf, then dock T-Rext by cding into one of your shelf directories and running

git clone https://codeberg.org/catseye/T-Rext
shelf_link T-Rext

Usage from Python

T-Rext is built on an over-engineered library of pipeline processors, which you can use directly (note, its interface is not stable and liable to change.) To use the T-Rext Python modules in other Python programs, make sure the src directory of this repository is on your PYTHONPATH. For example, you might add this line to your .bashrc:

export PYTHONPATH=/path/to/this/repo/src:$PYTHONPATH

Then you can add imports like this to the top of your script:

from t_rext.processors import TrailingWhitespaceProcessor

Tests

This is a test suite, written in Falderal format, for the t-rext utility. It also serves as documentation for said utility.

-> Tests for functionality "Clean up punctuation and spaces"

Spaces before commas and periods are elided.

| Well , that is good .
= Well, that is good.

Multiple commas are collapsed into a single comma.

| Well , , that is good .
= Well, that is good.

Multiple periods are not collapsed into a single period.

| Well . . . that is good.
= Well... that is good.

Quotes are oriented.

| "Yes," he said.
= “Yes,” he said.

Single spaces after opening quotes and before closing quotes are elided.

| " Yes , " he said.
= “Yes,” he said.

But not the other way 'round.

| Muttering "Yes," he turned around.
= Muttering “Yes,” he turned around.

Multiple spaces after opening quotes and before closing quotes are elided.

| "   Yes ,   " he said.
= “Yes,” he said.

This is the case even if the quotes are oriented single quotes.

| Don’t reply ‘   Yes    ’ .
= Don’t reply ‘Yes’.

But not the other way 'round.

| Muttering   "Yes,"    he turned around.
= Muttering   “Yes,”    he turned around.

Quotes do not match across paragraphs.

| Turbid "Waters" that "leak.
| 
| You "don't" have a clue.
= Turbid “Waters” that “leak.
= 
= You “don't” have a clue.

Single spaces before apostrophes are elided in some situations.

| It wasn 't Arthur 's car.
= It wasn't Arthur's car.

This is the case even if the apostrophes are oriented single quotes. In fact, in this case, trailing spaces are elided too.

| It wasn ’t Arthur ’s car.
= It wasn’t Arthur’s car.

Punctuation at the beginning of a line is elided in some cases.

| , where he said so.
= Where he said so.

Capitalization is applied at the beginning of a line, and the beginning of a sentence.

| , where. he said so.
= Where. He said so.

| Really?    that was... so
= Really?    That was... so

Two full stops becomes an ellipsis. Full stop then comma becomes just a comma.

| It was.. the nice., thing.
= It was... the nice, thing.