git @ Cat's Eye Technologies T-Rext / master README.md
master

Tree @master (Download .tar.gz)

README.md @masterview rendered · raw · history · blame

T-Rext
======

<!--
SPDX-FileCopyrightText: (c) 2024 Chris Pressey, Cat's Eye Technologies
This file is distributed under an MIT license.  For details, see LICENSES/ dir.
SPDX-License-Identifier: LicenseRef-MIT-X-T-Rext
-->

T-Rext is a command-line filter that attempts to clean up spacing,
punctuation, and capitalization in a text file.  Its purpose is so that,
when you are writing a text generator, such as a Markov processor, you
need not worry too much about its output format; just toss its output
through T-Rext when you're done to make it more presentable.

The current version of T-Rext is 0.4, which runs under either Python 2.7
or Python 3.x.  Docker images based on appropriate versions of cPython
for each version are [available on Docker Hub][].

Usage
-----

### Usage from the Command Line

    bin/t-rext raw_output.txt > cleaned_output.txt

This will take lines that look like this:

    " Well , " said the king , , " no . "

and reformat them to look like this:

    “Well,” said the king, “no.”

To use T-Rext from any working directory, add the `bin` directory in this
repository to your `PATH`.  For example, you might add this line to your
`.bashrc`:

    export PATH=/path/to/this/repo/bin:$PATH

An easy way to accomplish the above is to install [shelf][], then
dock T-Rext by `cd`ing into one of your shelf directories and running

    git clone https://codeberg.org/catseye/T-Rext
    shelf_link T-Rext

### Usage from Python

T-Rext is built on an over-engineered library of pipeline processors, which
you can use directly (note, its interface is not stable and liable to change.)
To use the T-Rext Python modules in other Python programs, make sure the
`src` directory of this repository is on your `PYTHONPATH`.  For example,
you might add this line to your `.bashrc`:

    export PYTHONPATH=/path/to/this/repo/src:$PYTHONPATH

Then you can add imports like this to the top of your script:

    from t_rext.processors import TrailingWhitespaceProcessor

Tests
-----

This is a test suite, written in [Falderal][] format, for the `t-rext`
utility.  It also serves as documentation for said utility.

    -> Tests for functionality "Clean up punctuation and spaces"

Spaces before commas and periods are elided.

    | Well , that is good .
    = Well, that is good.

Multiple commas are collapsed into a single comma.

    | Well , , that is good .
    = Well, that is good.

Multiple periods are not collapsed into a single period.

    | Well . . . that is good.
    = Well... that is good.

Quotes are oriented.

    | "Yes," he said.
    = “Yes,” he said.

Single spaces after opening quotes and before closing quotes are elided.

    | " Yes , " he said.
    = “Yes,” he said.

But not the other way 'round.

    | Muttering "Yes," he turned around.
    = Muttering “Yes,” he turned around.

Multiple spaces after opening quotes and before closing quotes are elided.

    | "   Yes ,   " he said.
    = “Yes,” he said.

This is the case even if the quotes are oriented single quotes.

    | Don’t reply ‘   Yes    ’ .
    = Don’t reply ‘Yes’.

But not the other way 'round.

    | Muttering   "Yes,"    he turned around.
    = Muttering   “Yes,”    he turned around.

Quotes do not match across paragraphs.

    | Turbid "Waters" that "leak.
    | 
    | You "don't" have a clue.
    = Turbid “Waters” that “leak.
    = 
    = You “don't” have a clue.

Single spaces before apostrophes are elided in some situations.

    | It wasn 't Arthur 's car.
    = It wasn't Arthur's car.

This is the case even if the apostrophes are oriented single quotes.
In fact, in this case, trailing spaces are elided too.

    | It wasn ’t Arthur ’s car.
    = It wasn’t Arthur’s car.

Punctuation at the beginning of a line is elided in some cases.

    | , where he said so.
    = Where he said so.

Capitalization is applied at the beginning of a line, and the
beginning of a sentence.

    | , where. he said so.
    = Where. He said so.

    | Really?    that was... so
    = Really?    That was... so

Two full stops becomes an ellipsis.  Full stop then comma becomes
just a comma.

    | It was.. the nice., thing.
    = It was... the nice, thing.

[Falderal]:                https://catseye.tc/node/Falderal
[shelf]:                   https://catseye.tc/node/shelf
[available on Docker Hub]: https://hub.docker.com/r/catseye/t-rext