sampo/Practical_Matters.markdown - Specs-on-Spec (master)

Tree @master (Download .tar.gz)

Practical_Matters.markdown @master — view rendered · raw · history · blame

Practical Matters
=================

This document is a collection of notes I've made over the years about the
practical matters of production programming languages — usually stemming
from being irked by some existing programming language's lack of adequate
(in my opinion) support for them.  As such, these thoughts may be overblown
and sophistry-laden.  But it is nice to have a place to put them.

Fundamental Abstractions
------------------------

The following facilities should be either built-in to the language, or part
of the standard (highly standardized) libraries:

* Tracing.  Ideally, the programmer should be able to easily browse all the
  relevant reduction steps, and the relevant data being manipulated therein,
  in the part of the program's execution that interests them.  In addition,
  this should be something that can be enabled without polluting the source
  code (overmuch).
  
  This could be done, and fairly well, with techniques from aspect-oriented
  programming.  The rules to describe what to trace (or to highlight in a
  full trace) could be specified in what amounts to a configuration file,
  and thus be an implementation issue rather than a language issue.
  
  Unfortunately, this ideal is hard to achieve, so the system should also
  support...

* Logging.  Logging is basically an ad-hoc way to explicitly achieve
  selective tracing: the programmer knows what points in the program, and
  what data, are of interest to them, and outputs that data to the log at
  those points.

  Whether this is "debug logging" during development, or to support post-
  mortem analysis of issues in production, it amounts to the same thing:
  debugging, just on different time scales.
  
  The use of a "log level" is mostly just a way to filter the trace built
  up in the log files.  This is not necessarily a bad idea, but it should
  probably not be linear; information should be logged based on the reason
  that it is being logged, probably in the form of some sort of "tag", and
  filterable on that (whether at the time the log is being recorded, or
  being read.)
  
  Logging should not count as a side-effect.

  The logging function itself should have some properties:

  - Should not have side-effects (for example from evaluating its arguments),
    so that if it is not executed (because we are not interested in that
    part of the execution trace) the behaviour of the program is not changed.

  - In fact, should ensure that its arguments have no side-effects, and
    ideally, be total, with no chance of hanging or crashing.

  - Should pretty-print the relevant values, include the type and other
    metadata of the values, and put clearly visible delimeters around the
    values so printed.

  - Should include the source filename and line number.

  - Should not be overridable (shadowed?  not sure what I meant here.)

* History.  This is more relevant in a language with mutable values, but
  as part of tracing, it is useful to know the history of mutations of a
  value.  With immutable values, it would be useful to be able to view
  all the reductions which fed into the computation of the value at a
  point.  Either way, however, this is expensive, so should be specified
  selectively.  Again, an external, aspect-like configuration language
  for specifying which values to watch makes this an implementation issue.

* Command-line option parsing.  This should not rely on the Unix or DOS
  idea of a command line, and it should be unified with parameter passing
  in the language itself; calling an executable built in the language with
  arguments `a b c` should be no different from calling a function from
  within the language with the arguments `a b c` (probably as string values.)

Reflection
----------

* First-class tracebacks.  When a program, for example, encounters an error
  parsing an external file such as a configuration file, it should be able to
  report the position in that file that caused the error as part of the
  traceback, for consistency.  Java has some limited facilities for this, and
  some Python libraries do this (Jinja2? werkzeug?) using frame hacks, but
  a less clumsy solution would be nice.

* Tracebacks are *not* a special case of logging, or an artefact of throwing
  exceptions.  Since the traceback is basically a formatted version of the
  current continuation, this suggests the two facilities should be unified,
  perhaps not totally, but to a high degree.

Abstractions, not Wrappers
--------------------------

The basic principle here is that the existing APIs of most libraries are
(let's be polite) less than ideal, especially when they were designed for
some other language (such as C), and instead of blindly wrapping them in a
new language, the designer should at least *try* to make something nicer.

The abstractions should also recognize that modern computer systems are
generally not resource-starved (or at least that truly high-level
programming languages should not treat them that way.)

This applies to very basic facilities as well as what are usually thought
of as external libraries.  Specifically,

* Date and time: We can do better than simply copycatting interfaces like
  `strftime`.  All time data should be stored consistently, in GMT, always
  with a time zone.

* String formatting: We can do better than simply copycatting interfaces
  like `printf`.  We can use visual formatting strings, where fixed-size
  slots appear as fixed-sized placeholders (of the same size) in the
  formatting string.  (See also the scathing prog21 criticism of the
  vertical tab character.)

* Line-oriented communication: We can look at line-oriented communication
  more generally, as a form of record-oriented communication where the
  "delimiter set" for each record is {LF, CR, CRLF}.

The programmer who really wants atavistic interfaces like those mentioned
above can always implement them as "compatibility modules" if they wish.

Seperation from the Implementation
----------------------------------

This is just a repeat of the above section in slightly different terms.

A language should avoid tying any language construct (e.g. imports,
include files) to the file system or the operating system.  Instead,
have mappings between e.g. module names and where they live in the file
system, and between our model of a running computer and a real OS.
These mapping could be  specified in configuration files which are
in the domain of the implementation and outside the domain of the
language, i.e. they never appear in programs.

Standard modules supplied with the language should expose *models* of
commonplace artefacts out in the world, for example operating systems.
The models are similar to the artefacts, in order that the burden of
implementing an interface from the model to any given artefact is not
too great.  However, the models are *not* the artefacts.  Programs
should be written to the model, not to the artefact.

People who construct bindings to the language should be encouraged
(only because they can't effectively be required) to create models
more abstract than the libraries that they are binding.

Insofar as possible, we can have a compiler optimize things so that they
match the underlying architecture.  The language should allows and even
encourage definitions in the most general sense; special cases are to be
detected and optimized when they occur, instead of instituting those
special cases into the language itself.

Another aspect of this point of philosophy is that it should be possible
to specify and change the performance characteristics of the program
(but ideally not its behaviour) from outside the program, using
configuration files.

This counts as a practical matter because maintaining code which is
cluttered with implementation-specific artefacts is burdensome.

Serialization
-------------

(This section needs to be rewritten)

- All primitive values must be serializable
- All primitive values must be round-trippable
- All primitive values must thus have an order to them (like Ruby 1.9's
  hashes) because in this world of representations, orderless things don't
  really exist
- When building user-defined values from primitive values it must be
  easy to retain these serialization properties in the composite value
- This is actually fairly agnostic of the particular serialization format
  (yaml, xml, binary, etc)
- S-expressions are trivially serializable, except for functions

Formatting
----------

Closely related to serialization.

Many languages support a "standard" operation to convert an arbitrary value to
a string.  Some even have two (e.g. Python's `str` and `repr`).

But in reality, there are any number of ways to convert a value to a string.
Why should the string representation of 16 necessarily be `"16"` — why not
`"0xf"` or `"XVI"`?  `"16"` is fine, but it should be explicitly noted to be
the default for the reason that it's the most convenient for the audience of
humans who use the decimal Arabic notation when dealing with numbers.

How can we support both a reasonable (and possibly configurable) default
formatting, as well as any number of other ways to format values which would
be more appropriate in different contexts?

Can we pass a "style" argument to the string-conversion function?

Should we establish a "design pattern" for writing formatting functions, and
provide support for implementing such patterns?

(Also, `format` is probably a better name for this function than `str`.)

Multiple Environments
---------------------

(This section needs to be rewritten)

- Lots of software runs in multiple environments - "development", "qa",
  "production"
- Inherently support that idea

Assertions
----------

(This section needs to be rewritten)

- Software engineering is more about defining invariants than writing code.
- An "assert" command which produces details errors in development, but only
  logs warnings in production environments
- Very lightweight so that programmers use it without thinking
  (Python's `self.assertEqual()` is *not* lightweight)
  (Erlang's `A = {foo,B}` IS lightweight)
- So a conditional, by itself, is an assertion. (?)

Interfaces
----------

(This section needs to be rewritten)

One way or another, it should be possible to discover (programmatically,
through reflection of some sort) the set of operations that a value supports —
its interface.  Each operation has a name and a signature of some sort.

Collections are interfaces.

Some parts of an interface might be "private".  This — information hiding —
is obviously a somewhat complex topic.  The obvious bit is that information
hiding is useful to prevent unintended changes to program state, but it also
hinders debugging and testing.

Usability
---------

Memorization is not a good thing to make programmers do.  This can be
addressed by either copying things from an existing language that the
programmer base can be expected to already have memorized, or by providing
a more orthogonal set of things which maps to the culture which programmers,
as people, already live in.  (For example, few people in the Western world
do not know that `&` means "and".)

Non-alphabetic symbols should, idealy, have the same meaning regardless of
the context they're used in — in other words, the language should avoid
using the same symbol for different purposes in different contexts.

(Lots of languages are lacking here.  In C, `*` is both multiplication and
dereferencing. In Python, `.` is both object attribute access and package
hierarchy — although packages are, at least, kind of like objects.  In Lua,
`=` is both assignment and key value association.)

Programming Languages vs. Operating Systems
-------------------------------------------

(this section needs to be cleaned up — not sure where to put it, and it
arguably doesn't belong here)

What you see before you in this distribution can be described as a
programming language, but many of the ideas took root while thinking about
operating systems.

What's the difference between a programming language and an operating system?

Well, maybe less than you think.

Programming languages do need to define the environment in which they can
express programs.  Sometimes this is a specific OS (like early C on Unix) --
or they claim to be "portable", but then they're really just defining an
abstraction against all the possible OS'es they think they'll run on.  Often
this abstract is clumsy, but some languages put a lot of thought into it,
like Smalltalk.

Operating systems, on the other hand, don't tell you what programming
language to use -- or do they?  A modern OS insists everything is, at some
point, in native machine language, and a running instance will almost always
be limited to a single machine language of a single architecture.  Somewhat
more alternative OS'es define a virtual machine language to abstract away
from the concrete machine language.  Usually this virtual machine language
looks like a machine language, but sometimes it's a tad more high-level,
like Lisp.  Any way you slice it, the OS does sanction a particular, albeit
usually low-level, programming language.

Where PL's and OS's seem to meet more-or-less neatly is in the idea of the
VM, so let's examine that.

Most modern virtual machines are designed to implement high-level languages
in a modern operating system environment.  The JVM was specifically designed
for running Java, and while .NET was ostensibly designed for multiple
languages, the bytecode is pretty closely tuned to C\#.

What these VMs were not designed to do, but what a VM "should really" be
designed to do (if it, at least, wants to live up to the name "virtual
machine") is to abstract the *hardware* and provide virtualizations
(abstractions) of the available devices.

An environment contains zero or more devices.  A device exposes zero
or more services.  Each service conforms to one or more interfaces.
Each service may additionally require one or more services be available
(by interface).

At one point I was calling this place where programming language and
operating system meet a "CE" (Computational Environment) because
"operating system" is far too generic-sounding and "programming language"
doesn't address the important environmental aspect here.  Whether I would
continue to use the term CE or not, I'm not sure — it could just add to the
confusion.

How do most programming languages deal with the abstraction of available
(or virtual) devices?  Terribly, I would say.  Take, as a simple example,
an addressable character screen device.  Someone writes a library, in C,
to access it (e.g. `ncurses`,) providing an API comprising C functions
and C structs.  Someone then writes a binding or a wrapper (e.g. using
`swig`) or otherwise foreign-function interfaces it to the language, usually
exposing the exact same C-level API naively adapted to the programming
language.  Then you, the programmer in this language, wrestle with working
with the device almost exactly as a C programmer would, initializing and
releasing it as a C programmer would, with limitations on how you may or
may not use it from multithreaded code like a C programmer would (which
might be brutally different from how the runtime for your programming
language implementation assumes that its world works.)  All this, with the
added hassle of having to make sure you have all these bindings for the
device for your chosen implementation of your language built and installed
correctly.