John's ECMA-55 Minimal BASIC Compiler

Fri Dec 12 17:57:29 2014 UTC

This software is a compiler for Minimal BASIC as specified by the ECMA-55 Minimal BASIC (81KB, updated) standard from the ECMA International organization (formerly known as the European Computer Manufacturers Association). A version of the ECMA-55 Minimal BASIC (83KB, updated) better suited to printing (uses backspace underline sequences to more correctly match the original scanned input) is also available. Both text files are in UTF-8 encoding. Finally, the original source for this is the ECMA-55 Minimal BASIC PDF but that is not easily searchable and is relatively huge (15 MB) since it is a picture-based scan and not text-based.

The target is AMD64/EM64T/x86-64 machines running a modern Linux distribution. This compiler creates assembly language output files. These are then assembled into object files and linked to create an executable. The assembly dialect used is that of GNU gas from GNU binutils, since that will be available on any modern general purpose x86-64 Linux distribution. No libc or libm is used by the generated code, which allows creating very small executables. To keep the generated code small and simple, generated code output of SIN(), COS(), TAN(), ATAN(), EXP(), POW, LOG(), RND, and RANDOMIZE is only emitted if those features are required by the BASIC program.

Paper Published!

A journal article has been published about this compiler, as of version 1.7. You can download the article from this URL:
http://www.mdpi.com/2073-431X/3/3/69

Current Status

Version 1.9 Released

The code has been updated to use native SSE register-based floating point math. This required implementing local (to each arithmetic expression) register allocation. All tests pass even limited to just 3 XMM registers. The self-test harness was updated to run valgrind on all successful compiles to ensure the compiler does not leak memory. On failed compiles, abort() is often called (to allow getting a stack trace) so valgrind is not run on compiles that are known to fail. See the NEWS file for more details.

You can download my presentation slides from my talk entitled ResurrectMinimalBASIC in PDF format.

You can try Version 1.9 today. If you are OK with mercurial then you can get a read-only copy of the repository with this command:

hg clone http://hg.code.sf.net/p/buraphakit/MinimalBASIC/

Test Results

Emmanuel Roche from France kindly provided paper copies of the missing NBS test sources. All tests are now typed in and are part of the self-test 'check' and 'check32' targets. And yes, the new (to me) tests did find problems.

Jorge Giner Cordero repoprted bugs in NBS tests 12, 14, and 43 which I fixed using his helpful bug reports.

The complete status of the compiler's ability to pass the NBS tests can be seen in README.NBS. The complete status of the compiler's ability to pass the new HAM tests can be seen in README.HAM.

Licensing Details

The license for the compiler itself is the GNU General Public License version 2 only, see the file included file COPYING for details. Parts of the runtime support code have difference licenses, but all are open source and free to use.

This author of the actual compiler software is John Gatewood Ham, and that code is available under the GPL version 2 license only.

The included runtime library assembly routines for SIN(), COS(), TAN(), ATAN(), LOG(), EXP(), and POW (used by ^ operator) are public domain from SLEEF-2.80 (tweaked), from Naoki Shibata.

The included runtime library assembly routines for RND, and RANDOMIZE are public domain from ISAAC-64 (tweaked), from Bob Jenkins.

The included runtime library assembly routines for floating point output are derived from David M. Gay's dtoa and g_fmt routines. That code is copyrighted by Lucent Technologies, but is open source and freely redistributable.

The file dumpregs.s is not part of the compiler, but is optionally used for debugging. John Gatewood Ham wrote dumpregs.s and places that particular file into the public domain.

Why this dialect of BASIC?

The ECMA-55 standard was chosen over the "ANSI X3.60-1978 minimal BASIC" standard since it is free. ANSI, despite cancelling the standard, still keeps the 35 year old standard locked down and available only if you pay for it, which is a quite mean-spirited attempt to prevent any compliant free and open source implementations from being written. The same attitude exists with ISO for the "ISO 6373:1984 Data processing -- Programming languages -- Minimal BASIC" standard. This standard has many other names, such as "AS 2797-1985 Programming language - Minimal BASIC", and the only free one is ECMA-55, since all the other standards bodies are trying to kill BASIC forever.

Many products exist today that call themselves BASIC, but if it doesn't have line numbers, it's not really BASIC. If it has multi-line statements, it's not really BASIC. The modern un-BASICs have nice features, but they are not really BASIC at all, but instead seem to be more Pascal-like in nature. The nicest thing I can say is that they are derivatives of BASIC.

I could not find any Minimal BASIC implementation at all for modern machines. Few true compilers (compilers to bytecode do not count) for any BASIC with line numbers exist, and most of them do not compile to assembly. None of them (that I could find) were both FOSS and ready for 64bit. Thus this project was undertaken to fill the need for a FOSS Minimal BASIC compiler for modern 64bit Linux machines.

http://www.ecma-international.org/publications/files/ECMA-ST-WITHDRAWN/

How can others benefit from this project?

Well, obviously if you need a Minimal BASIC implementation on Linux this will be useful. However, I suspect mostly it will show how to generate some simple but effective code for 64bit Linux on AMD64/EM64T with a small enough system that a normal programmer can learn it and understand it. Projects like gcc and llvm are huge and learning how to modify those compilers is so hard most people just give up. This project is much simpler, which no need to learn LISP, any intermediate languages, etc. You just need to know C99 and AMD64 assembly language in the GNU gas dialect. That's still a lot, but it is a tiny fraction of what is required for production compilers.

With production compilers, especially ones with multiple front and back ends, it is easy to get overwhelmed by the amount of information required to understand it. With this Minimal BASIC compiler, that problem does not exist. Many compilers need many dependencies (cmake, or the GNU autotools, or flex/bison, etc.) but this compiler just needs make, gcc (or clang), and binutils to build. Most compilers today emit code that requires a dynamic linker. Even what gcc calls static is actually dynamic in most cases, since GNU libc wants to be able to dynamically load code to resolve things for NSS. Here the generated code doesn't link with glibc so this problem is avoided.

This project can serve as an example of how to produce programs that don't link to glibc but instead call the kernel directly. Most assembly examples for the GNU gcc that I could find are embedded asm in C or C++ programs; very little stand-alone example code exists (that I could find) for GNU as dialect beyond a few simple hello world programs. There were no good examples of dealing with floating point exceptions that I could find at the assembly level for GNU as either.

Now is this generated code great? No, there is a lot of room for improvement. I've aimed for correctness and simplicity, and have avoided all optimization for now. Exception handling for math operations is exceptionally ugly, but even the ugly code took me a long time to get working. Constructive feedback with examples of doing it in a way that is cleaner, but still gives the same results and doesn't require any external libraries, would be welcome.

Why was a new compiler not using SSE4.x or AVX?

My work machine while developing this compiler from August 2013 through August 2104 was a Core 2 Duo E4700 (Conroe), and my request for something newer was placed on hold. Since the processor I had available did not support those instructions, the original design didn't use them.

I am very pleased to report that my employer, Faculty of Informatics at Burapha University, provided me with a new Haswell refresh machine in September 2014. On September 13, 2014, I completed adding support for SSE4.1's PINSR[QD] & PINEXTR[QD] and now they are used when you supply the -4 option. The compiler still generates code for a Conroe by default, but when the -4 option is supplied the nicer SSE4.1 instruction sequences are used. Surprisingly, when testing on the i7-4790 the run time is essentially unchanged. However, reading the code is much nicer with the new instructions. The just released 1.7 version includes this work.

I have already started planning on adding support for AVX, but will maintain the ability to run on ancient Conroe hardware for people who need that ability. As of Septeber 14, 2014, AVX detection is working (use the -A option) and it automatically also includes the SSE4.1 support since AVX is a superset of SSE4.X, but no AVX code (beyond the detection) is actually generated yet.

Implementation Details

The overall structure is best described by the structure diagram. This compiler is simple and generates the assembly directly from the parse without using any explicit syntax tree or intermediate representation.

Here is a list of the files that make up the compiler and their purpose.

COPYING
This contains a copy of the GNU GPL version 2 license for the compiler itself.

ChangeLog
This contains a high-level overview of changes sorted by time in ascending date order with the newest changes at the end of the file.

globals.c
and
globals.h
This contains global variables that must be shared across all modules.

scanner2.c
and
scanner2.h
This is the scanner that converts the input bytestream into tokens for the parser. This uses a hand-coded finite state machine (deterministic).

parser2.c
and
parser2.h
This contains the parser that uses the token stream created by the scanner. It is a hand-coded top-down, recursive descent parser.

lineno.c
and
lineno.h
This contains the line number module. It is used when checking branch targets.

symbol_table.c
and
symbol_table.h
This contains the symbol table module. It tracks the variables in use, as well as constants, and generates the GAS assembly data definition statements.

codegen.c
and
codegen.h
This contains routines that emit the GAS assembler output except for the symbol table output. It also contains the runtime functions and macros. Much of the runtime library code is from other people and is not GPLv2 like the compiler itself.

gay.c
and
gay.h
This contains routines that emit the GAS assembler output for David M. Gay's code for converting between ASCIIZ strings and floating point.

jenkins.c
and
jenkins.h
This contains routines that emit the GAS assembler output for Bob Jenkin's code for the ISAAC-64 random number generator.

shibata.c
and
shibata.h
This contains routines that emit the GAS assembler output for Naoki Shibata's SLEEF code for elementary functions.

main.c
This contains the main routine that calls everything else. It does the command-line argument processing, loads the input file into a buffer, calls the scanner to convert that into a token stream, then calls the parser to process the token stream.

dtoa5.s
This contains the assembly version for David M. Gay's dtoa.c file. The process to generate this is in the magic.txt file in the dgay sub-directory.

g_fmt_BASIC.s
This contains the assembly version for my tweaked version of David M. Gay's g_fmt.c file. The process to generate this is in the magic.txt file in the dgay sub-directory.

Makefile
This is the project build file for use by the make program.

ecma55.1
This is the man page for the compiler.

ECMA-55.TXT
This file contains the text of the ECMA-55 standard for "Minimal BASIC". This was retyped by me from the PDF version both to get a smaller file and to allow easy searching. It is in UTF-8 encoding because of the address that has an o with a hat on it.

dumpregs.s
This file contains a dumpregs procedure that will show a register dump for debugging. The compiler never links to this or uses it. The file is included for debugging of the compiler generated code at the assembly level. The BASICC script will automatically link dumpregs.o if it exists, and you can build that with 'make dumpregs.o'.

datum.dot
This is the graphviz dot source file for the diagram of the finite state machine used by the INPUT runtime subsystem. If you do not have graphviz dot, you can download the FSM diagram directly.

structure.dot
This is the graphviz dot source file for the diagram of the overall structure of the compiler. If you do not have graphviz dot, you can download the structure diagram directly.

parseinput.c
This is the C source code for the INPUT runtime subsystem. Compile with -DTROUBLE to get a trace of the states as the transitions occur.

zonermore.c
This is the C source code for the PRINT runtime subsystem.

robert1.c
This is the C source code for the RND function and RANDOMIZE statement. It is public domain and derived from Bob Jenkins' ISAAC-64.

peephole.c
This is the C source code for the special purpose peephole optimizer that finds some silly redundant pushxmm/popxmm macro sequences and removes them.

Example Showing How To Use The Compiler

There is an included

BASICC
script that you can modify that automates the compilation process. If you want 32bit math for add, subtract, multiply, and divide you can use
BASICCS
instead. If you want wide output with more significant digits and support for larger exponents, you can use
BASICCW
instead. You can optionally strip the generated executable with the strip command and it will still work, but be slightly smaller.

If you want to know the low-level details, please keep reading. The next section describes how to use the compiler directly without any scripts.

  1. Create source with text editor
    vi WHATEVER.BAS
    
    You must use a text editor that creates 7bit ASCII files. Remember, this dialect of BASIC requires line numbers and has 72 columns lines. String variables are limited to 18 bytes.

  2. Compile source to assembly language
    ecma55 WHATEVER.BAS
    
    If you have trouble, try again with -v option and redirect the output to a file to get very verbose diagnostic information and read it like this:
    ecma55 -v WHATEVER.BAS 2>&1 | tee WHATEVER.BAS.LOG | less
    
    Use 'q' to exit the less output viewer. Another advantage of using the '-v' option if you have trouble is that it will leave the partially generated assembly file for you to examine.

  3. Assemble and create object file
    as -n --divide --64 --size-check=error WHATEVER.BAS -o WHATEVER.o
    
    You will get warnings about jumps - while I can specify a .d32 suffix to instructions, the warn in the assembler fires anyway. However, the code is generated correctly. If you decide to modify the assembly code before assembling, please remember this is AT&T syntax and both the opcodes and the order of operands are different from Intel syntax. Also, the compiler emits code for the large memory model, and the code is not RIP-relative (PIC). This is not what most compilers generate by default for x86-64; it is much simpler, but may run a bit more slowly.

  4. Link and create static executable
    ld -nostdlib -z defs -z nodefaultlib -z nodlopen -z noexecstack -Bstatic \
       --no-omagic -m elf_x86_64 -o WHATEVER WHATEVER.o
    
    I do not support dynamic linking, and no system libraries are used, so this long line will carefully link the object file to create an executable that is able to run on 64bit Linux on AMD64 compatible processors (with SSE2 support, which is all of them I think) without any system libraries. If you are using dumpregs, you'll need to append dumpregs.o to the end of the ld command.

  5. Run executable
    ./WHATEVER