John's ECMA-55 Minimal BASIC Compiler

Fri Sep 4 21:52:17 2015 UTC

This software is a compiler for Minimal BASIC as specified by the ECMA-55 Minimal BASIC (81KB, updated) standard from the ECMA International organization (formerly known as the European Computer Manufacturers Association). A version of the ECMA-55 Minimal BASIC (83KB, updated) better suited to printing (uses backspace underline sequences to more correctly match the original scanned input) is also available. Both text files are in UTF-8 encoding. Finally, the original source for this is the ECMA-55 Minimal BASIC PDF but that is not easily searchable and is relatively huge (15 MB) since it is a picture-based scan and not text-based.


Alternative Implementation: bas55

If you want an interpreter instead of a compiler, a new open source interpreter called bas55 was written in 2014 by Jorge Giner Cordero. The bas55 system even has a nice online manual including a tutorial on programming in ECMA-55 Minimal BASIC.


The target is AMD64/EM64T/x86-64 machines running a modern Linux distribution. This compiler creates assembly language output files. These are then assembled into object files and linked to create an executable. The assembly dialect used is that of GNU gas from GNU binutils, since that will be available on any modern general purpose x86-64 Linux distribution. No libc or libm is used by the generated code, which allows creating very small executables. To keep the generated code small and simple, generated code output of SIN(), COS(), TAN(), ATAN(), EXP(), POW, LOG(), RND, and RANDOMIZE is only emitted if those features are required by the BASIC program.

Paper Published!

A journal article has been published about this compiler, as of version 1.7. You can download the article from this URL:
http://www.mdpi.com/2073-431X/3/3/69

Current Status

Version 2.12 Released

This release fixes many corner cases in the math support where infinite values were involved. This version also finally forces programs to be all upper-case as the ECMA-55 standard requires. Finally, it adds the -X switch to enable extensions, and adds several useful extensions.

See the NEWS file for more details.

Programming Guide

I am writing an introduction to programming using ECMA-55 Minimal BASIC as the language. You can download a draft copy in PDF format. The file was last updated Fri Sep 4 21:49:06 2015 UTC and includes a new chapter on subroutines, as well as a fix for an embarassing error in the sequential search flowchart where the loops were starting on one instead of zero.

Computer Science Programs

I have coded several `classic' computer science algorithms using ECMA-55 Minimal BASIC as demonstration programs. These are included in the compiler download file, or you can grab each one separately from this list.

SEQSEARCH.BAS Sequential Search
BSEARCH.BAS Binary Search
BSORT.BAS Bubble Sort
COMBSORT.BAS Comb Sort
FACTORIAL.BAS Recursive Factorial
FIBONACCI.BAS Generate Fibonacci numbers
HSORT.BAS Heap Sort
ISORT.BAS Insertion Sort
MSORT.BAS Merge Sort
QSORT.BAS Quick Sort
SSORT.BAS Selection Sort
SLLIST.BAS Singly Linked List Demo
SLLISORT.BAS Singly Linked List Insertion Sort

I have also found this program and ported it to ECMA-55 Minimal BASIC.

ERATOSTHENESE.BAS Sieve of Eratosthenes to find prime numbers

You can download my old presentation slides from my talk entitled Resurrect MinimalBASIC in PDF format. Updated slides reflecting recent changes are available in the ECMA55-slideshow.pdf file.

You can try Version 2.12 today. If you are OK with mercurial then you can get a read-only copy of the repository with this command:

hg clone http://hg.code.sf.net/p/buraphakit/MinimalBASIC/

Test Results

Emmanuel Roche from France kindly provided paper copies of the missing NBS test sources. All tests are now typed in and are part of the self-test 'check' and 'check32' targets. And yes, the new (to me) tests did find problems.

Jorge Giner Cordero from Spain repoprted bugs in NBS tests 12, 14, and 43 which I fixed using his helpful bug reports. Test P030.BAS had a typo which was fixed for the 1.97 release. He is the same Mr. Cordero who wrote the bas55 interpreter.

The complete status of the compiler's ability to pass the NBS tests can be seen in README.NBS. The complete status of the compiler's ability to pass the new HAM tests can be seen in README.HAM.

Licensing Details

The license for the compiler itself is the GNU General Public License version 2 only, see the file included file COPYING for details. Parts of the runtime support code have difference licenses, but all are open source and free to use.

This author of the actual compiler software is John Gatewood Ham, and that code is available under the GPL version 2 license only.

The included runtime library assembly routines for SIN(), COS(), TAN(), ATAN(), LOG(), EXP(), and POW (used by ^ operator) are public domain from SLEEF-2.80 (tweaked), from Naoki Shibata.

The included runtime library assembly routines for RND, and RANDOMIZE are public domain from ISAAC-64 (tweaked), from Bob Jenkins.

The included runtime library assembly routines for floating point output are derived from David M. Gay's dtoa and g_fmt routines. That code is copyrighted by Lucent Technologies, but is open source and freely redistributable.

The file dumpregs.s is not part of the compiler, but is optionally used for debugging. John Gatewood Ham wrote dumpregs.s and places that particular file into the public domain.

Why this dialect of BASIC?

The ECMA-55 standard was chosen over the "ANSI X3.60-1978 minimal BASIC" standard since it is free. ANSI, despite cancelling the standard, still keeps the 35 year old standard locked down and available only if you pay for it, which is a quite mean-spirited attempt to prevent any compliant free and open source implementations from being written. The same attitude exists with ISO for the "ISO 6373:1984 Data processing -- Programming languages -- Minimal BASIC" standard. This standard has many other names, such as "AS 2797-1985 Programming language - Minimal BASIC", and the only free one is ECMA-55, since all the other standards bodies are trying to kill BASIC forever.

Many products exist today that call themselves BASIC, but if it doesn't have line numbers, it's not really BASIC. If it has multi-line statements, it's not really BASIC. The modern un-BASICs have nice features, but they are not really BASIC at all, but instead seem to be more Pascal-like in nature (without the semicolons). The nicest thing I can say is that they are derivatives of BASIC. Sadly, when Kemeny and Kurtz decided to make this change, they didn't change the name of the language. One reference about the intention of the original designers to change BASIC to a more Pascal-like language I found was "BASIC Becomes a Structured Language" from Computer Language magazine's premier issue in 1984, page 21, written by Kemeny, Kurtz, and Brig Elliot. Whether the new language was better or not, this article proves that the new language was not BASIC in the original 1960's sense. It is unfortunate that the new language was considered by many to be a small evolution and not a totally new language since the syntax was intentionally entirely incompatible. The new style is called True BASIC in the article, but few people make that distiction, and the article still hints it's an evolution, and not a total rewrite of something completely different that has the same goal (a teaching langauge) but little else in common with the original 1960's Dartmouth BASIC. The style proposed in "On the Way to Standard BASIC" from BYTE Volume 7 Number 6 page 182, by Kurtz in June of 1982 is something more palatable and still uses line numbers.

I could not find any Minimal BASIC implementation at all for modern machines. Few true compilers (compilers to bytecode do not count) for any BASIC with line numbers exist, and most of them do not compile to assembly. None of them (that I could find) were both FOSS and ready for 64bit. Thus this project was undertaken to fill the need for a FOSS Minimal BASIC compiler for modern 64bit Linux machines.

http://www.ecma-international.org/publications/files/ECMA-ST-WITHDRAWN/

How can others benefit from this project?

Well, obviously if you need a Minimal BASIC implementation on Linux this will be useful. However, I suspect mostly it will show how to generate some simple but effective code for 64bit Linux on AMD64/EM64T with a small enough system that a normal programmer can learn it and understand it. Projects like gcc and llvm are huge and learning how to modify those compilers is so hard most people just give up. This project is much simpler, which no need to learn LISP, any intermediate languages, etc. You just need to know C99 and AMD64 assembly language in the GNU gas dialect. That's still a lot, but it is a tiny fraction of what is required for production compilers.

With production compilers, especially ones with multiple front and back ends, it is easy to get overwhelmed by the amount of information required to understand it. With this Minimal BASIC compiler, that problem does not exist. Many compilers need many dependencies (cmake, or the GNU autotools, or flex/bison, etc.) but this compiler just needs make, gcc (or clang), and binutils to build. Most compilers today emit code that requires a dynamic linker. Even what gcc calls static is actually dynamic in most cases, since GNU libc wants to be able to dynamically load code to resolve things for NSS. Here the generated code doesn't link with glibc so this problem is avoided.

This project can serve as an example of how to produce programs that don't link to glibc but instead call the kernel directly. Most assembly examples for the GNU gcc that I could find are embedded asm in C or C++ programs; very little stand-alone example code exists (that I could find) for GNU as dialect beyond a few simple hello world programs. There were no good examples of dealing with floating point exceptions that I could find at the assembly level for GNU as either.

Now is this generated code great? No, there is a lot of room for improvement. I've aimed for correctness and simplicity, and have avoided all optimization for now. Exception handling for math operations is exceptionally ugly, but even the ugly code took me a long time to get working. Constructive feedback with examples of doing it in a way that is cleaner, but still gives the same results and doesn't require any external libraries, would be welcome.

Why was a new compiler not using SSE4.x or AVX?

My work machine while developing this compiler from August 2013 through August 2104 was a Core 2 Duo E4700 (Conroe), and my request for something newer was placed on hold. Since the processor I had available did not support those instructions, the original design didn't use them.

I am very pleased to report that my employer, Faculty of Informatics at Burapha University, provided me with a new Haswell refresh machine in September 2014. On September 13, 2014, I completed adding support for SSE4.1's PINSR[QD] & PINEXTR[QD] and now they are used when you supply the -4 option. The compiler still generates code for a Conroe by default, but when the -4 option is supplied the nicer SSE4.1 instruction sequences are used. Surprisingly, when testing on the i7-4790 the run time is essentially unchanged. However, reading the code is much nicer with the new instructions. Version 1.7 and greater include this work.

I have already completed scalar support for AVX, but will maintain the ability to run on ancient Conroe hardware for people who need that ability. As of version 2.00 released on May 26, 2015, AVX scalar code generation is working and complete.

Implementation Details

The overall structure over version 1.7 is best described by the structure diagram. Version 2.0 is more complex and I need to generate a new diagram for that. This parser calls the scanner for tokens and generates an abstract syntax tree. This tree is then passed to a semantic analyzer pass which updates the tree and symbol table. Finally the updated tree is passed to the high-level code generator and it makes calls to actually emit the assembly code.

Here is a list of the files that make up the compiler and their purpose.

COPYING
This contains a copy of the GNU GPL version 2 license for the compiler itself.

ChangeLog
This contains a high-level overview of changes sorted by time in ascending date order with the newest changes at the end of the file.

globals.c
and
globals.h
This contains global variables that must be shared across all modules.

scanner2.c
and
scanner2.h
This is the scanner that converts the input bytestream into tokens for the parser. This uses a hand-coded finite state machine (deterministic).

parser2.c
and
parser2.h
This contains the parser that calls the scanner2 for tokens and generates the AST used by the semantic_checks and asmgen modules. It is a hand-coded, top-down, recursive descent parser.

registers.c
and
registers.h
This is the C source code for the register allocator module.

optimizer.c
and
optimizer.h
This is the C source code for the simple optimizer that does constant folding.

tree.c
and
tree.h
This is the C source code for the n-ary tree code used to support the parser creating ASTs.

ast.c
and
ast.h
This is the C source code for the pretty-printing which traverses the AST that is generated during a parse and regenerates a semantically equivalent program.

symbol_table.c
and
symbol_table.h
This contains the symbol table module. It tracks the variables in use, as well as constants, and generates the GAS assembly data definition statements.

semantic_checks.c
and
semantic_checks.h
This contains the code that walks the AST the parser creates and performs semantic checks, symbol table population, and jump target checking with help from the lineno and symbol_table modules.

asmgen.c
and
asmgen.h
This contains the high-level assembly generating module. It walks the AST created by the parser and populated by the semantic_checks module and makes calls to the codegen module to actually emit the assembly code.

codegen.c
and
codegen.h
This contains routines that emit the GAS assembler output except for the symbol table output. It also contains the runtime functions and macros. Much of the runtime library code is from other people and is not GPLv2 like the compiler itself.

gay.c
and
gay.h
This contains routines that emit the GAS assembler output for David M. Gay's code for converting between ASCIIZ strings and floating point.

jenkins.c
and
jenkins.h
This contains routines that emit the GAS assembler output for Bob Jenkin's code for the ISAAC-64 random number generator.

shibata.c
and
shibata.h
This contains routines that emit the GAS assembler output for Naoki Shibata's SLEEF code for elementary functions.

main.c
This contains the main routine that calls everything else. It does the command-line argument processing, loads the input file into a buffer, calls the scanner to convert that into a token stream, then calls the parser to process the token stream.

dtoa5.s
This contains the assembly version for David M. Gay's dtoa.c file. The process to generate this is in the magic.txt file in the dgay sub-directory.

g_fmt_BASIC.s
This contains the assembly version for my tweaked version of David M. Gay's g_fmt.c file. The process to generate this is in the magic.txt file in the dgay sub-directory.

Makefile
This is the project build file for use by the make program.

ecma55.1
This is the man page for the compiler.

ECMA-55.TXT
This file contains the text of the ECMA-55 standard for "Minimal BASIC". This was retyped by me from the PDF version both to get a smaller file and to allow easy searching. It is in UTF-8 encoding because of the address that has an o with a hat on it.

dumpregs.s
This file contains a dumpregs procedure that will show a register dump for debugging. The compiler never links to this or uses it. The file is included for debugging of the compiler generated code at the assembly level. The BASICC script will automatically link dumpregs.o if it exists, and you can build that with 'make dumpregs.o'.

datum.dot
This is the graphviz dot source file for the diagram of the finite state machine used by the INPUT runtime subsystem. If you do not have graphviz dot, you can download the FSM diagram directly.

structure.dot
This is the graphviz dot source file for the diagram of the overall structure of the compiler. If you do not have graphviz dot, you can download the structure diagram directly.

parseinput.c
This is the C source code for the INPUT runtime subsystem. Compile with -DTROUBLE to get a trace of the states as the transitions occur.

zonermore.c
This is the C source code for the PRINT runtime subsystem.

robert1.c
This is the C source code for the RND function and RANDOMIZE statement. It is public domain and derived from Bob Jenkins' ISAAC-64.

peephole.c
This is the C source code for the special purpose peephole optimizer that finds some silly redundant pushxmm/popxmm macro sequences and removes them.

Example Showing How To Use The Compiler

There is an included

BASICC
script that you can modify that automates the compilation process. If you want 32bit math for add, subtract, multiply, and divide you can use
BASICCS
instead. If you want wide output with more significant digits and support for larger exponents, you can use
BASICCW
instead. You can optionally strip the generated executable with the strip command and it will still work, but be slightly smaller.

If you want to know the low-level details, please keep reading. The next section describes how to use the compiler directly without any scripts.

  1. Create source with text editor
    vi WHATEVER.BAS
    
    You must use a text editor that creates 7bit ASCII files. Remember, this dialect of BASIC requires line numbers and has 72 columns lines. String variables are limited to 18 bytes.

  2. Compile source to assembly language
    ecma55 WHATEVER.BAS
    
    If you have trouble, try again with -v option and redirect the output to a file to get very verbose diagnostic information and read it like this:
    ecma55 -v WHATEVER.BAS 2>&1 | tee WHATEVER.BAS.LOG | less
    
    Use 'q' to exit the less output viewer. Another advantage of using the '-v' option if you have trouble is that it will leave the partially generated assembly file for you to examine.

  3. Assemble and create object file
    as -n --divide --64 --size-check=error WHATEVER.BAS -o WHATEVER.o
    
    You will get warnings about jumps - while I can specify a .d32 suffix to instructions, the warn in the assembler fires anyway. However, the code is generated correctly. If you decide to modify the assembly code before assembling, please remember this is AT&T syntax and both the opcodes and the order of operands are different from Intel syntax. Also, the compiler emits code for the large memory model, and the code is not RIP-relative (PIC). This is not what most compilers generate by default for x86-64; it is much simpler, but may run a bit more slowly.

  4. Link and create static executable
    ld -nostdlib -z defs -z nodefaultlib -z nodlopen -z noexecstack -Bstatic \
       --no-omagic -m elf_x86_64 -o WHATEVER WHATEVER.o
    
    I do not support dynamic linking, and no system libraries are used, so this long line will carefully link the object file to create an executable that is able to run on 64bit Linux on AMD64 compatible processors (with SSE2 support, which is all of them I think) without any system libraries. If you are using dumpregs, you'll need to append dumpregs.o to the end of the ld command.

  5. Run executable
    ./WHATEVER