Daily Archives: June 24, 2011

“Let’s Build a Compiler”

Ever wanted to write your own programming language? Defining one is easy (relatively speaking). Building the compiler for it…well, not so much. Usually, that’s where most people (including myself) stop. To solve this problem, Jack W. Crenshaw started a series titled, Let’s Build a Compiler where he covered the basics of building a simple compiler before starting to write a real one as the series continued. It was never finished and the last article appeared in 1995. Despite stopping short of a “real” compiler, it is an excellent starting point for those wanting to start building them.

The code used in the series was designed to run on Turbo Pascal 4.0 (from back in the 1980′s). While the code is old and the compiler to compile isn’t used anymore, it’s still compilable thanks to the Free Pascal project. Alternative, just get a copy of Bloodshed’s open source Dev-Pascal IDE which includes Free Pascal.

The assembly code that is being generated is for Motorola 68000 with the SK*DOS operating system. I don’t know of anyone with a device still running on that processor and operating system so the code may not be so useful in itself. However, M86K assembly is by far, easier to read and understand than more modern assembly languages such as x86. That really does help with understanding the code generator for those (including, again, myself) who have never written anything in assembly.

All usefulness for the generated code is not lost however! The EASy68K program provides an open source interpreter for M68K assembly code. You can use it to make sure that the compiler works. Note that you will have to change the the Header() procedure which generates a header for SK*DOS (“WARMST  EQU $A01E”) which is not very useful in any way. Once you change that to something like “ORG    $1000″, then you should be able to run it properly.

If you really have that much of an issue with Pascal and M68K, there are alternatives. There is a translation of the entire series to Forth and x86 from 2005. There’s another version ported to C generating P-code. The site is unavailable at the time of this post so I’m attaching the compiler to the post. Note that it isn’t a direct translation and there are many changes. There are two more versions on this page of the Tiny language, one written in C and the other in C++ derived from the series. The first one is attached to the first post is in C and generates Propeller Assembly (PASM) code for small devices. I tested it and it works quite well. The second one, written in C++, is attached to the 8th post and generates x86 assembly for GNU/Linux and is compatible with the GNU Assembler so it may be of value for those wanting to extend their compilers into real world use. For those running on Intel Macs (couldn’t find any PowerPC version), here is another translation to C code producing x86 assembly with Mac headers. Finally, there’s a Haskell version complete with its own tutorial paralleling the original. The code generated is x86 and appears to work on Mac and Windows (I have not tested it so I can’t say for sure).

That should be more than enough to get you started on your own compiler. So how exactly should you write it? That’s really up to you but I would recommend first writing the compiler in whatever language you choose with the given M86K code generator due to its readability. After thoroughly understanding the compiler and the code it produces, start writing a code generator for x86 using any of the above implementations.

Keep in mind that the Tiny compiler is designed for learning rather than for production. The code produced by it is horrible at best (though it may still beat out certain interpreters which will remain unnamed here). For one, absolutely no optimization is performed so don’t expect to use it in a production environment where other compilers exist. You will still have to read through other (dizzying) compiler books before it becomes a possibility.

I mentioned earlier that I gave up writing an actual compiler from scratch. I’m currently in the progress of writing another one and am following my guidelines I set above. Perhaps by next year, it will finally be usable. Until then, there’s still a lot of work to be done.

Good luck to all those compiler writers out there!

Preliminary Results for Lua and Python (Cygwin)

Here are some preliminary benchmark data I’ve obtained for Python and Lua. Many of the programs failed, probably required extra libraries which I did not install. The following results show only those where BOTH a Lua and Python program worked. In those cases where there were multiple programs working per language, only the fastest (average) was taken. Each program was run 6 times and the average of the last 5 were taken (except nbody which was run 10 times with the average of the last 9 taken). The unit used is seconds (lower = better):

binarytrees lua .939
binarytrees python 1.789
fasta lua .872
fasta python 2.362
mandelbrot lua .904
mandelbrot python 1.399
nbody lua 1.672
nbody python 4.357
spectralnorm lua .912
spectralnorm python 13.924

Using:

  • Lua 5.1.4 (Cygwin 1.7)
  • Python 2.6.5 (Cygwin 1.7)
  • Windows XP SP3
  • Intel Pentium 4 Processor: 2.8 GHz (single-core)

Due to the non-controlled nature of this run, the results should not be taken seriously. Nonetheless, it does allow for some (potentially erroneous) generalizations. For one, Lua appears to be significantly faster than Python on Cygwin. This could be that Python is using some slow OS-dependent features which Cygwin then needs to emulate as POSIX thus slowing things down. I don’t know the inner workings of Python so I have no idea what Cygwin could possibly be emulating for it. I do know that Lua is essentially ANSI C (except the loading library which is not used in these tests) and so there’s probably nothing much for Cygwin to emulate.

More results may come in the near to distant future when I get more languages working.

Programming Language Benchmarks

Ever wonder why a program you wrote runs so slowly? Have you squeezed your brain out trying to optimize it with no luck? I’ve been in that situation many times. Usually, its something wrong with the code that I find out after hours of searching. And then there are those other times where even that does no good. When that happens, the problem could be in your compiler/interpreter that you are using.

To confirm that as the problem, there’s a website on Debian’s Alioth (similar to SourceForge)  called the Computer Language Benchmarks Game (formerly the Great Computer Shootout — renamed in 2007 following the Virginia Tech shooting) which provides VERY detailed benchmark results for various languages. They used to also have results for different implementation of the same language but that was removed from the website in favor of simplicity which is unfortunate because some of the fastest implementations of a language are often non-standard languages.

All is not lost however. The website is a free and open source project under the revised BSD license. Everything from benchmark software to benchmark scripts to data collected and even the website is included in that project and is available through the project’s CVS for you to make your own benchmark tests. Be warned though that the CVS snapshot is HUGE — 234 MB when extracted. If you don’t want to deal with that much stuff, you can also download just the Python script used to produce the benchmarks (plus 3 benchmarks — 2 for Python, 1 for Java).

If you can get it to work, please share your own results. I will publish my results if and when I get the script running.