THE JELLYBEAN MACHINE

MIT Artificial Intelligence Laboratory

This is http://www.cva.stanford.edu/projects/j-machine/cva_j_machine.html

Last updated July 7, 1998

The J-Machine is a fine grained concurrent computer designed by the MIT Concurrent VLSI Architecture group (now located at Stanford University) in conjunction with Intel Corporation.

Pictures of J-machine Hardware

1024-node J-machine with cover removed.

64-node J-machine board.

SCSI Interface for the J-machine.

Host Interface board for the J-machine.

Detail Showing SPARC host and interface to board stack.

Note

At the moment, we're trying to bring a few more users on-line with creative applications. If you have a cycle or bandwidth hungry application please contact Andrew Chang or Richard Lethin.

Overview

The J-machine project was started at MIT in about 1988 as an experiment in message-passing computing based on work that Bill Dally did at Caltech for his doctoral dissertation.

The work was driven by the VLSI philosophy "processors are cheap" and "memory is expensive". This philosophy is based on a idealistic view of VLSI economics, in which the cost of a function is based on the VLSI area dedicated to it. Although the standard view is that processors are much more expensive than memory (and this standard view was very much true before levels of VLSI integration allowed processors to be integrated on a single chip), if we look at a typical workstation with 32 Mbyte of memory, the amount of silicon area dedicated to memory is roughly 100 times that for the CPU. A bit of DRAM is 100 lambda^2, so 32Mbyte is 32G lambda^2, versus the arithmetic units in the CPU, which are about 300M lambda^2.

Of course, this ignores issues related to the relative production volumes and process technologies for logic vs. memory, and runs against the current "wisdom" that the best way to build a fast parallel processor is to bolt a network-interface and coherent-cached-shared-memory hardware onto a standard microprocessor. However, we're interested in technology imperatives much more than market imperatives.

With CPUs so cheap, in the silicon-area sense, the J-machine project set out to explore an architecture in which the processors are more "liberally scattered" through the machine. We envisioned a component with economies of scale like that for DRAMs: a "message-driven processor" with a small processor and network interface integrated *with* the memory. The "J" in "J-machine" stands for "Jellybean", in the sense that the processors would be cheap and plentiful, like jellybean candies.

The design of the J-machine incorporated several novel technologies. The machine architects immediately realized that a key to performance would be fast, low-overhead messaging. The processor and network interface are tightly coupled, so that user-level messages can be sent with very little overhead for copying. The network is a 3-dimensional deterministic wormhole-routed mesh. User-level message handlers dispatch on message arrival, with a small amount of queuing in on-chip memory at the destination. We also dispatch to handlers before the message has completely arrived, trying to speed the dispatch process.

The communication capacity of the J-machine is pattern-dependent, of course. Each node can inject into the network at 2 words (72 bits) per clock. Messages travel over the links in 18-bit "flits" at 12.5MHz. Message reception is into on-chip memory, buffered in "4-word (144-bit) queue row buffers" that write in one clock cycle to minimize interference with processing and instruction fetching. The major bisection of a 1024 node machine is 8 by 8 channels, each 18 bits wide and running at 12.5 Mhz, or 1.8 Gbyte/sec. Microbenchmark studies have shown that random traffic patterns can achieve about 40% utilization of this bandwidth.

Each processor node has about 4kb of on-chip memory and 1Mbyte of off-chip external memory. The off-chip memory was added when we discovered the amount of memory we could put on-chip in the available ~1989 process technology was too small. We'd have prefered to have put the memory on-chip and have had more processors. Furthermore, pinout constraints on our packaging technology restricted us to a narrow interface to external memory. This remote and narrow path result in an external access latency of 6 clock cycles, vs. one clock for the on-chip memory.

The node's processor itself is very modest: it runs at 10MHz internal clock, with a limited number of registers, and no floating point hardware, for performance that's about equivalent to a 25Mhz 386.

The J-machine also incorporates mechanisms to support a concurrent object-oriented programming model over a global object space. Instructions are included for a quick hashed lookup in an on-chip table. This can be used for caching, roughly equivalent to a TLB, except that it is managed by the compiler/os and policy is not set by the hardware. This is used to translate "object identifiers" (equivalent to a segmented global virtual address) to the node which handles methods for the object and to the memory address on the local node where the object resides in a single instruction.

Hardware also supports dynamic typing: each 32-bit word is tagged with 4 bits. Hardware instructions like "add" will trap if they see a type other than integer. The tag is also used to identify "futures"; these are used for place-markers for the results of asynchronous method calls. An attempt to use a future by an arithmetic instruction results in a fault; the OS suspends the thread waiting for the arrival of the result.

We currently have two programming environments for the J-machine.

Concurrent Smalltalk is a concurrent object-oriented language which looks more like Scheme than smalltalk. It offers support for concurrent distributed objects, a global address space, object migration, explicit object placement, inheritance, concurrency, futures, locks, etc. All computation is done in user level message handlers. This a really neat and underrated language, but it requires a bit of an adventurous spirit since it is experimental. Waldemar Horwat's revised CST manual is available by clicking here. (Note: ghostview has problems with this; this HREF prints ok though.) Someone should port this to something like the CM-5 or T3D to expand its user base.

Message-Driven C was developed by our collaborators at Caltech. The programming model is MIMD with asynchronous function call invocations around the machine. Daniel Maskit's MS thesis discussing this can be accessed by clicking here.

Three 1024-node J-machine systems have been built, and live at MIT, Caltech, and Argonne National Research Labs.

The 1024-node J-machine at MIT is hosted by the machine jelly-donut.ai.mit.edu, eg. it's on the Internet. The 1024-node machine has a peak performance of 1G instructions/sec, peak memory bandwidth of 6 GB/sec to external memory, 1.28 GB/sec bandwidth across the central bisection. The J-machine also includes a dedicated filesystem and a distributed graphics system.

Architectural Evaluation

Noakes, Michael D. and Wallach, Deborah A. and Dally, William J. "The J-Machine Multicomputer: An Architectural Evaluation, Proceedings of the 20th International Symposium on Computer Architecture, 1993.

Spertus, Ellen and others, "Evaluation of Mechanisms for Fine-Grained Parallel Programs in the J-Machine and the CM-5", Proceedings of the International Symposium on Computer Architecture, 1993.

Fatovic, Jerko, "A Ray Tracer for the J-Machine", MS Thesis, Massachusetts Institute of Technology Department of Electrical Engineering, May, 1992.

Shaun Yoshie Kaneshiro "Branch and Bound Search on the J-Machine, MS Thesis, Massachusetts Institute of Technology Department of Electrical Engineering, September 1993.

Programming Systems

Chien, Andrew and Dally, William J., "CST: An Object-Oriented Concurrent Language" Object-Based Concurrent Programming Workshop, September, 1988, Conference held at San Diego, CA. SIGPLAN Notices, February 1989.

Dally, William J., "The J-Machine: System Support for Actors", in Towards Open Information Science, Editors, Hewitt, Carl and Agha, Gul, MIT Press, 1992.

Horwat, Waldemar, "A Concurrent Smalltalk Compiler for the Message-Driven Processor, MIT AI Memo, 545 Technology Sq., Cambridge, MA 02139, May, SB Thesis.

Horwat, Waldemar, Concurrent Smalltalk on the Message-Driven Processor, Master's Thesis, MIT, May 1989.

Horwat, Waldemar and Totty, Brian and Dally, William J., "COSMOS: An Operating System for a Fine-Grain Concurrent Computer", ?unpublushed?

Horwat, Waldemar and Andrew Chien and William J. Dally, "Experience with CST:Programming and Implementation", Proceedings of the ACM SIGPLAN 89 Conference on Programming Language Design and Implementation, 1989.

Horwat, Waldemar, "Revised CST Manual Version ?", CVA Memo #?. VERY WORTHWHILE (Note: this HREF points to a .ps file which prints but gives trouble to ghostview).

Totty, Brian, "An Operating Environment for the Jellybean Machine", SB Thesis, Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science, May, 1988.,

Design and Development

Dally, William J. and Fiske, J.A. Stuart and Keen, John S. and Lethin, Richard A. and Noakes, Michael D. and Nuth, Peter R. and Davison, Roy E. and Fyler, Gregory A., "The Message-Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms", "IEEE Micro", April, 1992.

Dally, William J. and others}, "Design and Implementation of the Message-Driven Processor", Proceedings of the 1992 Brown/MIT Conference on Advanced Research in VLSI and Parallel Systems, MIT Press, March, 1992.

Dally, William J. and others, The Message-Driven Processor: An Integrated Multicomputer Processing Element, Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, IEEE Press, 1992.

Lethin, Richard A. and Dally, William J., "MDP Tools and Methods", Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors, 1992.

Richard A. Lethin, A Simulator for the Message-Driven Processor, Master's Thesis MIT, 1991.

Noakes, Michael and Dally, William J., "System Design of the J-Machine", Sixth MIT Conference of Advanced Research in VLSI, The MIT Press, 1990.

Nuth, Peter R. and Dally, William J., "The J-Machine Network", Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors}, October, 1992.

Early Architecture

Chao, Linda, "Architectural Features of a Message-Driven Processor", MIT SB Thesis, May, 1987.

Dally, William J. and others, "Architecture of a Message-Driven Processor", Proceedings of the 14th International Symposium on Computer Architecture, 1987.

Dally, William J. and Seitz, Charles L., "Deadlock Free Message Routing in Multiprocessor Interconnection Networks", IEEE Transactions on Computing, Volume C-36, May, 1987.

Dally, William J., Fine-Grain Message Passing Concurrent Computers, Proceedings of the Third Conference on Hypercube Concurrent Computers, Pasadena, CA, 1988.

Dally, William J., "The J-Machine System", in Artificial Intelligence at MIT: Expanding Frontiers, editor Patrick Winston with Sarah A. Shellard, MIT Press, 1990.,

Dally, William J. and Kajiya, James T., An Object Oriented Architecture, Proceedings of the 12th International Symposium on Computer Architecture, 1985.

Dally, William J., A VLSI Architecture for Concurrent Data Structures, Kluwer Academic Publishers, 1987.

Miscellaneous

An experimental QCD code has been implemented on the J-machine by Richard Lethin and Robert Rippel in order to evaluate communication and computation behavior (report in PostScript).

Marcus Alexander von Kapff, Industry Assesment and Market Analysis of Massively Parallel Computers, Master's Thesis, Sloan School at Massachusetts Institute of Technology, May, 1993.

[email protected]