User:Su-steve/SmartMemReduxPaper

From Wikipedia, the free encyclopedia

1 Title: Does reconfigurability improve compute efficiency?
2 Plain english abstract
3 Notes
4 Pointer to the original ISCA 2000 paper
5 Claims from the original ISCA 2000 paper
6 General idea for the new paper
7 Open questions
8 Notes from the ISCA 2000 paper
9 New paper 1:
10 Proving the web-page claim
11 Experiments we can do now:
12 Experiments we have done
13 MIT RAW paper, ISCA 2004

[edit] Title: Does reconfigurability improve compute efficiency?

Previous versions: User:Su-steve/tmp-paper

Alternative titles:

Smart Memories Update
Revisiting reconfigurable computing
ISCA 2000 paper, updated

[edit] Plain english abstract

An earlier paper (ISCA2000) made certain claims about the abilities of a proposed machine called Smart Memories. Here it is eight years later. How well does the final machine perform with regard to the original claims?

Perhaps more importantly, what can the Smart Memories experience tell us about reconfigurable computing? What open questions might be answered? For the larger context, see Reconfigurable computing.

[edit] Notes

  - use wikipedia to develop the paper "reconfigurable computing"
  - maybe start a thread on comp.arch

  - premise: reconfigurable computing has been evaluated in
    various ways [list the ways, including the ISCA 2000 paper
    premise that RC would provide multiple different architectures
    at little extra cost].

  - check ISCA 2000 paper---were benchmarks unaltered?
  - talk to bill dally---is this study interesting?

  - Next: find the apps from the Big Experiment.
  -- use the apps from original ISCA 2000 paper:
     Imagine: fft, fir, convolve, dct.
     Hydra: compress, grep, m88ksim, wc, ijpeg, mpeg, alvin, simplex.

MIT RAW paper (see notes below) is an example of one chip 
follow-through paper that made it to ISCA.

[edit] Pointer to the original ISCA 2000 paper

K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, M. Horowitz,
Smart Memories: A Modular Reconfigurable Architecture, International Symposium on Computer Architecture, June 2000.

Smart memories combines the memory flexibility of (Tri-media [2,3], Equator [4], Mpact [5], IRAM [6]) with the high-ILP / multi-CPU capabilities of RAW.

[edit] Claims from the original ISCA 2000 paper

Assumption from the original 2000 paper: Given three different compute systems

CCCM, a dedicated conventional cache-coherent machine;
STRM, a dedicated stream machine; and
TLSM, a dedicated TLS machine.

Sometimes you want CCCM; sometimes you want STRM; sometimes you want TLSM; and sometimes you might discover that you want a whole different sort of machine. (Are these assumptions valid? Why exactly, do you want each of these machines?)

Thesis of the original 2000 paper: We can build a single reconfigurable machine SM (Smart Memories) that reasonably approaches the performance of dedicated machines CCCM, STRM and TLSM and, by extension, other as-yet-unnammed machines.

Is STRM really better than CCCM, or can CCCM reasonably suffice in most cases? See Leverich paper?

Is TLSM really better than CCCM? Another paper makes this case (whose?)

Given that TLSM is better than CCCM, might some other idea arise that's yet better than TLSM? (Yes, see TCCM. Whose paper will make this claim?)

Can redo the original experiments with the new hardware; should get similar results; will this be interesting?

Can show that machine supports TCC, the TLS follow-on. A good claim for flexibility to adapt to unforeseen circumstance!

[edit] General idea for the new paper

Assumption, from original 2000 paper:

  - some programs run better as cache-coherent programs; others
    run better as stream programs; still others run better as TLS.

  -  SM will run all three (and more!) types of program
     with little performance loss vs. a dedicated machine.

Put together a mix of cache-coherent, streaming and TCC benchmarks.

Run the entire mix in at least three modes, get performance 
numbers for each mode:

  1. Tailored for cache-coherent
  2. tailored for streaming
  3. tailored for TCC

Then, run the entire mix with each benchmark running in its best mode.

Quantify the difference between 1, 2 and 3.

Problem: can the same code run in each mode, or does it have to be modified?

Meta-question: how do we compare the same program written 
in two different styles for two different machines?

Assertion: Application FOO runs better as a streaming application on a 
streaming machine than it will ever run as a CC app on a CC machine.

Difficult question: CC vs. stream vs. TCC/TLS, coding of app changes 
with mode

Simpler question: just change the cache, don't change the code.

Meta-question: are users willing to recode for higher performance?

Meta-question: is programming language A easier or more efficient 
etc. than programming language B?

Ongoing open questions:  how do we find the optimal configuration 
for a given app?  And/or is this an interactive process?

[edit] Open questions

1. Dusty deck: are users willing to recode their app for better performance?

[edit] Notes from the ISCA 2000 paper

The ISCA 2000 paper indicated that RC would be able to provide
multiple programming models in a single RC chip, at only a slight
cost to performance versus a rival chip tuned for a given model.

In other words, TLS apps running on RC would run approximately as well
as TLS apps running on Hydra. Streaming apps running on RC would run
approximately as well as streaming apps running on Imagine. The
implied (?) upside for RC was that a single server with
an RC chip would be able to run a mix of TLS and streaming apps better
than a server with either a streaming or TLS chip alone. Unproven(?) assumptions
include the following: 1) apps coded for and running on a TLS machine run better than apps
coded for and running on a GP CMP; 2) apps coded for and running on a
streaming machine run better than apps coded for and running on a GP
CMP; 3) people are willing to rewrite their apps for a special-purpose
processor so as to gain some advantage.

------------------------------------------------------------------------
From ISCA 2000:

The 8-cluster Imagine is mapped to a 4-tile Smart Memories quad.

...

The kernels simulated - a 1024-point FFT, a 13-tap FIR filter, a
7x7 convolution, and an 8x8 DCT - were optimized for Imagine and
were not re-optimized for the Smart Memories architecture.

...

The Hydra speculative multiprocessor enables code from a
sequential machine to be run on a parallel machine without
requiring the code to be re-written [34][35]. A pre-processing
script finds and marks loops in the original code. At run-time,
different loop iterations from the marked loops are then
speculatively distributed across all processors.

------------------------------------------------------------------------

[34] L. Hammond, et al. Data Speculation Support for a Chip
Multiprocessor. In Proceedings of Eighth International
Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS VIII), pages
58-69, Oct. 1998.

[35] K. Olukuton, et al. Improving the Performance of Speculatively
Parallel Applications on the Hydra CMP. In Proceedings
of the 1999 ACM International Conference on
Supercomputing, June 1999.

------------------------------------------------------------------------
Point of the original ISCA 2000 paper: SM can run Imagine-specific and
Hydra-specific benchmarks at speeds similar to that of Imagine and
Hydra (i.e. within like 50% performance). Implication was that
Imagine would do poorly on the Hydra benchmarks and/or Hydra would do
poorly on the Imagine benchmarks, but that SM would do reasonably well
in either mode. Missing was the overall performance comparison for a
combined Hydra/Imagine suite as measured on 1) Imagine, 2) Hydra and
3) Smart Memories. Results would presumably look something like:

Imagine Hydra Smart Memories
------- ----- --------------
Imagine apps: Good Bad Okay
Hydra apps: Bad Good Okay
Combined apps: Bad Bad Good

Assumptions:

Certain applications run better when re-coded for, and run on, a
different style of architecture. For instance, applications suited
for streaming will run better on an Imagine processor than on a
general-purpose processor. Applications suited to multithreading
will run better on a Hydra-like processor.

Imagine is a reasonable target for applications that stream data.

Hydra is a reasonable target for applications that can take good
advantage of multithreading.

Imagine, Hydra and Smart Memories simulators are sufficiently
accurate, individually and in tandem, such that results are valid.
I.e., not only must the characteristics and flaws of an Imagine
simulator map to those of an actual Imagine processor, it must also
reasonably match the characteristics and flaws of the Smart Memories
simulator to which it is being compared, and so on.

Open questions:

How hard is it to recode my applications for streaming, or for
multithreading?

How does actual performance of Smart Memories, Imagine, Hydra
compare to a known latest-and-greatest real processor, such as
PowerPC, SPARC or x86?

[edit] New paper 1:

  Concentrate on SPECthroughput.  This requires no recoding of
  applications.  Uses unaltered industry-standard applications of
  known characteristics.

  Compare Smart Memories only to itself.  This removes all the open
  variables associated with Imagine, Hydra, TCC or other theorized
  processor or system.

Leaves only two assumptions/open questions:

  How well does Smart Memories simulator match the actual chip?

  How does Smart Memories performance compare to that of a
  latest-and-greatest real processor?

[edit] Proving the web-page claim


The web page, http://www-vlsi.stanford.edu/smart_memories/, says: ``It
[Smart Memories] is a single chip multi processor system with coarse grain
reconfiguration capabilities, for supporting diverse computing models,
like speculative multi-threading and streaming architectures. These
features allow the system to run a broad range of applications
efficiently.''  Do we still believe this? Have we shown it to be true?

The broader implication is that SM will run this wide range of
applications more efficiently than an eqiuvalent general-purpose CMP.
Presumably, because of its reconfigurable memory system.

To prove this, we will need
  * multi-thread application(s) of interest;
  * streaming applications of interest;
  * TCC applications of interest.

[edit] Experiments we can do now:


  - Configure SM as a general-purpose multi-thread machine, run all
applications, note the performance A.  Tailor SM to each individual
app, note performance B.  Compare individual and aggregate performance
A to individual and aggregate performance B.

[edit] Experiments we have done


Experiments we have already done, based on the list of papers on the
web site (http://www-vlsi.stanford.edu/smart_memories/papers.html):

  - Compare stream performance on SVM to actual machine hardware, show
that SVM is a good predictor of actual performance.  Hardware: ATI,
Nvidia, Imagine.  Applications: Matrix Vector-Multiply, 2D FFT, Image
Segmentation.  Paper:  F.Labonte, P. Mattson, I. Buck, C. Kozyrakis
and M. Horowitz, "The Stream Virtual Machine," PACT, September 2004.

  - Paper: K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally,
M. Horowitz, Smart Memories: A Modular Reconfigurable Architecture,
ISCA, June 2000.  "To show the applicability of this design, two very
different machines at opposite ends of the architectural spectrum, the
Imagine stream processor and the Hydra speculative multiprocessor, are
mapped onto the Smart Memories computing substrate. Simulations of the
mappings show that the Smart Memories architecture can successfully
map these architectures with only modest performance degradation."

Experiments: 1) Imagine vs. Imagine-on-SM; 2) Hydra vs. Hydra-on-SM.

Applications, Imagine: fft, fir, convolve, dct.
Applications, Hydra: compress, grep, m88ksim, wc, ijpeg, mpeg, alvin, simplex.

Conclusions: "The overheads of the coarse-grain configuration that
Smart Memories uses, although modest, are not negligible; and as the
mapping studies show, building a machine optimized for a specific
application will always be faster than configuring a general machine
for that task. Yet the results are promising, since the overheads and
resulting difference in performance are not large. So if an
application or set of applications needs more than one computing or
memory model, our reconfigurable architecture can exceed the
efficiency and performance of existing separate solutions."

Or, more concisely: SM's performance is comparable (?) to that of
non-reconfigurable hardware for two very different (?) architectures.

Missing: *wc* did well on imagine, poorly on hydra.  How would suite-wide
performance compare for hydra vs. imagine vs. tuned-per-app SM?

  - Paper: R. Ho, K. Mai, and M. Horowitz, The Future of
Wires. Proceedings of the IEEE, April 2001, pp. 490-504.
"...increased delays for global communication will drive architectures
toward modular designs with explicit global latency mechanisms."

  - Paper: J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian,
M. Horowitz, C. Kozyrakis, "Comparing Memory Systems for Chip Multi-
processors," International Symposium on Computer Architecture, June
2007.  "...our results indicate that there is not sufficient advantage
in building streaming memory systems where all on-chip memory struc-
tures are explicitly managed.  On the other hand, we show that stream-
ing at the programming model level is particularly beneficial, even
with the cache-based model, as it enhances locality and creates oppor-
tunities for bandwidth optimizations. Moreover, we observe that stream
programming is actually easier with the cache-based model because the
hardware guarantees correct, best-effort execution even when the pro-
grammer cannot fully regularize an application's code."

[edit] MIT RAW paper, ISCA 2004

http://cag.csail.mit.edu/raw/documents/raw_isca_2004.pdf

"Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180 nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x to 9x better for higher levels of ILP, and 10x-100x better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw."

"an operation of the form c = a + b in a load-store RISC architecture will require a minimum of 4 operations – two loads, one add, and one store. Stream architectures such as Raw can accomplish the operation in a single operation (for a speedup of 4x) because the processor can issue bulk data stream requests and then process data directly from the network without going through the cache."

"The evaluation for this paper makes use of a validated cycle-accurate simulator of the Raw chip. Using the validated simulator as opposed to actual hardware allows us to better normalize differences with a reference system, e.g., DRAM memory latency, and instruction cache configuration."

"For fairness, this comparison system must be implemented in a process that uses the same lithography generation, 180 nm."

"Much like a VLIW architecture, Raw is designed to rely on the compiler to find and exploit ILP. We have developed Rawcc [5, 24, 25] to explore these compilation issues. Rawcc takes sequential C or Fortran programs and orchestrates them across the Raw tiles in two steps. First, Rawcc distributes the data and code across the tiles to attempt to balance the tradeoff between locality and parallelism. Then, it schedules the computation and communication to maximize parallelism and minimize communication stalls."

"unmodified Spec applications stretch [the rawcc compiler's] robustness. We are working on improving the robustness of Rawcc."

"The speedups attained in Table 8 shows the potential of automatic parallelization and ILP exploitation on Raw. Of the benchmarks compiled by Rawcc, Raw is able to outperform the P3 for all the scientific benchmarks and several irregular applications."

"We present performance of stream computations for Raw... We present two sets of results. First we show the performance of programs written in StreamIt, a high level stream language, and automatically compiled to Raw. Then, we show the performance of some hand written applications."