new benchmark: core-2-core transfer-speed

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

It's always said that the single-die technology of AMD's phenon
is superior to Intel's dual-die solution. I wrote a little bench-
mark some time ago that measures the speed of transfers from one
CPU core to another. This benchmark measureed the throughput and
latency of linear and random memory accesses to cachelines writ-
ten by another core just before. I ran it on my 3GHz Core II Ex-
treme quadcore and someone gave me some numbers after running it
on a Phenon overclocked to 2,53GHz.

The numbers were surprising to me:

- First, it seems AMD didn't manage to get a real advantage from
   its single-die technology. Random memory-acccesses to a 16kB
   block written to the L1-cache of another core just before have
   a throughput of about 1GB/s (!) whereas my Core 2 Extreme has
   3,9GBs between the cores on the same die and 2,7GB/s between
   the cores of different dice.
   The linear throughput between the cores of the same block-size
   is about 3GB/s for the Phenon-system and 6GB/s between cores
   on the same die and 4,8GB/s between cores on different dice
   when probed on my Core 2 Extreme.
- Second, transferring from the L1 cache of one core to the L1
   cache to another core on the same die of Core-2-based CPUs is
   slower than when the data has been written back to the common
   L2-cache and is transferred from there to the destination-core.

Some on a german board mentioned that this tests test aren't
meaningful for real-world-purposes because I probe only trans-
fers from one core to another in one direction where other
cores do nothing.
So I wrote a new benchmark for Win32 that has configurable
behaviour on:
- the pattern:
   Linear measures the throughput of linear memory-accesses and
   random measures the throughput and latency of random memory
   -accesses (measuring the latency of linear accesses doesn't
   make sense in my case because I don't do pointer-chasing on
   linear accesses and memory-accesses become pipelined).
- the direction - unidirectional vs. bidirectional:
   When transferred unidirectional, one core produces some data
   and another consumes it; when transferred bidirectional both
   cores are producers and consumers.
- the block-size:
   The block is the entity produced by the thread on one core
   and consumed by the thread on the other core. Block-sizes
   range from "4k" to "64m".
- producers and consumers:
   You can give a number of core-pairs to the benchmark that
   will be tested. When benchmarking unidirectional transfers
   the first core is the producer and the second is the con-
   sumer; when benchmarking bidirectional transfers both are
   producers and consumers.
   The core-numers are from 1 to N where N is the number of
   cores in the system. With Intel's quadcores the cores on
   the same die are 1 and 2 or 2 and 3 (relies on the APIC
   -IDs and I've never seen a BIOS that does this different).

You can download the benchmark including the sources at [1].
There are two batch-files in the .zip-archive. These are
quadcore.cmd and dualcore.cmd; both run a large number of
benchmarks against different patterns, directions, block
-siztes and core-configurations and one is for dualcore,
the other for quadcore-systems (I could also build a batch
for 8-core-systems with two CPUs - or even larger).

It would be nice to see some results in any of the newsgroups
I posted to. You can copy the output of the batch by chosing
the copy-function of the console's system-menu.


Site Timeline