[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bgl-discuss] global tree bandwidth and socket write() performance
What kinds of tests have people done on these things?
I just did some rough tests, but the results were interesting. I've been
sending 10-200 MB buffers from a compute node to a front-end node, and I
get very different results if I divide up the buffer myself or let the OS
do it.
When I write() chunks of the buffer myself, I get a little over 60 MB/sec
from a single compute node to a front-end node, but when I let the OS
handle it, I only get about 2.8 MB/sec. The code for the first case is
like this, more or less:
buf = (char *)malloc(totalbytes);
...
while ((remainingbytes = totalbytes - sentbytes) > 0) {
if (remainingbytes < MSGBUFSIZ) {
r = write(fd, buf+sentbytes, remainingbytes);
} else {
r = write(fd, buf+sentbytes, MSGBUFSIZ);
}
....
sentbytes += r;
}
The other way I tried was letting the OS deal with the large buffer:
sentbytes = 0;
while (sentbytes < totalbytes) {
r = write(fd, buf+sentbytes, totalbytes-sentbytes);
...
sentbytes += r;
}
When I let the OS do it, it writes either 72400 or 73848 bytes at a time,
and it's over 20 times slower, even though I'm calling write() more often
when I use a payload less than 72400 bytes. (I also tried payloads of 80
KB.)
I ran the same test codes between two Linux/x86 machines at UChicago, and
they show almost identical results whether my code divides the buffer or
the OS does, so this appears to be a BGL quirk.
I thought it might have something to do with buffer alignment; a few IBM
docs briefly mentioned 16-byte memory alignments. I'm not really familiar
with any particular tricks for memory alignment, so I tried using a
payload size (MSGBUFSIZ, above) of 65535 and 65536, and I tried sending
the buffer through a loop 16 times, starting with sending buf+0, then
buf+1, etc. I was just guessing that if mod-16 user-buffer memory
addresses were significant, I would notice some performance difference,
but all thoses cases gave the same results.
For practical purposes, it's good to know that I can get 60 MB/s out of a
node, but I'm still curious what's going on. Are there other tests I
should try?
Thanks,
ccg
- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.