User talk:Ken Birman

From Wikipedia, the free encyclopedia

You've added references to virtual synchrony on quite a few pages. Please stop; even if they were all topical, well-written and informative, you have a conflict of interest when doing so. Anaholic (talk) 12:47, 8 February 2008 (UTC)

It appears that you are correct about no longer having a financial stake in things; I was working on outdated information. Obviously I should have checked first. Sorry about that; I'll try not to do it again. I also made a mistake in failing to assume good faith, though in my defence I have seen many contributions looking like that that *were* pure advertisement, and it took some work to be certain this was not.

Still, I don't believe this invalidates the rest of my reasoning. You still have a stake in virtual synchrony - you'd have to be superhuman not to - and it does not deserve to be mentioned in the /introduction/ of pub-sub systems any more than logical clocks does. It is simply one technique among many, not a technique that is central to implementing pub-sub systems or one that is only useful in pub-sub systems.

You would be very welcome to add a historical perspective to the article, however, preferably in a section on history. It may well be right that it was the *first* practical pub-sub system, it simply is no longer a very central one. Anaholic (talk) 14:05, 12 February 2008 (UTC)

Thanks, I think your suggestion makes sense. Completely agree that as the inventor of virtual synchrony, I have an obvious interest there. But of course we're not actually discussing that. The citation was because pub sub was invented by Frank Schmuck and (accidentally) the first real pub-sub system was part of Isis. Widely used for a while, but indeed, not very much on the table today. Still, history is what is (what it was, at any rate). Ken Birman (talk) 23:20, 12 February 2008 (UTC)

On an entirely different note, would it be okay for you to tell me which companies are reporting stability issues, and/or with which technologies? I haven't seen problems myself, but then, research networks with a few dozen nodes probably wouldn't. Anaholic (talk) 14:15, 12 February 2008 (UTC)

I'm nervous about the whole issue of public attribution. I've been told that at two of the largest .com datacenter operators, the people building the in-house infrastructure were forced to switch from publish-subscribe to other options because of instabilities in the products that each were using; both basically reduced the use of the technology drastically to get load and scalability issues under control, and one of them (call them RainForest.com) more or less eliminated it in favor of other options.

I should add that I learned the above outside of any sort of NDA setting. In both cases, the context was that I was actually called and asked to help design alternative solutions. But the bottom line is that I'm hesitant to list the names of these two massive .com players, because I worry about lawsuits. Wikipedia is a very public place... Sorry.

These sorts of stories are common, and the typical "thread" goes like this. We loved pub-sub and sort of loved it to death: we deployed it very widely in our data center of xxx nodes (often xxx will be thousands or more; the approach works better with small deployments). The reliability mechanisms then went haywire when our system became heavily loaded."

On close examination, you find that the issue relates to the weak reliability model. These systems -- the commercial ones -- normally have some kind of NAK/retransmit mechanism. So, RainForest,com, or whoever, starts using the technology heavily, and eventually overloads receivers. They start to demand retransmissions of pretty much every packet, since pretty much all the packets are being dropped by someone. The load goes even higher, loss gets worse, etc.

This is especially serious when IP multicast is in use for technical reasons: on a 100Mbit or Gigabit ethernet, you can easily get back-to-back packets, but they "normally" will have disjoint sender/dest addresses, just because with point to point communication, randomness works in your favor. No single NIC can generate back to back packets. But with two or more IP multicast senders you might easily have receivers who need to receive lots of back-to-back packets. Now the NIC gets overrun with data (few can even handle two in a row, much less a longer sequence). So the NIC drops packets.

The upshot is that load drifts upwards, then you see a massive load spike with 100% badput (retransmit requests and retransmissions but no good data). The whole datacenter shuts down cold. 90 seconds later the timers in the pub-sub technology timeout all the pending sends, and the thing recovers. A broadcast storm.

Now, this is a stereotyped story, but a common one -- you see problems of this sort with many products, and even with the systems we've done as research technologies here at Cornell. Our new Quicksilver platform had such a problem, for example (we think we have it fixed, but need to test in bigger settings to be sure). A tough research issue. And for commercial vendors a real limitation that is getting them thrown out from the biggest, most loaded data centers around.

As I said, both of these specific stories were told to me outside of NDAs by folks no longer employed at the two .com companies in question. I suspect that anyone actually at either company would refuse to comment. Lawsuits are common in this crowd... Ken Birman (talk) 23:20, 12 February 2008 (UTC)

User talk:Ken Birman

From Wikipedia, the free encyclopedia

Views

Navigation

Interaction

Search