Performance hit of a List across threads

Well Trey Hutcheson brings us some interesting news regarding List across threads.

I’ve seen a lot of concern over performance during very premature stages of development. Worse I’ve actually seen code changes (very complicated ones) made to avoid that possible performance hit. So I find this discovery kind of funny because at the end of the day it came due to premature optimization. From what details Trey provides it dosen’t appear that the original code was causing any real issues (especially since he switched back to it). So by switching his implementation to an implementation that he thought would be faster for the sake of it he actually caused a big performance hit.

The good news is that Trey noticed the big hit the switch had made and did actual metrics to show there was a problem. As a smart man taught me.. show me the performance problem and I’ll fix it. Until then let me code with testing and maintainability as my primary focus.

And if you do happen to read this Trey don’t consider it a hit on you.. we all fall victim to the evil spirit of premature optimization at points.

Performance hit of a List across threads

3 thoughts on “Performance hit of a List across threads

  1. I don’t consider it a hit at all; I’m just surprised somebody found the blog 😛

    I generally agree with you. However, I do have to argue a few points. First, one should design performance into an application or system as a first-class principle. For example, imagine two data access approaches:

    1) Object readers that issue discreet sql select calls per row (1000 objects = 1000 queries)
    2) Object reader that executes a single statement and constitutes 1000 discreet objects

    If one designs a system for maximum flexibility(flexibility that may never be exercised), one can design one’s self into a corner when it comes to performance. Just make informed decisions, basically.

    In the specific case I described in my blog post, however, is quite different. The exercise I was undertaking was that of making a particular subsystem as fast as possible, before any code was written. More of an intellectual exercise, really.

    For example, I found out that on AMD64 processors, this code:

    if( condition )
    true part;
    false part;

    is faster than this code:

    result = condition ? true : false;

    But the second example is faster on Intel Core Duo processors. At least whatever opcodes are being jitted and executed at runtime.

  2. I’ll agree that your data access case is an obvious example and that anyone who isn’t making the single statement call without good reason is probably making a mistake. At the same time I’ve seen systems where it’s required that any action result in only one SQL call. This can result in code that is so complex and fragile, a simple change brings down the whole deck of cards.

    While performance is something you should have in the back of your mind, I think if it is causing you to write code that is going to be harder to maintain or test then this can be a problem. Especially if you are coding some unimportant side piece that only gets called once in a blue moon.

    At the end of the day I prefer to code to the standard of “testable and fast enough” because I can always redefine “fast enough” as required.

  3. I completely agree. I tend to lean to testability and maintainability of a system, framework, or application.

    However, I did not make myself clear in the case of the threaded List post. I would like to elaborate.

    My company has an “engine” that has existed for nearly 20 years. This engine’s sole purpose is to perform calculations on the fly as fast as possible, using large amounts of native binary data. It was originally implemented in C as a dll for use by desktop applications. The original performance requirements were to calculate all of the numbers for a specific report in 1 minute, on an 8mhz 8086.

    Fast forward to today, and that same engine is still in use. It’s been tweaked here and there, and changes have been made over time to account for changing methodologies. And of course, it’s performance has increased with the progress in hardware. However, it’s still being used by desktop applications, and it was decided to create a new version of this engine to run on the server side, to aid in real-time web reporting.

    On the desktop, we’re a “mutt” shop. Everything from old c/c++, vb, and delphi, to newer c# codebases. However, on the server, we’re a java shop. So the new server-side engine has been implemented in java. I won’t say the performance of the new engine is “slow”; but it does not match the performance of the native c engine.

    So I started an exercise. I had some ideas that might improve performance, and I wanted to test out these ideas. Most of these new ideas centered around in-memory data modeling, and new ways of pre-filtering this data, as well as data aggregation and post-processing. I also wanted to make it as thread-friendly as possible, to take advantage of multicore server boxes.

    I implemented a new engine from scratch, with two goals: make it as fast as possible, and demonstrate some of the new data modeling, filtering, and business rule models that help speed up the actual data processing. This was also a project being developed solely in my private time. As such, I didn’t do any test-driven development, and the code is pretty complex(read not the most maintainable).

    My version of the new engine was implemented in c# 2.0, using quite a bit of unmanaged code. That helped speed things up quite a bit. I also was able to break out certain processing areas into separate chunks; meaning I could take advantage of the thread pool. So far, I have no only matched, but exceeded, the performance of the native c engine for most execution scenarios on a single-core machine.

    And these are very complex, time-consuming operations. For example, when placed behind a web service on a Intel Core Duo laptop, my new engine is able to only service just under 10 requests per second for our largest dataset. This engine is not IO bound, nor memory bound; it is solely cpu cycle-bound. And with the support of threading, it can scale out predictably with more cpus.

    I’ve probably gone into too much detail. But the point is that there are rare cases where performance is just as important as functionality. In this specific case, it’s so important that even microseconds matter. I spent 2 days alone on implementing and measuring several bitcounting algorithms. For this data engine, if we didn’t engineer for performance from the beginning, and optimize as we went, the engine would be so slow as to be unusable.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s