Monday, May 09, 2011

Environment Cost: Are licensing costs damaging software quality?

Are you unable to test your solution fully except in production? This could well be because your business can't afford to buy licenses for the same software as is in production to be installed on UAT, QA and Development environments. I've seen clients where things like web servers, application servers, messaging middleware and storage solutions just can't be the same as production because it's too darn expensive.
One reason this happens is because these kind of software purchases tend to be costed based only on production needs, most often this occurs when procurement sits off to one side away from the rest of the business. In fact in one organisation I saw a few years ago the development part of the organisation had to negotiate separately to buy the same software since procurement dealt solely with "operational" systems.
Frankly I also think it happens because some software vendors are hoping clients will have to come back and ask for lots more licences once they realise they need to start testing things. I've seen this happen too, although to be fair I also know of vendors who throw in test and dev licences for free. (of course some vendors will gladly sell you expensive test tools to test their expensive complex software, but thats a whole other story....)
The lack of a representative "production like" environment causes numerous problems, in fact most outages I've seen could have been prevented had something more representative of production been available. When I say representative I mean environments where the performance delta's between them are relatively easy to measure and model, that gets much harder when you are running different software or radically different hardware.  I'm now of the view that if you can't afford to buy the licences for all of production, UAT, QA and development then you can't afford the software at all.
Once you go down the road of relying on some expensive software (or hardware) to scale things on production you are running increasing risks: an outage becomes more and more likely and you are letting a software or hardware vendor control the cost of scaling your business. I think it's much better to go with commodity hardware, and software where you choose what to pay for and when - this seems to be what works for the best known "internet scale" companies.
I think software licensing and hardware costs can damage software quality as they make scaling the ability to test software too expensive. It's just not enough to scale production, you have to have a cost effective way to scale the means of getting high quality software into producution as well.

Wednesday, December 15, 2010

Signal to Noise ratio in software testing

When using techniques like automated functional testing we want a high signal to noise ratio, we want a failing test to tell us something has really broken and not just that we changed something.
Tracking this ratio of 'useful failures' to 'wasteful failures' gives us a signal to noise ratio for our tests. Ideally we want something like 10:1, say, so only every 10th failure is spurious and just due to a software change and/or fragility in our tests. Unfortunately for many teams this is more like 1:10, so most test failures are due to test fragility and not to something really being broken.
When we get too much noise and not enough signal we tend to start ignoring problems and disabling tests, while this might make the ratio better we do so at the cost of losing some of the signal. You might want to try tracking this ratio for a few weeks, tracking the trend over time can give a way to focus attention on eliminating areas of high noise.
My experience is the sources of noise are often related and are often due to things like timeout issues, hard coding id's or creating unnecessary dependencies on the order things get displayed or happen. Another common source of noise is teams working with very large backlogs of 'low impact' bugs, the issue here is more complex and probably worth a post on it's own - but when prioritizing bugs it is worth considering the impact they have on team productivity and not just the production impact.
Whatever the cause a relatively small effort can sometimes dramatically improve the signal to noise ratio.

Wednesday, March 10, 2010

Software that requires unnessecary constant internet connectivity

I bought a logitech harmony remote a while back and last night needed to reconfigure it, so I fire up the config tool and sit watching connection errors. I finally followed a link to the support web site only to see "routine maintenance" was in progress. I'm left frustrated and wondering why a connection is always required to just reconfigure a couple of buttons on the remote. Why not cache some data locally and have an offline mode?

I had the same issue with my slingbox which I use to stream tv to another room, when I moved house I had no broadband for a week or so - I could not retune the slingbox without internet connectivity as this is required for authentication, again puzzlement and frustration.

Now in the last few days I've been reading about Ubisoft and issues people have had with their "always need to be connected" DRM solution, I'm very glad I've not bought the product involved.

Until the internet achieves utility status (like electricity or water) it seems a huge assumption and big risk to release products that rely on "always on" connections where an offline mode seems entirely possible. I think some packaging that says "Needs an internet connection" needs to be revised to say "Needs a constant internet connection to function at all"

To introduce an unnecessary single point of failure is in my book poor design or poor decision making process, especially where most of the impact of the associated risk ends up with the consumer.

Wednesday, January 27, 2010

Using Retlang for multi-threaded windows form code

I've been a fan of Retlang for a long time and have been meaning to write up some of the ways I've used it. This is the first of 3 blog entries and in this one I'll describe a way of using Retlang to avoid the dreaded InvalidOperationException "Control accessed from a thread other than the thread it was created on".


It avoids lots of nasty boiler plate calls around InvokeRequired(), this isn't new and has been described before but what I want to show how it can also make testing easier and work with existing MVC patterns.


We are going to use Retlang to allow controller code and view code to communicate over message channels in a thread safe way.


Lets imagine a very simple implementation of MVC to illustrate the approach.


Here's the test, we are just adding numbers together and then displaying the answer in the view. 



[TestFixture]

public class TestSimpleController

{

private Model model;

private SimpleView view;


private MockRepository repository;


[SetUp]

public void SetUp()

{

repository = new MockRepository();

model = repository.StrictMock<Model>();

view = repository.StrictMock<SimpleView>();

}


[Test]

public void testShouldDoSimpleAdd()

{

model.PushNumber(10);

model.PushNumber(5);

model.SumStack();

LastCall.Return(15);

view.DisplayCurrentTotal(15);


repository.ReplayAll();


var simpleController = new SimpleController(model,view);

simpleController.SendNumber(10);

simpleController.SendNumber(5);


simpleController.Sum();

repository.VerifyAll();

}

}


I'm using RhinoMock for mocking the view and the model. The controller implementation looks like


public class SimpleController

{

private readonly Model model;

private readonly SimpleView view;


public SimpleController(Model model, SimpleView view)

{

this.model = model;

this.view = view;

}


public void SendNumber(int i)

{

model.PushNumber(i);

}


public void Sum()

{

var result = model.SumStack();

view.DisplayCurrentTotal(result);

}

}


Now suppose we have to extend the implementation to deal with a long running task, we don't want to block the view thread as that is bad for user experience so we create a new thread to call the model from. Here is the controller code for that:


public void DoPrediction()

{

var thread = new Thread(InvokeModelWork);

thread.Start();

}


private void InvokeModelWork()

{

var result = model.LongRunningCalculation();

view.DisplayCurrentTotal(result);

}


It's worth noting here that a test for this code written in the same way as for our SimpleAdd will quite likely start failing at this point as asserts will be made on the calling thread before the "worker" thread has called the model and the view. Here is that naive and incorrect version of the test


[Test]

public void testShouldInvokeLongRunningCalc()

{

model.PushNumber(33);

model.PushNumber(44);

model.PushNumber(11);

model.LongRunningCalculation();

LastCall.Return(88);

view.DisplayCurrentTotal(88);


repository.ReplayAll();


var simpleController = new SimpleController(model, view);

simpleController.SendNumber(33);

simpleController.SendNumber(44);

simpleController.SendNumber(11);


simpleController.DoPrediction();

repository.VerifyAll();

}


This sort of test is a common problem in many code bases where the threading is not added until after complaints from the users about performance problems - this can create a lot of problems with the testing and is a very common source of bugs in MVC code. It is easy to tie yourself in knots using the mock framework to signal threads that a method was called or, even worse, using Thread.Sleep() to pause inside of the test. My experience has been that both of these are a source of "mysterious" test/build failures and are very hard to maintain.


Anyway even if we make our test pass we'll see the following when we try to run the code for real in a Windows Forms implementation of our view interface:


{"Cross-thread operation not valid: Control 'textBoxResult' accessed from a thread other than the thread it was created on."}


For completeness here is the very simple user control that implements the view.


public partial class SimpleControl : UserControl, SimpleView

{

private readonly SimpleController controller;


public SimpleControl()

{

controller = new SimpleController(new DoesMath(),this);

InitializeComponent();

}


public void DisplayCurrentTotal(int i)

{

textBoxResult.Text = i.ToString();

}


private void buttonSum_Click(object sender, EventArgs e)

{

controller.Sum();

}


private void buttonPredict_Click(object sender, EventArgs e)

{

controller.DoPrediction();

}


private void buttonSubmit_Click(object sender, EventArgs e)

{

var input = int.Parse(textBoxInput.Text);

controller.SendNumber(input);

}

}


So how can Retlang help?


Retlang lets us create channels which different threads can use to communicate. We can create one of these channels from the dispatch thread of a windows form class.


Here is the new version of the constructor for the controller:


private Channel<int> viewChannel;


public SimpleController(Model model, SimpleView view)

{

this.model = model;

viewChannel = new Channel<int>();

viewChannel.Subscribe(view.Fibre, view.DisplayCurrentTotal);

}


In this simple case we have one channel over which we will send int's and we also have have just one subscriber which is the view.DisplayCurrentTotal() method. We also ask the view to provide use with the Fiber to use, in this case this allows the view to provide a fiber that it knows will be safe to execute the subscriber method(s) on.


In more advanced cases you might want to have many channels or multi-plex different message types over the same channel and use the Dispatcher pattern to route those to the right subscribers on the view. I've done just that in a real example and used reflection to create mappings from each view method to the the correct message type being received from the channel.


On with our simple example; so what do Sum() and DoPrediction() look like now?


public void Sum()

{

var result = model.SumStack();

viewChannel.Publish(result);

}


public void DoPrediction()

{

var thread = new Thread(InvokeModelWork);

thread.Start();

}


private void InvokeModelWork()

{

var result = model.LongRunningCalculation();

viewChannel.Publish(result);

}


So instead of calling methods on the view we publish messages instead, in this case just an int.


What about the test? We'll need a few changes to SetUp as well, Retlang gives us a nice stubbed implementation of fiber we can use for testing.


[SetUp]

public void SetUp()

{

fiber = new StubFiber();


repository = new MockRepository();

model = repository.StrictMock<Model>();

view = repository.StrictMock<SimpleView>();

SetupResult.For(view.Fibre).Return(fiber);


fiber.Start();

}


[Test]

public void testShouldInvokeLongRunningCalc()

{

var finished = new ManualResetEvent(false);


model.PushNumber(33);

model.PushNumber(44);

model.PushNumber(11);

model.LongRunningCalculation();

LastCall.Return(88);

view.DisplayCurrentTotal(88);

// note the type of m below must match the parameter type of the above method

LastCall.Callback((int m) => finished.Set());


repository.ReplayAll();


var simpleController = new SimpleController(model, view);

simpleController.SendNumber(33);

simpleController.SendNumber(44);

simpleController.SendNumber(11);


simpleController.DoPrediction();


Assert.IsTrue(finished.WaitOne(1000),"Timed out");

repository.VerifyAll();

}


We use a manual reset event to allow us to wait for a particular event to occur, in this case the call to view.DisplayCurrentTotal() is followed by the use of a callback to allow this to happen. We then wait for the event (or a timeout) before calling VerifyAll(). So we can't avoid the fact that things happen asynchronously, but by using a message channel we make things explicitly so and importantly we have consistency across all the controller/view interactions. If we treat our test code as just code we can take advantage of this consistency to extract methods etc and this helps keep the code readable.


Here are the required changes for the view:


private readonly FormFiber formFiber;


public SimpleControl()

{

formFiber = new FormFiber(this, new BatchAndSingleExecutor());

controller = new SimpleController(new DoesMath(),this);

InitializeComponent();

formFiber.Start();

}


public IDisposingExecutor Fibre

{

get { return formFiber; }

}


We no longer need to worry about surrounding things with DispatchRequired() calls, our implementation of the view interface stays nice and clean. We also avoid having to create a view Proxy, this is another solution to the DispatchRequired() that requires a lot of repetitive boiler plate code.


I've tried to keep things simple for this example, some things to think about for a real world implementation are

  • That Methods on the view interface will be of the form void Method(Message msg)
  • Any methods on the view that do need to return something must be called from the same thread as created the form (but to be thread safe this must be done anyway).
  • Whether to multiplex multiple message types over the same channel and then use a Dispatcher to call the correct method on the view or to have a channel per message type
  • Whether to create different message channels for different styles of message flow and quality of service

I wonder if this is more Model Channel Controller as opposed to Model View Controller? 


In the next entry I'll describe using a similar idea to allow high performance low latency updates of a view direct from a domain model by sidestepping the controller but without seriously compromising encapsulation. In the final entry of the 3 I'll describe how to use Retlang channels to efficiently share work across multiple worker threads and hence CPU core. The Retlang developers have done a great job and created a simple to use but highly useful library.

Tuesday, November 24, 2009

Ban The Debugger

How much logging do you really need in your application? I was visiting a client recently to help diagnose some problems with one of their applications. I asked for their support people to send me the log file, seemed like a good place to start. I got a file of about 2.5K, that seemed kinda small to me. I opened it up and found just one exception and it's stack trace so I got back to the support people to check. Yep, that was it, logging cranked up to max and the only output was one stack trace. Not even the date and time of the problem. Wow! A technique I find very useful prior to go live is to Ban the Debugger. Developers get very used to just firing up the debugger when fixing issues or diagnosing problems. This is fine but it means that no one looks at the log files from the point of view of someone who only has those available to find out what is going on. For our colleagues in support and operations it is only the log files that they can use to find and fix issues. So during development prior to go live I stop the developers from using the debugger, instead I ask them to spend at least some time trying to fix the issue based solely on the log files and whatever else our friends in support will have available once we have gone live. This usually leads to a big upswing in the amount of logging and it's logging we know helps to fix issues. Of course sometimes you do need the debugger, but hopefully after we've used the logging to narrow down the problem area and to understand what the users were trying to do at the time. So back to the client above - it turned out the technical lead had never worked in support and that the support team had not really been represented during development. We owe our colleagues better than this, perhaps Banning the Debugger for a time during development might help.

Wednesday, November 04, 2009

Technology Lightning Talks in Chicago

Some of my colleagues from the ThoughtsWorks Technology Advisory Board members will be delivering Lightning Talks next week in Chicago, I'm not speaking myself but will be attending. If you are in Chicago and would like to come along follow this link for registration information. I think it's a pretty great list of speakers and am looking forwards to hearing them myself.

Monday, November 02, 2009

Databases and Separation of Concerns

A continuing source of pain on projects is Object Relational Mappings and databases. A contributor to this pain is the mixture of two concerns, this mixture seems to occur on nearly every project that uses a database. I think spending a moment to think about these two different kinds of usage is worthwhile. So what are these two concerns? A. Persist State Needed for Recovery This is just working state saved so that when the application restarts we can continue processing. For example perhaps the current state of a customer order or work in progress on a very long running calculation. B. Data saved for Reporting and Querying This is data saved so it can be queried later on, perhaps to allow end of month reports to be generated or tracking of user trends. It is not needed to recover the working state of an application. Many teams try to overload (A) to achieve (B), the sorts of problems this can cause are i) Data Volume - the volumes of data needed for (A) tend to be smaller, it's current working data as opposed to historical data. This can show itself as performance issues for the application as queries become slow over time. ii) The object design needed for (A) and (B) is not necessarily the same. This often shows itself as fields or object relationships being created with names like "history" or "recordOf", so an object design created for (A) becomes overloaded with things needed for (B). This again causes performance issues as the number of objects and data getting pulled into memory by the ORM can increase. It also means a simple state update can touch a lot of tables as we try to achieve (B) at the same time, chances are the indexes needed for historical queries can start to impact the update speed for these. iii) Confusing Code As with any other area where we fail to achieve separation of concerns the code can become confusing, for example state change operations become implicitly overload to create and persist data needed for historical reasons. iv) Archive of historical data becomes problematic, so you can't cleanly identify what data in the DB can be safely moved out to a historical database or deleted without impacting the functionality of the application itself. It wont always help but separation of these two concerns can provide clarity and address some kinds of performance issues. I certainly think it's worth calling out these concerns as different kinds of requirement even if you end up using the same implementation for both.