The confessions of a refactoraholic: 2010

Wednesday, December 15, 2010

Code entropy and OO

On a related subject ...

The concept of OO is designed with the assumption that it's good to couple data together with the functions that operate on it. This is the concept of a class. The idea is that the functions in the class have a close relationship to the data, and do all the tricky logic, while the outside world doesn't have to know about the details. This is the idea of modularization, which helps make sure that changes in one part of a system have minimal effects on other parts of the system.

It all goes wrong when class boundaries are defined incorrectly, and then classes need to rely excessively on data from each other. Then instead of boundaries that protect from details, we end up with boundaries that prevent proper usage.

In order to take full advantage of OO, it is imperative to continuously rearrange class boundaries, and maintain the close relationship between data and the code that operates on it. In fact I would go as far as to say this is what distinguishes good OO code from bad OO code. By default, all OO code tends to deteriorate, or increase in entropy precisely because data goes out of alignment with the code that operates on it. This is perhaps the biggest (and very much fair) criticism of OO-haters.

What I'm trying to arrive at here, is a way to automate, or semi-automate refactoring in order to take full advantage of OO.

Tuesday, December 14, 2010

Code entropy

Most of the time when I refactor, I strive for some sort of perfection, some sort of lofty goal of readability, maintainability, scalability, etc.
It's all nice and good, but as a proponent of the scientific method, I keep thinking:
"Wouldn't it be nice to quantify all that?"

I mean, if I had some sort of objective measure, a number, I could put on my code to say "My code has improved by X", or "this code sucks ass because it has the quotient of Y". Then I could perhaps run my code thru some Turing Machine, and spit out this number, then after the refactoring do the same thing, and get a better number.

So today I had this idea of measuring quality of code via Code entropy. The lower the number, the more maintainable the code (or some other such subjective adjective). Of course code that has the lowest entropy is code that does absolutely nothing. The code with high entropy would hopefully describe either code that either does a lot or is very messy. So we have to balance readability with functionality. In the meantime, we could define (good) refactoring as rewriting code in a way that preserves functionality, but lowers entropy.

So what is this elusive entropy concept? Well, I don't really know yet, this is fresh off the press. But one idea I had was to measure code entropy in terms of "coupling", "interdependence", between modules, classes, and functions.
For example, we can measure it in terms of how often one class needs access to members of another.

Here's a very simple example of what I mean.

High entropy version	Low entropy version
class X { public: void DoStuff() { if (y.CalculateStuff1() == y.CalculateStuff2()) y.CalculateStuff3(); } private: Y y; } class Y { public: Stuff CalculateStuff1(); Stuff CalculateStuff2(); void CalculateStuff3(); }	class X { public: void DoStuff() { y.CalculateStuff123(); } private: Y y; } class Y { public: void CalculateStuff123() { if (CalculateStuff1()==CalculateStuff2()) CalculateStuff3(); } private: Stuff CalculateStuff1(); Stuff CalculateStuff2(); void CalculateStuff3(); }
entropy of 3 (class X needs to access members of Y 3 times)	entropy of 1 (class X needs to access members of Y only once)

Obviously in the 2nd case, the interaction between X and Y is minimal, and because Y has access to its own data, while X may not, the code on the right can be simplified further, depending on what CalculateStuff1, 2, and 3 are doing, and what data members they're using. The code on the left cannot be simplified, if Y's data is hidden from X.

There are obviously a lot of open questions here. How do we count when we call a protected function or access a protected member of a base class? How do we count when we call a virtual function? Does this count as 1 or 2?

The idea would be to reward code where logic is closer to the data that it operates on, and penalize code that tries to operate on data that doesn't belong to it, which gets rid of the "code smell" called "Feature envy" in the Refactoring book (p.80) .

Then we could have different types of entropy:

Static code entropy

Measure on the static structure of the code
If we analyze the code "on paper", how often would objects try to access each other's methods / data

Dynamic code entropy

Measure on the runtime properties of the code
If we actually run the code, how often would the methods from other classes be called

Potential code entropy

Measure on how badly the code could be misused
For example if you make all your class members public or use global variables, that is not in and of itself bad, but it has the potential of being used the wrong way, and entangled in all sorts of undesirable relationships with other objects.
In the example above, the code on the right has lower potential entropy, because 3 of its functions are private, and in principle are protected from the hoards of malicious hackers or bad programmers

I think that's all I have to say about the subject for now.

P.S. Apparently the concept of software entropy is not new, but so far I haven't seen it quantified quite like what I'm describing here. More research is needed.

Monday, December 13, 2010

The root of all evil

is duplication of logic. Nothing is worse than 2 subsystems trying to do almost the same thing in sufficiently different ways.

Friday, December 10, 2010

booleans Galore

Often I bump into code which contains convoluted logic involving a lot of booleans.

If the booleans are independent, then the number of states represented by the booleans is 2^n. All too often, however, the number of REAL states that the object can be in (based on those booleans) is much lower. That is because the booleans are interdependent and setting one to true may mean another one needs to be set to false, etc.

After 4-5 such interconnected booleans, the logic can get blurry, and code hard to read.

One thing that seems to help is to replace multiple booleans with one enum variable. The enum represents the true combined state of the object(s). Sometimes a state may correspond to one of the booleans being true, or sometimes it can be a combination. It's important to get an overview of what can happen to the object and which of the booleans are dependent on each other, and which are completely independent (and perhaps should not be included in this state machine). The number of entries in the enum will then represent the actual states of the object.

Now this one is very much on a case by case basis, but a few times I've managed to reduce logic from 4-5 interconnected booleans (16-32 states) to 5-10 enum entries, which gets MUCH easier to reason about.

Anything else? Hmm... I think that's all I'm gonna say today. Refactoring awaits. MWAHAHAHA

Monday, December 6, 2010

Optimization vs. readability

It is not uncommon that I encounter situations, where I know the code I'm writing or refactoring is clearly not performance-optimal, and yet more readable.
Here's an example of code that I've encountered recently (excuse the bad formatting, I'm experimenting here):

struct State{
float m_Prop1;
int m_Prop2;
std::string m_Prop3;
// ...
}

class Manager { State oldState, newState; std::vector m_Elements; }

//...

foreach elem in m_Elements {

if (oldState.m_Prop1 != newState.m_Prop1)

elem.SignalProp1Change();

else if (oldState.m_Prop2!=newState.m_Prop2)

elem.SignalProp2Change();

// ...

}

There are 2 ways to go from here.

The first is to take the comparison outside the loop, as a result of which we only need to do the comparisons once. The second is to move both the comparison and the signaling into the elements, which would put all the important logic where it is actually needed. Here's the comparison.

Performance-optimal version OO version

Code in the Manager header struct State {
  m_Prop1, m_Prop2, m_Prop3;
}
class Manager {
  State oldState, newState;
  std::vector m_Elements;
} struct State {
   m_Prop1, m_Prop2, m_Prop3;
}
class Manager {
  State oldState, newState;
  std::vector m_Elements;
}
or
class ZState {
friend class Element;
private: m_Prop1, m_Prop2, m_Prop3;
}
class Manager {
  ZState oldState, newState;
  std::vector m_Elements;
}

Code in the Manager if (oldState.m_Prop1!=newState.m_Prop1)
    foreach elem in m_Elements
        elem.SignalProp1Change();
if (oldState.m_Prop2!=newState.m_Prop2)
    foreach elem in m_Elements
         elem.SignalProp2Change(); foreach elem in m_Elements
   elem.SignalChanges(oldState, newState);

Code in the element source file //nothing much happening here void Element::SignalChanges(const State& oldState, const State& newState) {
   if (oldState.m_Prop1 !=newState.m_Prop1)
      SignalProp1Change();
   if (oldState.m_Prop2 !=newState.m_Prop2)
       SignalProp2Change();
}

Code in the element header file class Element {
  public:
    void SignalProp1Change();
    void SignalProp2Change();
} class Element {
public:
    void SignalChanges(const State& oldState, const State& newState);
  private:
    void SignalProp1Change();
    void SignalProp2Change();
}
or
class IElement {
virtual void SignalChanges(const State& oldState, const State& newState) = 0;
}
class Element : public IElement {
// the rest is the same as above
}

The first solution is more performance optimal, since if a property hasn't changed, we never call any of the functions, thus skipping the traversal of the list altogether. In the second solution is more modular, and all logic is moved into the elements. This way we hide all that logic from the manager, we can also hide the contents of the state from the manager. This latter approach is more robust, leaving the manager class completely oblivious to logic of the elements. Also a new programmer looking at the code base needs to inspect less code in order to figure out what to do.

So what to do, what to do? I typically prefer the second option, as you may have already guessed. The benefits of modularization are great for maintenance, and the performance gain may only be important if this code is executed very frequently. But I think this example does illustrate an important point. Data hiding can prevent us from getting code that is fully performance-optimized. The more we separate code into clean independent modules, the less we can take advantage of their interdependence.

Sunday, November 28, 2010

Refactoring and performance

"The interesting thing about performance is that if you analyze most programs, you find that they waste most of their time in a small fraction of the code. If you optimize all the code equally, you end up with 90% of the optimizations wasted, because you are optimizing code that isn't run much. The time spent making the program fast, the time lost because of lack of clarity is all wasted time ... [When] the code is clearer, you have a better understanding of your options and what kind of tuning will work" (Refactoring, p.70)

I think Martin Fowler has a very good point here, which I have known for a while, but couldn't quite articulate as well. Writing performance-optimized code comes at a cost, that cost is typically clarity, maintenance and ultimately time. Another quote comes to mind:

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified” - Donald Knuth

Very often the most unexpected things may end up being performance bottlenecks. Someone may spend days, weeks, or even months, trying to squeeze out every inch of performance out of a piece of code, and in the end it may not matter at all, as you discover it's something else (the part of code that doesn't even look like it needs optimizing) that creates 90% of the performance loss.

What's worse, the prematurely optimized code can end up being re-written altogether, because the user requirements have changed, and then the hours spent painstakingly squeezing out miliseconds of CPU time are wasted 100%!

Wednesday, November 24, 2010

Flavors of optimization

I was thinking the other day about optimization. What does it mean to optimize code? Well it depends. What are you optimizing with respect to? It seems like everyone got their favorite type of optimization.

You can:

Optimize for the long-term speed of feature development and bug fixes
Optimize for performance (this is what people think of as optimization by default).
Optimize for minimum time spent doing any work, which means

Offloading onto others
Making small features look more important
Making useless features look useful
Shoving off bugs for later (hopefully forever) or to others

Minimize the amount of wasted work

I don't want to do something now, which has to be redone later
I don't want to make a temporary fix now, because there might be a proper fix later

Maximize for maximum features per day

Short term speed of development above all
No time for design, refactoring, or performance optimizations
Focus on appearance of progress, and the total volume of code produced, regardless of correctness, extensibility, and readability
Bug fixing is considered a completely separate process, which essentially gets in the way of progress, testers are annoying
Documentation is considered a nuisance
Other programmers who point out bugs are also annoying

Maximize for prettiness over functionality

What's funny is that in a large project, every one of these optimizations will be present, resulting in a mixture of crappy code, pretty code, buggy code, easily maintainable code, and performance-optimized code, all mish-mashed with parts trying to interact with each other.

Thursday, November 18, 2010

A shiny new class

Having a new class in the right place is like having a new employee with the right skill set. As soon as you create the class, you begin to see things that were in chaos falling into place, things that didn't have an owner, having an owner, and things that you thought were difficult, now becoming easy. Hurray to new classes that make sense!

Friday, November 12, 2010

A manager for heterogeneous data (the solution)

OK, so here's the solution I ended up with (forgive the improper syntax, you should still be able to get the picture):

class ZDataManager
{
   public:
   void RegisterListener(IListener* listener)
   { m_aListeners.push_back(listener); }
   void UnregisterListener(IListener* listener)
   {m_aListeners.remove(listener);}
   void ProcessEvent(const SEventArgs& args) {
   foreach (list in m_aListeners)
   if (list->GetArgType() == args.GetArgType())
   list->ProcessEvent(args);
   }
   private:
   std::vector<IListener*> m_aListeners;
}

class IListener
{
   virtual EArgType GetArgType() const = 0;
   virtual ProcessEvent(const SEventArgs&) = 0;
}

struct SEventArgs
{
   SEventArgs(const EArgType& eType) {m_eType = eType;}
   EArgType GetArgType() const {return m_eType;}
private:
   EArgType m_eType;
}

struct SEventArgsFruit: public SEventArgs
{
   SEventArgsFruit(ETexture eText, EColor eCol, ETaste eTaste): SEventArgs(FRUIT_ARGS)
   { m_eText = eText; m_eCol = eCol; m_eTaste = eTaste};
   ETexture m_eText;
   EColor m_eColor;
   ETaste m_eTaste;
}

class ZFruitListener: public IListener
{
   ZFruitListener() {GetDataManager()->RegisterListener(this);}
   ~ZFruitListener() {GetDataManager()->UnregisterListener(this);}
   virtual EArgType GetArgType() const {return FRUIT_ARGS;}
   virtual ProcessEvent(const SEventArgs& args) {
   SEventArgsFruit* fruitArgs = static_cast<SEventArgsFruit*>(&args);
   //process new fruit
   }
}

class ZVegetableListener : public IListener
{
   virtual EArgType GetArgType() const {return VEGGIE_ARGS;}
   virtual ProcessEvent(const SEventArgs& args) {
   SEventArgsVeggie* veggieArgs = static_cast<SEventArgsVeggie*>(&args);
   //process new vegetable
   }
}

I'm quite satisfied with this solution, it relies on a static cast, but overall is still elegant, and does the job. The key to the solution is the generic SEventArgs struct and all of the wrappers that inherit from it, allowing us to package different types of data with an arbitrary number of parameters into something that can be processed generically thru the manager. It might seem awkward at first to be passing in a wrapper to the manager each time, or to create that wrapper for each case. But remember, we may have started with a solution that had a separate list for each arg type, and separate registration functions. Now, only in the ProcessEvent function do we begin to treat the args differently.

One assumption I make (true for my set of use cases) that multiple listeners could listen to the same type of data, and one listener cannot listen to 2 different types of data. We could also have a base class ZListenerBase (inheriting from IListener), which simplifies the registration (this registration / unregistration happens exactly in the same way in the constructor / destructor of the listener classes).

Saturday, November 6, 2010

Replace method with method object

This technique I found very useful, and needed when untangling complex functions, with local variable dependencies. I believe that when a function becomes too complex, or too large, it is often a hint that there should be another class here, taking care of business in a more systematic way. At first it seems like a big deal to untangle the complex funcion, but as you start doing it, you usually realize how much cleaner the logic becomes when it's all separated into the other class.

The original function becomes less cluttered, as it offloads the work to the new class
The new class becomes cleaner, for the same reason, but often even more than expected, because the functionality offloaded to the new class can help clean out other functions in the original class
Once the logic has been dumped into the new class, it may be clearer how to simplify it further, and reduce the coupling between the original and the new class

This is not a trivial process, so I'll refer to Martin Fowler's book for the mechanics (p.136):

Create a new class, name it after the method
Give the new class a final field for the object that hosted the original mehtod (the source object) and a field for each temporary variable and each parameter in the method
Give the new class a constructor that takes the source object and each parameter
Give the new class a method named "compute"
Copy the body of the original method into compute. Use the source object field for any invocations of methods on the original object [my italics]
Compile
Replace the old method with one that creates the new object and calls compute
Now comes the fun part. Because all the local variables are now fields, you can freely decompose the method without having to pass any parameters

Now notice the italicized text. This can be an important glitch, which I'm sure Fowler is aware of but didn't want to get into for the sake of simplicity. Often you'll find that in your attempt to disentagle a function in this fashion, you end up creating 2 classes which are entangled with each other. The new class will (at first) need to make calls to functions of the original class. Of course it would be a shame to leave it there. Often if your new class ends up calling functions of the original class is a signal that perhaps some more functions should be offloaded to the new class, or some new parameters need to be passed in to the functions of the new class. There is an art to this, and at this point I don't really have a checkpoint list of how that situation should be handled, as it comes on a case by case basis. Sorry, no examples here, as last time I did this type of refactoring was a few months ago.

Sunday, October 31, 2010

Code quality and communication

I find it interesting how much communication patterns and personalities influence code quality.

If a programmer thinks he's too cool for school, sooner or later he'll end up making changes to other people's code without asking, often with undesirable side effects
If a programmer is afraid to ask about how to do this or that, lest he's perceived as someone who is weak, he ends up creating duplicate code, reinventing the wheel, when a lot of people reinvent the wheel in different places, maintenance of large code bases can become problematic. This can also be a major waste of time, when instead of spending minutes to hours asking about an existing way to do X or Y, which may already exist, someone might spend a day or two (or god forbid even more), doing Z.
If everyone is afraid to take responsibility for a subsystem, that subsystem deteriorates. Each programmer sees the mess but doesn't want to / doesn't know how to clean it up, and ends up creating more of a mess. It's important that a software subsystem has an owner / a goto guy, so that there is an authority on how to do things right, and what wheels not to reinvent.

Unfortunately communication has an overhead.

It takes time to get off your ass to go to a colleagues desk.
An expert / owner of a subsystem may be too busy to help
Even though the subsystem expert may have a solution, it may require some wrap-up work on his side to make it fit your requirements. In the end it's a trade off between (for example) 2 hours on your side + 2 hours of his, versus 2-3 days if you do it alone.

Overprotective system owners have a detremental effect on code quality. If you are afraid to touch the expert's code, or ask them for help, you will end up doing the same thing twice.

Helpful and friendly system owners, on the other hand, can very much help reduce the overall maintenance cost, by making sure other people use their system properly, don't reinvent the wheel. They will also try to accomodate their subsystem to new use cases from others. It takes a little up front effort, but is worth it at the end.

Saturday, October 30, 2010

A manager for heterogeneous data (a refactoring puzzle)

Currently I'm in the process creating a software subsystem, which is responsible for collecting data from multiple places, and then notifying some listener-type entities when that data has been received.

It looks something like this.

Pusher classes => __raw data__ => DataManager => __processed data__ => listener classes

So far so good. The left half of this graph seems to be easy to structure, as the DataManager simply provides a few public functions to push different kinds of data, and those are the calls made from the pusher classes. Even though those functions have different signatures for different types of data, there's only one place of access, namely the manager itself.

The right side of the graph is a bit more problematic. That's because every listener listens to different types of data. Thus the function signatures of the listeners have to reflect the signatures of the corresponding DataManager functions, creating a dependency, and a maintenance nightmare.

So my code currently looks like this.

ZDataManager
{
   //heterogeneous listener registration / unregistration
   void RegisterFruitListener(ZFruitListener*);
   void RegisterVegetableListener(ZVegetableListener*);
   void RegisterBerryListener(ZBerryListener*);
   //...plus corresponding unregistration functions

   //heterogeneous receiver functions for receiving different types of data
   void ReceiveNewFruit(ETexture, EColor, ETaste);
   void ReceiveNewVegetable(string sVeggieName, float fVeggieWeight);
   void ReceiveNewBerry(float fYumminessFactor, float fVitaminsPerGram);

   //multiple arrays of heterogeneous listeners
   list<ZFruitListener*> m_FruitListeners;
   list<ZVegetableListener*> m_VegetableListeners;
   list<ZBerryListener*> m_BerryListeners;
}

void ZDataManager::ReceiveNewFruit(ETexture eText, EColor eCol, ETaste eTaste)
{
   list<ZFruitListener*>::enumerator en = m_FruitListeners.GetEnumerator();
   while (en.MoveNext())
   {
   en.Current()->NewFruitReceived(eText, eCol, eTaste);
   }
}

//...here we have almost identical definitions for ReceiveNewVegetable and ReceiveNewBerry, but with different function signatures!!

Then we have the listener classes, doing the following:

ZFruitListener::Init() {
  GetDataManager()->RegisterFruitListener(this);
}

ZFruitListener::~ZFruitListener() {
  GetDataManager().UnregisterFruitListener(this);
}

ZFruitListener::NewFruitReceived(ETexture, EColor, ETaste)
{
   //process new fruit
}

//same idea for ZVegetableListener / ZFruitListener Init, destructor, and receiving functions.

As you can see here, I'm writing a lot of duplicate code, maintaining similar lists, registering classes in a similar way, and multicasting to different functions, but in a similar way. The main stumbling block here is the fact that the function signatures are different, making a more smooth solution elusive.

Tuesday, October 26, 2010

Automation of refactoring

I'm really missing automation in my refactoring. A lot of what I do on a day to day basis, is robotic code moving that has no impact on functionality (or it should not if you don't make human mistakes). A nice tool for doing basic operations, such as moving functions from one class to another, changing a parameter list, extracting a function from a code chunk, etc. would do so much good!

Many refactoring operations aren't difficult, but even for a refactoraholic like me they can be

Robotic
Repetitive
A pain in the ass

The number of primitive actions can be quite large (up to 10-12)

Bug-prone,

It's easy to make a mistake in the long list of primitive actions, and then your code has unnecessary bugs, which were introduced out of good intention

I use Visual Studio 2008, and luckily I can see that Visual Studio 2010 has a lot of the things I've been looking for. A lot of the refactoring operations have in fact been automated, and I'll probably write about that in my future posts.

Sunday, October 24, 2010

Refactoring pitfalls

Here are some things I found that go wrong during refactoring, or reasons why refactoring is not as easy as it could be:

If you have a large enough project, iteration times tend to be long

Compilation times are long
Startup / loading times are long
Running tests takes a long time

Refactoring in small steps is difficult, since you either

End up spending a long time retesting each iteration, or
Try to squeeze in several rounds of refactoring in one sweep, which decreases turnaround time, but increases risk and testing time.

The number of test cases is too large and not all tests are (even in principle) automatable.

Corner cases are not always obvious and the effect your refactoring will have is not always clear

The combination of long iteration times and large number of test cases can make refactoring code very costly (and believe me I've paid the price). You end up doing it in large sweeps, while adding new features at the same time, after which everything is broken and you need to do a whole suite of tests needs to be designed (or redesigned). I've had refactorings that took between 1 week to 1 month, and I regretted starting them in the first place. I was happy about the end result, and probably in the long run it still paid off, but the time and the energy it took to get the system back on track was enormous, and I'm still looking for better ways. Suggestions are welcome.

Saturday, October 23, 2010

800+ lines of code in 1 function

Yes, it's true, and I just saw it with my own eyes on a real project.

This is why I do refactoring.

Some nice quotes from "Refactoring"

Some nice quotes I found in the book (will update as I'm reading).

Any fool can write code that a computer can understand. Good programmers write code that humans can understand. (p.15)

When I look at unfamiliar code, I have to try to understand what it does. I look at a couple of lines and say to myself, oh yes, that's what this bit of code is doing. With refactoring I don't stop at the mental note. I actually change the code to better reflect my understanding, and then I test that understanding by rerunning the code to see if it still works. (p. 54)

I strongly believe that a good design is essential for rapid software development... Without a good design, you can progress quickly for aa while, but soon the poor design starts to slow you down. You spend time finding and fixing bugs instead of adding new function. Changes take longer as you try to understand the system and find the duplicate code. New features need more coding as you patch over a patch that patches a patch on the original code base (p.57)

"Refactoring: Improving the Design of Existing Code" by Martin Fowler

So I started reading THE refactoring book. Admittedly I should have picked it up a long time ago, but now that I'm writing this blog, I have no excuse.

I'm on Chapter 1, and so far the book has met my expectations. It has confirmed the value of most of the techniques I've acquired through experience, and the principles behind them.

Here are some of the highlights in the very illustrative example that Fowler uses in Chapter 1.

Move code into a separate function

And make sure that the signature has the correct types

Rename variables to what makes sense
Functions that no longer reference the class that they're in, are good to move to another class
Function signatures can typically be reduced when we moved functions to the classes that are more relevant to them
After the function has been moved, we can either forward all calls to it from the original class, or replace all references to the original class's function to the new class's function
Getting rid of temporary variables

This is something I don't do as much because typically temp variables are easier to debug than function calls.
Fowler uses the example: result += "\t" + ... + each.getCharge() + "\n"
I would very rarely want to place a direct function call inside a string concatenation like this for 3 reasons

It's easier to debug the value if it's on a separate line
The return value of the function might be needed again
The string concatenation code becomes shorter easier to read, if it's not infested with multiple function calls (and god forbid those functions also have side effects, but that's another story altogether)

Replacing switch / case statements with polymorphism

I have done this before, but I find the overhead to be prohibitive most of the time
This may be worth while if the switch / case statement has large chunks of code of 10+ lines for each major case, but if it's a 2 liner for each case, then both readability and performance might be better off with the switch / case, and the overhead of going to the polymorphic solution is not worth your while.

Fowler also makes good use of UML class and sequence diagrams to illustrate the effect of each step of his refactoring. So far so good.

Tuesday, October 19, 2010

Refactoring pet peeves: Part II (Internal enums)

I dedicate this special rant to internal enums as this one I see a lot in practice. Here's a typical example:

ZActor

{

public:

//huge chunk of declarations

enum EActorState

{

ALERT,

BORED,

HAPPY,

}

//more declarations here

enum EActorType

{

MAN

WOMAN

SNAKE

}

//more declarations

}

ZDirector

{

public:

void DirectManWomanScene()

{

ZActor::EActorType eFirstActor = m_aActors[0];

ZActor::EActorState eFirstActorState = m_aActorState[0];

switch (eFirstActor])

{

case ZActor::MAN:

if (eFirstActorState == ZActor::BORED)

//do man stuff

case ZActor::WOMAN

if (eFirstActorState == ZActor::ALERT)

//prevent man from doing man stuff

}

private:

ZActor::EActorType m_aActors[3];

ZActor::EActorState m_aActorStates[3];

}

And so it goes on. You will notice the annoying overuse of ZActor:: qualifier, and the cluttering of ZActor declaration with the enum, which should serve as a warning signal to anyone making or reading this type of code.
The problem with internal enums is that typically they are so small, that people don't go thru the hassle to create a separate file for them or take them outside the class. In practice, however, enums tend to be small enough and conceptually clean that they're convenient to pass around and reuse between different classes.
In my experience, soon after an enum is created, it's often needed in another class. Internal enums are also a bad idea esthetically, because they clutter up the class declaration.
I've seen this again and again, where an internal enum constantly gets referenced from other classes and files, and create an absolutely unncessary dependency between the user class and the class that contains the enum.
If your enum has a tendency to change a lot (as some enums do), then you'll find yourself recompiling unnecessary files every time you change the enum. If the enum is kept in a separate file, only the files that have the dependency on the enum will need to be recompiled.
As a rule of thumb, I always move enums outside the class, and better yet into their own file. It is not necessary to have a separate file for each enum, but it's nice to have a separate file for a collection of enums related to the same concepts.
Of course when the enums are outside a class, we need to make sure there are no conflicts, and it's important to use an approrpriate naming convention. Otherwise we get a lot of eUndefined, eUnlisted, eUnknown, eItemCount, eInvalid, eDefault from all over the project colliding.

The confessions of a refactoraholic