The confessions of a refactoraholic: December 2010

Wednesday, December 15, 2010

Code entropy and OO

On a related subject ...

The concept of OO is designed with the assumption that it's good to couple data together with the functions that operate on it. This is the concept of a class. The idea is that the functions in the class have a close relationship to the data, and do all the tricky logic, while the outside world doesn't have to know about the details. This is the idea of modularization, which helps make sure that changes in one part of a system have minimal effects on other parts of the system.

It all goes wrong when class boundaries are defined incorrectly, and then classes need to rely excessively on data from each other. Then instead of boundaries that protect from details, we end up with boundaries that prevent proper usage.

In order to take full advantage of OO, it is imperative to continuously rearrange class boundaries, and maintain the close relationship between data and the code that operates on it. In fact I would go as far as to say this is what distinguishes good OO code from bad OO code. By default, all OO code tends to deteriorate, or increase in entropy precisely because data goes out of alignment with the code that operates on it. This is perhaps the biggest (and very much fair) criticism of OO-haters.

What I'm trying to arrive at here, is a way to automate, or semi-automate refactoring in order to take full advantage of OO.

Tuesday, December 14, 2010

Code entropy

Most of the time when I refactor, I strive for some sort of perfection, some sort of lofty goal of readability, maintainability, scalability, etc.
It's all nice and good, but as a proponent of the scientific method, I keep thinking:
"Wouldn't it be nice to quantify all that?"

I mean, if I had some sort of objective measure, a number, I could put on my code to say "My code has improved by X", or "this code sucks ass because it has the quotient of Y". Then I could perhaps run my code thru some Turing Machine, and spit out this number, then after the refactoring do the same thing, and get a better number.

So today I had this idea of measuring quality of code via Code entropy. The lower the number, the more maintainable the code (or some other such subjective adjective). Of course code that has the lowest entropy is code that does absolutely nothing. The code with high entropy would hopefully describe either code that either does a lot or is very messy. So we have to balance readability with functionality. In the meantime, we could define (good) refactoring as rewriting code in a way that preserves functionality, but lowers entropy.

So what is this elusive entropy concept? Well, I don't really know yet, this is fresh off the press. But one idea I had was to measure code entropy in terms of "coupling", "interdependence", between modules, classes, and functions.
For example, we can measure it in terms of how often one class needs access to members of another.

Here's a very simple example of what I mean.

High entropy version	Low entropy version
class X { public: void DoStuff() { if (y.CalculateStuff1() == y.CalculateStuff2()) y.CalculateStuff3(); } private: Y y; } class Y { public: Stuff CalculateStuff1(); Stuff CalculateStuff2(); void CalculateStuff3(); }	class X { public: void DoStuff() { y.CalculateStuff123(); } private: Y y; } class Y { public: void CalculateStuff123() { if (CalculateStuff1()==CalculateStuff2()) CalculateStuff3(); } private: Stuff CalculateStuff1(); Stuff CalculateStuff2(); void CalculateStuff3(); }
entropy of 3 (class X needs to access members of Y 3 times)	entropy of 1 (class X needs to access members of Y only once)

Obviously in the 2nd case, the interaction between X and Y is minimal, and because Y has access to its own data, while X may not, the code on the right can be simplified further, depending on what CalculateStuff1, 2, and 3 are doing, and what data members they're using. The code on the left cannot be simplified, if Y's data is hidden from X.

There are obviously a lot of open questions here. How do we count when we call a protected function or access a protected member of a base class? How do we count when we call a virtual function? Does this count as 1 or 2?

The idea would be to reward code where logic is closer to the data that it operates on, and penalize code that tries to operate on data that doesn't belong to it, which gets rid of the "code smell" called "Feature envy" in the Refactoring book (p.80) .

Then we could have different types of entropy:

Static code entropy

Measure on the static structure of the code
If we analyze the code "on paper", how often would objects try to access each other's methods / data

Dynamic code entropy

Measure on the runtime properties of the code
If we actually run the code, how often would the methods from other classes be called

Potential code entropy

Measure on how badly the code could be misused
For example if you make all your class members public or use global variables, that is not in and of itself bad, but it has the potential of being used the wrong way, and entangled in all sorts of undesirable relationships with other objects.
In the example above, the code on the right has lower potential entropy, because 3 of its functions are private, and in principle are protected from the hoards of malicious hackers or bad programmers

I think that's all I have to say about the subject for now.

P.S. Apparently the concept of software entropy is not new, but so far I haven't seen it quantified quite like what I'm describing here. More research is needed.

Monday, December 13, 2010

The root of all evil

is duplication of logic. Nothing is worse than 2 subsystems trying to do almost the same thing in sufficiently different ways.

Friday, December 10, 2010

booleans Galore

Often I bump into code which contains convoluted logic involving a lot of booleans.

If the booleans are independent, then the number of states represented by the booleans is 2^n. All too often, however, the number of REAL states that the object can be in (based on those booleans) is much lower. That is because the booleans are interdependent and setting one to true may mean another one needs to be set to false, etc.

After 4-5 such interconnected booleans, the logic can get blurry, and code hard to read.

One thing that seems to help is to replace multiple booleans with one enum variable. The enum represents the true combined state of the object(s). Sometimes a state may correspond to one of the booleans being true, or sometimes it can be a combination. It's important to get an overview of what can happen to the object and which of the booleans are dependent on each other, and which are completely independent (and perhaps should not be included in this state machine). The number of entries in the enum will then represent the actual states of the object.

Now this one is very much on a case by case basis, but a few times I've managed to reduce logic from 4-5 interconnected booleans (16-32 states) to 5-10 enum entries, which gets MUCH easier to reason about.

Anything else? Hmm... I think that's all I'm gonna say today. Refactoring awaits. MWAHAHAHA

Monday, December 6, 2010

Optimization vs. readability

It is not uncommon that I encounter situations, where I know the code I'm writing or refactoring is clearly not performance-optimal, and yet more readable.
Here's an example of code that I've encountered recently (excuse the bad formatting, I'm experimenting here):

struct State{
float m_Prop1;
int m_Prop2;
std::string m_Prop3;
// ...
}

class Manager { State oldState, newState; std::vector m_Elements; }

//...

foreach elem in m_Elements {

if (oldState.m_Prop1 != newState.m_Prop1)

elem.SignalProp1Change();

else if (oldState.m_Prop2!=newState.m_Prop2)

elem.SignalProp2Change();

// ...

}

There are 2 ways to go from here.

The first is to take the comparison outside the loop, as a result of which we only need to do the comparisons once. The second is to move both the comparison and the signaling into the elements, which would put all the important logic where it is actually needed. Here's the comparison.

Performance-optimal version OO version

Code in the Manager header struct State {
  m_Prop1, m_Prop2, m_Prop3;
}
class Manager {
  State oldState, newState;
  std::vector m_Elements;
} struct State {
   m_Prop1, m_Prop2, m_Prop3;
}
class Manager {
  State oldState, newState;
  std::vector m_Elements;
}
or
class ZState {
friend class Element;
private: m_Prop1, m_Prop2, m_Prop3;
}
class Manager {
  ZState oldState, newState;
  std::vector m_Elements;
}

Code in the Manager if (oldState.m_Prop1!=newState.m_Prop1)
    foreach elem in m_Elements
        elem.SignalProp1Change();
if (oldState.m_Prop2!=newState.m_Prop2)
    foreach elem in m_Elements
         elem.SignalProp2Change(); foreach elem in m_Elements
   elem.SignalChanges(oldState, newState);

Code in the element source file //nothing much happening here void Element::SignalChanges(const State& oldState, const State& newState) {
   if (oldState.m_Prop1 !=newState.m_Prop1)
      SignalProp1Change();
   if (oldState.m_Prop2 !=newState.m_Prop2)
       SignalProp2Change();
}

Code in the element header file class Element {
  public:
    void SignalProp1Change();
    void SignalProp2Change();
} class Element {
public:
    void SignalChanges(const State& oldState, const State& newState);
  private:
    void SignalProp1Change();
    void SignalProp2Change();
}
or
class IElement {
virtual void SignalChanges(const State& oldState, const State& newState) = 0;
}
class Element : public IElement {
// the rest is the same as above
}

The first solution is more performance optimal, since if a property hasn't changed, we never call any of the functions, thus skipping the traversal of the list altogether. In the second solution is more modular, and all logic is moved into the elements. This way we hide all that logic from the manager, we can also hide the contents of the state from the manager. This latter approach is more robust, leaving the manager class completely oblivious to logic of the elements. Also a new programmer looking at the code base needs to inspect less code in order to figure out what to do.

So what to do, what to do? I typically prefer the second option, as you may have already guessed. The benefits of modularization are great for maintenance, and the performance gain may only be important if this code is executed very frequently. But I think this example does illustrate an important point. Data hiding can prevent us from getting code that is fully performance-optimized. The more we separate code into clean independent modules, the less we can take advantage of their interdependence.

The confessions of a refactoraholic