The confessions of a refactoraholic: Code entropy

Most of the time when I refactor, I strive for some sort of perfection, some sort of lofty goal of readability, maintainability, scalability, etc.
It's all nice and good, but as a proponent of the scientific method, I keep thinking:
"Wouldn't it be nice to quantify all that?"

I mean, if I had some sort of objective measure, a number, I could put on my code to say "My code has improved by X", or "this code sucks ass because it has the quotient of Y". Then I could perhaps run my code thru some Turing Machine, and spit out this number, then after the refactoring do the same thing, and get a better number.

So today I had this idea of measuring quality of code via Code entropy. The lower the number, the more maintainable the code (or some other such subjective adjective). Of course code that has the lowest entropy is code that does absolutely nothing. The code with high entropy would hopefully describe either code that either does a lot or is very messy. So we have to balance readability with functionality. In the meantime, we could define (good) refactoring as rewriting code in a way that preserves functionality, but lowers entropy.

So what is this elusive entropy concept? Well, I don't really know yet, this is fresh off the press. But one idea I had was to measure code entropy in terms of "coupling", "interdependence", between modules, classes, and functions.
For example, we can measure it in terms of how often one class needs access to members of another.

Here's a very simple example of what I mean.

High entropy version	Low entropy version
class X { public: void DoStuff() { if (y.CalculateStuff1() == y.CalculateStuff2()) y.CalculateStuff3(); } private: Y y; } class Y { public: Stuff CalculateStuff1(); Stuff CalculateStuff2(); void CalculateStuff3(); }	class X { public: void DoStuff() { y.CalculateStuff123(); } private: Y y; } class Y { public: void CalculateStuff123() { if (CalculateStuff1()==CalculateStuff2()) CalculateStuff3(); } private: Stuff CalculateStuff1(); Stuff CalculateStuff2(); void CalculateStuff3(); }
entropy of 3 (class X needs to access members of Y 3 times)	entropy of 1 (class X needs to access members of Y only once)

Obviously in the 2nd case, the interaction between X and Y is minimal, and because Y has access to its own data, while X may not, the code on the right can be simplified further, depending on what CalculateStuff1, 2, and 3 are doing, and what data members they're using. The code on the left cannot be simplified, if Y's data is hidden from X.

There are obviously a lot of open questions here. How do we count when we call a protected function or access a protected member of a base class? How do we count when we call a virtual function? Does this count as 1 or 2?

The idea would be to reward code where logic is closer to the data that it operates on, and penalize code that tries to operate on data that doesn't belong to it, which gets rid of the "code smell" called "Feature envy" in the Refactoring book (p.80) .

Then we could have different types of entropy:

Static code entropy

Measure on the static structure of the code
If we analyze the code "on paper", how often would objects try to access each other's methods / data

Dynamic code entropy

Measure on the runtime properties of the code
If we actually run the code, how often would the methods from other classes be called

Potential code entropy

Measure on how badly the code could be misused
For example if you make all your class members public or use global variables, that is not in and of itself bad, but it has the potential of being used the wrong way, and entangled in all sorts of undesirable relationships with other objects.
In the example above, the code on the right has lower potential entropy, because 3 of its functions are private, and in principle are protected from the hoards of malicious hackers or bad programmers

I think that's all I have to say about the subject for now.

P.S. Apparently the concept of software entropy is not new, but so far I haven't seen it quantified quite like what I'm describing here. More research is needed.

3 comments:

Alexey YakovenkoDecember 14, 2010 at 1:01 PM
the metric has already been invented.
(in russian): http://blog.gamedeff.com/?p=321
corvuscoraxDecember 14, 2010 at 2:05 PM
You should check this paper out (from 1992)

http://portal.acm.org/citation.cfm?id=141354
refactoraholicDecember 15, 2010 at 3:07 PM
This paper also seems to be relevant: http://www.cs.umd.edu/~basili/publications/technical/T94.pdf

If I can only get myself thru the set theory notation ...

The confessions of a refactoraholic

Tuesday, December 14, 2010

Code entropy

3 comments:

About Me

Blog Archive