The confessions of a refactoraholic: May 2011

Wednesday, May 25, 2011

C++ woes

and the tedium of needless detail never ends ... it almost feels like a prison sentence. Do I have to write 50 for loops with an early exit every fucking day?

Sunday, May 22, 2011

Automated code analysis and refactoring

Here's another paper I found very practical:

http://www.ibm.com/developerworks/library/j-ap07088/

This paper talks about automated identification of "code smells" and a semi-automated approach for fixing the smells with refactoring.

Unfortunately the tools mentioned in the paper are for Java only, but it's a start. Hopefully i can find something similar for C++, and start applying it in my everyday development at work.

It's actually funny to see the low tolerance the authors have for certain practices:

"Many consider a cyclomatic complexity number of 10 or more to indicate an overly complex method" (p.4)

"A general rule of thumb I try to adhere to is to keep my methods to 20 lines of code or fewer. Of course, there may be exceptions to this rule, but if I've got methods over 20 lines, I want to know about" (p.8-9)

Code toxicity

So finally I found a piece work that is in line with what I was thinking in terms of code entropy. What I like about it is that it's actually quite comprehensive, as it takes into account several factors, such as function length, cyclomatic complexity, nested-if depth, etc.

Well without further ado, here's the link:

Doesn't look like I'm gonna be taking the world by surprise by making a brilliant academic thesis on the matter.

Stages of development for code written from scratch

I'm going to try to reverse engineer the process I go through during software development. At the end of the day, this is some sort of optmization with respect to "maintainability" and "clarity". I'm haven't quite nailed down the formal definitions of these, but I'm going to try at a later stage.

So here's the process I go through if I start working on code from scratch:

Rapid prototyping. Use structs or classes with all public data members and functions. At this point we don't know what the structure of the problem is, so there's no point in "locking down" classes or hiding members. When the structure of the problem is not clear, "boundaries" between classes are arbitrary. Which data and what functions belong together will become clear later.
Exploration. Getting some results the quickest way, so as to get some iterations, user feedback, etc.
Paradigm adjustment - The problem space is clearer. Classes actually correspond to concepts in the real world altho the data they operate on might be elsewhere. Even though the problem space is clear, the vocabulary used in the prototype no longer fits.
Consolidation. It is now clear what problem we're trying to solve and how we're trying to solve it. As concepts are clearer so are the class boundaries. Most operations should now be on members of the same class. Most data members should be private, as they're no longer needed from the outside. Implementation details are now hidden and client code only needs to know a small fraction of it, the "interface", the API , via which the user can interact.

I believe in the end this process is about reducing code entropy, which I'm trying to define. I will ultimately need to:

Define "code entropy" more formally
Refine the process described in this post
Show that the process described in this post leads to code with lower entropy

Revant previous posts:

Code entropy

Sunday, May 15, 2011

Why manual linked lists are evil

Recently I've been reminded of a pattern in code I particularly dislike. Manual traversal of linked lists. I couldn't quite put my finger on why I have an aversion to this practice, but I'm gonna try to pinpoint right here in this post, and solve this moral dilemma once and for all.

The pattern looks something like this (here i'm not talking about design patterns, just a pattern of usage).

//declaration code
struct SPointsList
{
SPoint m_CurPoint;
SPointsList* m_Next;
}

void foo(SPoint& pt);

//...
//client code
SPointsList points;

//.. some annoying initialization code goes here, creating a non-empty valid list
SPointsList* pt = points.m_CurPoint;
while (pt!=NULL)
{
foo(pt->m_CurPoint);
pt = pt->m_Next;
}

So what don't you like about this code, Mr. Refactoraholic? It's efficient, it's compact, it's not using templates, it's doing just what it's supposed to do and no more. Doesn't that fit into your favorite K.I.S.S. principle? Aren't you just being a tight-ass abstractionist who is just waiting for an excuse to use some templates and iterators that you read about in your favorite design pattern book?

Well, I'm glad you asked. Here's some of the things I don't like about this code:

It reinvents the wheel. Implementing a linked list from scratch is an exercise that I've learned how to do in high school, and yet in practice try to avoid as much as possible for reasons below
It does pointer arithmetics, which is an overused feature of C / C++, most other languages have wisely chosen to get away from.
It is prone to off-by-one errors, since you have to be careful not to hit uninitialized memory or to get yourself into an infinite loop.
Client code also has to be careful to construct a valid data structure (the last element needs to point to a NULL, so that elsewhere in client code we could check for it). I didn't include an example of this, simply to keep my blood pressure low and my stomach from getting too upset.
All in all, the main negative thing about using manual linked lists, is that this approach exposes implementation details to the client, and client code everywhere has to adjust to that. If the representation changes, client code has to be re-written, and that's a real waste, as this code could have been written in a more flexible way from the start.

One way of writing this code would be:

typedef std::list < SPoint > SPointsList

//client code:

SPointsList points;

//some initialization code goes here, using list::insert, or list::push_back

SPointsList::iterator itr = points.begin();

for (;itr!=points.end(); ++itr) {

foo(*itr);

}

We can argue about how pretty this code is or that it uses more characters than the original pointer version. But in the end, this approach hides all the nasty details and pointer arithmetics, and changing the representation does not affect the client code at all.

There's also a more neutral approach that avoids using STLs, where we basically only implement the functions that are needed, but still use the iterator-based approach to hide the details of the pointer math. Here's a very sloppy version of that:

struct SPointsList

{

SPoint m_CurPoint;

SPointsList* m_Next;

void AddNewPoint(const SPoint&, SPointsList* pos);

void RemovePoint(SPointsList* pos);

SPointsList* GetNext() {return m_Next;}

bool IsLast(SPointsList* pt) {return pt==NULL;}

}

Perhaps only implementing the functions you need can reduce code bloat and allow you to special-case optimize access to your data structure, but you have to do more work to reinvent the wheel, and error-proof your internal pointer math for adding / removing elements.

[It seems like blogger is not the most code-snippet friendly of environments, so I'm gonna look into switching to wordpress or something else]

The confessions of a refactoraholic