You are viewing [info]fairweatherinbg's journal

Leaving LJ

Long story short, I have become so weary with LJ's latency, lack of features and its abysmal "support" for posting code, that I've decided to move my blog elsewhere. I've registered fairweatherinbg.blogger.com and will post my new articles there.

Ciao

Sorting Algorithms

I'm currently reading Robert Sedgewick's "Algorithms in C" and even though I like the book as such, I must say I'm disappointed with the quality of the translation. Apart from the numerous spelling errors and inconsistencies, (which it's the editor's job to point out and to ensure they're fixed), the translator keeps inventing idioms and translating word-for-word what was best retold, which seriously hampers the exposition. The rendering idioms and terminology begotten in another language than the one they were developed in is a tricky business, but I'd much rather have a colloquial explanation (with idioms that I'm likely to hear somewhere else, not just in this particular edition), accompanied by a footnote explaining the English origins and connotations, than a word-for-word-whole-sort-of-general-mish-mash or an English-(whatever my language is) soup.

On the other hand, what makes the book stand out and redeems it as a source of instruction, is the ingenuous presentation of the algorithms and structures - the book is full of small diagrams, slide-shows and charts which help you to understand what's going on, and there is a good proportion of exercises and tangential information to make it worth reading (the example program fragments are also quite good). In general, I would say it looks like a really good all-round book on its subject, but I'd rather have bought the English version.

Speaking of presentation, here is an interesting site I discovered yesterday. It contains a (small) gallery of visualisations of sorting algorithms as well as short annotations. It's quite interesting, especially as it allows you to compare different algorithms and scenarios side by side, but it's disadvantage is that the maximum problem size (50 elements) isn't large enough to allow the asymptotic differences between the algorithms to really kick in.

http://www.sorting-algorithms.com/

Edit: Here are a few more pages with algorithm animations listed by site/author

D. Thiebaut of Smith College - Very interesting visualization, including option to display the executed code or the number of swaps and comparisons

Thomas Baudel - Very good visualization, allows you to select a variety of starting distributions.

Hat Full Of Hollow - Static presentation of small samples (see for yourself)

Morin - Includes some algorithms I haven't heard of before

Staerk - Somewhat similar to sorting-algorithms.com, only with bigger test samples, lower latency and no explanations.

Sorting Contest - The difference in speed between the different algorithms is very obvious here, as well as just how much Bubblesort sucks. Also features bi-tonic sort

Programming tidbits

Some time ago a friend of mine asked me what the most difficult aspect of my job as a developer are and so I decided to share some small tips and lessons I've learned since I started working.

Learn the tools of the trade


The debugger is your friend. Source control is your friend. Virtual machines are your friends. Diff and grep are your friends. Regular expressions are your friends. If you know how to use the appropriate tool for the situation you can save yourself a lot of drudgery and avoid repetitive task. Seeing how programming is one part text manipulation, there is a lot to be gained from a well-placed regex or Perl script. Profilers and unit tests are also important, but I don't understand them well enough to discuss them yet.

Don't make your job more difficult

As time progresses, I can identify the most problematic places in the code as the ones I need to reread most often. Making sure your flow control makes sense, is consistent and consistent can mean a lot in the long run, especially when you're adding new features. Here are some examples:

Don't control flow with exceptions or return primitive error codes if you can avoid it. Instead of throwing a different exception from inside the function, slowing down and polluting the external code, or coming up with an ad-hoc scheme which forces you and the maintainer to remember what error-code "3" means, look for something more elegant. For example:


public enum DataRetrievalResult {
ConnectionFailed = 0,
DataNotAvailable,
InvalidData,
OK
}

public enum DateSelectionResult {
NoDateEntered = 0,
NonExistingDate,
DateOutOfRange,
OK
}

Enumerations like these mean that the calling code becomes much more cleaner and self-documenting, and makes handling different cases elegant.  Enums also have other nifty uses; imagine you have a bunch of functions that rely on a certain state to execute properly, and which a) can call each other and b) must restore the pre-execution state upon finishing (for example, opening and closing a file, or preparing and establishing a database connection). Here is a simple solution:

enum ConnectionState {
ObjectNotInstantiated = 0x001,
ConnectionNotEstablished = 0x010,
Connected = 0x100
}

public void Func1() {
ConnectionState state = MakeConnection();
try {
//...
int k = Func2();
//...
}
finally {
RestoreConnectionState(state);
 
}
}

public int Func2() {
ConnectionState state = MakeConnection();
try {
//...
}
finally {
RestoreConnectionState(state);
}
}

where MakeConnection establishes a connection and returns the state the program was before it was called (in the form of a bitwise union of ConnectionState's), and RestoreConnectionState takes a parameter ConnectionState and returns the program to that state. Clearly when calling Func2 in the brown code, it will not try to re-establish the connection that Func1 created, and its RestoreConnectionState will not destroy the connection, because it was started somewhere else (in Func1;s MakeConnection()).

Prepare for trouble

Trouble happens in code and it's best to prepare for it. You don't want to spend an exhausting afternoon chasing elusive bugs and you certainly do not want to learn that your customers are having problems and you have no idea what's going on. Implementing a centralized, sensible exception-handling system is one of the first thing you need to do when you're creating an application, and making sure you log all exceptions, stack-trace included, somewhere safe so that your customers can email it to you when problems occur. There is a lot to be said about exceptions, bur I don't think I'm qualified enough on the topic to give many opinions, but I'll just repeat - avoid using exceptions to control your program. Exceptions are meant to be unexpected, and if there is an error condition that you cannot prevent with conventional checks (with the exception of OutOfMemory and the like), it means someone somewhere made a mistake.


Don't make your job more difficult n2:

Assume as little as you can about anything. This principle really revealed its importance to me when I started event-driven programming, and probably holds just as strongly in concurrent programming.
  • Never assume an event is going to be fired
  • Never assume an event is not going to be fired
  • Never leave event handlers do the work for you.
The point of the first two points is obvious - in an event-driven context, every function that you use could potentially trigger an event, whose handlers can trigger other events, etc, usually in unpredictable order. A particularly bad aspect of this is that sometimes this isn't really obvious for a while, in cases when the interaction does not break the code; the surprise comes when a new addition or minor change causes everything to topple down or when you step inside the debugger to discover that the call stack is 20 stories high. Disentangling event-handlers is no fun at all, so I've come to the simple solution:

For each event-raiser you write, create a flag which will indicate that the event is not meant to be fired. For example, you have an OnTextChanged trigger, which is supposed to fire the event when the user types something. Chances are you do not want it to fire when you are changing the text programmatically, but even if you do, it's much better to do it explicitly, rather than allow an implementation detail (which could change) do it for you:

bool bf_text_changed;

void OnTextChanged(EventArgs e) {
if(bf_text_changed)
return;

 
//... do stuff

TextChanged(this, e);
}

Another idea I've found useful is to implement a catch-all flag which determines whether the event was raised by the user or by my own code.

Project Management

Take lots of notes, always carry around a convenient notebook and utilize personal information managers (I use AM Notebook). Nothing irritates your clients more than forgetting to do something you discussed with them at length, and being able to see your all your requirements on a page makes planning that much easier.

...

Of course, none of these guidelines are absolute and I'm sure someone out there disagrees or has a salient comment regarding them. Please feel free to comment if you have something to add/discuss.
[this post is based on an article by Eric Lippert which can be found here]

Last time, we devised an implementation of the stack structure which does not allow the user to change the contents of its elements any more than the rules of mathematics allow her to change the value of the number 10 through the variable x. To recap, the stack in question was an enumeration of elements which allows the user to get the information it contains without changing the in-memory data. As an added benefit, forking mechanism used to create new Stacks means that two or more stacks instantiated from a common ancestor will share all nodes from the point of forking down. A particular Stack, seen as a sequence of nodes, will exist in memory as long as any variable that points to a Stack constructed above it exists, thanks to the linked implementation; once all references have been popped, the C# or Java Garbage Collector will take care of all nodes which are not longer part of any stacks.

However, our forking implementation also suffers a drawback which results in suboptimal use of memory. Consider

IStack<int> stack_0 = Stack<int>.Empty.Push(0).Push(-1);
IStack<int> stack_1 = stack_0;
IStack<int> stack_2 = stack_0;

//What these queries do is basically push 10, 20, ..., 1000 on both stacks
//The
brown code segments are lambda functions
Enumerable.GetRange(1, 1000).Where((i) => i%10==0).Aggregate(stack_1, (s, j) => s = s.Push(j));

Enumerable.GetRange(1, 100).Select((i)=>i*10).Aggregate(stack_2, (s, j) => s = s.Push(j));

So what happens now? We have two identical 100-element Stacks, which were however created from a common parent. Clearly, both those structures contain the same information, and since they cannot be changed, we have a redundancy of 100 elements. Ideally, we would have used stack_1 to instantiate stack_2, thereby making them share 102 instead of just 2 elements (0 and -1). However, this is not always possible. By far the most likely scenario for me is that two threads with reference to the same memory location decide to perform the same sequence of Push()'es on it, which results in awful redundancy; the possibility that the same stack segment occurs randomly is a lot more improbable than two worker threads performing the same task.

The natural solution of this problem is to memoize the results of a particular Push() and give that address to all subsequent requests we consider identical. In the perfect case, we would store each node-value at most one, initializing it the first time it is requested. However, a way to do this while preserving the structural information of each Stack does not immediately occur to me, so let's do it differently - using a separate lookup table for each IStack, which will hold references to all stacks instantiated from it.

In passing, we will also make the IStack<T> interface covariant - i.e. a function which expects a IStack<T> variable will also accept any assignment-compatible type, such as IStack<U> where T < U.

First, let's remove the instance Push() from both the interface and the Stack<T> and EmptyStack<T> implementations (note that the template parameter for EmptyStack is implied by its containing class, Stack<T>

interface IStack<T> : IEnumerable<T>
{
T Peek();
IStack<T> Pop();
IStack<T> Push(T elem);
bool IsEmpty
{
get;
}
}

static class Stack<T>
{
//..
private static class EmptyStack : IStack<T>
{
public bool IsEmpty { get { return true; } }
 
             public IStack<T> Pop() {
                  throw new InvalidOperationException("Empty stack");
             }

             public T Peek() {
                  throw new InvalidOperationException("Empty stack");
             }

            
public IStack<T> Push(T value) {
                  return new Stack<T>(value, this);
            }

            //...

}

public bool IsEmpty {
get { return false; }
}

static EmptyStack _empty;
public IStack<T> Empty {
get { return _empty; }
}

private readonly IStack<T> _tail;
private readonly T _head;

public T Peek() {
return _head;
}
 
 
private Stack(T head, IStack<T> tail) {
_head = head;
_tail = tail;
}

public IStack<T> Push(T elem) {
return new IStack<T>(elem, this);
}


//...
}

 
There. Now we know that the stack can never be used as a consumer, so it is safe to assign to a superior type. But how do we create new stacks without Push() ? And what about the memoization ? Read on:

//...
public static Stack<T> Push(T head, IStack<T> tail) {
return new Stack(head, tail);
}

What's going on? Well, quite simply, given a Stack of some type, say Middle, we can only do

IStack<Bottom> st1 = Stack<Bottom>.Push(new Bottom(),  Stack<Bottom>.Empty);

st1 = Stack<Bottom>.Push(new Bottom(), st1);

IStack<Middle> st2 = Stack<Middle>.Push(new Bottom(), st1);

st2 = Stack<Middle>.Push(new Middle(), st2);

st2 = Stack<Middle>.Push(new Bottom2(), st2);

All the nodes, excuse me, stacks, which used to be Stack<Bottom> are still Stack<Bottom>. The type of the reference we have, however, has expanded to Stack<Middle> and now we can treat st2 as a stack of Middles, which, provided Bottom is indeed substitutible for Middle, is perfectly ok, even though the tail is not of the same type.

On the other hand, were we using the old instance method for Push(), the following would be impossible:

IStack<Middle> st_1 = Stack<Bottom>.Empty.Push(new Bottom());

st_1 = st_1.Push(new Bottom2());

In this case, assuming as before that Bottom and Bottom2 are siblings, this is what would happen in the constructor:

private Stack(T head,  //<-- Bottom2
           IStack<T> tail //<-- Stack<Bottom>
) {
_head = head;
_tail = tail;
}

Having the construction as a type-method rather than an instance-method allows us to expand the type of the structure.

And now for the memoization
 
Dictionary<T, IStack<T>> branches = new Dictionary<T, IStack<T>> ();
private object lock1;

public IStack<T> Push(T elem) {

lock(lock1) {
if (!branches.Keys.Contains(elem)) { //Ensure before-after semantics
branches[elem] = Stack.Push(this, elem);
}
}
return branches[elem];
}

This code ensures that each Stack<T> object can only create a single new Stack<T> for a value of T, and since all Stacks of T stem from Stack<T>.Empty, we can never have two identical stacks who do not share their tail. Note the synchronization structures (in brown) which make sure we get predictable behavior in multi-thread scenarios.

Immutable data structures and C# - 1

As you may know, functional programming is a programming paradigm which strives towards eliminating the program state. Unlike procedural programs, which are sequences of operations over stored data, an FP program is a set of functions definitions which can be evaluated under certain conditions, not unlike mathematical operations. This approach towards program structure is important not only because it promotes structured code in itself, but because the immutability of all objects can be a great asset in the face of concurrent programming, where access to data locations is uncertain and the interaction between diiferent execution paths introduces additional complexity.

Think about it: you don't need to worry whether the reference you have points to null, to an object which is in the middle of being modified or in an unacceptable state; you aren't passing a reference, but a definition, which is timeless and immutable; different threads or processes can compose and derive from your definition, but what you have remains the same. Seeing how concurrent execution constantly grows in importance, these are desirable features to have, as well as some others ( see if you can guess ).

One object-oriented language that has been adopting functional constructs is C#, Microsoft's derivation of C++ and Java. In a series of thoroughly enjoyable posts, their compiler engineer Eric Lippert discusses the FP approach towards data structures and how it can be implemented in contemporary C#. This discussion is going to be based on his treatment of the subject in the beginning, but I may have some more to say about it myself. I'm going to use C Sharp for the code samples, and while there won't be anything too sophisticated, I'm not going to use baby-talk either. A reasonable understanding of Java or C Sharp and some googling will get you through with no problems.

The precise term to refer to structures which are defined, rather than built, is immutable - which cannot be changed once defined; an example of this is the C#'s String type - every operation over a string produces a new instance without changing the original. However, when working with complex and potentially large structures such as stacks, the copying approach is too inefficient. The answer to this lies in the definition of the data structure - instead of instantiating a new copy with the applied, we produce a new definition. Let us describe this process:

1. Define an ADT's structure:

Stack of T = <empty> | ( <instance of T> and (Stack of T) )

Apparently, <empty> has to be a constant value; we need to define an <empty> object with some defined properties (in this case, the empty stack cannot be popped) and we also need to make sure that the rest of the program can treat it as a normal stack (for example, using the same interface), but is also able to distinguish it.

2. Define its behavior:

//<T> is equivalent to "of T" for those who aren't used to generics
//Deriving IEnumerable<T> means we can use the stack object inside a foreach statement or similar constructs

interface IStack<T> : IEnumerable<T>
{
T Peek();
IStack<T> Pop();
IStack<T> Push(T elem);
bool IsEmpty
{
get;
}
}

As you can see. Pop and Push return define another IStack (a Stack<T> or an empty stack) with either one element more or one element less. You will see how this turns out shortly.

3. Implement empty singleton

static class Stack<T>
{
//..
private static class EmptyStack : IStack<T>
{
public bool IsEmpty { get { return true; } } //Obviously
 
             public IStack<T> Pop() {
                  throw new InvalidOperationException("Empty stack");
             }

             public T Peek() {
                  throw new InvalidOperationException("Empty stack");
             }

             public IStack<T> Push(T value) {
                  return new Stack<T>(value, this);
            }

  //IEnumerable<T> - don't worry if you don't understand the implementation, it does just 
  //what we said it should - present the stack as a sequence of values ( empty sequence in
  //this case )

 
            public IEnumerator<T> GetEnumerator() { yield break; }
            IEnumerator IEnumerable.GetEnumerator() { return this.GetEnumerator(); }
}
}

4. Now for the real stack

sealed class Stack<T>
{
//EmptyStack definition
//...

//The real stack
public bool IsEmpty {
get { return false; }
}

static EmptyStack _empty; //one object for each  value of T we use

public IStack<T> Empty {
get { return _empty; }
}

private readonly IStack<T> _tail; //readonly means they cannot be changed outside the constructor
private readonly T _head;

public T Peek() {
return _head;
}
 
 
private Stack(T head, IStack<T> tail) {
_head = head;
_tail = tail;
}

public IStack<T> Pop() {
return _tail;
}

public IStack<T> Push(T elem) {
return new IStack<T>(elem, this);
}

//IEnumerable<T>
//See if you can follow the logic of the for loop

 
public IEnumerator<T> GetEnumerator() {
      for (IStack<T> stack = this; !stack.IsEmpty; stack = stack.Pop()) {
             yield return stack.Peek();
       }
 }

 IEnumerator IEnumerable.GetEnumerator() { return this.GetEnumerator(); }
}

As you remember, the key concept here is the ability to define a "stack" using another "stack". In this case, an instance of this class is nothing more than a single value of T that can be inspected using Peek()*, and a reference to another "stacky" object - which might turn out to be the the EmptyStack, or indeed any other object which conforms to IStack<T>, including a normal mutable stack**. The upshot of this is that we cannot modify an instance of a this class - we can make it produce another one or to give a reference to its tail***, but the object behind the reference will always be the same thing. You can ask aStack.IsEmpty and you will know that the reference aStack points to a non-empty stack until you assign it to something else, no matter what context changes take place and what other entities do - nobody can touch it.

(*) As you see, the retrieval and removal operations have been implemented as different methods; since neither of those has any side effects ( and no method in an immutable class should, anyway ) we are not concerned whether the operation is atomic. Since the operations which would otherwise mutate the object now need to return a new one, it feels more natural to define separate methods for them and the rest of the class' behavior. Whether or not this is a benefit is up to each programmer

(**) Which would be bad

(***) Hopefully, also immutable

Here is an the intended usage pattern of this class:

//.....
IStack<int> aStack = Stack<int>.Empty; //The constructor is for private use only so we start from another stack

aStack = aStack.Push(10);
//Even though the reference has been modified, the object it used to address is still the same; all other references //anyone may have to it still mean the same thing.

aStack.Push(10).Push(12).Push(100);
//Both the reference and the object are the same

//The (s,n) => s.Push(n) lambda includes implicit assignment.
Enumerable.GetRange(1,1000).Aggregate(aStack, (s,n) => s.Push(n));

//We now have a stack with 1001 objects 1001 lightweight stack objects, each one of which defines the next one

IStack<int> bStack = aStack.Push(5).Push(-100).Push(int.Max);

//aStack is unchanged; bStack is defined by 1004 small IStack<int> objects, 1001 of which are shared with //aStack.

//In some scenarios, chaining definitions instead of copying the whole object can yield a substantial memory //benefit.

foreach(int ii in bStack){
Console.Write(ii);
Console.Write(" ");
}

//10 1 2 3 4 5 6 7 8 9 10 11....

The next post will proceed to discuss some of the implications of using immutable data structures and examine more ADT implementations.

---
As an aside, did you notice that all stacks of a particular type have the same final member ?

CSV File Parsing, Part 1

This short article was ready more than a week ago. Afterwards, I spent no less than 4 hours battling with LJ's ridiculous posting system in order to post it with the proper formatting, got engaged in a project with a tight deadline and fell kinda sick. I won't waste any more of my time trying to twist the site's arms to do what I need it to do, so just in case anyone's interested in the subject of manipulating comma-separated-values with Perl, here is the link to the document

Word 2007
RTF

Hopefully by the time I finish part 2, I will have found a normal blogging site that supports code formatting.

Matching the Nth Letter of the Nth Row

Behold my latest creation:

$regex =~ qr/(?{$i = 0})
(?:
(?{$j = 0})
(?(?{$j < $i})
(?:\w(?{$j = $j+1}) |
[^\w\n]) |
(?!))*
(\w)
[^\n]*\n?
(?{$match .= $1;
$i++})
)+/x;

This Perl regex illustrates how regular expressions correspond to traditional programming constructs. Here's what's going on:



(?{$i = 0}) #Setting a temporary variable $i which hold the row number
(?:         #The (?: ... )+ construct is a loop over all rows
            #We're inside the row now
             (?{$j = 0})       #The temporary variable $j holds the letter number
             (?(?{$j &lt; $i}) #Condition: if the letter number is lower than the row...
               (?:\w(?{$j = $j+1}) | #Either match a letter and increment $i
                  [^\w\n]            #or match a non-letter, non-EOL character
             )|                #Else...
                  (?!)         #Fail the match
             )*            #Do this as long as you can
             (\w)          #Then store the next letter in $1
             [^\n]*\n?     #Carve to the end of the line
             (?{$match .= $1;   #Append the letter to the string $match
             $i++})             #And increment the line-variable
)+     #The closing parenthesis of the loop
/x;    #Use the free-form flag; i.e. the pattern spaces and newlines are ignored.



Note that in [^\n]*\n? the question mark denotes optional new-line; if it were omitted, the regex would not match the last line of text and would return one less letter. Also, the (?!) (always fail) assertion is not really needed, since the conditional will cause the regex to fail eventually, but I left it for clarity's sake.

Here's an example usage:


open(INPUT, "c:/users/fairweather/desktop/input.txt");

$string = join "", <INPUT>;

$string =~  m/(?{$i = 0})(?:
                         (?{$j = 0})
                         (?(?{$j < $i})(?:\w(?{$j = $j+1})|[^\w\n]))*
                         (\w)
                         [^\n]*\n?
                         (?{$match .= $1;
                            $i++})
                         )+/x;
print $match;


Supposing input.txt contains Poe's Valentine, the output is:

Francessargentosgood

Which just so happens to be the girl he had written the poem for.

Feb. 17th, 2009




(Found on the blog of Sebastien Lorien)