19
Jun 08

Emacs

Hmmm. I started programming on an old ICL OPD (anyone else remember microdrives?). There was some fooling about afterwards with an IBM PS/2 system (woo-hoo, 40Mb of hard disk space!), but the first time I hit what I thought of as a real machine was on a unix account in the first week of college. And that’s when I first hit the biggest question any starting computer engineer hits (or used to hit, back then): Vi or Emacs.

Back then, I chose Vi by default, in that it was the first thing I came across and command/edit mode switching actually made sense to me. And I used vi for a good while afterwards, until I finally found, and immediately switched to, vim. I’ve used vim ever since for any task where I could use it on any platform where I could get it (in fact, not being able to use it for wordpress blogging ticks me off). Being able to run around a source code file in command mode and dd or y lines from here and p them over there, or J them together or call up boxes with an <F4> remap to give me a nice easy-to-spot function header, those are things I’ve gotten down to the reflex level at this stage. In short, I like vim. It’s small, fast, uncluttered. And don’t talk to me about tabbed editing and intellisense and debugger integration please, that’s been in vim for quite some time if you want it (and frankly, if you’re using Visual Studio anything and telling me it has a better editor, you seriously need to stop wasting your time). Not to mention the fact that everytime I’ve seen something like intellisense in actual use, it’s been used by some rather lazy programmers. Which isn’t to say that brace completion and the like can’t be used by good programmers, just that my impression so far hasn’t been a good one. Lie down with dogs, get fleas in your keyboard…

Anyway, the thing is, I’m familiar with the claims about the immense customisability and power in Emacs, but by the time I’d started to hear about it, I’d learnt vi, and once you learn an editor and there’s work to be done, it’s hard to justify the effort to relearn another one. Feels too much like faffing about. But right now, work’s in the research phase (ie.  Feck! Why won’t that work? Damn, go read the devel mailing list archives and the code…), and I have my little side project for the rifle club tootling away, and apparently XEmacs/GNU Emacs is now a run race and GNU Emacs is now a full superset of xemacs so there’s an easy install choice, and GNU Emacs 22 supports mercurial (there’s a blog post in there on its own methinks) out of the package, and PHP with a simple apt-get install, and the more tools you can use the better, right?

So I’ve installed a basic GNU Emacs setup on the work and home boxen with a PHP major mode, and over the next week or so, I’m going to try to use it for my side project. If that takes off, I’ll think about using it for the main project and where it fits into my toolbox overall. I suspect vim’s still a winner where another app has to call an editor (VCS commits from a command line and such) or where you’re running round /etc trying to fix something, but we’ll have to see.

First thing I have to figure out is what the emacs equivalent to command mode and edit mode are…


19
Jun 08

Dadhacker

If war stories of things like shaking wire-wrapped boards to make the assembler code work right; combined with a degree of pragmatism towards programming that you only get after discovering for the umpteenth time that the problem wasn’t your code but a bad EPROM burn (or after discovering that your colleague with the wonderful new ideas on Agile programming has actually never seen a command line in his life) sounds like a stonking good read to you, you need to read DadHacker. Set aside a few hours. Add it to your google reader or whatever feedreader you’re using or just into your daily links. Yes, it is that good.
via Ewan’s blog.


06
May 08

Implementing HAT-Tries in Java

As I said in an earlier post detailing the CCHashTable, it formed a part of a larger data structure, the HAT-Trie. The HAT-Trie is a recent variant on the trie data structure, published by Askitis and Sinha in last year’s Australasian conference on Computer science (the original paper is here). The problem they were addressing was that while the trie is very fast, it’s also very expansive in memory; and while some variants of it like the radix trie (or Patricia trie) and the related burst-trie, are less expensive in terms of memory space, they aren’t as fast as they could be because they don’t take into account caches in their design.

Consider for a moment, the standard trie node, with its array of pointers to all possible following characters. Think of a string of these nodes, representing a character string of some type, and how the terminal node then stores a link to a piece of information which the overall character string is indexing to (there are obviously a lot of applications for such a data structure). The obvious problem is that the longer the string, the more nodes, and the nodes are large – even for standard ascii, that’s 128 pointers per node, which means that to store ten characters, you’re looking at 512 bytes. And few strings used in such applications are that short – certainly in the application I was looking at originally, we were looking at strings ten times that size at least. The burst-trie, the radix trie (and it’s predecessor the compact trie) all attempt to solve this problem through similar methods involving compressing parts of the trie. The compact trie looks at all the branches of the trie that lead straight to a terminal node and represents it with a single node (in other words, it takes the unique bits of the ending of the strings stored in the trie and compresses those). The radix tree takes this a step further and notes that most strings being indexed have similar chunks throughout the strings, not just at the ends, and it compresses these chunks from multiple trie nodes to one node.

The burst-trie takes a slightly different approach, in that it compresses chunks from the trie by grouping together those strings with the same chunks at the start of the thread, so that the trie works as a partial index, and points to a secondary data structure that stores the rest of the string and it’s associated data. Heinz used linked lists when documenting this data structure, but linked lists take a performance hit when caching is taken into account, because linked lists tend to stump caches as their memory access can’t be predicted reliably, even through things like moving the last accessed element to the front of the list can help.

Enter the HAT-Trie.The idea is simple enough – a burst trie, but using a cache-concious data structure, in this case the earlier mentioned CCHashTable. Implementation is somewhat complicated in that the actual algorithm starts with a single CCHashTable node, which on reaching a set threshold for the number of strings contained (chosen to keep the number of hash table collisions to a minimum), it bursts into a single root HAT-Trie node, and a number of new CCHashTable nodes attached to it. That bit of jiggery-pokery should be handled by a third class, a trie manager of sorts, but that bit of refactoring has yet to be done. For now, there’s a bit of icky code in the CCHashTable class that knows about the HAT-Trie class. Ah, the crappy code we write under deadlines…

The actual HAT-Trie object’s data is simple enough:

public class HATTrieNode {
    private HATTrieNode parent;
    private Object children[];
    private int id;

The reason for the children being Object pointers is that they could be other HATTrieNodes or CCHashTable objects. And id is the piece of information being indexed by the input string. Storing a string is a relatively straightforward recursive function:

public boolean addString(char inputString[], int id) {
    if (inputString.length > 1) {
        char newString[] = new char[inputString.length-1];
        System.arraycopy(inputString, 1, newString, 0, inputString.length-1);
        Object child = this.children[(int)inputString[0]];
        if (child != null) {
            if (child.getClass().getName() == "HATTrieNode") {
                return ((HATTrieNode)child).addString(newString,id);
            } else if (child.getClass().getName() == "CCHashTable") {
                return ((CCHashTable)child).addString(newString,id);
            }
        }
        return false;

which handles the terminal nodes like so:

         
    } else if (inputString.length == 1) {
        char newString[] = new char[0];
        Object child = this.children[(int)inputString[0]];
        if (child != null) {
            if (child.getClass().getName() == "HATTrieNode") {
                return ((HATTrieNode)child).addString(newString,id);
            } else if (child.getClass().getName() == "CCHashTable") {
                return ((CCHashTable)child).addString(newString,id);
            }
        }

And there’s some additional sanity checking boilerplate as well. Checking for a node then just involved running through the trie, again recursively, with code very similar to the above. The bursting function is the icky part – it requires the CCHashTable to know how to burst itself and stitch the parts back together using a new HAT-Trie node, which means that the CCHashTable and HATTrieNode classes are very, very tightly coupled. Which is a big OO no-no, and is in dire need of refactoring. The basic idea is to create a new HAT-Trie node, tie it to the parent node of the CCHashTable being burst, then burst the CCHashTable and link all the new CCHashTables created to the HAT-Trie node, like so:

public HATTrieNode burst() {
    HATTrieNode originalParent = this.parent;
    HATTrieNode newNode = new HATTrieNode(originalParent);
    StringBuffer tmp = new StringBuffer(8192);
    int j = 0;
    int k = 0;
    int length = 0;
    int id = 0;
    for (int i = 0; i < CCHashTable.tableSize; i++) {
        j=0;
        while (j < this.contents[i].length) {
            if (this.contents[i][j] == '') {
                break;
            }
            // first decode the length of the string
            length = (int) (this.contents[i][j++] | (this.contents[i][j++] << 16));
            for (; this.contents[i][j] != ''; j++) {
                tmp.append(this.contents[i][j]);
            }
            j++;
            id = (int) (this.contents[i][j++] | (this.contents[i][j++] << 16));
            if (newNode.addString(tmp.substring(0),id)) {
                tmp.setLength(0);
                //empty the stringbuffer for re-use
            } else {
                System.out.println("Error while bursting");
                return null;
            }
        }
    }
    return newNode;
}

This isn’t good design or good practise though. In fact, I’m really not happy with the design here at all. However, like the steam-power systems on a nuclear aircraft carrier, it’s a kludge that works.