06
May 08

Implementing HAT-Tries in Java

As I said in an earlier post detailing the CCHashTable, it formed a part of a larger data structure, the HAT-Trie. The HAT-Trie is a recent variant on the trie data structure, published by Askitis and Sinha in last year’s Australasian conference on Computer science (the original paper is here). The problem they were addressing was that while the trie is very fast, it’s also very expansive in memory; and while some variants of it like the radix trie (or Patricia trie) and the related burst-trie, are less expensive in terms of memory space, they aren’t as fast as they could be because they don’t take into account caches in their design.

Consider for a moment, the standard trie node, with its array of pointers to all possible following characters. Think of a string of these nodes, representing a character string of some type, and how the terminal node then stores a link to a piece of information which the overall character string is indexing to (there are obviously a lot of applications for such a data structure). The obvious problem is that the longer the string, the more nodes, and the nodes are large – even for standard ascii, that’s 128 pointers per node, which means that to store ten characters, you’re looking at 512 bytes. And few strings used in such applications are that short – certainly in the application I was looking at originally, we were looking at strings ten times that size at least. The burst-trie, the radix trie (and it’s predecessor the compact trie) all attempt to solve this problem through similar methods involving compressing parts of the trie. The compact trie looks at all the branches of the trie that lead straight to a terminal node and represents it with a single node (in other words, it takes the unique bits of the ending of the strings stored in the trie and compresses those). The radix tree takes this a step further and notes that most strings being indexed have similar chunks throughout the strings, not just at the ends, and it compresses these chunks from multiple trie nodes to one node.

The burst-trie takes a slightly different approach, in that it compresses chunks from the trie by grouping together those strings with the same chunks at the start of the thread, so that the trie works as a partial index, and points to a secondary data structure that stores the rest of the string and it’s associated data. Heinz used linked lists when documenting this data structure, but linked lists take a performance hit when caching is taken into account, because linked lists tend to stump caches as their memory access can’t be predicted reliably, even through things like moving the last accessed element to the front of the list can help.

Enter the HAT-Trie.The idea is simple enough – a burst trie, but using a cache-concious data structure, in this case the earlier mentioned CCHashTable. Implementation is somewhat complicated in that the actual algorithm starts with a single CCHashTable node, which on reaching a set threshold for the number of strings contained (chosen to keep the number of hash table collisions to a minimum), it bursts into a single root HAT-Trie node, and a number of new CCHashTable nodes attached to it. That bit of jiggery-pokery should be handled by a third class, a trie manager of sorts, but that bit of refactoring has yet to be done. For now, there’s a bit of icky code in the CCHashTable class that knows about the HAT-Trie class. Ah, the crappy code we write under deadlines…

The actual HAT-Trie object’s data is simple enough:

public class HATTrieNode {
    private HATTrieNode parent;
    private Object children[];
    private int id;

The reason for the children being Object pointers is that they could be other HATTrieNodes or CCHashTable objects. And id is the piece of information being indexed by the input string. Storing a string is a relatively straightforward recursive function:

public boolean addString(char inputString[], int id) {
    if (inputString.length > 1) {
        char newString[] = new char[inputString.length-1];
        System.arraycopy(inputString, 1, newString, 0, inputString.length-1);
        Object child = this.children[(int)inputString[0]];
        if (child != null) {
            if (child.getClass().getName() == "HATTrieNode") {
                return ((HATTrieNode)child).addString(newString,id);
            } else if (child.getClass().getName() == "CCHashTable") {
                return ((CCHashTable)child).addString(newString,id);
            }
        }
        return false;

which handles the terminal nodes like so:

         
    } else if (inputString.length == 1) {
        char newString[] = new char[0];
        Object child = this.children[(int)inputString[0]];
        if (child != null) {
            if (child.getClass().getName() == "HATTrieNode") {
                return ((HATTrieNode)child).addString(newString,id);
            } else if (child.getClass().getName() == "CCHashTable") {
                return ((CCHashTable)child).addString(newString,id);
            }
        }

And there’s some additional sanity checking boilerplate as well. Checking for a node then just involved running through the trie, again recursively, with code very similar to the above. The bursting function is the icky part – it requires the CCHashTable to know how to burst itself and stitch the parts back together using a new HAT-Trie node, which means that the CCHashTable and HATTrieNode classes are very, very tightly coupled. Which is a big OO no-no, and is in dire need of refactoring. The basic idea is to create a new HAT-Trie node, tie it to the parent node of the CCHashTable being burst, then burst the CCHashTable and link all the new CCHashTables created to the HAT-Trie node, like so:

public HATTrieNode burst() {
    HATTrieNode originalParent = this.parent;
    HATTrieNode newNode = new HATTrieNode(originalParent);
    StringBuffer tmp = new StringBuffer(8192);
    int j = 0;
    int k = 0;
    int length = 0;
    int id = 0;
    for (int i = 0; i < CCHashTable.tableSize; i++) {
        j=0;
        while (j < this.contents[i].length) {
            if (this.contents[i][j] == '') {
                break;
            }
            // first decode the length of the string
            length = (int) (this.contents[i][j++] | (this.contents[i][j++] << 16));
            for (; this.contents[i][j] != ''; j++) {
                tmp.append(this.contents[i][j]);
            }
            j++;
            id = (int) (this.contents[i][j++] | (this.contents[i][j++] << 16));
            if (newNode.addString(tmp.substring(0),id)) {
                tmp.setLength(0);
                //empty the stringbuffer for re-use
            } else {
                System.out.println("Error while bursting");
                return null;
            }
        }
    }
    return newNode;
}

This isn’t good design or good practise though. In fact, I’m really not happy with the design here at all. However, like the steam-power systems on a nuclear aircraft carrier, it’s a kludge that works.


01
May 08

Ubuntu upgrades and fundamental problems

New job, new machine. So the last two days I’ve been setting up hardware platforms for work (one linux desktop machine, one win32 laptop, and some cool toys like an iPod Touch and a Nokia N800 – there are more to come, but it’s more than enough to start with). I’ve been using Kubuntu for work for a few years, and Xubuntu on my ancient (and now breaking) laptop at home (an Inspiron 1100, which should tell you why it’s xubuntu and not kubuntu there as well). So, no-brainer, other people have had no problems so just download the latest ISO and do a quick install, right?

Well, yes and no. Continue reading →


25
Apr 08

Tips for hiring new engineers

Left dotMobi a few weeks back just after the DeviceAtlas project was successfully launched; took about three weeks to sleep (14-hour days for far too long to make the release date) and then starting back into the search for the next role. It was an interesting few weeks, and something I mentioned earlier about how “passion” is being seen as the new black when listing off what’s required of a new hire in a job ad really did come to bug me during this interval, so I thought that I’d blog about it when it was over.

Ladies and gentlefolk of HR who write these job adverts, might I make a few suggestions to you?

A job ad without salary information is like a CV without a name on it.

I will not read such an ad. I filter them out right at the search page. It’s not that I’m following the age-old business advice that people do jobs for the sole motivation of collecting their pay – if I truly believed that, I might be a lot better off in material terms, but I’d also be dealing in illegal narcotics because the profit margins there are so much higher. No, I check for the salary information for two reasons. Granted, the first is because the salary is actually important – it may not be the dominating factor in my decision (for example, my new role involves a pay cut, but it also brings other benefits to the package that make up for that), but it is a factor nonetheless. Without good reason, you really shouldn’t take pay cuts to get the next job. Parity would be the lowest you should go, that’s only common sense – forget being used to a set quality of life, think what the next company will think if you had to take a 10% paycut to get your last job.

The more immediate reason I check for the salary information in a job ad, however, is because it gives me information about the role and how the role is viewed by the company. Is it a junior role or a senior one? Is it spec’d as a senior role but with junior pay? Is it at or above the market average? Is it a set figure or a range of figures, and if so, how wide is the range? All of this gives you a feel for how the job will treat you – which is information that won’t be quite so easy to get during the interview.

As for a company that won’t tell you the salary, but instead has “Negotiatable” or “highly competitive” entered instead of a number? Nope, not interested. Walk on by. If you’re not willing to tell me how much you’re willing to pay, then why waste my time on you when several other companies will tell me?

Get your engineers to write the technical requirements

Look, seriously. HR people may know the buzzwords. Agile, Object-Oriented, Design Patterns, etc, etc. But if the requirements on about 20% of the ads I’ve seen in the past month were correct, then a group mind made up of Linus Torvalds, Steve Jobs and Sergey Brin wouldn’t be qualified for the roles being advertised.

Don’t have the HR people write the technical requirements. You have engineers, use them. Send round an email asking for a list of technical requirements. Give them about twenty minutes in a room together and remind them that they’re looking for another person at their technical skill level. They’re not trained monkeys, they know what skills the new hire must have to be able to keep up, and they know what skills the team is lacking and needs for the next job.

And don’t class all those requirements as being absolutely necessary. Have two lists – critical, mandatory skills; and skills which would be advantageous to have. Because many professional engineers will look at the mandatory skills listed and if we see some we don’t have, we won’t apply. If you put in AJAX as a critical skill and you’re looking for someone to work with the back end of the LAMP stack, you’ve just hurt your chances of a fast hire.

When writing the nontechnical requirements, remember you’ll be hiring a professional

Seriously. If one of your nontechnical requirements is that the person be able to work as part of a team, then you’ve not spent enough time thinking about your nontechnical requirements yet. The basics of professional work – getting along with others, being able to manage your time, being able to communicate clearly and well, being able to work to deadlines – it is a waste of your time and mine to put these in the job advert. You aren’t taking random people off the street here – you’re hiring trained professionals who’ve working in this field for some time. It’s nearly insulting to tell them not to apply unless they play well with others.

And pragmaticaly (and perhaps a little cynically), it’s hardly likely that if they can’t do these basic things, that they’d tell you, now is it?

And don’t start worrying about the few that don’t know how, but apply anyway. That’s what the CV and the interviews and the recommendations are for. That’s what LinkedIn and their own websites and portfolios demonstrate. Seriously, putting in basic skills into a job ad is a waste of space you could have used to convey important information.

Tell me about the job

You’d imagine this one wouldn’t need saying, wouldn’t you? That a job ad’s very raison d’etre is to tell me about the job, right? I know I used to, a long time ago. I think I’ve become inured to it though. Job ads these days tells me about how the company is successful/about to be successful, how the work is fast-paced/challanging/exciting, how the people are passionate/professional/the best in the field. There’ll be a skillset list a mile long, most of which later turns out to be utterly irrelevant or utterly over-specified. There might, if I’m actually reading this far, be a salary and maybe, in about 5% of the cases, they’ve listed other perks, like health insurance or a sports club.

But I’ve not yet, not once, not ever, come across a job advert that actually tells me what the job will be like. What I’ll actually be doing. What skills I’ll really be using. What tool chain the other engineers are using. What my workstation (where I’m now going to be spending a good third of my life) is like. What the dress code will be (okay, granted, that’s now becoming fairly standard, but still). Whether or not there’s a flexitime scheme (and with Dublin traffic, that’s a pretty big thing to leave out).

Seriously, what possible problem is there in revealing this kind of information? It’s hardly trade secrets we’re talking about here!

Anyway, I had two more observations, but they’re not really about the adverts, they’re really more about the interview.

There is no interviewer, but there are two interviewees

Most places I’ve been to of late actually do get this. But there are still quite a few where they’re quite set in their thinking that I should be lucky to get to work for them and I ought to know it. Truth is, I – and the other software engineers you’ll be interviewing – got really lucky in our lives. We have a highly marketable, rather lucrative skill. And we can take that skill anywhere we wish to. If I want to go to Geneva in the morning, there’s nothing stopping me from doing my job over there (Okay, so I’ve actually agreed to my current role here for the next year at least first, but the principle is sound).

An example from recently – I applied for a role, and was immediately told I was to come in for a technical test and it would either be at time A or time B. Now, I don’t mean that I spoke to someone who talked to me for five minutes, or that I had an email or two, I mean I was ordered to come in, as the first contact with the company. No phone call. No discussion of any kind. I didn’t pursue the role. If you’ll order a stranger about when you’ve been looking for someone for your role for over a year, how would you treat an actual employee?

And, as applicants put on a good face, so should prospective employers. Another recent interview saw me sitting in reception for 25 minutes waiting for my interview, after which I was bumped from room to room until the interview happened. Now I get that on some days people’s agendas just don’t survive contact with reality, but by the time the interview started (40 minutes late), I was about 80% sure I wasn’t going to take on the role before even greeting the interviewer. Though I must admit that during those 40 minutes I got to observe the day-to-day working environment and conditions, and they did not impress me either, so perhaps in this instance, I was more fortunate than put-upon.

And finally… know thy hire

It’s perhaps telling that the role I’ve now taken on had the most informal interview of them all. I came in, I had a cup of coffee with my new boss-to-be in his office and we discussed the project he wanted me to work on. We discussed the benefits when we knew I found the project interesting, and while it was less salary than the last role, I was given time to complete my PhD as a part of the package, which was more than worth the paycut to me – and my new boss-to-be knew this.

You don’t always have to have suits and interview panels and complicated procedures if you know how the people you hire are going to think.