Hackthology http://blog.hackthology.com Collected Sayings posterous.com Mon, 30 Jan 2012 14:11:00 -0800 Passmash - The Site Specific Password Munger http://blog.hackthology.com/passmash-the-site-specific-password-munger http://blog.hackthology.com/passmash-the-site-specific-password-munger

Passmash is a new commandline password munger. It has been tested to work on Linux with X and on MacOS. It should also work on Windows.

What is a Munger?

A munger takes a password and turns it into another password, "munging" it. In particular passmash takes

  • A password (supplied interactively at the prompt)
  • A URL (or other identifier) (supplied as a command line argument)
  • A secret key (kept at ~/.ssh/passmash.key)

and returns a password. It has the advantages of a password manager without having to worry about syncing a password database. The key file is static, so simply keep a (possibly encrypted) backup of it. If you loose the key file, you will not be able to recover your passwords.

Example Usage

In most circumstances you will want to use the pm command

$ pm myurlhere.com
Password:

$

This command automatically generates and copies the password to you clipboard. On Linux it uses xclip -selection clipboard, on Mac OS X it uses pbcopy and on Windows it uses clip.

If it is on another operating system (like OpenBSD) it will pretty print the password for easy typing. eg.

$ pm myurlhere.com
We don't yet support OpenBSD for autoclipboard copying
Password:

5KrUw4pBgC89LGxggXEIFtjM41aPc+/GxH+cumCuTo4
5KrUw - 4pBgC - 89LGx - ggXEI - FtjM4 - 1aPc+ - /GxH+ - cumCu - To4

Technical Details

Passmash uses a SHA256 based HMAC with key strengthening.

def mash(key, url, password):
    h = hmac.new(key, password, sha256)
    h.update(url)
    for i in xrange(250000):
        h.update(h.digest())
    return h.digest()

On my machine (a 2.0 Ghz Core2) it takes around 1 second to derive a password using this function. A more secure version of the same utility could make use of bcrypt or scrypt. However, either would add an external dependency.

This password derivation function should provide strong defense against an attacker who has

  • A password generated from the function (perhaps obtained from a hacked website).
  • The algorithm. (eg. they know you use this program to generate your passwords).

And optionally:

  • The key file
  • or the "master" password (but not both)

If your "master" password has sufficient entropy then your other passwords generated with the same key should be reasonably secure against a brute force attack.

Happy Munging!

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/290393/akbar.jpeg http://posterous.com/users/36udceGAwRbz Tim Henderson timtadh Tim Henderson
Thu, 02 Jun 2011 08:22:00 -0700 Ternary Search Tries for Fast Flexible String Search : Part 1 http://blog.hackthology.com/ternary-search-tries-for-fast-flexible-string http://blog.hackthology.com/ternary-search-tries-for-fast-flexible-string

Searching a large corpus of strings is a problem many applications have to solve, whether the application features autocomplete boxes or full-text search. Efficient methods for conducting such searches are not always readily apparent to the algorithm designer. In this series of articles I will present a data structure known as the Ternary Search Trie (TST) which is designed to assist in solving this problem. For this introductory article I will not discuss algorithms in detail but only provide a high level overview of the structure and algorithmic running time for various operations. In the next article I will detail the process of maintaining the structure with insertions and deletions. The final article will discuss different flexible search algorithms and their implementations.

Symbol Tables: A Short Review

A symbol table is a mapping between a string key and a value which can be any type of object from an integer to a complex nested structure. There are two classic symbol table implementations most programmers are immediately familiar with: Binary Search Trees (BSTs) and hash tables. Both of these structures work by exactly matching the search key to the keys stored in the structure. If there does not exist an exact match then there is a miss. Thus, neither of these structures can serve as useful index for an autocomplete algorithm where only part of the key is known. They may be useful for a full text index, but they will not be as efficient as some of the other structures we will later discuss.

While these simple structures are limited, other symbol table implementations have properties more suited for modification for flexible string search. This series will focus on the Trie category of structures. These structures are more suited for partial match as we will see. But first, why are hash tables not suited for this job? They have such excellent running time characteristics, O(1) lookups! However, one cannot modify the hash table algorithm to effectively serve the purpose of a partial match or a range query. Why? Because hash functions transform strings into numbers (which correspond to buckets in an array). Good hash functions have wide variance in hashing strings and will not hash a similar string or a sub string to the same bucket.

Introducing the Trie

Trie1

Figure 1. An Example Binary Search Trie

In general a Trie is a special for of a tree. However, instead of comparing entire key at each node during traversal, it only compares parts of keys. The key/value pairs are kept in the leaves (like in a B+ Tree). We will first consider a Binary Search Trie. Like the Binary Search Tree, each node in the Trie has two children, left and right. The left child is defined as the 0'' child and the right as the 1'' child. As a key is inserted, a node is created or visited for each bit in the key. When visiting a node which already exists, the direction to descend is based on the current bit. that is:

  • let there be a function bit(i, s) which returns the ith bit in the string s.
  • let depth(r, n) return the depth of the node n in the tree rooted at r
  • bit(depth(r, n), s) is the bit in the string used to make the decision on which of n's children to visit.

Sedgewick gives the formal definition: "A [binary search] trie is a binary tree that has keys associated with each of its leaves, defined recursively as follows: The trie for an empty set of keys is a null link; the trie for a single key is a leaf containing that key; and the trie for a set of keys of cardinality greater than one is an internal node with left link referring to the trie for the keys whose initial bit is 0 and right link referring to the trie for the keys whose initial bit is 1, with the leading bit considered to be removed for the purpose of constructing the subtrees."1

A search using this structure is directed by the strings in the database. However, since only one bit is considered at a time in the search for a k-bit string the search will take in the worst case k bit comparisons. This makes for a very tall structure when using string keys, since single characters will be at least 8 bits long in ASCII and much longer in Unicode. Another unfortunate implementation detail is that modern processors typically work more efficiently when accessing bytes or words. Thus, a more efficient structure might consider multiple bits at once.

Multi-way Tries

Trie2

Figure 2. An Example R-Way Trie

If one considers multiple bits at once one has to increase the fanout (number of children per node) of the tree. Consider figure 2, in which each node has a fanout of 26.2 While useful if every node has 26 children, the space to store the pointers becomes wasteful for real data. However, despite the wasted space in comparison to the binary version, searches on the R-Way Trie will perform faster than on the Binary Trie. It will be faster for the CPU to compare the bytes under consideration and there will be fewer comparisons over all. In general, a Binary Trie will require log2(N) comparisons to perform a search, and an R-way Trie it will take logR(N) comparisons.

 

However, to produce a usable structure for our purpose (a large in-memory string index) we need to cut down on the space wasted by the extra pointers in each node. The tricky bit is to do this while still maintaining the hard-fought gains in search speed. Simply using a dynamic structure like a hash table in each node to hold the array won't work either because hash tables are slower than an array access and if the hash table becomes overly full it may actually use more space than the array. Thus, a different structure is needed.

Ternary Search Tries

Tst1

Figure 3. An Example Ternary Search Trie with strings [abc, abs, awe, and].

The Ternary Search Trie helps avoid the unnecessary space needed by a traditional multi-way trie while still maintaining many of its advantages. In a Ternary Search Trie each node contains a character and three pointers. The pointers correspond to the current character under consideration being less than, greater than or equal to the character held by the node. In a sense this structure is like taking the Multi-way Trie and encoding it on to a Binary Search Tree with the keys as current character and the values as another BST corresponding to the next character.

While a Multi-way Trie has about R*N/log2(R) pointers, a Ternary Search Trie has R + c*N pointers where c is a small constant, perhaps 3. Consider the graph of their performance:

Multiway_vs_trie

Figure 4. Links in a Multi-way Trie vs. a Ternary Search Trie

When R is small, 2, 3, 4 Multi-way Tries and Ternary Tries have similar a similar number of pointers, but a low branching factor destroys the advantage of the Multi-way Trie. When R grows to a larger, more reasonable size such as 256, the number of pointers explodes in comparison to the Ternary Trie. Thus, the Ternary formulation will be far more space efficient in the worst case than the Multi-way formulation.

What is the cost for better space efficiency? In the Multi-way Trie we must traverse at most the length of the search key links. In a TST we may need to traverse up to 3 times that many links in the worst case. However, this pathological case is rare. In the average case the situation can be made much better through a few small improvements to the basic structure.

Tst2

Figure 5. An Improved Ternary Search Trie.

The first improvement, illustrated below in Figure 5, involves collapsing the leaf nodes. Instead of allowing long chains of nodes at the leaves, we collapse them into a single node. This allows the final check to be computed efficiently.

The second improvement, also illustrated in Figure 5, combines the best of the Multi-way Search Trie with the Ternary Search Trie. The root node is an R-Way node like in the Multi-way Search Trie. The rest of the tree is a Ternary Search Trie with the leaf nodes collapsed. In practice these improvements result in an enormous speedup. The theory also supports the practice; according to Sedgewick, these improvements cut the number of comparisons needed in half.3

There is one final improvement to consider not shown the above figure. Similar to the first improvement, it involves collapsing nodes, but instead of collapsing leaf nodes, internal nodes are collapsed. This idea is similar to the Patricia Trie. When a group of strings shares the same contiguous substring, instead of having a node for each character shared, collapse the shared nodes into a single node.

Conclusion and Whats Next

In this post we discussed the theory behind Symbol Tables and the use of Tries as a symbol table implementation. The Trie, and in particular the TST, is an efficient way to implement a symbol table. A good implementation of a TST has comparable performance to a Hash Table. However, as we will see in the next post they allow much more flexible search operations. Stay tuned.

 


  1. Sedgewick R. Algorithms. Third Edition. Definition 15.1. http://www.amazon.com/dp/0201314525/ Hereafter: Sedgewick
  2. Note: Usually an R-Way or Multi-Way Trie has a fanout equal to the numbers of characters in the character set or the number of bits in a machine word, half word or byte. So in practice a node in an R-Way Trie might have 256 children or perhaps 2^16 children. As the fanout (the number of children per node) increases the space efficiency of the Trie decreases. However, the search speed increases. A classic time/space trace off.
  3. Sedgewick Table 15.2 and related text.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/290393/akbar.jpeg http://posterous.com/users/36udceGAwRbz Tim Henderson timtadh Tim Henderson
Mon, 23 May 2011 08:48:00 -0700 How To: Write Self Updating Python Programs Using Pip and Git http://blog.hackthology.com/how-to-write-self-updating-python-programs-us http://blog.hackthology.com/how-to-write-self-updating-python-programs-us

If you are a pip[1] and virtualenv user you already know how easy it is to install python packages. Unlike the bad old days when I started programming in Python, 9 years ago, it is now easy to add, remove and manage python modules. In fact we can leverage pip to create an update command for a python program, for example and ease of illustration, a shell utility.

Table of Contents

  1. Desired Features
  2. Basic Idea
  3. Implementation

Desired Features

  • Uses a server controlled by the owner (instead of the Python Package Index).
  • Install an arbitrary version of the program.
  • Defaults to updating to the newest version of the major release one is tracking.

I often want my programs to update themselves from a specific location. For instance an internal server or perhaps my github account. Fortunately pip already supports such nicities with the -e for the install command.

Additionally, when running a generic update you often want to stay on the same major revision and simply get the bug fixes. However, it is important to provide the option to update to any arbitrary release including tracking the master branch.

Basic Idea

Use Pip and the -e option plus a base URL to automatically update your software. eg. Have you software run pip for the user.

example command:

pip install --upgrade --src="$HOME/.src" -e git+<URL>@<REV>#egg=PACKAGE_NAME

Tracking Major Versions

To track major version updates some care must be taken in setting up the repository. I use branches instead of tags to track major releases. This allows me to push out bug fix updates for every one tracking that release. I tag minor releases to allow users to install a specific version.

Branches

  • master
  • stable
  • r0.1
  • r0.2
  • ...
  • rN

Tags

  • r0.1
  • r0.1.1
  • r0.1.x
  • ...
  • rN

Pip Gotcha

When checking out branches using pip you have to supply origin/branchname ex:

pip install --upgrade --src="$HOME/.src" -e git+https://github.com/user/repo.git@origin/branch#egg=PACKAGE_NAME

While when checking out a commit you should not supply origin

pip install --upgrade --src="$HOME/.src" -e git+https://github.com/user/repo.git@COMMIT_ID#egg=PACKAGE_NAME

Why does pip work like this? Because of the commands it executes. For the command:

pip install --upgrade --src="$HOME/.src" -e git+https://github.com/user/repo.git@<VERSION>#egg=PACKAGE_NAME

pip runs

git fetch -q git reset --hard -q <VERSION>

Store the tracked version in the source

To ensure the update command installs the correct updates I put which release to checkout in the source code. This allows me to "release" a version by creating a branch and then changing the RELEASE constant to point the name of the branch.

Implementation

Note: This is example code only, you should modify for security and stability of your enviroment.

Note: I didn't include virtualenv support in this code but it is trivial to add.

from subprocess import check_call as run 
from getopt import getopt, GetoptError 
RELEASE = 'master' # default release 
SRC_DIR = "$HOME/.src" # checkout directory 
UPDATE_CMD = ( # base command 
'pip install --src="%s" --upgrade -e ' 
'git://github.com/timtadh/swork.git@%s#egg=swork' 
) 

@command 
def update(args): 
    try: 
        opts, args = getopt(args, 'sr:', ['sudo', 'src=', 'release=', 'commit=']) 
    except GetoptError, err: 
        log(err) 
        usage(error_codes['option']) 

    sudo = False 
    src_dir = SRC_DIR 
    release = RELEASE 
    commit = None 
    for opt, arg in opts: 
        if opt in ('-s', '--sudo'): 
            sudo = True 
        elif opt in ('-r', '--release'): 
            release = arg 
        elif opt in ('--src',): 
            src_dir = arg 
        elif opt in ('--commit',): 
            commit = arg 

    if release[0].isdigit(): ## Check if it is a version 
        release = 'r' + release 
    release = 'origin/' + release ## assume it is a branch 

    if commit is not None: ## if a commit is supplied use that 
        cmd = UPDATE_CMD % (src_dir, commit) 
    else: 
        cmd = UPDATE_CMD % (src_dir, release) 
    
    if sudo: 
        run('sudo %s' % cmd) 
    else: 
        run(cmd)

[1] http://www.pip-installer.org/en/latest/index.html "A Python package installer."

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/290393/akbar.jpeg http://posterous.com/users/36udceGAwRbz Tim Henderson timtadh Tim Henderson
Wed, 18 May 2011 19:30:00 -0700 Announcing swork - Simplify your Shell Configuration http://blog.hackthology.com/announcing-swork-simply-your-shell-configurat http://blog.hackthology.com/announcing-swork-simply-your-shell-configurat

If you are like me, and if you are reading this you may very well be, you spend an inordinate amount of time juggling inane details, like shell environment variables, while programming. Now there is nothing wrong with setting, exporting, and then unsetting variables, mounting and unmounting FUSE partitions, starting routine backups, and so on but it does get tedious after a while. Eventually, you may have written a host of scripts to solve these various problems. Today I present swork (or start work) a command line utility to help manage these little one off scripts with ease.

Don't Repeat Yourself

A typical pattern seen in scripts, such as virtualenv's activate script, is the storing of old environment variables such that the changes made by the script can be easily undone. Every non-trivial script I write seems to include this detail, and I am tired of it. It is boring, it is simple, and it is abstract-able. So I have abstracted. swork frees you from needing to write this code. When you want to go back the original state of the shell, you simply type:

$ swork restore

As long as you have run swork at some point in the past on the current shell (or rather the current bash process) swork will restore environment of the shell to the state it originally found it.

Writing Configuration Scripts

While, swork saves you the trouble of saving and restoring variables, you still have to write the scripts to run. Fortunately, this is very easy. You simply write a bash script (or any executable) then you add it to the ~/.sworkrc (located conveniently in your home directory).

Example setenv file:

#!/usr/bin/env bash  
source env/bin/activate # activate a virtualenv  
export SOMEVAR="new value"  
export PATH="some/new/stuff":$PATH  
export PYTHONPATH="more/new/stuff":$PYTHONPATH

example .sworkrc file:

{
    "project1" : {
        "root":"/path/to/project1/root",
        "start_cmd":"source /path/to/project1/root/then/setenv"
        "teardown_cmd":"echo 'project1 teardown'"
    },
    "project2" : {
        "root":"/path/to/project2/root",
        "start_cmd":"source /path/to/scripts/project2_setenv"
        "teardown_cmd":"echo 'project2 teardown'"
    }
}

Wrapping Up

swork makes it easy for you to manage the environment on you shell allowing you to switch contexts with minimum fuss. It currently implements the minimum functionality to be useful, but is just waiting for your feature request!

check it out on github: https://github.com/timtadh/swork

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/290393/akbar.jpeg http://posterous.com/users/36udceGAwRbz Tim Henderson timtadh Tim Henderson
Thu, 17 Feb 2011 07:30:00 -0800 Grammars, Ambiguity, and Expressibility http://blog.hackthology.com/grammars-ambiguity-and-expressibility http://blog.hackthology.com/grammars-ambiguity-and-expressibility

Last night I gave a talk at CWRU Hacker Society about formal languages. This is the first talk in a series of lectures I will be giving on compilers. Unfortunately, unlike my regular expression talk I did not get a recording of the audio. I may do a write up of exactly what I talked about later. Until then enjoy my "slides."

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/290393/akbar.jpeg http://posterous.com/users/36udceGAwRbz Tim Henderson timtadh Tim Henderson
Fri, 28 Jan 2011 07:39:00 -0800 Hacker Trading Cards http://blog.hackthology.com/hacker-trading-cards http://blog.hackthology.com/hacker-trading-cards

The spring career fair at Case Western Reserve University is coming up next week. Instead of collecting swag from all the employers, we decided to make Hacker Trading Cards to give to companies as CWRU Hacker Society swag! We would like employers looking to hire CS students to give talks at Hacker Society this semester and thought this was a creative way to get their attention.

This wasn't intended to be a comprehensive collection, so what cards would you have created? What other creative ways would you suggest for student groups to engage companies?

If you'd like to make your own Hacker Trading Cards, fork our gist! 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
Name
Type
Description
Beard Bonus

Paul Buchheit
More Than Just The Gmail Guy
Paul's a CWRU grad. We put him on here to try to get him to give a talk or interview at the CWRU Hacker Society. So, Paul, if you see this please email sj@case.edu!

John McCarthy
Creator of Lisp
Lisp Chainsaw: +20 damage
Parentheses: Slow all enemies unless opponent's IDE autocompletes. New cards must be represented with eth-ekthprethionth.
Beard Bonus: 80

Alan Kay
Dynabook Librarian
Everything is now an object. OOPs!
Mustache Bonus: 15

Edsger W. Dijkstra
Formally Verified
Shortest path: +1 to speed
Ineffective if GOTO in play
Beard Bonus: 40

Richard Stallman
Source Code Freedom Fighter
GNU Manifesto: 30 damage to closed-source constructs in play belonging any player. +10 additional damage if player can be convinced to work on the Hurd.
Beard Bonus: 100

Donald Knuth
Doctor of the Art of Computer Programming
High-TeX Typesetting: deals 1 damage for each as-yet-unwritten word in Volume 4 of TAoCP unless opponent pays $2.56.
+2 to all defenses if an organ is in play.
Beard Bonus: 0

Eric S. Raymond
Steward of Culture
Jargon Battle: Flip a coin. If heads, build a cathedral. If tails, build a bazaar.
Release Early, Release Often: +20 health to open-source constructs due to source code contributions.
Beard Bonus: 2

Tim Berners-Lee
Inventor of the World Wide Web
Hypertext Transfer: Deal opponent +10 damage unless they control a Spam Filter.
Web Standards: If anyone still uses IE6, randomly redistribute all cards in the play area.
Beard Bonus: 0

Dennis Ritchie
Man Who Could C The Future
Null Pointer: Opponent takes +15 damage. 10% chance of shooting yourself in the foot for +5 damage.
Doubles the effects of Thompson, Torvalds, and Stallman.
Beard Bonus: 90

Ken Thompson
Stand Back, He Invented Regular Expressions!
Minimax: Think ahead of your opponent and deal them +10 damage.
Doubles the effects of Ritchie, Torvalds, and Stallman.
Beard Bonus: 80

Steve Wozniak
Champion of the PC Revolution
+40 to hardware efficiency and wonkiness. All cards now fit in 4k of memory... somehow.
Beard Bonus: 20

Linus Torvalds
Unrepentant Git
Now that Linus is on your side, you are always right!
If Stallman is on the table, players must fistfight.
Beard Bonus: 0

John Carmack
Harbinger of Doom
Raycasting: Opponent takes +25 damage and is overrun by exquisitely rendered aliens.
Doubles the effects of all graphics cards.
Beard Bonus: 0

Ada Lovelace
The Original Hacker
Gain +10 to intelligence for corresponding with Lady Lovelace.
Analytical Engine: 20 damage to opponent for arriving at confused ideas.
Beard Bonus: -10

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/318449/bloggerCrop.png http://posterous.com/users/37lz98KC0XNT Toby Waite ybot Toby Waite
Mon, 10 Jan 2011 12:00:00 -0800 RE: BASIC (Or, The First Programming Book I Ever Read) http://blog.hackthology.com/re-basic-or-the-first-programming-book-i-ever http://blog.hackthology.com/re-basic-or-the-first-programming-book-i-ever

Cross-posted from stevejohnson.posterous.com

Over the holidays someone gave me a copy of the first programming book I ever read. In rereading it, I found much more than when I first read it at nine years old.

BASIC Programming for Kids
by Roz Ault

The first thing you need to know is that since it was published in 1983, I didn't know how to find an interpreter for the code in 1999. All examples were run via thought experiment. Because of this fact, I think that this book did more to get me excited about the idea of programming than it did to impart knowledge. In this way, I think it follows a higher-level version of the "give a man a fish..." saying.

This book will teach you how to write simple programs in BASIC for your computer. Its purpose, however, is not to make you a programmer. Its purpose is to help you understand computers, to think about how computers can help you in all kinds of ways, and to discover how much fun you can have when you learn how to talk to computers.

Content

BASIC Programming for Kids explains how to write simple BASIC programs for The Apple II+/e, Atari 400/800, Commodore 64/PET/VIC 20, TI 99/4A, Timex Sinclair 1000, and TRS-80. (The programs still run in Chipmunk Basic.) The fact that such a book could be written for ten different platforms is a testament to the ubiquity of BASIC in personal computers at the time, but the book does spend a lot of time explaining special cases and how to rewrite the examples so that they run on some of the platforms. (Apparently the Timex Sinclair 1000 was an awful machine.)

It begins with a 30-page chapter on how to type and use the prompts on the various platforms. Then the language is taught over ten chapters, with exercises at the end of each chapter. There is a final chapter with some example programs, and then appendices for reference, troubleshooting, editing, and information about computers in general.

Here's a quick example to refresh your BASIC memory:

NEW
10 FOR J = 1 TO 5
20 PRINT "JUMP";J
30 FOR C = 1 TO 3
40 PRINT "CLAP";C
50 NEXT C
60 NEXT J
RUN

After rereading the book cover to cover, I have only two new thoughts. First, that the author did a good job of conveying the joy of computing to young readers. Second, that the BASIC language was an awful mess but succeeded for very good reasons.

Review: Positive

This is a good book. I'm glad I found it when I went looking for it. Here's an example that follows a description of what variables are:

You can put variables in a program, like this:

NEW
10 N = 2
20 X = 5
30 PRINT N + X

What will that program print? Run it and see if you guessed correctly. Notice that lines 10 and 20 don't make anything happen on the screen when you run the program. They tell the program to do something inside the computer, but only the word PRINT makes a message appear on the screen.

Now change lines 10 and 20 to give the variables some different values.

It's a small, self-contained, understandable example with a concise, complete explanation and an invitation to experiment. In this way, it mirrors the style of Zed Shaw's Learn Python The Hard Way. It never blames the reader for being wrong, and in fact seems to encourage the reader to forgive his or her own mistakes while writing programs.

So yes, it's a good book. But about this BASIC thing...

BASIC Sucked, But Worse Was Better

Seriously, what is this crap? Specifying a line number for each statement before the program is finished? REM? No proper functions? How did anyone survive this?

Oh, the alternative was to use assembly, or to slip into an AI laboratory. Right.

The two great things about BASIC as it existed in personal computers was that it was extremely simple, and it was everywhere. Ault was able to describe almost the entire language with extensive examples in a hundred pages of large print, and those pages covered the vast majority of PCs on the market at the time.

The reason I was able to understand the book without actually using a computer was because of the simplicity of BASIC and because of Ault's ability to explain it using terms no more complex than necessary. At this point I should mention that this book may not actually be the very first programming book I read, but it was certainly the first one I understood. I really don't remember.

...And Also Worse

Before rereading this book, I had mostly forgotten what BASIC really was, and didn't necessarily agree with the statement that BASIC causes brain damage. Now I agree wholeheartedly, and I speak from experience. BASIC crippled me for years.

At its core, BASIC is a crappy way to express a state machine. The syntax encourages tight, unreadable balls of spaghetti. GOSUB is a poor way to break out functions. Most of the punctuation feels very ad hoc.

I can't help but contrast this mess with Scheme. All other concerns aside, Scheme makes a great teaching language because there is almost no syntax and code is inherently hierarchical. Beginners simply learn new words. Everything else is gravy.

But I didn't learn Scheme, I learned BASIC and stuck with it for about six years. I went from TrueBASIC to Visual MacStandardBasic to METAL BASIC to BlitzMax. I would occasionally try to learn a new language, but would quickly become frustrated with the lack of easy-to-use IDEs and graphics libraries. (During this period I was writing nothing but games.)

Languages like METAL BASIC had few features and libraries, but for me that was as much of a strength as it was a weakness. Rather than spending hours searching for which giant package to import, I could browse a complete list of commands less than twenty pages long to find what I needed, modules and namespaces be damned. When I was done writing a game, I could click "Compile" and email it to my friends instead of asking them to download Joe Shmoe Player 3000 or tearing my hair out over config files. (For this reason, I think Processing and BlitzMax are currently the best platforms to learn game programming with.)

I spent so much time in the game bubble that I missed out on many early opportunities to learn new language concepts.

What Everyone Knows Is Wrong Today

Today's popular languages are objectively better than BASIC in every way. Features, syntax, libraries, the works. But for a kid who wants to write games by typing into a window and clicking COMPILE or RUN, the language options are limited to Processing, BlitzMax, and whatever GameMaker uses*. None of these are real-world languages, so anyone starting out with them will not have a smooth transition to the next stage in their development as a master of technology.

I won't bother rehashing what others have said about this problem or yearn for the days of the BASIC-prompt-as-main-interface. I'll just say this: make better tools, write more books and tutorials.

*I'm sure I forgot your favorite. Just drop it in the comments.

Resources

Invent Your Own Computer Games with Python is comparable to BASIC Programming for Kids. Better, even. Python's tools are not ideal for children, but they are good enough to teach programming with.

Here are some ways to create games with a good write/run/distribute tool, but not necessarily with good documentation for those new to programming:

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/657640/Screen_shot_2010-07-25_at_11.37.04_PM.png http://posterous.com/users/4aB39Hf4sdBT Steve Johnson stevejohnson Steve Johnson
Wed, 08 Dec 2010 11:27:00 -0800 Interpreting the Free Software Movement as Religion http://blog.hackthology.com/interpreting-the-free-software-movement-as-re http://blog.hackthology.com/interpreting-the-free-software-movement-as-re

A person should aspire to live an upright life openly with pride, and this means saying “No” to proprietary software.

        - RMS

Introduction

The Free Software movement which began in earnest twenty-five years ago has become one of the most quietly influential movements of the Internet age. Today, many social phenomenas occurring in our networked world, such as Wikileaks, can be understood more completely by understanding the Free Software movement. The Free Software movement can be usefully analyzed from many perspectives however, this paper will use the lens of religion. Specifically, the movement will be analyzed from the context of selected writing from its founder, Richard M. Stallman, using the categories defined by Mircea Eliade. Through the use of Eliade's categories one understands Stallman to be demarcating the sacred from the profane in an attempt to return to an archaic past.

Eliade's Terms and Categories

In his 1957 treatise, The Sacred and the Profane , Mircea Eliade creates categories, such as sacred vs profane, and attempts to fit many different religious traditions into the categories. Methodologically this approach has a serious shortcoming, the supporting traditions do not always fit well into the chosen categories. However, since Eliade has defined broadly useful, even if not universally applicable, categories for interpreting religious traditions. The utilization of these categories highlights the religious nature of the Free Software.

Eliade's primary purpose in his treatise is to discuss the experiential demarcation between the sacred and the profane. Eliade defines the sacred in two senses. The first is recursive: the sacred is the opposite of the profane.1 In this sense one can place objects into two categories, sacred and profane, as long as they don't overlap. Eliade clarifies this slightly through use of Rudolf Otto's term, the wholly other,2 indicating the sacred manifests itself as wholly different than the profane. By using Otto's language, Eliade indicates the sacred has an element to the divine.

For the purposes of this paper we will stipulatively take the divine to mean: that which seems to the individual to have a numinous quality. An object which has a numinous quality is one which seems irreducible and the individual thus feels a creature dependence towards.3 Thus, the sacred is the manifestation of the numinous into the corporeal. The profane, on the other hand, is the common, that which seems to be understandable. These stipulative definitions conform to Eliade's requirements: the sacred is opposite of the profane, and the sacred is wholly different than the profane.

In addition to his categories, sacred and profane, Eliade defines two related categories: archaic society and modern society. The man who lives in archaic society, homo religiosus, seeks to exist as much as possible in or around the sacred.4 In contrast the man of modern societies, modern man, exists in a desacralized environment.5 The modern man depending on his temperament may look back to the religious man either romantically or derisively. Thus, modern and archaic do not indicate, as they traditionally do, an essential ordering or time line. It may be that archaic and modern coexist with each look back to the other as a mythical past while eagerly looking forward to a time when man is more or less religious.

The Free Software Movement

What is the Free Software movement? How can it be understood as religious, using the terms and categories defined by Eliade? The Free Software movement was started in 1984 with the publication of the GNU Manifesto by Richard M. Stallman. Stallman had become disgusted with the unethical nature of software and computer usage and sought a return to an earlier time where users freely shared and modified programs. To enable this return, he set about to create an ecosystem of software which was protected from being made proprietary. In many ways Stallman succeeded: today there is large amount of Free Software available. Every computer user unwittingly uses such software on a daily basis, and companies such as Google and Facebook could not exist without Free Software.

What unethical nature was Stallman disgusted with? The answer lies in the GNU Manifesto where Stallman states, I consider that the golden rule requires that if I like a program I must share it with other people who like it. Software sellers want to divide the users and conquer them, making each user agree not to share with others. I refuse to break solidarity with other users in this way.6 It seems to Stallman he should obey the golden rule. One may guess from the text Stallman would define the golden rule as a principle of reciprocity. Thus, if an individual likes a program and would want another to share it with him he is ethically required to share the program with another as well. Furthermore, an individual must not restrict those he shares software with from further sharing the software, or from modifying the software to suit their needs better. To prevent such sharing and modification would violate his golden rule.7

It is important to note Stallman does not believe that all things should be shared alike. He only considers such freedom ethical where there is no harm done to the person who shares by sharing. Stallman states: "Owners say that they suffer harm or economic loss when users copy programs themselves. But the copying has no direct effect on the owner, and it harms no one."8 Thus, Stallman believes that copying a program does not harm the original owner, because the owner does not loose the use of the program because a copy is made. Such copies can be made indefinitely. In the same way Stallman defends the right to modify software: whether you run or change a program I wrote affects you directly and me only indirectly. Whether you give a copy to your friend affects you and your friend much more than it affects me. I shouldnt have the power to tell you not to do these things. No one should.9 Thus, Stallman has constructed his own ethical system based on how the golden rule seemed to him.

To explain and understand his movement, Stallman constructed a founding narrative. The narrative begins in ancient times when copyright as a concept did not exist. Stallman explains that in those times the roles of authors, copiers, scribes, and commentators were muddled. Everyone who participated in written culture freely copied, improved, and commented on previous works.10 Stallman holds this copyright-free society up as the exemplar from our historical past for how one should relate to written work.

Stallman continues his narrative by connecting the experiences of early computer programmers (his in particular) to the copyright-free society detailed above. While copyright existed when computing culture began in the 40's and 50's it was not yet universally applied to computer source code. Stallman participated in this society when he joined the MIT Artificial Intelligence Lab in 1971: When I started working at the MIT Articial Intelligence Lab in 1971, I became part of a software-sharing community that had existed for many years. Sharing of software was not limited to our particular community; it is as old as computers, just as sharing of recipes is as old as cooking. But we did it more than most.11 Stallman holds his early experiences in the AI Lab as a second exemplar for the proper orders of society, where software is freely shared, edited, commented, and ported.

However, Stallman's perfect society eventually fell into disrepair. Programmers were asked to sign software licenses and non disclosure agreements when the university purchased new equipment and software. To Stallman these events had a chaogonic12 quality: This meant that the first step in using a computer was to promise not to help your neighbor. A cooperating community was forbidden. The rule made by the owners of proprietary software was, 'If you share with your neighbor, you are a pirate. If you want any changes, beg us to make them.'13 Thus, the ideal of community which had obeyed the golden rule began to unravel. No longer could a programmer freely help his neighbor, no longer could a programmer freely fix bugs, no longer could a programmer port software to new platforms. The programmers were now at the mercy of contracts and legal agreements.

In the depths of this disarray, Stallman experienced a heirophany, Eliade's term for the sacred manifesting itself. It began with the AI Lab being gifted a printer from Xerox. However, despite giving them the printer, Xerox refused to share the code for the driver.14 Unfortunately, the driver had bugs in it. When Stallman offered to fix the bugs if they gave him their code, they refused. The experience was transformative. Stallman could no longer accept the status quo of license and non-disclosure agreements. He set out to change the world, so he could return to an ideal society where programmers helped their neighbors.15

Thus, according to Stallman's narrative detailed above, Stallman set out to purge his life of the corrupting influence of proprietary software. Unfortunately, to rid himself of proprietary software he needed to create a new ecosystem of Free Software. So he quit his job at MIT and began working on several Free Software programs. Stallman states: I realized that I was elected to do the job.16 Thus, he began the GNU project to create a Free operating system and ecosystem of software. Without, such an ecosystem Stallman feels one cannot live an upright life as a programmer.17

Therefore, the Free Software movement seeks to establish an alternative reality where all software written is Free. Users are free to modify and redistribute software. No one is free to limit another's use of software. This movement can be understood using Eliade's categories of sacred vs profane, and archaic society vs modern society.

Interpretation

The copyright free societies Stallman references in his narrative parallel the Eliadian concept of the archaic society. In both cases these societies are held up as exemplars of what it means to be truly religious, to be a homo religiosus. In Stallman's pre-copyright society, programmers shared with each other freely, they modified programs without hesitation; unwittingly they obeyed his golden rule and helped their neighbors. In modern society programmers no longer share code and modify programs. They are prevented from doing so. Thus, they no longer obey the golden rule. By not obeying the golden rule they have become corrupt.18

Stallman's narrative fits into Eliade's categories of the modern society vs. the archaic society. Stallman represents himself as a truly religious person living in the modern society seeking to return to the archaic society. His method for returning to the archaic society is to resacralize the world.

To sacralize, one must have a concept of something that is sacred vs. something that is profane. For to sacralize one make the the profane sacred. For Stallman, Free Sofware itself is sacred. Free Software is opposite the profane proprietary software. Proprietary software cannot be shared and it cannot be modified. Free Software can be explicitly shared and modified. Free Software also manifests a numinous quality. Specifically, Free Software is the revelation of the ideal divine society today. In the ideal society all software is Free Software. To have Free Software in present society is to experience the revelation of the ideal. Thus, Free Software is not just sharable and modifiable it also sacred. Free Software is Sacred Software.

Stallman desires to return to his ideal archaic society where programmers were more religious and users could share and modify programs at will. To bring about the return of the archaic society he must resacralize the present profane society. To do so he created the GNU Project to manifest Sacred Software into the present profane space. Thus, Stallman marks off the sacred, Free Software, from the profane, proprietary software, in an attempt to return humanity to the ideal society.

Conclusion

The Free Software movement can be understood as a religious movement using Eliade's terms and categories. Stallman basis his movement on his understanding of the golden rule. He uses his understanding of the rule to construct an ethical system for the production and use of software. Stallman then constructs a narrative to explain how society has moved from a religious ethical past to a profane present. To return society to the ideal past, Stallman attempts resacralize the present society by creating Free Software. Free Software is sacred. By introducing Free Software into present society, society becomes more sacred. The utilization of Eliade's categories clarified the religious aspects of the Free Software movement.


  1. Eliade, M. The Sacred and the Profane: The Nature of Religion trans. Trask, W. Harcourt Inc. New York. 1957. pg 10. Hearafter: Sacred and Profane.
  2. Ibib. pg 2. and Otto, R. The Idea of the Holy trans. Harvey, J. Oxford University Press, New York. 1958. pg 25. Hereafter: The Holy
  3. The Holy pg. 6-7
  4. Sacred and Profane pg. 12, 15
  5. Sacred and Profane pg. 17
  6. Stallman, R. M. Free Software, Free Society: Selected Essays of Richard M. Stallman. GNU Press, Boston, 2002. pg. 34. Hereafter: Free Software.
  7. See Free Software pg. 43 for a discussion of the precise meaning of Free Software.
  8. Free Software pg. 48
  9. Free Software pg. 49
  10. Free Software pg. 39
  11. Free Software pg. 17
  12. Chaos creating, antonym of Cosmogonic. See Beal, T. K. Religion and its Monsters. Routledge, New York, 2002.
  13. Free Software pg. 18
  14. A driver is a piece of software which allow the computer to communicate with a piece of hardware. Every piece of hardware has a unique communication protocol, necessitating many different drivers.
  15. Free Software pg. 19
  16. Free Software pg. 19,20
  17. Free Software pg. 57
  18. For an example of Stallman using such language see for instance Free Software pg. 130

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/290393/akbar.jpeg http://posterous.com/users/36udceGAwRbz Tim Henderson timtadh Tim Henderson
Wed, 07 Jul 2010 16:50:00 -0700 Writing an interactive REPL in Python http://blog.hackthology.com/writing-an-interactive-repl-in-python http://blog.hackthology.com/writing-an-interactive-repl-in-python

Today I figured out how to write a repl prompt with history and editing in python. Normally when you are just using the built-in "raw_input" in a while loop, things don't work quite right. The arrow keys don't necessarily work as expected, and there are other problems. I looked at curses, and a couple other options but ended up writing my own solution.

To make it work, you need to execute a little magic. namely:
Figuring out the correct setting for the terminal took me a bit of time. But once you have the terminal setup like this the following things happen:

  1. input is no longer line buffered and is no longer auto echoed to the user
    • This means you can process every character sent to the terminal in near real time
    • More importantly since you get *every* character you can process special characters like the arrow keys which are actually multi-byte
    • Since it isn't auto echoed you have to send the characters you want back to the terminal
  2. control characters are not printed to the terminal
    • Meaning when you press [up arrow] you no longer see ^[A
    • It also means the cursor in the terminal moves in response to the arrow key press
  3. you can programmatically move the cursor
    • For instance to move the cursor left

Now writing the REPL is a matter of figuring out the control logic which goes inside the try block. Since REPL stands for Read, Eval, Print, Loop I started by writing an infinite while loop. In my loop you can exit either by using the exit command I defined in my command language or by typing control C. Pretty simple. the EVAL_LOGIC will depend on what language your implementing. Mine was simple, command [optional argument]. So my eval logic looked at the first word in the prompt determined if it was defined, if it was it executed the command. Obviously many languages are much more complex than this, mine was not so I won't go any further into the details on implementing EVAL_LOGIC instead I will talk about giving the READ LOGIC some nice features.

I wanted my REPL to have history, and I wanted the user to be able to edit the line by using the backspace key, and the arrow keys. Not complex features, but if you use raw_input() as your read logic they will not be there out of the box.

By necessity the READ LOGIC processes each character as it entered by the user. To accomplish this I used an inner while 1 loop: Before I get into the nitty gritty the control key logic I am going to explain the function "clear_line" used in the above code. clear_line does exactly what one would think it would do, it clears the line, and replace the front of the line with whatever prompt is given by the user. Here is the code: The most interesting thing about this code is to delete characters in a terminal one moves the cursor left enters a space and then move the character left again. You should note in clear_line I assume the maximum size of the terminal is 150 characters. I am sure there is a way to get the actual size of you terminal from the environment, but for the first cut of my REPL I haven't bothered to look that up yet.

The logic for control characters works in a similar way to clear_line. First it detects what character has been entered. Then if it is an up character it must move the cursor down, and replace the prompt with a previously entered line from the history. If it is down it clears the prompt. Left and right allow the cursor to move left and right only if it stays within the characters the user has entered. If they stray it move the cursor back. Finally backspace simply removes the previous character from the prompt. Thats it! All I needed to do to write an improved interactive REPL in Python. Hopefully this helps some one out there who is using raw_input (as I have done so many times before) and wanting a better solution.

- Tim Henderson

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/290393/akbar.jpeg http://posterous.com/users/36udceGAwRbz Tim Henderson timtadh Tim Henderson
Sat, 10 Apr 2010 12:11:00 -0700 Lessons Learned While Implementing a B+Tree http://blog.hackthology.com/lessons-learned-while-implementing-a-btree http://blog.hackthology.com/lessons-learned-while-implementing-a-btree

B+Trees are complex disk based trees used to index large amounts of data. They are used in everything from file systems, to relation databases, to new style databases gaining popularity today. Sometimes a domain specific application needs to index a large amount of data, but cannot use a traditional database, or one of the NoSQL databases. In such instances the development team needs to roll their own indices. Here is an introduction to the B+Tree (one of the indexes my team created) and lessons I learned while implementing it.

Introduction to the B+Tree

B+Trees are one of the fundamental index structures used by databases today. This includes new style SQL free databases. The B+Tree popularity stems from their performance approaching optimal performance in terms of disk reads for range queries in a 1 dimensional space. What is a 1 dimensional space when talking about computer data which could be anything (not just numbers)? It is any collection of objects where the user accesses the object using only one attribute at a time.

For example if we have an object which has X, Y, and Z as attributes queries would only take place on X, or Y, or Z, but never on XY, or YZ, or XZ, or XYZ. A collection where multiple attributes are used to access the data elements are known as multidimensional spaces. For these spaces there are many other structures which have better performance than B+Trees.

B+Trees perform particularly well (in comparison to some other indices) when executing range queries. A range query is typically expressed as inequality such as "give me all strings between 'blossom' and 'brunet.'"

When I say their performance is approaching optimal in number of disk reads what do I mean? Why are we not measuring performance in number of instructions executed (like we do when we analyze a binary search)? In memory algorithms and structures like sorted arrays and binary searches are largely bound by the number CPU cycles it takes to execute the algorithm. We usually neglect CPU cache performance and memory locality when analyzing them, arguing these are constant in terms of the asymptotic performance of the algorithm. However, for a disk based structure like B+Trees the time it takes to read (or write) to a disk becomes the dominant term, since disks are extremely slow in comparison to main memory. Therefore for disk structures we analyze their performance in terms of disk reads/writes.

Basic Structure of the B+ Tree

While I will not give a through explanation of the exact structure and properties of B+ Tree (I leave that to algorithm and database textbooks by the likes of Knuth, Sedgewick, and Ullman), I will describe its basic structure.

A B+Tree is best thought of as a key-value store. It is structured as a generalized tree. Instead of having only one key in each node it has N keys in each node, where N is referred to as the degree of the B+Tree. In the B+Tree there are 2 kinds of nodes, interior nodes, and exterior (leaf) nodes. The interior nodes hold keys and pointers to nodes. The exterior nodes hold keys and their associated values. This indicates that the interior nodes have a different (usually higher) degree than the exterior nodes.

The reason the tree is structured this way is because it is rooted in the nature of disk access. Disks to not return 1 byte when you ask for 1 byte instead they return what is called the disk block to which that byte belongs, the operating system then sorts out which byte it is that you need. B+Tree exploit the situation by making their nodes fit exactly into the size of one disk block. Since the degree of the interior nodes is high, this makes the tree extremely wide, which is a good thing since it means fewer disk reads to find the value associated with any one key.

Bptree1

Figure 1. An Example B+Tree

 

In figure 1 you can see an example B+Tree. For this illustration I neglect showing the values, and have the order of the interior nodes equal to the order of the exterior nodes. In general this will not be the case. One thing to note in this simple example is how the exterior nodes are chained together in order. This is why it is efficient to execute a range query on the B+Tree. One can simply find the first key in the range, and then traverse the leaf nodes until the last key has been found.

Implementing the B+Tree

I made the decision to use TDD (Test Driven Development) for implementing the B+Tree. TDD has a lot of pluses when trying to create a data structure of any kind. When implementing a data structure one typically knows exactly how the structure should function, what it should do, and what it should never do. By writing tests first, you can ensure that when you finish a method, it actually works. This speeds development time especially since you already know how the structure should function. It makes it quicker to find bugs, and to battle test the B+Tree. Since I have released the B+Tree to the rest of my team to use, there has not yet been a bug filled against it.

So knowing that we are using TDD, and knowing what the structure is and how it performs. What is the best way to begin implementing this complex structure? The way I started was to create a general structure called a block file. My block files abstracted the notion of reading and writing blocks (and buffering them). I also created objects to model a block that could contain either keys and pointers, or keys and records (instead of values from here on I will use the term records). Actually my blocks are even more general than that as I intend to reuse them for other disk based index structures like linear hashing in the future.

I also created what I called a ByteSlice. My ByteSlice was an array of bytes of arbitrary length. I use it to represent, keys, records, and pointers; everything in the B+Tree. My ByteSlice implemented a comparator, so it could be sorted, and conversions from integer types of various lengths to the ByteSlice and back again. By implementing this general type my B+Trees can easily deal with any kind of data and perform in exactly the same way.

After the infrastructure was created I began working on my first iteration of the B+Tree. The first iteration was based on the algorithms give by Robert Sedgewick in his excellent book "Algorithms in C++." I managed to get this implementation up, running, and fully tested in a matter of days. However, the version given by Sedgewick which inspired my implementation did not deal gracefully with duplicate keys. Thus, I need to invent my own way of handling duplicate keys.

Approaches to Handling Duplicate Keys in B+Trees

There are several different ways of handling duplicate keys. One way is use an unmodified insert algorithm which allows duplicate keys in blocks but is otherwise unchanged. The issue with a structure such as this is the search algorithm must be modified to take into account several corner cases which arise. For instance one of the invariants of a B+Tree may be violated in this structure. Specifically if there are many duplicate keys, a copy of one of the keys may be in a non-leaf block. However, the key may appear in blocks that which appear logically before the block which is pointed at by the key in the internal block. Thus the search algorithm must be modified to look in the previous blocks to the one suggested by the unmodified search algorithm. This will slow down the common case for search.

There is another issue with this straight forward implementation, if there are many duplicate keys in the index, the index size may be taller than necessary. Consider a situation were for each unique key there are perhaps hundreds of duplicates, the index size will be proportional to the total number of keys in the main file, however, you only need to index an index on the unique keys. One of the files indexed in our program will be indexing has such characteristics to its data. It indexes strings (as the keys) with associated instances where those strings show up in our documents. There can be hundreds to thousands of instances of each unique string.

Therefore the approach I took was to store only the unique keys in the index, and have the duplicates captured in overflow blocks in the main file. An example of such a tree can be seen in figure 2. Consider key 6; there are 5 instances of this key in the tree. The tree is order 3, indicating the keys cannot all fit in one block. To handle this situation an overflow block is chained to the block which is indexed by the tree structure. The overflow block then points to the next relevant block in the tree.

Bptree2

Figure 2. A B+Tree with duplicate keys and overflow blocks.

 

To create a structure such as this, the insert algorithm had to be modified. Like the previous version these modifications do not come without a cost, in particular the invariant which states all block must be at least half full has been relaxed. This is not true in this B+Tree, some blocks like the one containing key number 7, are not half full. This problem could be partially solved by using key rotations to balance the tree better. However, there are still corner cases where there would be a block which is under-full. One such corner case includes when a key falls between two keys which have overflow blocks. It must then be in a block by itself, since this B+Tree has the invariant which state that if a block is overflowing it can only contain one unique key. In the future we would like to implement key rotations to help partially alleviate the problem of under-full blocks.

The advantage of this approach to B+Trees with duplicate keys is the index size is small no matter how the ratio of duplicate keys to the total number of keys in the file. This property allows our searches to be conducted quicker. Since the overflow blocks are chained into the B+Tree structure we still have the property of being able to fast sequential scans. One consequence is we have defined all queries on our B+Trees to be range queries. This is fine because all of our queries were already range queries. In conclusion we relax the condition that all blocks must be at least half full to gain higher performance during search.

The Lessons Learned

The biggest lessons learned through the journey:

  1. The Value of Test Driven Development The impact TDD had on the development time of the B+Tree vs. other structures in the project cannot be understated. TDD dramatically reduced the time it took to develop the structure (from over a month for some of the other structures to under two weeks for the B+Tree), and has ensured reliability of the structure once it entered production.

  2. The Value of the Iterative Approach By starting simple, testing, and then adding complexity I was able to get a better grasp on the problems posed by the modifications we needed to make to the structure. For instance I before I tried method 2 for duplicate keys, I modeled the data we would be putting into our tree and visualized the resulting structure. I found the structure would perform poorly. However, the same code allowed my to visualize method 2 and see that it would perform well.

  3. Visualizations as Part of Development Writing code allowing you to visualize the structure you are developing can really help you find bugs quicker. The best tool to do this with is graphviz. The pictures used in this blog where generated as part of unit test cases. My building a visualization framework early as part of your unit tests you can further reduce development time. When a bug a appears it can be enormously helpful to visualize the actual structure of the tree at the time the bug manifested.

Conclusions

When the right choice for your project isn't DBMS, but you still need to index large data, don't fear you can write the index structures yourself. By using TDD, iterating, and visualizing as you go you can ensure the index structure you create will perform well, and will never get into an incorrect state. Databases are not a black box, and they are not always the right answer. When required you can create you own system.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/290393/akbar.jpeg http://posterous.com/users/36udceGAwRbz Tim Henderson timtadh Tim Henderson
Thu, 08 Apr 2010 20:44:00 -0700 Regular Expression Talk http://blog.hackthology.com/regular-expression-talk-0 http://blog.hackthology.com/regular-expression-talk-0

A week or a so ago I gave a talk at CWRU Hacker Society on how to implement a regular expression engine. The inspiration for the talk came from two things: first, I have long held a fascination for regular expressions, and second was an excellent article by Russ Cox which inspired me to create my own engine [1]. Regular expression matching is largely a black block box today. Rarely will a programmer need to implement their own matching algorithm. However, there are times when it cannot be avoided, such as when the match needs to be executed on many string simultaneously.

This talk assumes minimal knowledge of regular expression, indeed it starts by defining the syntax, and the semantics, before explaining finite automatons and the matching algorithms themselves. To explain the matching algorithms I wrote simple implementations in Go (Google's new programming language) which are now hosted on github [2]. These implementations also serve to highlight using Go in less trivial way than in a Go tutorial I gave last month. The slides and audio from the talk should be attached to this post.

[1] http://swtch.com/~rsc/regexp/regexp2.html

[2] http://github.com/timtadh/regex-machines

Audio:

regex-talk.mp3 Listen on Posterous

Slides:

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/290393/akbar.jpeg http://posterous.com/users/36udceGAwRbz Tim Henderson timtadh Tim Henderson -