Code on the Mind: 2011

Friday, September 2, 2011

Algorithm-a-Day : Day 16 : Conway's Game of Life

Alright! I finished moving and my new apartment is mostly clean,
so it's back to AaD! And today is...

The Game of Life. A beautiful and wonderful no-player game.
Grab yourself some pygame, then give this a run:

Github :: Conway's Game of Life

Full notes in the repo!

Thursday, August 18, 2011

Moving Next Week

As I'm getting ready to move next week, I have less time lately to work on AaD. I'm therefore going to announce a slow-down, but not a cancellation. AaD will continue, just once I have everything settled down in my new place.

It's got an extra room for the same rent as the place I'm in now, and that's saying something for Japan.

Wish me luck! (or hard work)

Tuesday, August 16, 2011

Algorithm-a-Day : Day 15 : Email Address Parsing with Regex

Of course, inspired by: XKCD #208
And on that note: XKCD4ME  (Python script)
 
Github :: Email Regex 

ABOUT REGULAR EXPRESSIONS
-------------------------
  Regular expressions are a very powerful tool for finding and replacing
substring patterns in strings. Regex's are a language of their own,
separate from anything we've been using so far, but luckily they are mostly
uniform across programming languages.

(Find that prime solving regex)

PYTHON
------
  The regex we're given is:
    \b([a-zA-Z0-9-_.]+)@(\w+[.][a-zA-Z.]+)
but more specifically in a Pythonic context we can look at it directly as:
    r"\b([a-zA-Z0-9-_.]+)@(\w+[.][a-zA-Z.]+)"

  The r"" type of string is called a raw string, and it doesn't accept
special characters. The string r"haha this won't \n print a new line"
will print:
    haha this won't \n print a new line
Essentially, backslashes are not interpreted. This is important for regex's
because backslashes also have special meaning, which means that in a regular
string you would have to double-escape your special characters. We won't
bother with that, and just go ahead and use raw strings.
  See line 6. Holy moon language. Note that we could have written the regex
like this:
    regex = r"""\b
		([a-zA-Z0-9-_.]+)
		@
		(\w
		[.]
		[a-zA-Z.]+)
		"""

and then called it like this:
    pattern = re.compile(regex, 'VERBOSE')
Verbose mode allows whitespace and newlines to be ignored in the given regex.
This lets us break up longer regex's into more digestable chunks.
We can also comment them, like this:
    regex = r"""\b                  # Match the start of a word.
		([a-zA-Z0-9-_.]+)  # Match one or more characters from small
				            # 'a' to 'z', capital 'A' to 'Z', and 
				            # a few special characters.
		@                         # Match the 'at symbol' (or 'amphere').
		(\w		            # Match any letter or digit.
		[.]		            # Match a period. This has to be in a []!
		[a-zA-Z.]+)	    # Match one or more letter or period.
		"""

The last condition is important because some URLs from other countries have
multiple extentions to them. For example, a Yahoo email address from Japan
would end in: @yahoo.co.jp
  The rest is straight forward; we compile the regex, ask it for matches,
then store said matches after some string working.
  Try poking around at the regex itself to see what makes the script
explode and what doesn't. Then, try making one that parses URLs. Maybe to run
across an HTML file? Your life as a professional scraper begins today!

Monday, August 15, 2011

Algorithm-a-Day : Day 14 : Heap Sort

If you already have a Heap, this is the easiest sorting algorithm to code. Ever.

Github :: Heap Sort

ABOUT HEAPSORT
--------------
  Heap sort is one of the most conceptually simple sorts out there - that is,
if you already have a Heap. Simply grab the first (the maximum value!)
element, store it, perform head-deletion, repeat.
This process is demonstrated most elegantly in the Haskell version.
For the C and Python versions, however, instead of using the head 
removal function out-right, our implementations are going to go about things 
just a little differently. 

C
-
  The C and Python version share a quality: they trick the Heap
into thinking it's shorter than it is, then hiding the gradually sorted
values at the end of the Heap instead. In doing this, we are able to
perform many cascade-downs without disturbing those sorted values
that we accumulate at the end.
  For the sake of continuity, this version returns us the Heap's 
inner array wrapped in an IntList. When doing a HeapSort in place, 
the original Heap is essentially ruined, so this shouldn't be a problem.

PYTHON
------
  Nice and short, and essentially identical to the C version.
It's still running about 3 times slower than a CombSort11, which
is a bit pathetic. Unlike the C version (as has been the case with
all of our data structures so far) Python can handle any type, not just
ints.

HASKELL
-------
  At the moment, heapRemove is so inefficient that heapSort has trouble
sorting a list of only a few hundred elements. It's not the sort's fault,
the underlying algorithm just sucks.
  In the Heap Sort algorithm itself, however, we see Haskell's
powerful list creation functionality hard at work.

Sunday, August 14, 2011

Algorithm-a-Day : Day 13 : (Binary Max) Heap

Not the memory Heap! This is a tree based structure that a lot of fancy
things are built with :).

Github :: Heap


ABOUT HEAPS
-----------
  Heaps are essentially trees that satisfy something called "The Heap
Property". This property states that for every parent node,
its two (or more) children will be less than it (by whatever method
of comparison you're using). The order of the children doesn't matter.
Nor does is matter if a parent contents children greater than that
parent's sibling node. 
  Heaps bless us with an interesting property: the children are always 
an _inductable_ distance away from their parent. The root, being the first
parent, will always have children at Nodes 2 and 3. Likewise the 4th Node
will belong to root's first child, etc.
  This pattern allows us to bypass implementing a tree completely,
and instead store the Heap as a list/array, simply inducing the indexes
of the child/parent nodes whenever we need them.

INDEX ARITHMATIC
----------------
  For a given node at index i:
    - children @ a[2i+1] and a[2i+2]
    - parent @ a[floor((i−1)/2)]

WHAT HEAPS DO
-------------
  Heaps need to be able to do a few things:
    - Insert and rebalance.
    - Delete the root and rebalance.
    - Yield the max (or min) value present.
    - Merge with another heap.

INSERTION
---------
  New values are inserted at the end of the Heap. This is trivial if the
Heap is implemented as an array internally. We must then 'cascade-up', 
swapping the new child with its parent until the Heap Property is restored.
  At most this would only ever take as many operations as there are
layers in the Heap.

ROOT DELETION
-------------
  When removing the root, take the last leaf and make it the root.
Then 'cascade-down', always swapping the current leaf with its greater child
until again, the Heap Property is restored.

YIELDING THE MAX/MIN
--------------------
  While this isn't the case for all types of Heaps, Binary Max/Min heaps
can complete this in a single operation.

MERGING HEAPS
-------------
  Simply yield one to the other, inserting each value and cascading as
necessary.

WHAT HEAPS ARE USED FOR
-----------------------
  Heaps are used in a particular sorting algorithm called Heap Sort,
as well in graphing algorithms like Dijkstra's Algorithm.

IMPLEMENTATIONS
---------------
C
-
  As an experiment, I wrote this version in the Linux Kernel style of 
C coding. Turned out pretty neat, if you don't mind 8-space tabs.
  This does everything we promised Heaps have to do: insertion, deletion,
finding the max, and merging. It also prints itself out _like a tree_, 
just like Binary Search Tree from before.
  I went all-out with error checking too, and just hope I did it 
sensibly.

PYTHON
------
  Less than half the length of the C version... who'da thought? Python gives
us a lot of power here. Merging heaps is as simple as inserting all
the elements of a given iterator. It doesn't even have to be a heap ^^!
  Again, lots of power with __iter__ and __str__.
  You may notice that cascade_down() is quite different between the C
and Python versions. The Python version uses a more general approach to 
determine the greater child (and its index) and actually allows us to
have more than 2 children in a future implementation. The C version doesn't
even begin to allow this.

HASKELL
-------
  As lists in Haskell are Linked Lists and not random access arrays,
implementing a Heap with a list would be a bit foolish. Instead,
we go ahead and create our own data type, much like we did with
the Binary Tree. Essentially by changing how new items are added did we 
make this a Heap. Notice that this Heap is foldable just like the BS Tree was,
and that this Heap will print out properly in a Tree shape, just like the C
and Python versions.
  I admit that heapInsert and cascade are so close to each other that it's
a bit disturbing. I'm sure the solution is obvious, and that it just hasn't
dawned on me yet.
  Removing the root in this version is also less complicated (and slower).
We simply ignore the root and heapMerge the two branches together.
  listToHeap isn't really necessary, but it helps out with testing.

Friday, August 12, 2011

Algorithm-a-Day : Day 12 : Binary Search Tree

Our first tree structure. Hurray!
Trees are important because a lot of things are implemented with them. Not of all of them act like BS Trees, but it's easy so we're starting with it.
Also, make sure to check out the section below on how Trees are printed. This causes a lot of people trouble, and I found a good solution.

Github :: Binary Search Tree

ABOUT BINARY SEARCH TREE
------------------------
  A handy tree structure built for fast look-ups!
Our versions here today aren't self-balancing, meaning that if your root
sucks you could end up with a very lopsided tree.
Many built-in data structures in modern languages are implemented with trees,
so it's importaant to know what they are and how they work.
This included knowing how the following things are done:

-- Recursion. 
-- Queues.
-- Depth-first searches.
-- Breadth-first searches.
(But for a binary search tree we don't have to worry about those types
of searches)

ABOUT PRINTING TREE STRUCTURES
------------------------------
  This causes so much unnecessary suffering, so I created a slick recursive
way to print out tree structures opening out to the right. Each leaf gets
its own line, and everything is indented properly. It's quite easy to read. 
Check out both the py_test.py file and the C unit test to see what 
I'm talking about. 

C
-
  Here we had to introduce some new structs. Allocating memory for them is
standard run-of-the-mill stuff. Don't forget to check for NULL pointers
everywhere possible. I'll be there with an "I told you so" when a Segfault 
cracks your computer in half.
  Insertion and checking for containment is all done with recursion. These 
things can be done iteratively too, but it can get silly.
  Printing here, unlike the Python and Haskell versions, is done without
any string creation, and instead prints right in the functions itself.
  There is a **unit test** file, so make sure to check that out too.

PYTHON
------
  Shorter and sweeter than the C version, but all around the same.
Just as we've had in other Python versions of algorithms thus far, we get
to specify 'special functions' that allow us to use keywords on our Tree.
    print(tree)
    if 5 in tree: ...
    etc.
  For printing, here we're gradually building up a list of strings of the 
leaves, then concatinating it all at the end in the __str__ function.
Remember that in Python, str.join() is always better than str + str.

HASKELL
-------
  God functional programming rocks. Here we start by defining instances
of typeclasses for the Tree type, allowing us to (via Functor) map functions
across our Tree, and (via Foldable) use folds across it as well.
  While it may look a little different, the function showTree produces 
the same result as the print functions in the C and Python versions. Once
again, a victory for recursion. showTree is called by 'show', as is standard.
  Everything else is essentially the same, but with the Haskell twist.
Even so, its over half as short as the C version (and that's with all our
instance definitions included!).
  In conclusion, Haskell continues to impress me with its power of 
expressiveness.

Thursday, August 11, 2011

Algorithm-a-Day : Day 11 : Circular Binary Search

As far as I know, I invented this. It's a binary search algorithm that works on lists that are sorted, but not linearly. For example, this algorithm can determine containment in the following list:
[4,5,6,7,8,9,0,1,2,3]
without any restructuring of the list. This is handy when searching large batches of data like log files, where sorting beforehand isn't an option.

Github :: Circular Binary Search

Check the notes in the repo, it's a good read.

Wednesday, August 10, 2011

Algorithm-a-Day : Day 10 : Binary Search

Jon Bently goes on for quite a while about this in Programming Pearls.
Implementations can be buggy if you're not used to it, but forging one bug-free from scratch is something you can be proud of.

Github :: Binary Search

ABOUT BINARY SEARCH
-------------------
  You have a phone book, and are particularly violent about looking for 
numbers. You best friend, Mr. Ericson, has a number that you've forgotten.
You open the phone book up right into the middle: McLean. Ericson comes
earlier in the phone book than McLean, so you tear it in half, toss the
unneeded portion over your shoulder and repeat the process. Firth? Garbage!
Carlson? Garbage! You keep searching the middle name and tearing off
the unneeded half until your find Mr. Ericson's number. If it's in there,
hurray. If it isn't... well you've just wasted a phone book. 
Welcome to binary search!
  Binary search is a very fast way to search for elements in a sorted list.
If the list is even one element off of being perfectly sorted, you cannot
guarantee the results of the search. Given a big enough list, however,
it can be a bother checking for sortedness before every search. There are 
ways around this, but for us, we'll just assume the lists we're given
are sorted.

C
-
  Since the boolean data type doesn't exist in C, a failure to find the
desired item with result in a -1. Usually we use 0 to represent False,
but in this case 0 is a legitimate array index.
  Target less than mid? Go left. Target greater than mid? Go right.
Found it? Save and break. Using 'break' makes some people uncomfortable,
but I find it clearer than say, altering 'lower' or 'upper' manually
so that the loop will break on the next pass.
  Make sure to run that **unit test**!

PYTHON
------
  Sexier to look at than the C version. The only Python bonus we gain here
is (less lines and) the ability to infer the length of the list we're given,
instead of it being passed explicitely as in the C version.
  Watch out for line 7, that's Python3 only.

HASKELL
-------
  Nice and short. Notice a theme here? Guards and 'where' are our friends.
Since Haskell doesn't have loop structures, this is of course done via
recursion. It's safe to say that this will never blow the stack,
as with a default stack depth limit of 1000, that allows us 1000 (give or 
take a few) layers, or 2 ^ 999 values. That number is over 3.5 lines long 
on my 80 column terminal window. So no overflows :)
  'take' and 'drop' also make things easy for tearing our proverbial
phone book in half. Overall, just really short and clean!

Tuesday, August 9, 2011

Algorithm-a-Day : Day 09 : Quick Sort

TL;DR -> Do this over and over: Pick pivot, partition, sort.

Github :: Quick Sort

C
-
  The C version is quite standard, implementing an in-place quicksort
which is not only fast, but also saves on memory.
While pivot selection is apparently a hot topic in some circles,
this simple version doesn't bother with any fancy work and simply
takes the middle element (of each subset) as the pivot. One could be even
lazier and skip this entirely, taking the first element as the pivot, but
if your list is sorted perfectly backwards and sufficiently long enough,
you could blow the stack. 
-- I've included a short unit test for the C version as well, so feel
   free to compile that in your spare time.
PYTHON
------
  Nearly identical to the C version, and a lot shorter. 
This implementation does its best not to inefficiently throw memory around,
and (in theory) is an in-place sort.
  As a bonus from using Python, we can specify default values in qsort()'s
function definition. Take a look! This allows us to simply pass something 
like:     qsort([4,5,2])     and everything will be taken care of.
(Recall the C version where we had to pass the starting lower and upper
bound values right off the bat)
  We also chose a different method of pivot selection. Here, we took 
three values, the lower, upper, and middle, sorted them, then chose the
pivot position to be element number corresponding to the middle value
(essentially the average). This ensures that the pivot will never be an 
outlier, needlessly extending the stack and the amount of work that needs 
to be done. A combination of several built-in functions takes care of 
this work nicely for us, as can be seen in the get_pivot_pos() function.
HASKELL
-------
  Ahhh filters and lazy evaluation. This solution is obviously the cleanest
out of the three. The pivot here is the same as the C version, simply
chosen as the middle element of the set. This can easily be made a one-liner
if you don't much care for names and neat-looking code.

Monday, August 8, 2011

Algorith-a-Day : Day 08 : Primes

Yeah! Prime numbers! Who doesn't get excited about these? And they pop up all the time in Project Euler, too.

Written your own libraries, but don't think they're too hot? Can't figure them out in the first place? Check out implementations in C, Python, and Haskell! (The Haskell version is particularly fun. See below.)

Github :: Primes

(from the notes found in the repo)

ABOUT PRIMES
------------
  A prime number is an integer that is only divisible by 1 and itself.
Thus, the lowest prime number is 2. Don't be fooled, 1 is not included.
Prime numbers pop up in all sorts of places around the universe,
so it's handy to know about them and also have libraries to work with them.
  In all three versions I have included 3 things:
    - A function that tests for primality.
    - A function that given a number finds the nearest prime greater than it.
    - An infinite data structure that uses the above two to produce all
      the prime numbers ever. 
Note: When testing for primality, one need only to check up to the square
root of the number you're testing. We will see why this is with an example:
  Concider the number 101. It's square root is quite nearly 10.
We have so far tested up to 10, but no integer has divided evenly into 101
yet. For a number greater than ten to divide cleanly into 101, its 'partner' 
(e.g. in 15 = 3 * 5, the partner of 5 is 3) would have to be lower
than 10. Yet since we've already checked up to ten, we know that no such
partners exist. Therefore, 101 is prime.

C
-
  Standard procedural programming. We utilize the intlist library from
day 5 to implement the prime number list-producer.

PYTHON
------
  Here we differ from the C version only slightly, getting a true 
iterator class for the prime number generator, and the helper function
_next_prime() is done in the functional style. Not that the function
head() is a function I wrote to mimic the behavior of the function of same
name that appears in functional languages.

HASKELL
-------
  Here I present a unique solution of mine. The usual solution is to divide
only by odd numbers up to the root of the number you're checking, but this 
process can be sped up by only dividing by other prime numbers. Remember 
prime factors? All integers (save 1) are created by multiplying a set of
prime numbers together. A prime's only prime factor should thus be itself,
and no other primes. Therefore dividing by odd numbers (while quick) is still
doing more work than needs to be done.
  This Haskell solution works by having the primality test (isPrime) and the
infinite list of primes feed off each other. isPrime only divides by primes
found in the 'primes' list. But 'primes' gets its values by filtering
'isPrime' across all the odd integers... This might appear too circular
to work, but it turns out that supplying the original list with the values
2 and 3 is enough to produce all the primes, or check the primility of 
any number.

Sunday, August 7, 2011

Algorithm-a-Day : Day 07 : Linked List

Yup, a linked list!
Implementing these in Python and Haskell is nearly pointless, but here they are anyway.

Github :: Linked List

We can be grateful for 'special functions' in Python (ones that start and end with two underscores) as they allow us to easily apply built-ins to our data structures.
For example, asking for: len(our_list) will yields its length, so far as we've defined it in the class definition. The keyword 'in' will also work, as will print()!

Saturday, August 6, 2011

Algorithm-a-Day : Day 06 : FizzBuzz

The idea is to print the numbers 1 to 100, except for multiples of 3 your print 'Fizz', for multiples of 5 you print 'Buzz', and for multiples of both you print 'FizzBuzz'.

Apparently this is hard? Or it separates the men from the boys?
Regardless, I wrote four versions. The C version is standard, the Python version uses funky string manipulation, and the Haskell version (the second one) is a neat little solution I thought up myself. Take a good look at it.

Github :: FizzBuzz

Friday, August 5, 2011

Algorithm-a-Day : Day 05 : itol (int-to-list)

itol was originally a function I wrote in Python to help me solve Project Euler problems. It's been through many revisions, and the Python version presented here is its simplest version. The Haskell solution is also typical with no surprises, but the C solution ended up being a bit longer, to (again) account for C's lack of a more powerful built-in list type. It essentially became a general int list library.

Github :: itol (int-to-list)

There is also a short unit test for the C version, which you can compile at your leisure.

Thursday, August 4, 2011

Algorithm-a-Day : Day 04 : Command-line arguments

Needless to say, these come up a lot. How could they not?
Fledgling programmers need to know how these are handled, and I remember a day when I didn't have a flying clue how do deal with them. I mean heck, who would ever use command-line args? Who even uses the command-line? (said he, who at the time who had not even yet compiled a java program out of an IDE)
I was ignorant, to say the least.

Github :: Command-line arguments

The C and Haskell implementations are straight-forward. They're more of an "Oh, okay" rather than any sort of programming revelation. My Python implementation, however, is a big more stylised for real, well-defined use. It should be noted that it is also available in my python-libs repo on Github.

Wednesday, August 3, 2011

Algorithm-a-Day : Day 03 : cat

Everyone's favourite shell command. Goes well with pipes. And sugar.

Github :: cat

All three are fairly simple, with the C implementation probably being the most so (which is to be expected). The Haskell version still makes me giggle, though.

Tuesday, August 2, 2011

Algorithm-a-Day : Day 02 : Line Count

Counting the number of lines in a given file, in C, Python, and Haskell.
Quite simple today, but I needed some easy stuff right off the start to balance the more interesting algorithms to come later.

Github :: Line counting

Monday, August 1, 2011

Algorithm-a-Day : Day 01 : Reversing words in a string.

This seems to come up as an interview question more often than not. Here is the solution in C, Python, and Haskell.

Github :: Reversing words in a string

While the Python solution is pretty trivial, Haskell wins out of the three.
Nothing beats a one-liner built from library functions.
I think there's sense in "Don't reinvent the wheel" except when you're reinventing it as a nice exercise, as I did with the C solution.

EDIT (8/11/2011) : Fixed link, and bug in code.

Monday, July 25, 2011

Algorithm-a-day :: Starting in August.

Come August first I'm going to be releasing an interesting algorithm once a day, in Haskell, Python and C. Not only will this be a good repertoire of common algorithms, but it'll also be good practice for me to keep me on my toes in the various languages.

I focus on Haskell mainly these days, but I can't let my most fluent language (Python) and my batter-on-deck (C) get rusty.

Please look forward to A.a.D!

Tuesday, June 28, 2011

Mind Blown by a Save Icon

My Mind Blown by a Save Icon

I was writing a short article on the use of 'it' in the English
language earlier today. I happened to be on my Linux Mint boot, and
was using the (very compliant so far) Libre Office Writer to get my work done.

I wrote a section of decent length before I realised I hadn't saved yet,
and for no reason in particular I decided to hit the save button on the tool
bar instead of 'Control + s' like any badass would've normally done.

I was surprised at first when I couldn't find the icon. Where the hell was
the little floppy? I need my doc sa-...

...which was the moment a saw an icon; a green arrow over-top a hard-drive.
"It couldn't be..." I thought vainly as I brought my cursor over the icon
to bring up its mouse-over text. Sure enough, "save" appeared before my eyes.
From a picture of an arrow and a hard-drive.
My foolishness hit me then. Why the hell would a save icon be indicated by a
floppy disk? When was the last time I even TOUCHED a floppy disk? I chuckled
to myself at how much sense it made. This was followed by rage, for the real
party at fault here was the developers who persist in maintaining
familiar icons for the sake of convenience and "ease of use."

Well ease-of-use can walk the plank as far as I'm concerned here. I have not
saved anything on a floppy in years, and neither have those developers.
My hat goes off to the champions at Libre Office for being true to what
is realistic.

We're big kids. We can handle change.

Monday, June 27, 2011

My Coding Rules

Rules I follow when coding. They keep me on track for solid code.

- Don't clutter the namespace. Use as few variables as possible,
while still maintaining clear code. Aim for low amounts of
plain-coloured text in your editor!
Turn this,

def two_n_four():
      result = 2 + 4
      return result

into this,

def two_n_four():
      return 2 + 4

- Use as many built-ins as possible. They will be faster than anything
you can write.

- Early / multiple returns. This reduces indentation and increases readability.
Turn this,

def the_func(best_arg):
      if not isinstance(int, best_arg):
          result = 'Hey stop that'
      else:
         x = foo(best_arg)
    y = bar(x)
    result = baz(y)
      return result

into this,

def the_func(best_arg):
      if not isinstance(int, best_arg):
          return 'Hey stop that'
      return baz(bar(foo(best_arg)))

- Low line numbers. Multiple returns can aid this. Multiple returns DO NOT
necessarily make code harder to follow!
Remember, the easiest way to do something is the easiest way to do something.

- Functional programming style = beauty.
Turn this,

def factorial(num):
      '''I'm pretty fast and typically procedural!'''
      result = 1
      while num > 1:
              result *= num
        num -= 1
      return result

into this,

def factorial(num):
      '''Look how many lines I didn't take!'''
      return reduce(lambda acc, n: acc * n, range(1, num+1), 1)

PYTHON RULES
- Write docstrings.

- Bind specific imports to local names!
e.g.,
    from functools import reduce as _reduce

- Use iterators. Then use some more. A lot of functions accept iterators
as arguments and they can really reduce the amount of work that needs
to be done before an answer is arrived upon.

- Use decorators to rid your functions of 'administrative work'.
Like this:

def non_empty(func):
    '''Does not allow an empty list to be passed to a function.'''
    def inner(arg):
        if len(arg) < 1:
            raise ValueError('Empty list given.')
        return func(arg)
    return inner

@non_empty
def mean(items):
    '''Calculates the mean of a list of numbers.'''
    return sum(items) / len(items)

This keeps the function clean and to the point. Anyone looking at the
decorator should also be able to tell what it does by its name.

---

My personal goal in programming is a harmonious mix of beauty and
efficiency. I believe programming to truly be an art (I often do it
to calm down) and I employ the rules above to make sure my code is
as clear and elegant as I can possibly make it.

Tuesday, May 31, 2011

The Amount of Data on Wikipedia

How much (umcompressed) data is there on Wikipedia?
The Wikipedia FAQ states:
'Wikipedia currently has 3,646,436 articles in total in the English
version alone. In a past comparison of encyclopedias, Wikipedia had about
1,400,000 articles with 340,000,000 words in total, ...'

The average length of an English word is 5, which means that
at the time of that past comparison, all the words through all of Wikipedia
would have contained:
1,700,000,000 letters
As each character in an (English) file would be stored with one byte,
we can see that 1,700,000,000 would also be the amount of bytes required
to store all the letters.
To put this in perspective (as big numbers are scary),
1,700,000,000 B = 1.7 GB
So, at the time of that past comparison, a 2GB flash drive could store the
entire contents of Wikipedia.
But what about now, with their current figure of over 3.6 million articles?
Do I need to upgrade my flash drive?

We can see that the ratio of words to articles is:
340,000,000 / 1,400,000 = ~242.857
meaning that there are that many words per article.
Now that there are over 3.6 million articles, how many words ought that be?
3,646,436 * 242.857 = ~885,562,508 words
and 5 letters per word gives:
885,562,508 * 5 = 4,427,812,540 letters
And since letter count = byte count, we can simplify everything down
to a number in gigabytes.
Thus, the sum total of all data on the English Wikipedia is:
4.428 GB

Looks like I'll need a new drive.

Note: I wonder how big of a file grep would produce from a search of "the"?

Wednesday, May 25, 2011

XKCD command-line app

Made an xkcd command-line app that allows you to download and view xkcd comics.
Check it out here!

You'll need python3 and httplib2, which aren't hard to get.
Let me know what you think, and by all means report bugs!
I'm lacking a bit of Windows support, to be honest...

Saturday, May 14, 2011

The Beauty of Infinite Data Structures

While a similar concept is possible in Python (and other languages) using iterators or generators,
there is something about infinite data structures in Haskell that I find very attractive.
I'll introduce a few here to demonstrate how pretty they can be.

** Prime Numbers **
Every self-respecting programmer needs a prime number generator/solver
laying around in their coding tool box.
Assume we have a function 'next_prime' that tells us what
the next higher prime is relative to the one given, how could we
implement a generator in Python to yield a set of prime numbers?

Well...

def prime_gen(limit):
    '''Mmm. Primes.'''
    prime = 2
    for x in range(limit):
        yield prime
        prime = next_prime(prime)

Sure, that ought to do the trick. Now just add some of this:

>>> list(prime_gen(10))

into a script or the interpreter, and we get:

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

Great! Primes. We got them safe and sound, in probably one of the shortest
ways we could have. And there's nothing bad to be said about that, but
functional programming simply gives us a more elegant way.

primes = 2 : 3 : filter isPrime [5,7..]

This is assuming we have a function 'isPrime' of type:

isPrime :: Integral a => a -> Bool

which of course tells us if a given number if prime or not.

A simple:

take 10 primes

thus gives the same result as the Python code above. There are, however,
a few things to notice. 'primes' itself is a list. An infinite one, at that.
We declare it as an expression, which can be read as plain English:

We define primes to be a list, made up of 2, 3, and all the prime numbers
to be found among the odd natural numbers.

We could of course have written: primes = 2 : filter isPrime [3,5..]
but personally, I didn't want 2 to be lonely. He's the only even prime.
How would you feel if you were the only exception in an infinite set of
numbers? I know his pain... I'm a white guy living in Japan. So I gave him a friend.

Anyway...

Our list expression is an agreement with Haskell; it tells it what
a list of all primes would look like, were it to exist. Haskell, being lazy,
doesn't calculate all the values (it couldn't even), and so we are able
to define lists like this, skipping the idea of an intermediate generator
like in the Python code above.

** Fibonacci Numbers **
Next, we'll look at a nice equation for a list of all the fibonacci numbers:

fibs = 0 : 1 : zipWith (+) fibs (tail fibs)

Here's a fibonacci iterator implemented in Python.

This expression is different from the one above, in that it references
itself in its own creation. If you're unfamiliar with what zipWith does,just check this out.

To get a feel for how this expression works, let's see what would happen
if we did, say: take 5 fibs

take 5 [0,1,??] -- Fibs doesn't exist yet, but we know its first two values.

0 : take 4 [1,??] -- Haskell looked at our definition, and took the 0.

0 : 1 : take 3 [??] -- None of the rest of the list has been calculated yet.

0 : 1 : (0 + 1) : take 2 [??] -- zipWith in action! Element 0 of fibs plus element 0 of tail fibs.

0 : 1 : 1 : (1 + 1) : take 1 [??] -- zipWith takes another step.

0 : 1 : 1 : 2 : (1 + 2) : take 0 [??] -- And another.

0 : 1 : 1 : 2 : 3 : [] -- Done!

[0,1,1,2,3] -- The return value.

My mind really blew the first time I was shown this. Never would I have thought
to define a list in terms of itself. I was still in procedural mode then,
but this really opened my eyes. Procedural languages give us "algorithms
which calculate the fibonacci numbers" while functional languages give us
the freedom to define infinite data structures and say:
"These ARE the fibonacci numbers".

I think it's an important distinction.

Note to any Python programmers: Any better ways to generate the primes?
I'd love to see other solutions.

Tuesday, May 10, 2011

Haskell Type Woes

Everything has a learning curve, and Haskell is no exception. One thing that seems to take getting used to for new learners of
Haskell (myself included, especially) is its strong type system. Coming
from a background in Python (C and Java before that), I'm used to fluid
return types and the language taking care of almost all aspects of type for me.

Enter Haskell, where everyone has to be on the same page about what type
a function's argument is, and what type it is returning.

While not mandatory, function types can be included with their
definitions, as follows:

square :: Int -> Int

square n = n * n

which is clearly a function that takes one argument and squares it.
Haskell knows this function takes an Int, and returns an Int. If passed
a Char or a Float, it will explode. And rightly so. That would be breaking
your promise to it.

"But I want a function that can handle large squares."
And this happens, for sure (I'm looking at you, Project Euler).
In that case, we can change our function's type to look like this:

square :: Integer -> Integer

and our function will be able to handle a number as big as can be stored in
your computer's memory.

"But I want a function that can handle any number!"

Alright, fine. Here you go:

square :: Num a => a -> a

This essentially allows any number to be passed to the function, then squared.
The following will all work:

Prelude> square 5

Prelude> square 1.5

2.25

Prelude> square 123821935012837612834

15331871590323381123938231883275681511556

And this is all fairly straight forward.
My problems, however, have been coming from mid-function type changes,
in which you want a number of one type to be changed to another.

The following, for example, doesn't work.

toFloat :: Int -> Float

toFloat n = n

Nor does a vain attempt at C casting.

toFloat :: Int -> Float

toFloat n = n :: Float

But this will.

toFloat :: Int -> Float

toFloat n = fromIntegral n

Recall our definition of the function 'square' above. It takes a Num
and returns a Num. But Num itself isn't a type. The actually value returned
needs to be something concrete. Take a look at this:

*Main> let x = square 5

*Main> :t x

x :: Integer

Who told x to be an Integer? Well, Haskell did. Wait a minute...

*Main> let y = square 5.5

*Main> :t y

y :: Double

Aha! Proof. Haskell is dealing with the type change for us, since we only
told it that our function 'square' takes Nums and returns Nums.
(Both Integer and Double are part of the Num typeclass)

Here is where the magic of fromIntegral (and functions like it) comes in.
Doing,

toFloat :: Int -> Float

toFloat n = fromIntegral n

tells Haskell, "Yeah, I know n was an Int, but now it's an ambiguous Integral."
It is as if you had defined the type of toFloat to be:

toFloat :: Integral a => a -> Float -- Though just this wouldn't fix the problem

Haskell then looks at the return type, since a value can't be just
an 'Integral' and says "I declare you a Float!" and everyone lives happily
ever after.

As a more contrived example, consider some library functions I wrote
to convert Integral numbers to lists of their digits (and back again).

import Data.Int (Int8()) -- I wish there were Int4s, even.

itol :: Integral a => a -> [Int8]

itol n | n == 0 = [0]

| n < 0 = error "No negative numbers! Bad!"

| otherwise = itol' n

itol' :: Integral a => a -> [Int8]

-- This does all the work.

itol' 0 = []

itol' n = itol' (n `div` 10) ++ [fromIntegral $ n `mod` 10]

ltoi :: [Int8] -> Integer

ltoi ns = foldl1 (\acc n -> acc * 10 + n) $ map fromIntegral ns

The function itol turns any Integral into a list of Int8s. We don't need
normal 32-bit Ints, so we use the lowest we can, Int8s, to save memory.
Notice the appearance of fromIntegral in the code above.

Let's look at two of the lines from above. First,

itol' n = itol' (n `div` 10) ++ [fromIntegral $ n `mod` 10]

This function accepts any Integral and returns a list of Int8s.
Just,

itol' n = itol' (n `div` 10) ++ [n `mod` 10]

would explode, as n `mod` 10 would return a number of the same type as n,
make a list out of it, then attempt to concat it with itol' (n `div` 10),
which if you look at the types, we get:

[Int8] ++ [Integral a] -- Where 'Integral a' is whatever type n was.

This won't work as numbers of different type can't be operated together like
this. (Although admittedly, had n been an Int8 it would work, but we wouldn't
have a very big range of numbers to work with.)

[fromIntegral $ n `mod` 10]

will thus convince the result of n `mod` 10 to become an ambiguous Integral,
which Haskell will then force to an Int8, as that is the return type
of the function. We then have,

[Int8] ++ [Int8]

and everything is fine.

The second line we'll look at is,

ltoi ns = foldl1 (\acc n -> acc * 10 + n) $ map fromIntegral ns

which takes a list of digits like one we'd create with itol, and turns it
back into an Integer (not just an Int).
The goal of this function is to take every little Int8 in the list and
squeeze them all together into an Integer using a fold. Since an 8-bit
integer would roll-over pretty quickly, all the values need to be forced out
of their 8-bitness from the get-go. Hence we have,

map fromIntegral ns

which moves through the list of Int8s, ns, and convinces them they aren't
just 8-bits anymore.

Our example is concluded. Hopefully this clears up type changes for someone.

Bottom line: When converting between number types, if the compiler is
whining at you, chances are you're missing a cleverly placed fromIntegral.
Really think about what types your variables are, and what the return types
of operations performed on them are!

Sunday, May 8, 2011

Walking to the Sun

   My girlfriend doesn't have a scientific education, and I was thinking of ways to teach her about some of the things I love in a way that she could easily understand. When pondering how to explain the size of the solar system (and how small we are), it occurred to me that describing the distance from the Earth to the Sun in terms of how long it would take to walk there could put things into perspective.
   Here is what I found.

Average adult human walking speed,
   Let ws = 5 km/hs

Walking distance per day,
   Let dpd = ws * 24 = 100 km

Mean distance from the Earth to the Sun (1 AU),
   Let dts = 149,600,000 km

Thus,
Days to the sun,
   Let days = dts / dpd = 1,496,000

Years to the sun,
   Let years = days / 365.25 = 4095.82

Just shy of 4100 years. And what were humans doing 4100 years ago?
-Ancient Egypt was in its 'First Intermediate Period' where the first centralized power had collapsed and Egypt was governed by local rulers. The great pyramids had been built for hundreds of years.
-If the Xia Dynasty in China ever existed (I don't personally think it did), it would have been founded somewhere around now. Confucius wouldn't be born for another 1500 years.
-Japan was in the Middle Jomon period, and its peoples were starting to live in larger settlements and practice agriculture.

   I'm reminded of the episode of Carl Sagan's Cosmos, where he describes the cosmic calendar, and shows all of human history accumulated in the last seconds of December 31st.
Invent civilization, let it stew. Take a stroll to the sun while you wait.
You'd get there today.