## 30 August 2010

### REVISED: More fun pointless code metrics

My apologies to everyone who looked at the numbers for GCC before. I had run cloc from the wrong directory and it was counting all the lines of code in all the source trees on my computer. Oops! The GCC numbers listed at the bottom are corrected. While I'm at it, let's look at GHC vs. GCC. Here is GHC:
1177 text files.
1159 unique files.
247 files ignored.

http://cloc.sourceforge.net v 1.51  T=56.0 s (16.6 files/s, 6818.4 lines/s)
--------------------------------------------------------------------------------
Language                      files          blank        comment           code
--------------------------------------------------------------------------------
C                               139          10199          13583          46412
XML                              33           3839            539          29416
C/C++ Header                    170           3171           5236           8509
HTML                             41            708             82           6761
Perl                             31           1578           1526           6284
Bourne Shell                     26            410            794           3983
Pascal                            2            753            356           2711
m4                                2            293            171           1959
yacc                              4            261              0           1465
make                             42            216            397            606
Bourne Again Shell                5             75            138            393
Lisp                              1             65             76            291
Assembly                          1             34             30            125
Teamcenter def                    3             20              0             54
CSS                               2              8              0             36
D                                 1             10             29             23
C Shell                           2             24             41             20
--------------------------------------------------------------------------------
SUM:                            929          59877          87076         234878
--------------------------------------------------------------------------------

And GCC:
58499 text files.
57735 unique files.
116986 files ignored.

http://cloc.sourceforge.net v 1.51  T=2794.0 s (19.4 files/s, 2776.8 lines/s)
--------------------------------------------------------------------------------
Language                      files          blank        comment           code
--------------------------------------------------------------------------------
C                             15704         350409         365314        1790292
Java                           6324         169097         641639         680923
C/C++ Header                   9712         135787         126549         659612
C++                           12898          98958         127224         428592
Bourne Shell                    126          56208          48639         317192
HTML                            440          30509           5304         139342
Fortran 90                     2978          12630          22272          71604
m4                              163           6547           1879          57799
Assembly                        195           7543           9400          45902
XML                              56           3621            214          29679
make                            115           3193            941          20374
Teamcenter def                   76           2903            379          18601
Expect                          219           4214           7719          16402
Fortran 77                      381            917           2776          10119
Objective C                     275           1922           1092           7021
Perl                             24            760           1157           4121
XSLT                             20            563            436           2805
Bourne Again Shell               12            372            599           1675
CSS                               8            332            143           1427
awk                              10            221            373           1370
Python                            4            301            142           1311
yacc                              2            107            109            987
Pascal                            4            218            200            985
C#                                9            230            506            879
MUMPS                             4            121              0            521
Tcl/Tk                            1             72            112            393
ASP.Net                           7             37              0            203
lex                               1             36             27            150
NAnt scripts                      2             17              0            148
MSBuild scripts                   1              1              0            140
Javascript                        2             20             81            122
Lisp                              1              4             21             59
MATLAB                            3             13              0             52
DTD                               3             28             70             26
Fortran 95                        2             10              7             21
DOS Batch                         3              0              0              7
--------------------------------------------------------------------------------
SUM:                          54338        1115068        1673037        4970376
--------------------------------------------------------------------------------

Keep in mind, the GCC tree includes lots of languages because it's a compiler for lots of things. And it has a lot of tests.

## 29 August 2010

### Fun, but pointless, code metrics

I'm not really sure what motivated this, but I just used cloc to count the lines of code in both the darcs source and the git source.  Here are the numbers. The git source tree:
1951 text files.
1836 unique files.
848 files ignored.

http://cloc.sourceforge.net v 1.51  T=15.0 s (72.3 files/s, 20377.3 lines/s)
--------------------------------------------------------------------------------
Language                      files          blank        comment           code
--------------------------------------------------------------------------------
C                               267          15517          13469         100133
Bourne Shell                    589          15127           5508          84826
Perl                             40           3798           3441          23825
Tcl/Tk                           39           1453            375           9762
C/C++ Header                     99           1977           3557           8301
make                             12            413            434           2673
Bourne Again Shell                1            144            110           2165
Lisp                              2            231            170           1779
Python                           13            465            442           1384
ASP.Net                           8            141              0            931
m4                                2             87             21            858
CSS                               2            154             24            710
Javascript                        2            113            319            477
Assembly                          1             26            100             98
XSLT                              7             15             29             77
DOS Batch                         1              0              0              1
--------------------------------------------------------------------------------
SUM:                           1085          39661          27999         238000
--------------------------------------------------------------------------------

The darcs source tree:
561 text files.
549 unique files.
57 files ignored.

http://cloc.sourceforge.net v 1.51  T=189.0 s (2.7 files/s, 298.0 lines/s)
--------------------------------------------------------------------------------
Language                      files          blank        comment           code
--------------------------------------------------------------------------------
Bourne Shell                    300           2071           2869           8333
C                                 7            325            153           1494
HTML                              5             41              4            316
C/C++ Header                     12             92             83            308
Bourne Again Shell                3             51             95            180
Perl                              2             43             36            130
CSS                               1             21              3             79
make                              1             12              6             53
Lisp                              1              5              6             23
--------------------------------------------------------------------------------
SUM:                            501           7022          10629          38676
--------------------------------------------------------------------------------

Take those categories with a grain of salt. For example, the darcs source does not have any lisp files. It is interesting that git has 200k more lines than darcs. I'm not sure what that says. C is far more verbose than Haskell? Although, that's not really fair because they also have an order of magnitude more shell code. If you're just comparing C to Haskell it's a factor of about 4.

### Syntaxhighlighter on blogspot?

I used to use SyntaxHighlighter but when community.haskell.org went down I lost my brush files. Now I'm going to switch to github gists as detailed here.

What do you think? I quite like it.

## 01 August 2010

### Takusen Tutorial, Part 1: Hello, Takusen

With the recent release of Takusen 0.8.6, several people asked for a tutorial.  Hopefully I can help people get up to speed.

If you haven't heard, Takusen is an industrial strength database library written in Haskell.  Some of the reasons I like Takusen over its competitors:
• Good performance
• Supports the iteratee style so that you can stream your results from the database
• It has a test suite
The BSD license is nice because it means we can use it at Galois, and we do.  A project I worked on recently used Takusen to communicate with the database.  I found Takusen easy to debug with and it worked quite well.

If you like to start by reading lots of background information you might like to read these articles, but I will not assume you have read them in this tutorial:
In this tutorial you will learn:
1. How to install Takusen with the right backend(s) for your database.
2. How to write a "Hello, Takusen!" query.
Let's Get Started!

1. Getting things Setup!
First off, I will describe the environment I used to write this tutorial.  I like to develop using virtual machines whenever I can.  This allows me to start from a clean environment without interfering with any of my other projects.  I chose Virtual Box as my emulation software.  I installed Debian Squeeze, from here.

During the install I requested a web server, ssh server, and SQL database.  My normal account name is 'dagit', as you will see below.

After the install finished, I installed the following packages:
• sudo
• pkg-config
• ghc6
• ghc6-prof
• cabal-install
• libz-dev
• postgresql-server-dev-8.4
• sqlite3
• libsqlite3-dev
• unixodbc-dev
If you are only going to use a specific database then you can safely leave out the other database packages above.  For example, you only need unixodbc-dev if you are going to install the ODBC backend, see below.  This command should install everything above:
$apt-get install sudo pkg-config ghc6 ghc6-prof cabal-install libz-dev postgresql-server-dev-8.4 sqlite3 libsqlite3-dev unixodbc-dev Next, I ran: $ cabal install QuickCheck --constraint="==1.*"
At this point you should have all the dependencies of Takusen.  The next step is to install the backends we might want to use.  For example:
$cabal install Takusen -fpostgres -fsqlite -fodbc The command above will give us the postgres, sqlite, and odbc backends. If you don't need a backend, just omit it during the cabal-install command above. At this point I installed a few other things such as emacs and darcs to make my development experience friendlier. Configuring postgres is beyond the scope of this tutorial, but for reference here are the commands I used to get started with a "hellotakusen" database. $ sudo su postgres # Switching to postgres user
$createuser dagit # same as your unix account name Shall the new role be a superuser? (y/n) y$ createdb dagit # create dagit's default db
I made myself a superuser for convenience, after all this is just a dev machine.   Now switch back to your normal user account.  In my case, that account is 'dagit'.  Now as 'dagit', I run:
$createdb hellotakusen This gives us a demo database separate from the default user database. Let's double check our database: $ psql -d hellotakusen
psql (8.4.4)
Type "help" for help.

hellotakusen=#
Okay, looks good.  Now let's try a simple query:
hellotakusen=# select 'Hello, Takusen!';

?column?
----------------
Hello, Takusen!

(1 row)
2. Time do a query with Takusen!
Now we're ready to do the same thing, but from Haskell.  Let's start with GHCi.  I'll run through the minimum set of commands to get this query working, then I'll explain what we're doing at each step, and why:
\$ ghci
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Prelude> :m + Database.PostgreSQL.Enumerator
Prelude Database.PostgreSQL.Enumerator> let connection = connect [CAdbname "hellotakusen"]

Now we have a connection structure to our Postgres database.  Now we switch to the interface just one level above the database specific one to define an iteratee to fetch our results.
Prelude Database.PostgreSQL.Enumerator> :m + Database.Enumerator
Prelude Database.PostgreSQL.Enumerator Database.Enumerator> let { iter :: Monad m => String -> IterAct m (Maybe String); iter msg accum = result' (Just msg) }
We will use iter with doQuery, to fetch the result of our query,  and immediately print it like this:
Prelude Database.PostgreSQL.Enumerator Database.Enumerator> :m + Control.Monad.Trans
Prelude Database.PostgreSQL.Enumerator Database.Enumerator Control.Monad.Trans> withSession connection (doQuery (sql "select 'Hello, Takusen!'") iter Nothing >>= \(Just r) -> liftIO (putStrLn r))

Hello, Takusen!

Now it's time to start explaining!

We have to give a type signature to iter or else we'll get an incomprehensible error message involving functional dependencies:
Prelude Database.PostgreSQL.Enumerator Database.Enumerator Control.Monad.Trans> withSession connection (doQuery (sql "select 'Hello, Takusen!'") iter Nothing >>= \(Just r) -> liftIO (putStrLn r))

:1:24:
Overlapping instances for Database.Enumerator.QueryIteratee
(DBM mark Session)
Database.PostgreSQL.Enumerator.Query
(a -> t -> m (IterResult (Maybe a)))
(Maybe String)
Database.PostgreSQL.Enumerator.ColumnBuffer
arising from a use of doQuery' at :1:24-75
Matching instances:
instance [overlap ok] (Database.Enumerator.QueryIteratee
m q i' seed b,
DBType a q b) =>
Database.Enumerator.QueryIteratee m q (a -> i') seed b
-- Defined in Database.Enumerator
instance [overlap ok] (DBType a q b, MonadIO m) =>
Database.Enumerator.QueryIteratee
m q (a -> seed -> m (IterResult seed)) seed b
-- Defined in Database.Enumerator
(The choice depends on the instantiation of mark, a, t, m'
To pick the first instance above, use -XIncoherentInstances
when compiling the other instance declarations)
In the first argument of (>>=)', namely
doQuery (sql "select 'Hello, Takusen!'") iter Nothing'
In the second argument of withSession', namely
(doQuery (sql "select 'Hello, Takusen!'") iter Nothing
>>=
\ (Just r) -> liftIO (putStrLn r))'
In the expression:
withSession
connection
(doQuery (sql "select 'Hello, Takusen!'") iter Nothing
>>=
\ (Just r) -> liftIO (putStrLn r))

The problem itself is fairly simple.  If we don't give the explicit signature, then the inferred type of iter is:
iter :: (Monad m) => a -> t -> m (IterResult (Maybe a))

Compared with our supplied type:
iter :: Monad m => String -> IterAct m (Maybe String)

You should be asking yourself, what is the difference between "t -> m (IterResult (Maybe a))" and "IterAct m (Maybe String)".  Checking with the Takusen haddock, we see this definition:
type IterAct m seedType = seedType -> m (IterResult seedType)

Let's expand the IterAct type so we can more clearly compare the inferred type to the correct type:
iter :: Monad m => String -> Maybe String -> m (IterResult (Maybe String))

So there is the problem.  If we don't write the explicit type signature there was nothing in our definition of iter to help ghci infer that the second parameter is a Maybe.  You'll also noticed that in the type synonym, IterAct that it is a function from a seed type to an action containing an IterResult.

The seed here works just like the seed in a left fold.  The type of foldl is:
foldl :: (a -> b -> a) -> a -> [b] -> a

The second parameter is the seed type of foldl and it is of type a, here.  If you expand out a left fold, you will get an expression like this:
foldl f z [a,b,c,d] = f (f (f (f z a) b) c) d


When we pass iter to doQuery it uses iter much like foldl uses f.  In the call to foldl, I called the seed z, and it is passed as the first parameter to f.  First f will combine z and a.  Then the result will be passed to f along with b.  The result keeps getting fed to f in this manner till we hit the end of the list.  So the first parameter to f is an accumulator because f can pass to itself the result of the previous call of f.  The first time f is called it will receive the seed value as the accumulator.

When we defined iter we used the same exact convention, but you didn't see the seed in the definition because the function result' hides this detail for convenience.  You can see it in the type when we expand out the type synonym IterAct.

Where is the seed?  The seed was the last parameter we passed to doQuery.  In the example above, it was Nothing, and iter ignored it.  As this tutorial series expands, I will show you how to use the seed value.

Time to recap.

One simple rule is: Always try to give an explicit type signature to your iteratee function, here we called it iter.  If you get it wrong, the error message will be more forgiving than the error message you get from using it with doQuery.

Another important lesson we learned is that our iteratees work with doQuery in exactly the same way as functions passed to foldl.  Specifically, they take an accumulator that has the same type as the seed we pass to doQuery.  The iteratee is responsible for combining the accumulator and a 'current value' to produce a result.  In a future tutorial in this series, I will show you how to accumulate a list of results in the iteratee.