a-coda - Tumblr blog

a-coda · 6 months

Text

Noisy People

"To understand error in judgment, we must understand both bias and noise. Wherever there is judgment, there is noise - and more of it than you think." -- Kahneman, Sibony, and Sunstein

"Noise" by Daniel Kahneman, Olivier Sibony, and Cass R. Sunstein is essentially a follow-on book from the brilliant "Thinking Fast and Slow".

While "Thinking Fast and Slow" deals with biases, "Noise" deals with more random variation. Even so, the authors argue noise represents as significant a cause of poor outcomes as bias. The latter skews results in a particular direction but noise causes an unwanted spread of results.

The authors look at noise issues in many areas requiring judgement including law, medicine, forensics, economics, and management. For example, in the US, a 1973 study of variation in legal judgements described two people found guilty of cashing similar counterfeit cheques. One got 30 days and the other got 15 years. Eventually these undesirable differences were tackled in the 1984 Sentencing Reform Act. The guidelines significantly reduced sentencing variation. Unfortunately, the act was repealed in 2005 and noise has crept back into the system.

The authors breakdown noise into various types:

level noise: variability in the average level of judgements by different judges

stable pattern noise: variability in judges' responses to particular cases

occasion noise: variability due to transient effects

They feel noise often is a silent killer (of good decisions) within organisations. While biases have been highlighted as a source of discrimination and poor outcomes, noise is more pernicious. Those in authority are often oblivious to its existence and impacts.

To expose the problem, noise audits are recommended where the same task (which can be real or fictitious) is assigned to multiple experts. To remedy the problem, the principles of decision hygiene must be applied, which are:

The goal of judgement is accuracy, not individual expression

Think statistically and take the outside view of the case

Structure judgements into several independent tasks

Resist premature intuitions

Obtain independent judgements from multiple judges, and then consider aggregating those results

Favour relative judgements and relative scales

There is a lot to reflect on in the book, especially for commonplace activities like hiring which affects all workplaces. Overall, the book is good but perhaps does not have the same depth and density as "Thinking Fast and Slow".

"Noise" also feels a bit like a companion to "The Signal and the Noise" by Nate Silver.

A surprise lesson from the book was simple mathematical models (even really bad ones) of judges' performance often out-performed the judges themselves. This seemed to be because such models capture the essence of the judges' rules but eliminate the human variability.

This struck a chord with me as I recently had a bet with a group of a dozen or so acquaintances. We had to guess the order in which that season's teams would finish in the English Premier League (this is football aka soccer). The others had played before but it was my first time.

I decided to use an algorithmic approach. I concocted a simple formula which factored in a team's previous position and their subsequent squad investment (net spend on recruitment). I was the only person who bet none of the newly promoted teams would be relegated, which started to make me worried.

But amazingly I won the bet! My take-home lesson after reading "Noise" was others' judgements were probably much more coloured by their personal opinions of the teams at the time they were asked. I came to the bet with a much more dispassionate and quantitative evaluation of their prospects. As Steven Pinker says:

"Cognitive psychology tells us that the unaided human mind is vulnerable to many fallacies and illusions because of its reliance on its memory for vivid anecdotes rather than systematic statistics."

Finally, the book also looks at the judgement calls involved in insurance underwriting and claims adjustment. I work at an insurance company but ironically I am personally in the business of deliberate noise generation as I work on software which does stochastic modelling.

"Causally, noise is nowhere; statistically, it is everywhere." ― Kahneman, Sibony, and Sunstein

0 notes

a-coda · 1 year

Text

Mentioned in Dispatches

It is my official last day at Linguamatics (part of IQVIA). It has been fifteen years: "Man and boy", as my ex-colleague Chris would have said. It was a rollercoaster ride working on ground-breaking software with amazing people. But it is time to move on and do something new.

I am stepping away from development management and going back to programming. It feels like I have served my time in the management trenches and I have earned the right to do some of the fun hands-on stuff again.

I am also changing industry from Healthtech to Fintech, and I am going to be programming in C# for the first time, so I have got a lot to learn. That is all part of the attraction.

Fortunately, I was granted some extended leave at the end of my tenure and this gave me time to start looking ahead to C# and .NET. To this end, I have been using C# 10 in a Nutshell by Joseph Albahari, and the Exercism online exercise site. The latter was recommended by QA Hiccupps.

The book is 1000+ pages and I had to set a daily target to get through it. Overall, I thought it was thorough and readable. I only found two or three typos. Some parts of the language and the libraries seemed more deeply treated than others. I guess the author was more interested in those or more concerned about readers getting them right.

The training site was also good. It was easy to develop and submit exercises entirely within the browser-based envionment. Debugging is limited to writing to the console. This is mostly fine for these kinds of exercises and testsuites. However, the approach was insufficient for algorithms getting lost in large search spaces. A good tactic there was to introduce iteration or recursion limits as a backstop. Of course, that is the kind of thing reliable software needs anyway.

At the time of writing I have done all but 2 of the 164 C# exercises. The remaining ones require you to download the exercise infrastructure locally, which I have not done yet. There was only one other exercise accidentally requiring some local debugging. This used a property-based testing framework which ate all console output.

Disproportionately, the worse execise was one to calculate ten-pin bowling scores. Determining the end of the game and rejecting illegal throws seemed to be nightmare.

C# is mostly a sister language to Java so what is special about it? The differences actually have their own special page on wikipedia so I will not duplicate all of that here. Instead, the C# flourishes that stood out for me doing the exercises and reading the book were:

a more unified type system - I could make lists of unboxed ints

LINQ - C#'s most famous feature, but mostly like Java Streams the way I was using it

iterators - you can make them using "yield" like in Python - they work really well with LINQ

extension methods - it is sometimes neater to appear to add methods to existing types but there are limitations because it is faked

multi argument dispatch - there to support dynamic languages, and you have to opt in, but makes the Visitor Pattern much easier - you can even meddle with CallSite objects which manage the dynamic dispatch

expression trees - looks like it might be fun to hack some code generation

no checked exceptions - woo hoo!

I slightly missed Java's richer enums but C#'s are aimed at interoperability.

And, of course, C# 11 came out while I was learning C# 10 but twas ever thus ...

"Just because people tell you it can't be done, that doesn't necessarily mean that it can't be done. It just means that they can't do it." -- Anders Hejlsberg

0 notes

a-coda · 2 years

Text

Not a valid DuckDB database file!

"If it looks like a duck, and quacks like a duck, we have at least to consider the possibility that we have a small aquatic bird of the family anatidae on our hands." -- Douglas Adams

I recently learned about DuckDB. It is an embedded file-based database like SQLite. However, DuckDB is a column-oriented database designed more for OLAP than SQLite. DuckDB can also process CSV and Parquet format files. The latter are compressed big-data files in a column-oriented format. DuckDB sounded like it might be a useful building block.

I decided to experiment with DuckDB in an environment that let me explore a Parquet file dynamically. Clojure felt like it would fit the bill. It was bound to have a DSL for SQL which made it easy. And, what the hell, I could probably also learn Visual Studio Code while doing this too.

Cue lots of pain setting up VS Code, setting a up a Clojure project, including an SQL DSL (Korma), and trying to get DuckDB's driver installed. After fighting through all that I then hit a problem trying to read any sample Parquet file: "The file is not a valid DuckDB database file!". That seemed strange given the examples. Google found no hits for that error. Perhaps the hand-installed the driver or the abandonware DSL was to blame? I suspended my Clojure experiments.

C# has LINQ. Hell, I bet I could learn the language and the extension and query the Parquet files via DuckDB there!

Cue more pain setting up .NET and VS Code again for that. DuckDB.NET was handy although it wasn't clear how to add the underlying DuckDB DLL to the project other than by dropping it into the "bin" directory. I decided to use an Object-Relational Mapper called Dapper to look at the Parquet files. LINQ looked like it needed too much setup for SQL. But then I ran into the same error message: "The file is not a valid DuckDB database file!". Perhaps I needed to try an environment with a less personally bodged setup? I suspended my C# experiments.

Python is one of the more natively-supported DuckDB target languages. But one pip-install later I get the same error message: "The file is not a valid DuckDB database file!". Perhaps I needed to think more about what the error message actually meant? Then it finally dawned on me that DuckDB did not support directly opening a Parquet file as a database. Instead it only supported reading from a Parquet file after you had already opened some other database - which could be an empty in-memory one.

It was subsequently easy to make it work in Python and I could 'pop tech stacks' to C# and thence to Clojure and make it work in those environments. See below for the examples.

Some lessons along the way:

VS Code has some nice Clojure REPL integration

.NET has a nice package/project manager

Dapper is a nice Object-Relational Mapper

Dynamic languages are nicer for exploring unknown data

Powershell.exe is not quite the same as pwsh.exe

VS Code has lots of annoying windows

But the real moral of the story, apart from trying to shave too many yaks at once was:

Read the error message (again)!

Now I just need to remember what I was going to do with DuckDB ...

Python

import duckdb con = duckdb.connect(database=':memory:') con.execute("SELECT first_name, last_name FROM read_parquet('userdata1.parquet')") for p in con.fetchall(): print(p)

using DuckDB.NET.Data; using Dapper; namespace ducksharp { record struct Person(string first_name, string last_name); class Program { static void Main(string[] args) { using (var duckDBConnection = new DuckDBConnection("Data Source=:memory:")) { duckDBConnection.Open(); var people = duckDBConnection.Query❮Person❯("SELECT first_name, last_name FROM read_parquet('userdata1.parquet')"); foreach (Person p in people) { Console.WriteLine(p); } } } } }

Clojure

(require '[clojure.java.jdbc]) (def db {:classname "org.duckdb.DuckDBDriver" :dbtype "duckdb" :connection-uri "jdbc:duckdb:"}) (run! println (clojure.java.jdbc/query db ["select first_name, last_name from read_parquet('userdata1.parquet')"]))

0 notes

a-coda · 2 years

Text

There is a Silver Bullet

Fred Brooks said in 1986 there would be "No Silver Bullet" in software development. He meant software development gains could never be like the hardware gains from Moore's Law.

One could argue that with emerging components and platforms, a lot can now be achieved by plumbing together off-the-shelf parts. However, what I would really like to claim is the "silver bullet" at all levels of software development is feedback. Fast, high quality, feedback is the most powerful driving force we have.

In NLP search we get domain feedback in terms of an F-score, the harmonic mean of the precision (how many retrieved items are relevant) and the recall (how many relevant items are retrieved).

But many software engineering tools create feedback loops:

editing (IDE/editor highlighting)

real-eval-printing

compiling

debugging

profiling

testing

static analysing

code reviewing

model checking

operational monitoring

People, even imaginary or abstract ones, also provide feedback:

"rubber duck" or "pot plant"

peers

testers

stakeholders

managers

prospects

customers

users

"the market"

"open source" (many eyes)

Just recently someone told me they had worked on audio software. The software tended to be in C and C++, but the programmers needed audial feedback much faster than the usual build process would allow. So they developed real-time incremental compilation in which updates could be stitched into the live program. They did this just so the programmers could hear their changes as they made them. 1

The quicker you can get good feedback from any and every environment affected by the software the better you can make it.

Conversely, every barrier to, gap in, or delay in this feedback is lost ground and is detrimental to the outcome.

Sometimes this means you must ask for feedback in parallel to avoid one kind of feedback blocking another.

Sometimes this means you must curate too much, conflicting, or inconsistent feedback.

Sometimes this means you must incorporate feedback from one source before asking for feedback from another - so you don't get repeat feedback.

Sometimes this means you must stay true to your core values so you are not buffeted by feedback.

Sometimes this means you must request feedback from a known positive source so you are motivated.

Sometimes this means you must request feedback from a known negative source so you are realistic.

Sometimes this means you must code the feedback process itself:

"How will I know if this breaks?" -- Charity Majors (from "I test in prod")

Silicon Valley has justifiably made feedback its religion:

"Fail fast, fail often"

Yes, obviously it would have been easier to use a language designed for this like Common Lisp! But it does demonstrate the need for timely feedback. ↩︎

1 note · View note

a-coda · 2 years

Text

A Wordle Noodle

"Words, words, mere words, no matter from the heart." -- William Shakespeare

The world and her husband have been playing Wordle. It is a word-guessing game that is quick to learn and you can fit around your morning coffee. What seems to make it popular is there is only one word per day, everyone gets the same word, and you can publish your attempts without giving away the answer. Wordle is a social media hit.

The success of Wordle has led to imitators in other areas including Globle, Worldle, and Nerdle, and also to a big payday for its author Josh Wardle.

However, programmers cannot help but try and solve these kinds of problems programmatically, and here is my effort. I am not claiming any great performance or insight. Frankly, it was just a bit of fun and an excuse to write some Julia again.

The essense of the approach is to start with a big word list, make guesses, and whittle down the remaining words based on the score given. Then rinse and repeat. This means the program is effectively playing in hard-mode (where you must follow the clues given).

The word guesser tries to find a word incorporating as many common letters for each position as it can from the remaining words. Although, early buggy versions still did pretty well. I am not sure the guesser's performance is really much better than random choice or always choosing the first. As the choices get fewer it is effectively doing the latter anyway.

A downside of the strategy is, like some human players, the program can lose when the word has a common structure with a variety of letters for the remaining position.

Anyway, the resulting code is in Github. The style is moderately functional and dynamic, with no type definitions or declarations. I initially wrote the algorithms out longhand (eg for finding a maximum) but then refactored it to use higher-level library functions (after I Googled them). Nevertheless I consider it okay to write FORTRAN in Julia because that is its target market!

There is no real command line interface. Just run "solve" in the Julia REPL. There are some asserts that run on loading into the REPL acting as unit tests.

The dictionary I used is from filtering an internet source file down to include just five-letter words while excluding capitalized words. I cannot remember its provenance so I have not included it. However, it means I am not using the same dictionary as Wordle and so sometimes the program wastes attempts on obscure words.

Some lesssons I (re)learnt were:

REPLs are great. You can run elements of the code standalone to understand them better.

REPLs are not so great when you think you are testing your new code but actually you are running old code that has accumulated in the REPL.

Expression languages are great for reducing clutter.

Expression languages are not so great when you assign to a variable in a condition by mistake.

Dynamically-typed languages are great for noodling

If I wanted to do further noodling I might use right dictionary and teach the program a few good start words like an "opening book" in chess. I might also try to relax the hard-mode style so it can look for possibility-reducing words from the whole dictionary.

Other posts on Julia:

Julia's vector

Searching for syntatic suger, man

A recurring dream

Julia dream

0 notes

a-coda · 2 years

Text

Feelin' Groovy

Slow down, you move too fast You got to make the morning last Just kicking down the cobblestones Looking for fun and feeling groovy Ba da-da da-da da-da, feeling groovy -- Paul Simon

Our future is more Groovy than it was.

We have successfully used SCons for many years as a homogeneous build system covering Java, C++, JavaScript, and sundry other tasks. SCons has been part of a suite of Python-based tools which gave us a uniform development fabric (pun intended) including: Mercurial, Buildbot, Ansible, and Conan.

However, over time we found what SCons bought us in terms of uniformity, and flexibility, was offset by it not being particularly good at anything. Third party library code had to be adapted, IDE projects had to be separately maintained, dependencies were not managed, and builds were not as incremental as they could be.

So we have been shifting our build system away from a single generic tool towards more specialized tools. NPM crept in on the JavaScript side for front-end work, and in the last couple of years we swapped out SCons on the C++ side for CMake. Now it is the turn of the Java build system. We're adopting Gradle for this, partly because it still gives us a flexible dynamic language to work in instead of plain XML.

Gradle's extension language is Groovy (or you can use Kotlin). Groovy is a JVM-based scripting language that's been around since 2003 and it is good for extending Java applications. It is also a plugin language for other Java-based development tools like Jenkins.

Groovy is very dynamic with lots of support for building domain specific languages (DSLs) including metaclasses, annotations, multi methods, and various syntactic tricks. Java is a subset of Groovy but only in the same way that JSON is a subset of YAML: idiomatic Groovy doesn't look too much like Java. With its closure blocks I feel Groovy owes a debt to Ruby as much as any language.

Some of the Java developers at Linguamatics have used Groovy for a while, for prototyping, but I have mostly used Python. So in preparation for Gradle I thought I'd do some exercises in Groovy, and where better than the seasonal Advent of Code. I did a few of the puzzles while reading the language reference. What I learned were really two things. First, Groovy feels slow to start for a scripting language and second, despite all its flexibility:

"The determined Real Programmer can write FORTRAN programs in any language." -- Ed Post

0 notes

a-coda · 2 years

Text

What If ... ?

In the old days HTML was generated on the server and browser page loads fetched whole new HTML pages. JavaScript was used for some limited interactivity.

Then came Ajax, and Single Page Applications (SPAs). Now large JavaScript frameworks update the HTML DOM in the browser from (usually) JSON payloads returned from HTTP requests. The line can be blurred by Server-Side Rendering, but those are the main options.

But what if HTML realized its potential?

Why should only <a> and <form> be able to make HTTP requests?

Why should only click & submit events trigger them?

Why should only GET & POST methods be available?

Why should you only be able to replace the entire screen?

If we removed these restrictions then the HTML page could selectively update and replace parts of itself in cooperation with a server application purely returning HTML elements.

HTMX is a small JavaScript library allowing exactly this.

You basically add special attributes to any element like the following. These make the element talk to the server and swap in new elements there or elsewhere on the page:

hx-post="/clicked" hx-trigger="click" hx-target="#parent-div" hx-swap="outerHTML"

I have had a brief play: see source code. I made a colourful web page where each time you click you replace the current box with two more randomly coloured boxes in a row or column depending on where you click.

I built the server as simply as possible using Python's BaseHTTPRequestHandler.

I wanted to push all the decisions to the server so I had to send the mouse event information there. This in turn meant I had to find and use HTMX's "event-header" extension.

I couldn't find an out-of-the-box way to get information about the clicked-on element (for the bounding box) so I hacked the "event-header" extension to add it. Probably I should have made a new extension. Actually, I should have made the orientation calculation in the browser instead of pushing all the raw data to the server.

In the end I'm still more comfortable with SPAs because they are more like traditional desktop UI frameworks. Even so HTMX is an interesting point in the design space. It brings back the hypertext-oriented approach of the web.

"Everything is deeply intertwingled." - Ted Nelson

1 note · View note

a-coda · 3 years

Text

Right on CUE

I love configuration files. They parameterize the operation of programs without the need for those pesky user interfaces.

In the old days, programs frequently evolved their own ad doc and idiosyncratic configuration languages. SendMail and Autoconf are notorious examples in this regard, although (or because?) they both use the m4 macro language.

There were some de facto standards like INI files or even Java properties files. More recently configuration is often in XML, JSON, or YAML. Sometimes these metaformats are yoked to template substitution systems. For an example of the latter see Ansible.

In truth, in my old days, config files would have been S-expressions whose modern form is EDN.

However, no matter what their form, configuration files for any large long-lived application often struggle to find the right level of expression. They are either too simplistic and repetitive, or they evolve into poorly supported bespoke programming languages. (Just sometimes they go the whole way and metamorphose into perfectly-formed independent languages like Lua.)

Ultimately, Mike Hadlow's "Configuration Complexity Clock" charts this cycle of doom from, and back to, hard-coded configurations.

The latest entry in this space is Configure Unify Execute (CUE). CUE is a nice synthesis of JSON, types, and unification. It can constrain JSON data, and avoid repetition, but it isn't a full programming language. I think this makes it somewhere about five o'clock in Hadlow time.

CUE's particular target is Kubernetes. CUE grew out of experiences with Kubernetes' progenitor: Google's Borg. Borg's configuration language ("BCL", itself dervied from a generic "Google Configuration Language") is Turning Complete and is said to be difficult to maintain.

There have been other attempts to solve the problems of BCL like Jsonnet. A pithy description of the difference between Jsonnet and CUE is:

'Jsonnet is a direct descendant of BCL, "BCL done right". CUE is an indirect descendant of BCL, you might say "BCL rethought".' [1]

There's lots of ways of using CUE. It is a configuration language in its own right, or it can be used to validate other formats. I'm going to show an example of using it as a schema language for some JSON.

As I've mentioned before, we use Buildbot for our continuous integration. It's a very flexible Python framework that's all the way around around complexity clock, but we've extracted some more declarative configuration files. One of these is a "workers.json" file which defines the host machines running Buildbot workers. These have different capabilities in terms of platform, installed tools, and horsepower.

A sample excerpt:

{ "host-1": { "max_builds": 3, "max_parallel": 4, "owner": "Dev", "platform": "Linux", "workertype": ["Install", "Performance"] }, ... }

The format isn't very complicated so I tried to encode the schema in CUE. Here is my (simplified) attempt:

#Parallelism: uint8 & >=1 & <=10 #Branch: "default" | =~"^B[a-zA-Z0-9_.]" #Owner: "Dev" | "Test" | "Resources" #Platform: "Linux" | "Windows" #Type: "Install" | "WebAPI" | "SSL" | "Performance" #Worker: { max_builds: #Parallelism max_parallel: #Parallelism rpm_installs: [...#Branch] owner: #Owner platform: #Platform workertype: [...#Type] experimental?: bool | false } #Workers: [string]: #Worker #Workers

Without walking through this laboriously, I'm basically building up constrained types. I can limit numerical values, match strings via regular expressions, and create sum types. Using these I can define structs including optional fields like "experimental".

How can I use this? CUE comes with some nice tooling (written in Golang) which can validate the actual config file:

cue vet workers.cue workers.json

It actually doesn't matter which way around I put the command line arguments because CUE is unifying the inputs.

Suppose I tried to turn "max_builds" up to 11. The above command would report something like:

"host-1".max_builds: invalid value 11 (out of bound <=10): .\workers.cue:1:29 .\workers.json:3:23

Anyway, that turned out to be not so hard. Now I wonder if I could use CUE to define a validating schema for our much more complex YAML-based text mining language (EASL)?

"Every configuration file becomes a programming language, so you might as well think that way." -- James Gosling

0 notes

a-coda · 3 years

Text

Google gets to Gummybear

One Saturday morning I was making some pancakes for breakfast and wanted something to listen to. Instead of talk radio or music, I browsed Software Engineering Radio's recent podcasts and picked Doug Fawley talking about gRPC, which was broadcast last August. gRPC is a technology for making remote procedure calls between different processes. I had just read about Dropbox's adoption of gRPC for their internal services so it stood out as a topic.

Here I must confess to having a professional interest in RPC mechanisms as I have implemented CORBA support in both Java and Dylan, with a little help from my friends. You can still read the Dylan CORBA documentation. CORBA was meant to standardize the RPC space and even provide interoperability across the internet. Sadly that was not to be and RESTful web services swept through the industry. Nevertheless, behind the scenes, in data centres, diverse RPC mechanisms continued to multiply.

In this respect gRPC is just the latest game in town. It was originally developed at Google in 2015 as a next generation, and open source, version of their in-house RPC mechanism: "Stubby". Like all RPC mechanisms there is an interface definition language for defining the messages. These then get compiled into the code for the programming language you are using. In gRPC's case it not unnaturally uses Google protocol buffers for the interface definition language and for the data serialization (by default).

Perhaps the most distinct thing about gRPC is it is built on top of HTTP/2 rather than HTTP or TCP/IP. This means it benefits from high-level support for connection multiplexing while retaining performance via header compression and binary data serialization. Unfortunately, this doesn't mean you can use it directly from JavaScript in a browser because browsers don't expose enough of the HTTP/2 details. Instead a proxy is necessary.

To try gRPC out I converted my recent Go-based NLP web service into an NLP RPC service. See the source code. It was straightforward enough. Most of the user code is assembling the protobuf data on one side and disassembling it on the other. I had the most trouble just creating a multi-file Go project. For casual programming, dealing with package management and compilation units is often the biggest source of pain. In the end I bodged it all into one package to get rid of the frustrating lookup errors.

Other issues I had were really just reminders about what a dumb (in the nicest sense) language Go is. For example, instead of stretchy lists or vectors as you might have in Python, Java, or C++ you have to deal with slices. Another case is this innocent-looking for loop. This is supposed to put a sequence of objects into some other storage.

for i, object := range objects { storage[i] = &object }

Unfortunately, the code is putting a pointer to the local variable "object" into the other storage. This is reused by the loop so it means the storage array points to the last object multiple times. Instead you have to write:

for i, _ := range objects { storage[i] = &objects[i] }

(Or make copies.)

So where could Linguamatics use gRPC? Well, we already use protocol buffers over ZeroMQ for some backend services. It might make sense to switch to gRPC if the industry is standardizing around it for microservices. ZeroMQ is fast and supports some advanced communication patterns like pub-sub, but gRPC has streaming which can be used for that too. However, CORBA has shown that it can be difficult to pick a winner in this space. After a period of popularity it eventually fell flat as a pancake.

Finally, Doug Fawley says although gRPC is from Google the "g" doesn't stand for "Google" and instead stands for different things on each release - the latest at the time of writing being "gummybear".

"In the old days, you would chastise people for reinventing the wheel. Now we beg, 'Oh, please, please reinvent the wheel.'" -- Alan Kay

1 note · View note

a-coda · 3 years

Text

Advent of Code

As a child, advent calendars always added to the sense of anticipation in the lead up to Christmas. In my day you would be lucky to get a small picture behind each of the doors. These days, children expect chocolates or sweets. My wife has once even had a "Ginvent Calendar", with gin behind each door.

This year I marked Advent by having a go at the "Advent of Code" which has Christmas-themed programming puzzles posted each day. Most days are in two parts, with an easier puzzle followed by a harder one. Traditionally, I've posted a (mostly ignored) programming puzzle to our development team each Christmas. Last year I just recycled one of the Advent of Code puzzles, but this year I suggested we attempt the whole thing. The puzzles are so well thought out, in comparison to my efforts, that it seemed pointless to compete.

In the end, several of the team had a go. Some of the puzzles were harder than others, but I managed to solve them all by Boxing Day. What follows are some personal anecdotes from the various days with some general thoughts at the end. Note that there are some spoilers and the notes won't mean much if you've not done the puzzles. So in this case just skip to the end.

a sum-finder. I implemented the search tree via recursive calls. I drifted into using Python right from the start. It just felt like the easiest way to hack the puzzles quickly. In the past I had thought about using the puzzles to learn a new language. A colleague had done that with Rust in a previous year. Despite these good intentions, expediency took a firm hold. That said, in several puzzles I would have liked immutable collections or at least Lisp-style lists.

a pattern counter. Not that interesting except patterns were emerging in the programs themselves. Regular expressions got used a lot to read in the puzzle data. I learnt about things like match.group(1,2,3) which returns a tuple of the first three match groups, so you don't have to write (m.group(1), m.group(2), m.group(3)).

a grid tracer. The first interesting one because it was unfamiliar. Some other patterns started emerging: problem parameters got promoted to command line arguments, and data structure printers got hacked to help debugging. These two were often added between part 1 and part 2 of each problem.

a data validator. This felt like a bit of a slog. It was mostly about capturing the validation rules as code. Even though I made a point of reminding myself at the start that re.search doesn't match the whole string I still forgot it later. Duh.

an indexing problem. I patted myself on the back for realizing that the index was a binary number (or pair of binary numbers as I did it). At this point the solutions were still neat and I would do a little code golfing after the solution to tidy them up a bit and make them more concise.

another pattern counter. Pre-calculating some things during data reading kept the later code simple.

a recursive calculator. This was one of those puzzles where I had to reread the description several times to try and understand what it was asking for. It entailed a slightly tricky recursive sum and product, which was again made easier by creating more supporting data structures while reading the input data.

an interpreter. Probably my favourite individual puzzle because it was so sweet, especially after a bit of refactoring to make the language more data-driven.

another sum-finder. I found I didn't particularly like these.

an order-finder. This was the first one that made me pause for thought. An overly naive search algorithm from part 1 hit a computational complexity wall in part 2. I beat the problem by realizing that the search only had to be done on small islands of the data, but a colleague pointed out there was a better linear solution. The code was starting to get a bit ragged, with commented out debugging statements.

the game of life. The classic simulation but with some out-of-bounds spaces and some line-of-sight rules. It helped to print the board.

a map navigator. I liked this one even though I forgot to convert degrees to radians and that rotation matrices go anti-clockwise. I even introduced an abstract data type (ADT) to see if it would simplify the code (I'm not sure it ever did - I mostly used lists, tuples, strings, and numbers). The second parts of the puzzles were starting to get their own files now (usually bootstrapped by copying and pasting the first part's file).

a prime number theorem. I actually got stalled on this one for a bit. It eventually turned out I had a bug in the code and was missing a modulus. In effect I wasn't accounting for small primes far to the right. I left the puzzle and went on to complete a couple of others before coming back to this one. I checked what I was doing by Googling for hints, but in the end I had to take a long hard look at the data and find my own bug.

some bit twiddling. Part 1 felt like I found the expected bitwise operations, but part 2 felt like I was bashing square pegs into round holes.

a number sequence problem. Another pat on the back, this time for keeping a dictionary of recent occurrences and not searching back down the list of numbers each time. Another recurring pattern is evident: running a sequence of steps over the data. I liked to code the step as its own function.

a constraint solver. A nice one about labelling fields that satisfy the known constraints. Half the code was parsing the textual rules into data.

another game of life simulation. This time it was in more dimensions. I generalized from 3 dimensions to N instead of just doing 4. This made it more of a drag. I started naming auxiliary functions with placeholder names (social services should have been called). Also, I tacked on extra space along each dimension to make room at each step. This felt very ugly. I should have used a sparser representation like I did for day 24.

an expression evaluator. I used another actual ADT and wrote a simple but horrible tokenizer. The evaluator was okay but I hacked the precedence by inserting parentheses into the token stream. Don't try this at home kids.

another pattern matcher. Probably my biggest hack. My code compiled the pattern rules into a single regular expression. This was cute but meant the recursive rules in part 2 needed special treatment. One rule just compiled into a repeated pattern with +. Unfortunately, the other rule entailed matching balanced sub-patterns, which every schoolchild knows regular languages can't do. Perhaps some recursive pattern extensions might have worked, but I assumed there would be no more than 10 elements of the sub-patterns and compiled the rule into a large alternative of the possible symmetrical matchers. Yuck.

a map assembler. I did this one the most methodically. It had proper comments and unit tests. Overall it took the most code but perhaps it was just dealing with all the edge cases (ba dum tss). But seriously, it seemed to take a lot of code for rotating and flipping the tiles even after knowing how they must be connected. So probably there was a better approach. It was still satisfying the see the answer come out after all that work. Curiously, this one involved little debugging. I wonder if perhaps there is some connection between preparation and outcome?

a constraint solver. I tried a dumb approach first based on searching all the possible bindings. That didn't look like it was terminating any time soon. So I reverted to a previously successful technique of intersecting the associations and then then refining them based on the already unique ones.

a recursive card game. This card game playing puzzle seemed to be going okay, but the real data didn't converge for part 2. Had a quick Google for a hint after battling with it for a while, and the first hit was from someone who said they'd misread the question. Sure enough I had too. My recursive games were on the whole deck instead of the part dictated by the cards played. Duh. The description was clear enough and included a whole worked game. I just hadn't read it properly. It still seemed to need some game state memoization to run tolerably fast.

a circular sequence. Took three attempts. A brute force approach using an array was good enough for part 1, but no way was it going to work on part 2. Even optimizing it to use ranges was still 'non-terminating' for the array-based solution. So I Googled for a little inspiration and found the phrase "linked lists" and slapped my forehead hard. I switched to a dictionary of labels to labels and the solution popped out very easily, without any further optimization. Embarrassing. Was it time to ceremonially hand in my Lisp symbol and fall on a sharpened parenthesis?

another game of life. This one sounded neat because it was about a hex grid, but I didn't know how hex grids are usually indexed. So for the first time I did a little bit of general research at the start. Turns out there are a bunch of ways to index a hex grid. I opted for using 3-axes as that seemed natural despite the redundancy. The map itself was just a dictionary of locations. I should have looked up how to have structured dictionary keys in Python (implement __hash__) but I couldn't be bothered so I (look away now) serialized and deserialized the locations to and from strings. I still had a bug which I couldn't find until I hacked a crude hex board printer and realized I wasn't carrying the unchanged cells over from one iteration to the next.

a cryptographic puzzle. Came out quite short but only after some faffing around. Main trick seemed to be to keep the transformation ticking along instead of recalculating it from scratch each time. There was slight disappointment (tinged with relief) that there was no part 2.

Some general lessons I felt I (re)learned:

Read the questions very carefully, then reread them.

Try and use terms from the questions. Don't invent your own terminology and then have to map back and forth.

Make the trace output exactly like the examples to help comparison.

Next time I'd consider using BDD to turn their examples directly into tests. Next time.

Try the problem for a while by yourself, then think about it offline, and only then Google for hints.

Next time I'd consider using some form of source control from the start, or just a better set of file naming conventions.

Regular expressions go a long way, but can then they can get in the way.

Next time I'll consider doing it using a language I'm learning.

Sometimes when you get stuck you have to start again.

During some low moments it all felt like make-work that I'd inflicted on myself, but in the end it was a nice set of training exercises. I'd encourage others to have a go at their leisure.

"Practice is the best of all instructors." -- Publilius Syrus

2 notes · View notes

a-coda · 3 years

Text

Database Internals

Database Management System [Origin: Data + Latin basus "low, mean, vile, menial, degrading, counterfeit."] A complex set of interrelational data structures allowing data to be lost in many convenient sequences while retaining a complete record of the logical relations between the missing items. -- From "The Devil's DP Dictionary" ― Stan Kelly Bootle

I've just read Alex Petrov's "Database Internals". Alex is a contributor to the NoSQL database Apache Cassandra and the book is about storage engines and distributed computing algorithms. It has the following chapters:

Storage Engines: Introduction and Overview

B-Tree Basics

File Formats

Implementing B-Trees

Transaction Processing and Recovery

B-Tree Variants

Log-Structured Storage

Distributed Systems: Introduction and Overview

Failure Detection

Leader Election

Replication and Consistency

Anti-Entropy and Dissemination

Distributed Transactions

Consensus

This book will stick in my memory both it was good and also because of when I read it. A big portion was read on the train from London to Exeter. That journey was part of a long day of travel from Cambridge to Cornwall. Wearing a mask all day and dealing with travel delays was a trial. Plus I was on my way to help my parents move house in the middle of the pandemic. Altogether, it felt like a wartime mission and the book was a welcome distraction.

Moving house is conceptually simple but, as everyone knows, it is a complex logistical effort. Similarly, storing data on disk sounds simple but rests on numerous processes. Like the metaphorical swan, there is much vigorous paddling underneath the surface to maintain the appearance of smooth progress.

Fortunately, the book starts gently and talks about the architecture of database systems and the high-level trade-offs. Then it gets into more detail about how to to persist data for efficient retrieval.

For example, the book explains why binary trees, although useful in-memory data structures, are not so useful for persistent storage. The reason is most large-scale storage devices are slow compared to memory so you want to avoid touching them very often. Unfortunately, binary trees are not good for this since they have low 'fanout'. Traversing them and updating them means touching lots of nodes.

What can we do? Well, these storage devices also tend to be block-structured so the idea is to pack as much as you can into an individual block. B-Trees store N sorted keys at each node (which is block-sized). The lookup algorithm does a binary search on the keys at a node and then follows a pointer to the relevant child block.

Inside the block the potentially variable-sized data has to be carefully laid out. Usually the blocks are 'slotted' which means the keys-and-pointers or keys-and-values are serialzed into 'cells' and written backwards from the end the block. Pointers to the cells are then written forwards from the start of the block (along with header information). This indirection allows entries to be added and removed without moving the real data too much. A downside is space inside a block can end up being fragmented so there have to be defragmentation processes.

At this point I started to feel the urge to code this, but I recalled my database lecturer's admonishment from decades ago:

"If anyone asks you to implement a B-Tree, just say no!"

The reason is there are lots of fiddly cases to deal with like block splitting, merging, and overflow. There are also many access patterns you might try to optimize for. In the end, mutable B-Trees are better for fast reading, They trade slower insertion times for faster lookup times.

There are some workloads which need fast writing. These are better suited to Log Structured Merge-Trees. In these, insertions are buffered into an in-memory data structure. This is then written to disk as a fully populated immutable tree. Searches must check multiple places, both the memory buffer and on disk. Meanwhile a process gradually merges these disk trees in the background. I actually used a log structured storage engine, RocksDB, in one of my Rust blog posts.

There are many storage trade-offs and the book does its best to guide you through them. Then there is the fascinating second half of the book concerning distributing computing algorithms for leadership election (a popular topic right now) and consensus. I can't do justice to it all in a short review blog. However, I personally found the storage section more interesting. It felt like it had more of a narrative.

Returning to the context in which I read this, I can say, after a lot of work, my parent block move was successful. However, my own block is still acting as temporary storage for a large keyed-item.

0 notes

a-coda · 4 years

Text

Go Easy

I thought I'd follow up my flirtation with Go from last month. I had finished reading another online tutorial text ("Effective Go") and felt I could fit in a small project.

Go is meant to be good for building network services, so what should I build in that vein? The company I work for (Linguamatics) is in the Natural Language Processing (NLP) business. Could I build a simple NLP-as-a-service website in Go?

My idea was to create a web form which would take a piece of English text and respond with an analysis of the contents. In particular, it should list the entities found.

Sadly, this turned out to be so ridiculously easy the project was over almost before it started.

The 82 lines of code are available here. The main components used are:

net/http to set up an HTTP service

html/template to create the web form

Joseph Kato's NLP library

More than half of the code is the HTML template.

The HTTP service setup is borrowed from the example at the end of "Effective Go". The handler for requests is a simple function which is turned into the handler type by an adapter.

The NLP analysis is then triggered by form submission and the response is a new page with the results inline.

Most of the time went in figuring out how to:

install the NLP library

do loops in the HTML template

"When the solution is simple, Go is answering." -- (with apologies to) Albert Einstein.

1 note · View note

a-coda · 4 years

Text

Time to Go

"When in doubt, use brute force" -- Ken Thompson

Rich Hickey recently published "A History of Clojure". I'd blogged about it before:

"Getting some Clojure"

"The Road to Clojure"

"Bravely Truthy"

"Too Lazy"

I was inspired by this history lesson to get back to the last of those: my XML munging project.

The current Python version of this process takes a few liberties in order to be fast. It doesn't fully parse the XML and instead crudely slices it up as text blocks. It can do this because it knows the way the files are printed and only needs to transform the data, not interpret it.

I wondered if I could get similar performance in Clojure by processing the XML more naturally. Gettting started was easy. Slurping an XML file into memory is embarressingly easy in Clojure. The parser turns the XML into standard Clojure collections.

Unfortunately, it then became frustrating. I was trying to stream the transformation via the lazy data structures, but I ended up spending way too much time looking at runtime errors in obscure backtraces. Clojure benefits, but also suffers, from being a hosted language. The downside is the errors are in terms of the supporting Java innards. (To be fair, users of Java's own functional streaming support can also suffer from this.)

I suppose I should have used the REPL more for experimenting with smaller units, but the performance also sucked and caused the GC to thrash (until I increased the JVM size a lot). In despair at my clumsy attempts to stream the data elegantly I switched to an asynchronous style with channels.

Unfortunately, I hit other problems with that approach. At one point the compiler and runtime made opposing complaints about the type of a data structure. Eventually I figured out I was using "take" instead of "async/take". After running into another 'impossible' problem with the channel approach I gave up.

The Clojure history paper said Clojure's async support was inspired by the Go programming language. So, I reasoned, why not program the goddam thing in Go instead?! I crammed a tutorial and started converting the not-working-very-well Clojure application.

The Go tutorial is nice. It has some interactive parts where you get to test your understanding. It doesn't cover everything but was enough for my needs.

While writing the Go version I actually spent most of the time Googling for the right XML library. Many Go XML processors assume you want to unmarshal the XML into application-specific structs, but I just wanted to splurp the XML into something vanilla. In the end I found "etree" which mimics Python's Element Tree.

First I coded it in a very linear way, and then again with channels (producer-consumer). The performance was much better than my (virtually non-terminating) Clojure app and only slightly slower than the Python original. This is good considering I'm really parsing the XML into a DOM in the Go version and the Python version is mostly only chopping up strings. The Python code is also invoking C library code for the heavy lifting while the Go code is all in Go. (You can wrap Go around C for XML if you want).

Overall the development experience with Go was very different from working with Clojure:

old fashioned imperative programming

compile-time type checking

fast compile times and much faster execution (on this code)

good level of compilation and runtime error reporting

Go itself felt like a halfway house between Java and C. It has a GC but also has pointers (watch out for copying). It doesn't have classes or inheritance but it does have interfaces and a form of dynamic dispatch. It doesn't have resizable vectors but does have a smart "append" function on slices which does the same job. Go does not have exceptions (and so no try-blocks) but does have a mechanism for accumulating deferred cleanup work in a function. It doesn't have a "public" declaration but does have capital letters!

Go has been described, in a postive way, as a "boring language". I certainly found it simple to learn. I think it might be easy to return to and use in a casual way. Perhaps further evidence (if it were needed) that "worse is better".

If you want other recent musings there are some here:

"Early Impressions of Go from a Rust Programmer"

"Why Go and not Rust?"

In the end:

"Sometimes, the elegant implementation is just a function. Not a method. Not a class. Not a framework. Just a function." -- John Carmack

1 note · View note

a-coda · 4 years

Text

Shape Up

“Find a judo solution, one that delivers maximum efficiency with minimum effort. When good enough gets the job done, go for it.” -- Jason Fried, Basecamp founder

A while ago I listened to Ryan Singer talking about Basecamp’s Software Development Process on Software Engineering Radio. It was basically a trailer for Singer's book "Shape Up: Stop Running in Circles and Ship Work that Matters". It sounded sufficiently interesting I eventually got around to reading it.

First I should give some background. Basecamp used to be called 37signals. They renamed the company after their main product, which is a project management application. The Basecamp software also gave rise to the Ruby on Rails web development framework whose convention-over-configuration approach influenced a generation of tools.

Singer's thesis and the approach Basecamp takes overlaps with but diverges from agile and Scrum. I'm no fan of Scrum either so I was interested to read about their alternative and see how much it crossed over with what we do.

One place where "Shape Up" is the same as Scrum (and ourselves) is it is time-boxed. The "iron triangle" dictates budget, scope, and time are locked together. Budget is usually fixed as it represents your team. So either you fix the scope and have an uncertain ship date, or you fix the ship date and cut the scope to fit. The latter has proven more useful in many software development organisations. The rest of the organisation can then plan around a regular release cycle. However, instead of two week sprints Basecamp have a six week cycle, or more accurately a six-plus-two week cycle. (In constrast we have a six-plus-six week quarterly cycle.)

Somewhere "Shape Up" is different from Scrum is it promotes an up-front design process (partly a requirements elicitation process) called "shaping". This happens in parallel with the development cycle and delivers scoped out features to a 'betting table' where projects are chosen for the next cycle. Some aspects of the process include "breadboarding" (action/activity diagrams) and "fat marker sketches" (user interface outlines). The idea is to give the development team a tractable bounded project to do in the six week period. The idea is the "shaped" feature contains enough guidelines, and has fenced off enough rabbit holes, so the development team can be given relative autonomy.

Their projects are similar to ours with one or two developers in a team of other specialists. However, our shaping process involves more members of the team and so is more inlined and blurs into the development process. Sometimes the "shaping" takes a cycle and we don't get to start implementing the feature until the next cycle. One area of similarity is the need for what Singer calls "scope-hammering": cutting or trimming ideas at the edge of a feature to enable it to be done in the allotted time.

There's lots more in Basecamp's approach including scope maps, a status metaphor (uphill vs downhill states), and simple adjustments to terminology (eg what's our "appetite" for this feature?). Overall it was good to see someone offering an alternative to the myopic one-size-fits-all world of Scrum. Could soon be time to read another one of the Basecamp books.

"If you're not working on your best idea right now, you're doing it wrong." -- David Heinemeier Hansson, Basecamp partner and inventor of Ruby on Rails

1 note · View note

a-coda · 4 years

Text

The Debit Column

“Back in those less complicated times, there were lots of industries that operated more or less by rote: the old banker’s motto, for instance, was “3-6-3”: take money in at 3 percent, lend it out at 6 percent, and be on the golf course by 3 P.M.” ― Bethany McLean

I'm indebted to a colleague who recommended "The Smartest Guys in the Room", a book about the rise and fall of Enron. I'd meant to watch the documentary or read this book for a while. More recently, "Bad Blood" had given me a taste for business disaster stories.

Anyway, after reading the book, I think Enron went bust because they:

pretended to be a logistics company when really they were a commodities trading company

placed big "new business" bets which failed

borrowed lots of money to stay afloat

hid debt (even from themselves) in ever more complex financial arrangements

They only kept going as long as they did because the stock market of the time allowed them to borrow against an ever rising stock price, like a pyramid scheme. Eventually the business collapsed after the "dot-com bubble" burst and many stocks tumbled. (Enron had also gambled on broadband and video-on-demand and so were associated with the tech industry.)

The above business flaws were further exposed by their reckless and fraudulent management. There was no cost control and senior individuals lined their own pockets at the expense of the company. Several senior figures went to jail.

What has this got to do with software development? Well, there's an established concept of "Technical Debt" invented by Ward Cunningham. (Coincidentally, Cunningham is the inventor of the wiki whose descendent syntax I'm using to write this blog entry.) Technical debt was meant to mean the ongoing cost of software not matching your current understanding of the domain, but it has come to mean any long-term implementation flaws in your software. You 'pay interest' on this debt every time you take longer to add improvements. Are there other software parallels with financial processes?

A straightforward parallel with bad software development organisations is Enron's QA department which was called "Risk Assessment and Control" or "RAC". Enron publicised their risk management team and boldly claimed they allowed them to do deals other companies could not. In reality RAC was marginalised internally and their chief either rubber-stamped deals or was overruled. All the while, the risks kept accumulating.

Another parallel could be from Enron's approach to dealing with very large debts becoming due. Enron often 'solved' this by borrowing from another source to pay off the original debt. This had the advantage you solved your immediate problem but at the cost of pushing an inflated problem into the future. This is a real escalation of commitment. A software parallel might be hastily slapping an adapter on the front of some horrible code. You may have fixed the problem in the short term but in many ways you've added to your problems. A senior colleague likes to call it "digging tomorrow's hole today".

A hard financial process to find a software development parallel for is securitization where debts are packaged up and sold as an investment. An example is collateralized debt obligation. Outsourcing? Answers on a postcard.

"If you want to go fast you take some debt. But the misunderstanding was that people thought that you never have to pay it back. If you are going to take on debt you have to pay it back or you go out of business. It’s called bankruptcy and there have been plenty of bankrupt projects." -- Ward Cunningham

0 notes

a-coda · 4 years

Text

A Twisted Tale

We use Buildbot for our automatic build and test system. Like other such systems, Buildbot can notify you via email and I recently wanted to start filtering the email messages based on likely interest. Fortunately Buildbot is written in Python and is very programmable. You can easily subclass and override any default behaviour. At least in principle.

Buildbot has a master service running on one machine which sends tasks to workers running on many other machines. These workers execute the tasks and report the results back to the master. All this means quite a bit is happening concurrently across multiple processes on multiple machines.

Python (at least CPython) is single-threaded so Buildbot is written on top of Twisted which is a Python framework for networking and asynchronous programming. The latter can be more tricky than synchronous programming and languages have recently been adding special support for async/await.

However, Twisted was written before Python had asyncio so it was built on top of other features already in the language. In particular, Twisted uses (abuses?) the "yield" statement via decorators. For example:

from twisted.internet import reactor from twisted.internet.defer import inlineCallbacks from twisted.internet.task import deferLater def sleep(secs): return deferLater(reactor, secs, lambda: None) @inlineCallbacks def print_numbers(n): for i in range(n): yield sleep(1) print i print_numbers(10) reactor.callLater(11, reactor.stop) reactor.run()

This prints the numbers 0 to 9 one second apart and then exits. But what is it doing?

The function "print_numbers" is a generator because of its "yield" statement. This means than when you call the function it returns a generator object. Each time you call "next" on the generator it will re-enter the function and return the next value. Python's for-loop knows to repeatedly call "next" if given a generator.

But Python generators also support "send". You can send values back into the generator and they will be the result of the yield statement inside the function. Twisted exploits this.

The Twisted reactor is an event loop which waits for events and processes them. The events are interpreted by Deferred objects which are Twisted's version of "promises". They have associated callbacks which are run when events happen.

The "@inlineCallbacks" decorator is thus an adapter which turns the generator into a Deferred. When the "yield" statement is used on a Deferred the rest of the loop will be re-entered when that Deferred is run. In this case the loop will be re-entered after a one second wait.

The reason I'm having a go at explaining Twisted is that when adding the email filtering to Buildbot I neglected the necessary yield/inlineCallbacks incantations and ended up suppressing all the mail notifications. Although, to be fair, this was not unappreciated.

Anyway, if you want a better explanation of Twisted try Stack Overflow. My colleague Tony did a better job explaining Category Theory from scratch in last week's lunch time talk, and there is a part 2 this week.

If you want a rant about how asynchronous programming often ends up creating two kinds of function please read: What color is your function?.

"Though sleep is called our best friend, it is a friend who often keeps us waiting!" -- Jules Verne

0 notes

a-coda · 4 years

Text

Outliers

Ironically, Christmas is a bad time to be born. Your birthday gets subsumed by the general seasonal celebrations. Either no one remembers your birthday, or any gifts or greetings that you get are 'combined ones'.

It turns out it is particularly bad for would-be Canadian hockey stars to be born at Christmas. The eligibility cut-off for age-class hockey in Canada is January. So if you are born at Christmas you will be one of the youngest in the group. This means you are likely to be one of the smallest and up to a year behind in general development.

Conversely, those born in January will be the oldest in the group and so tend to be bigger and smarter. This advantage then gets magnified as the slightly better players get moved on to the top teams and get more training and coaching. The result is that the birth months of elite Canadian hockey players are heavily skewed towards January.

In Malcolm Gladwell's 2008 book "Outliers" this is called the "Matthew Effect":

"For to every one who has will more be given, and he will have abundance; but from him who has not, even what he has will be taken away." — Matthew 25:29, Bible (Revised Standard Version).

Other sports and even whole school systems suffer from the same issue albeit with different cut-off dates. Gladwell uses these examples to show that top performers or "outliers" need not be considered either miraculous or solely the result of very personal efforts. Talent and dedication are prerequisites but you also have to be in the right place at the right time.

Gladwell looks at many other kinds of outlier (both successes and failures) including:

the town of Rosetto - a place with a lower than average death rate due to the immigrants bringing with them a sense of community involvement (not genes or diet)

the personal computing founders like Bill Gates (Microsoft), Steve Jobs (Apple), and Bill Joy (Sun) - people all born in the mid-1950s so that they were young adults when the first personal computers arrived in the mid-1970s. Since the examples were American I went to the trouble of looking up Tim Berners-Lee: born in 1955. At work a colleague and I hypothesized that significant software industry founders would be born 20 years before the World Wide Web. Enter Larry Page and Sergey Brin (Google) who were born in 1973, 20 years before the WWW was made free by CERN.

young talent like the Beatles or Bill Gates - people who had special opportunities to put in 10,000 hours of practice very early in their careers. The Beatles played more hours together in Hamburg before they were famous than most bands manage in their lifetimes. Gates and Joy both had unlimited access to early time-sharing computer systems at a time when that was a scarce resource.

New York Jewish lawyers - an ostracised social group who were given cast-off law work to do until those areas (M&A, litigation) happened to become the big money spinners. Then they became the dominant players.

Asian mathematical geniuses - who get a head start because their number systems are more logical. So mathematics doesn't seem so impenetrable in the early years and more students carry it on further.

high IQ individuals - a long-term study showed they had little measurable advantage in terms of career development and were instead more affected by their family environment.

air travel - where cultural reticence about questioning authority and using mitigated speech has led to numerous plane crashes.

The book was a very entertaining read and I'd recommend it. Gladwell finishes with an autobiographical piece and shows that his own family benefited from being in the right time and place. His mother came from Jamaica - a possible disadvantage - but was of mixed race and so enjoyed a privileged education and social position. She also got lucky with a scholarship that was handed on when another person received two.

In my own way I was also the beneficiary of time and place. After finishing my mathematics degree and a conversion course to computer science I applied for a PhD. At the time Japan was investing heavily in their Fifth Generation Computing Program. In response the UK government created the Alvey Program which increased investment in things like software engineering and artificial intelligence. Alvey ran from 1984 to 1990, which (completely coincidentally) was exactly when I needed grant money.

I'll leave the last word to Tim Minchin whose song "White Wine in the Sun" pervades the house at Christmas.

"I'm not a good philanthropist yet; I'm not as good as I'd like to be... I believe very hard in luck. It's all chance; therefore, any privilege you have is chaos." -- Tim Minchin

0 notes