Books I read in 2014

(disclaimer: the book cover images in this post are “Amazon Affiliate” links. If you click them and buy a book, I will receive a few cents in the form of an Amazon coupon. If you dislike this idea, you can simply remove “mybookbox-20″ from the end of the URLs.)

BookBox: embed book widget, share book list

Ha-Joon Chang: 23 Things They Don’t Tell You About Capitalism. A series of observations about the world economy, capitalism, free trade, planning, development, inequality in society, etc. – the chapter titles sound contrarian and provocative, but the opinions are mostly common sense and argued with clarity although not equally convincingly – many of the statements are challenging mainstream economics dogma, but the author does construct a few strawmen too. One of the recurring themes is that economic systems are never of a “pure” kind as described in textbooks, instead, they are both messier and more interesting. Readable, pragmatic, non-ideological. Some of the ideas reminded of NN Taleb’s work.

Michael Nielsen: Reinventing Discovery: The New Era of Networked Science. The book describes several recent (and some less recent) developments and projects that have changed how scientific research is done and to some extent make us think about what scientific research even means. It presents a compelling vision of the future of science and it is a “calls to arms” to embrace and popularize these new ideas and techniques, and to invent new tools.
More concretely, the book talks about open access publishing (e.g. arXiv), open source software (e.g. Linux), online collaboration (e-mail, wikis, forums, but more specifically the recent Polymath project), “data-driven intelligence” (predicting influenza, automated translation, Sloan Digital Sky Survey), citizen science (protein folding, amateur astronomy), etc. etc. The book also discusses limitations and obstacles, often due to misaligned incentives in the academic world. The book is excellent, it inspires and motivates — two minor criticisms: selection is somewhat arbitrary – could have included other projects too, and the overall narrative that all these fit into a common thread is slightly forced.

David Gale: Tracking The Automatic Ant: And Other Mathematical Explorations. A book that contains puzzles, interesting mathematical ideas and results, short opinions on various aspects of mathematics. The topics are loosely connected, the format is similar to that of Martin Gardner’s or Douglas Hofstadter’s collections. A common theme throughout the book is the exploration (often computer-assisted) of large mathematical structures, such as the solution space of a puzzle, particularly when classical approaches fail. It is one of the best recreational mathematical books I’ve ever read, covering a rich variety of topics, difficult to summarize in a review.

Donald E. Knuth: Digital Typography. A collection of Donald Knuth’s articles and essays on the topic of digital typography, mostly related to the creation of the TeX and Metafont systems, and the transformation of digital typesetting (esp. of math-y text). Wonderful in its attention to detail (there is a chapter on the letter S). I found the linebreaking algorithm and the design of the AMS Euler font in collaboration with Hermann Zapf especially interesting, as well as the whole idea of parametrized fonts. We get glimpses into the intricacies of typography, where nothing is as simple as it seems from the outside.

BookBox: embed book widget, share book list

Masha Gessen: Perfect Rigor: A Genius and the Mathematical Breakthrough of the Century. I read the German translation of this biography of Grigory Perelman. What I liked:
– the fascinating story of Perelman with interesting facts and details I didn’t know before
– interesting facts about the school system and the academic world in the Soviet Union
What I didn’t like:
– fascination with Clay Prize and other external accolades
– the author somewhat unsympathetic to Perelman and to mathematicians in general – taken out of context any personal quirk can be made to look monstrous (I quite agree with this review here).
– occasional “narrative fallacy”, attempt to make various facts fit together smoothly

William Aspray: John von Neumann and the Origins of Modern Computing. A book about the life of John von Neumann and his contributions to several fields, with special focus on computer science and numerical calculus, as well as about his work in science planning and management. Unusually for a biography, the book gets quite a bit into technical details (which I liked), and tries to trace the evolution of von Neumann’s thinking on various topics. The book focuses on his work and not so much on his personal details, when it does “get personal”, it is almost universally positive, leaving any unflattering aspects to other biographers.

Andrew Hodges: Alan Turing: The Enigma. Less worldly-smart and successful than von Neumann, but his work just as important to the foundations of computer science. Also the intellectual interests of these two men are somewhat parallel, going from mathematical logic to the theoretical foundations of computing, to the engineering task of actually building computers, and finally to biological systems.

This biography is incredibly detailed and well documented – not only discusses technical ideas, but also sets the broader context of the intellectual currents of the day. Describes the life and thoughts of Alan Turing (reconstructed mostly from letters) to an extreme level of details. The parts about the work at Bletchley Park on breaking the Enigma are very interesting. Also the parts about the Colossus and Ace computers – the technical details and decisions are interesting but in some places the administrative back and forth gets bit too tedious. In summary, this is a definitive biography that paints a comprehensive picture of Alan Turing.

Martin Aigner, Günter M. Ziegler: Proofs from THE BOOK. A collection of some of the most beautiful mathematical proofs. I was at first suspicious that such a concept would work as a book, and the selection is of course somewhat subjective, but the book is in fact excellent, and seems accessible for the most part from high school level onwards (requiring significant effort of course, but repaying it well). I think this book is one of the best possible gifts to anyone who loves mathematics.

BookBox: embed book widget, share book list

Andreas M. Hinz, Sandi Klavzar, Uros Milutinovic, Ciril Petr: The Tower of Hanoi – Myths and Maths. I reviewed this book for William Gasarch’s column.

Most people remember the Tower of Hanoi puzzle as something quite simple, interesting mainly as a textbook example of recursion, and as an example of a problem where powers of two appear naturally. In reality, as the book demonstrates, the Tower of Hanoi has a very rich mathematical structure, and as soon as we tweak the parameters we surprisingly quickly find ourselves in the realm of open problems. The rest of the review can be read here.

Günter Ziegler: Darf ich Zahlen? (German). A nice, light-hearted book amout mathematics and the process of doing it, also containing anecdotes about famous mathematicians. Makes a nice present and good for learning German.

Früher war noch viel mehr Lametta (German). A pleasant collection of short-stories loosely related to Christmas. As the stories come from different sources and styles, the difficulty of their German is very uneven – so it is quite good to test and improve language skills.

Michael Kohlmeier: Sagen des klassischen Altertums (German). The myths and stories of classical Greek antiquity retold in an informal style with insight and wit. Based on a popular radio and TV show of the author. The book is quite amazing, and good for practicing German, although the informal style does not necessarily make it very easy to read.

BookBox: embed book widget, share book list

Franz Kafka: The Metamorphosis. A classic about which nothing can be said that hasn’t been said already. I found it rather disturbing in places. Also, of the classics, maybe among the more easily accessible in German. 4/5

Paul Auster: The Invention of Solitude. A novel developing some autobiographical themes and stories – the first half is in a straightforward memoir format, the second part is a fictitious “book of memories”, written in third person and mixing storytelling with essayistic parts of varying depth. Overall, the main theme of the book is fatherhood and the relationship between father and son. As such, I found it insightful and moving in places, with good stories to tell, a bit sentimental and overwritten in others. 3/5

Charles Bukowski: Women. The thinly veiled autobiographical (anti)hero of the book, Henry (Hank) Chinoski describes himself as M Sade without the intellectual depth. Maybe the same applies to this book and Bukowski himself. A more generous critic calls him the great de-Disney-fier of our age. A bit annoying narcissism, good storytelling and turns of phrase. 3/5

Kurt Vonnegut: Cat’s Cradle. I found this book both very clever and very entertaining. On the surface a short and funny Dr Strangelove story – describing as an aside possibly the best fictitious religion ever created, and throwing in concepts that haunt you long after finishing the book: karass, duprass, granfalloon, wampeter – several layers of satire intermixed with insight on human character and society. Like all good magic realism, walks the fine line between cynicism and significance. 5/5

Interesting articles in 2014

Here are a bunch of links to a few nontechnical articles/essays I read and bookmarked this year. I discovered most of them on HN, Reddit and other similar timewasters news sources or via spam e-mail from friends. Not all texts were written this year, indeed, some might be decades (or centuries:) old. Enjoy! (Previously)


The seven rules of nationalism and some more insight on nationalism in a book review.

An unusually bitter critique of Stefan Zweig.

Spelling reform in English and its pitfalls.

Layers of Rome.

Sartre and the Nobel prize.

Keynes on the economic prospects of our grandchildren. No, I don’t want your rewards card.


William Thurston on mathematics, also in a longer essay. In a similar vein, Feynman’s letter on “what problems to solve“.

What is combinatorics? Computers and math. Zeilberger’s Opinion 44 on the same topic. Erdős. Unorthodox proof techniques. Interview with Preda Mihăilescu.

Tech / Science

The funniest person at Microsoft: To wash it all away. This world of ours. A similar sentiment expressed in this essay by a different author.

Bad scientific code is usually not so bad. On the other hand, some over-engineered systems are indistinguishable from parody.

20 questions for Knuth.

The internet in 2014.


Machine learning AMA with Michael Jordan.

The Tanenbaum-Torvalds debate.

Failure in complex systems.

Academic publishing.

Visualizing Cuckoo Hashing

Cuckoo hashing is a simple and elegant method for collision-resolution in hash tables. The resulting dictionary data structure is easy to implement, quite efficient in practice and it is also interesting theoretically (there are several interesting open questions about its properties).

I made a small visualization of how Cuckoo hashing works. The demo uses JavaScript and HTML5 Canvas, so it should work in most browsers. Let me know if you find it useful or if you have any feedback.

The demo is loosely inspired by Jason Davies’ Bloom filter visualization. Here is my old post about Bloom filters.

A recursive memory game

I made a small logic game/puzzle called Recursi. It consists of independent “memory” games with two pairs of cards each. These games are nested recursively, behind each game there is a card of the game at a higher level. If you find two pairs, you get to flip the card at a higher level behind the current game. Sounds like fun? :) Go ahead and try it. Recursi, a fractal memory game.

Interesting posts from 2012-2013

Here are a bunch of articles/essays/comments I read, enjoyed and bookmarked in the past two years – some of it is deep, some shallow, some funny, some sad, some inspiring, some cynical. Let me know if you find something particularly interesting here, or if you particularly dislike some article. Enjoy! (previously)


Mandelbrot, Zhang, Wolfram, Higgs, Erdős, Erdős and Graham, Erdős, Erdős, Frankl, Linus, Alan Kay, Coppola, Bellard, Kolmogorov, Taleb+, Taleb-, Zeilberger


Derek Sivers: It’s all who you know?
Leo Babauta: Tips to simplify
Philip Guo: Productivity tips
James Altucher’s Anti-tips (funny writing)
Legacy of the 60s
Comments/threads: great comeback, too old, burnout, quality


The Hardy-Littlewood axioms of collaboration
Luc Devroye’s authorship commandments
How to read mathematics
How to write mathematics
Emancipatory aspects of mathematics
The prime gap story
Bernard Chazelle: the Algorithm
Timothy Gowers: examples first, trivial mathemathics
Oded Goldreich: advice, stories, awards
Gödel’s letter to Neumann
Erick Carreira’s “work ethic” letter
A critical reading of Hamming’s “you and your research”
Douglas Comer: CS identity-crisis, measuring research
A common statistical error
Dijkstra: teaching formal methods, course organization
Terence Tao: what is good math, partial progress, rigour


James Mickens: The Slow Winter – witty and insightful essay on computing speed
What’s wrong with C++ (in summary)
Programming and personality traits
Early job ads: MS, Amazon
A cold look at startup jobs
Solutions to spam
Real artists ship
Pinboard architecture and interview
Zed Shaw programming advice
37signals writing on bootstrapped business
Comments/threads: social networks and TV shows, acquihire, ERP software, niche-vs-universal

Random Opinion/Linguistics/Humor/Music/Misc.

Contronyms: 1 2 3 4
Robert Dickau’s tweets – an explosive concentration of self-deprecating humor
Raganwald: the internet made me sad today
G.G. Marquez: The solitude of Latin America
Top ten reasons to play Go
Chris Lightfoot: real pirates and copyright “pirates”
Miles Davis blindfold listening
Bernard Chazelle: My 29 favorite pieces by Bach
The worst argument in the world
Language modularity
Moravec’ Paradox – the only lucid explanation I know of why perception is harder than chess
Notch: I love you, dad
If by Whiskey…
Slavoj Zizek: Shostakovich in Casablanca
William Gibson: the Net is a waste of time (1996)
Linds Redding: A lesson in perspective
Charles Bukowski: People empty out
Nassim Taleb: Understanding, or Convexity
Doron Zeilberger’s favorite quotes
Ribbonfarm: Hacking the non-disposable planet

Books I read in 2013

I was fortunate to read quite a few good books in 2013. Here is the list of them except for those that I couldn’t finish or which I thoroughly disliked. Initially I wanted to write just a one-sentence blurb about each, but I couldn’t resist expanding them into short summaries/reviews in some cases. (Here was my 2012 reading list.)

If you have read any of these books and came away with an opinion totally different from mine, I would love to hear about it.

(disclaimer: the book cover images link to Amazon using “Amazon Affiliate” links through my Bookbox project. This project earns me a few dollars a month that (partially) offsets the hosting cost of the website. If you dislike this idea, please remove “mybookbox-20″ from the end of the URLs.)

BookBox: embed book widget, share book list

Steven Callahan: Adrift: Seventy-six Days Lost at Sea. This short book tells the story of Steve Callahan, who suffered a sailing accident a few hundred miles off the coast of the Canary Islands, then floated on an inflatable liferaft for 76 days, until reaching Guadeloupe on the other side of the Atlantic, together with the small ecosystem that developed around the raft. The book is fascinating, both the descriptions of concrete events (the obtaining of water and food for survival, maintenance of the various systems on the boat) and the philosophical commentary about hope and despair, solitude, man’s place in nature, etc. are amazingly well-written. One of the best books I read in 2013.

Eric Weiner: The Geography of Bliss. The admittedly light-hearted question that the book explores is what makes people in different parts of the world happy – the answer could have been simply “it depends on the person” – but what we get instead is a travelogue of sorts through a number of faraway places, where the author meets and interviews people, sharing with us his own observations intermixed with “a new study has found…”-type of pop science and with tired stereotypes about each place. I wanted to like the book, the text has a nice flow and the author does make some amusing and occasionally some insightful observations – but for the most part I found it shallow and didactic, lacking any real curiosity or reflection, the author trying too hard in some parts to sound cute, in other places writing in an annoyingly defensive style – overall, a somewhat entertaining read, but quite pointless, unfortunately.

Albert Hofmann: LSD: My Problem Child. This is a leftover from 2012 which I forgot to write about last time. The book is a short autobiography of the chemist who inadvertently invented LSD – an interesting story well told, and an exceptionally clear and objective look at the effects of the invention, its potential, its dangers, and the ramifications to society. We meet interesting characters, including Timothy Leary, Ernst Jünger, and Aldous Huxley, and we follow the author and others on trips into to the mountains of Central America where they go through various psychedelic experiences.

Richard P. Feynman: The Meaning of It All. The book contains the transcripts of three public lectures given by Feynman in 1963 for a non-specialist audience. In his usual entertaining, anecdotal style, he discusses many loosely related topics, such as the scientific method, the value of science to society, religion, pseudoscience, dictatorships, conspiracy theories, and so on. Overall, a nice and accessible read, but less deep and original than his “Surely you are joking..” book, and apparently published against Feynman’s stated intentions.

BookBox: embed book widget, share book list

Karen Pryor: Don’t Shoot the Dog! The book talks about techniques for getting dogs, cats, dolphins, and other animals to do (or not do) various things. The methods are gentle, they are mostly based on well timed reinforcement signals and rewards, and the research and philosophy behind them seems sound. As the author points out through many examples, some of the techniques are no doubt useful when trying to influence human behaviour too (our own or of others) – for example when teaching children, or when trying to (un)learn some habit. There are also insightful observations about certain behaviors and phenomena, such as e.g. superstitions – overall, however, I found the method slightly oversold – and some of the case-studies involving human situations seemed a bit manipulative.

Maria Tatar: The Hard Facts of the Grimms’ Fairy Tales. After recently re-reading some of Grimm’s fairytales in original, I was curious to read Tatar’s analysis, as she appears to be a leading expert on the work. While it is interesting, the book also disappoints slightly: by “hard facts” she really means just that – we learn a bit (too much) about the biography of the brothers, the author dispels many myths about the way the tales were collected and edited, and she gives a detailed catalog of all the types of characters, heroes, villains, beasts, magical items, as well as the different turns of plot appearing in the tales. Few comments are made on deeper psychological symbolism of the tales, when such interpretations are mentioned, it is usually only to ridicule them. In large parts of the book, the author convincingly proves what the tales aren’t, without giving alternatives as to what they are instead. All in all, I found the book to be a well researched scholarly work, but (a bit) less colourful than expected.

Nelson Mandela: Long Walk to Freedom. The autobiography of Nelson Mandela, his personal life and the anti-Apartheid struggle: detailed, inspiring, written in an understated, no-nonsense tone and with exceptional clarity, radiating with optimism and strength on every page – one of the best non-fiction books I read in 2013.

Antal Szerb: Journey by Moonlight. I cheated here and I read this in the original Hungarian, but according to reviews, the English translation isn’t too shabby either. This is one of just about three or four novels written by Antal Szerb, who died tragically in 1945 and who was also an exceptionally erudite scholar and historian of literature. From such an author one might expect a “scientifically” written novel, but contrary to expectations, the novel is accessible and playful – it has subtle twists of plot, it shows deep understanding of character, and it never seems to take itself too seriously. This was by far the best fiction I have read in 2013.

BookBox: embed book widget, share book list

John Maynard Keynes: The Economic Consequences of the Peace. It is a rare opportunity to have one of the decisive historical moments witnessed and narrated by one of its sharpest and most eloquent contemporaries. In this case, this is exactly what happens, as Keynes gives a detailed account of the Paris Peace Treaty, an opinionated description that has more or less become the canonical interpretation of the historical event. He mixes keen observations on the characters (and character flaws) of the main participants, Clemenceau, Lloyd George, Wilson, and others, with a detailed analysis of the economic contents of the treaty and its projected consequences. The description is interesting for the back-of-the-envelope estimates Keynes makes, and for the detailed snapshot of how the world economy looked around the time of the first world war. The predictions about the untenability of the treaty and the ensuing catastrophes have mostly stood the test of time, further cementing Keynes’ reputation.

Richard Brautigan: Trout Fishing in America. An amusing short novel written with dry humour, containing small vignettes about various characters in rural (and occasionally in urban) America. One source of the humour is the usage of “trout fishing in America” in several ways: in its original meaning referring to the act itself, as the name of a person, as an adjective, etc.. Some reviewers compare Brautigan to Hemingway – which seems accurate – but the sensation one gets from this small book is that of an angrier, less focused but sometimes more imaginative author.

Richard P. Feynman: Perfectly Reasonable Deviations from the Beaten Track. This is a selection of Feynman’s personal correspondence, lovingly compiled by his daughter Michelle. We can follow through Feynman’s career and personal life, learn about his working style, travels, everyday routine, opinions on science, education, politics, etc. Some of the letters contain advice to colleagues or strangers, others are replies to fans, dispelling misunderstandings in physics, with patient and crystal clear explanations, some are diary-like accounts of events, others concern official business. Especially funny are the letters that show Feynman’s ambivalence towards accolades such as honorary degrees, academy membership and the Nobel prize. Also interesting are Feynman’s big picture explanations of theoretical physics and science in general, as well as his reviews of elementary textbooks – particularly his poignant criticism of the “New Math” of the 60s. The text is obviously less polished than his work written for a general audience, but nevertheless, it gives a glimpse into Feynman’s amazing (and amusing) personality, sharp mind and intellectual honesty, and occasional goofiness. The collection is worth a read even just as an example of a no-nonsense, precise style of letter-writing.

E. F. Schumacher: Small Is Beautiful. Schumacher’s 1973 book was ahead of its time in many ways, bringing up topics which today are known as sustainable development, environmentalism, anti-consumerism, simple living, organic movement, “go local” movement, regional development, decentralisation, etc. Some of the predictions in the book proved to be slightly inaccurate, and some of the statements might be overly general, but these detract only marginally from the overall message, which seems as relevant as ever and at a large scale as ignored as ever. The author’s main point is a criticism of global economy which he sees to be on a collision course and of prevalent economic philosophies that can be summed up as a pursuit of growth at all costs and an “idolatry of gigantism”. In Schumacher’s words “the most striking thing about modern industry is that it requires so much and accomplishes so little”. He sees as illusory the desire to create, in Gandhi’s words “systems so perfect that no-one will need to be good”. He exposes many inherent contradictions of mainstream economics and proposes sensible (if sometimes naive) alternatives. Memorable concepts of the book include: small scale operations, meta-economics, Buddhist economics, appropriate technology, intermediate technology. An inspiring and timely short book.

BookBox: embed book widget, share book list

Imre Lakatos: Proofs and Refutations. This acclaimed book contains a series of Socratic dialogues on the nature of mathematical discovery and proof, using Euler’s polyhedron theorem as a running example. The dialogues explore how definitions and theorems co-evolve, the role of formal and informal arguments, and most importantly, of local and global counterexamples. The dialogue format makes the book a pleasant read and also allows the author to present differing historical and modern views of mathematics.

Jean-Dominique Bauby: The Diving Bell and the Butterfly. A short memoir whose text was composed by selecting letters one-by-one, the author only being able to use his eyelids, as he suffered from locked-in-syndrome caused by a massive stroke, and died shortly after finishing the book. Some of the chapters depict small episodes from his earlier life, others explain his situation after the accident and his strategies for dealing with it. Tragic, hopeful, poetic, occasionally humorous, humbling. One of the most powerful books I’ve read.

Keith Johnstone: Impro: Improvisation and the Theatre. Impro seems to be the classic book on improvisation theatre, but it is a very enjoyable read for someone with little knowledge (and perhaps interest) in theatre too. Firstly it offers a glimpse into the mechanisms of a strange but beautiful microcosm (that of improv theatre) – it is always fun to learn about the concerns, the “rules of the game”, the system of values of some subculture totally different from our own. Furthermore, any field of human activity has its own metaphors, it influences thinking in its own way, and it spills over its boundaries, giving a toolset for its practitioner to tackle problems in other domains of life – thus taking a shot at becoming a “theory of everything”. This book is improv’s attempt at doing so, and not an unsuccessful one at that. We find insights about improvisation, acting, status, spontaneity, creativity, storytelling, concepts that are relevant to any human interaction, and the author has interesting things to say about all of them. The last part is a bit different, as the author discusses his experiences about teaching improvisation with masks, and relates it to trance-like states and hallucinations – slightly bizarre, but powerful and interesting nevertheless.

Werner Herzog: Conquest of the Useless. Herzog’s very entertaining diary from the time of production of the movie Fitzcarraldo. Mirroring the plot of the movie itself, the production is an endless ordeal in the Peruvian rainforest, apparently dragging on from failure to failure until the eventual completion. Besides the forces of nature and the bureaucratic obstacles, the erratic behavior of lead actor Kinski sets the tone for the book. Herzog gives a detailed account of the over-the-top everyday events of the production, as well as his own equally over-the-top commentary and reflections.

BookBox: embed book widget, share book list

Douglas R. Hofstadter: Fluid Concepts And Creative Analogies. In this book Hofstadter describes the research projects of his group, on attempting to model human thinking and problem solving. Rather than looking at the full messiness of the real world, the projects concern abstract microworlds, such as arithmetic puzzles of a certain kind or anagrams of words. Some of the studied problems are interesting in their own right – starting from a simple puzzle in recreational mathematics, deeper concepts are reached, others are thought-provoking due to their open-endedness, such as the analogies between strings in the CopyCat project. The approaches appear original and unconventional compared to mainstream AI or machine learning research – although they mostly fall short of efficiently solving the given problems. Hofstadter insists that his main goal was not efficiency (he doesn’t have much praise for world champion chess programs) – but rather to approach the problems in psychologically realistic ways. This might explain why there is no mention of computational complexity or other central concepts of computer science – however, without this connection, the study of heuristics seems a bit naive. This shortcoming is less apparent in the domains where the objective itself is ill-defined – for example when looking for a “reasonable” or “nice” continuation of a sequence of numbers. Here I think that it would have been beneficial to relate to other classical concepts, such as Kolmogorov complexity.

The most interesting parts of the book are those in which the author speculates about the workings of the mind, analyzing the way humans might approach and solve certain puzzles. Hofstadter has a knack for introspection, deconstructing a thought process into small explicit steps that seem entirely plausible on reading but which would be performed mostly unconsciously. These parts have a similar feel to Pólya’s “worked problems” in his books about problem solving. However, when the insights are translated into programs something is lost – while Hofstadter sees intelligence as an emergent phenomenon, the programs still appear to have his high level ideas for the particular domain “baked in” – something he vigorously criticizes in other projects. Taking all this into account, the results produced by the programs are rarely surprising.

The book is accessible and slightly more focused in scope than other books of Hofstadter, although there are some loosely related digressions. Some chapters are devoted to criticism of other research projects of the time – these seem to be mostly of historical value only. The “letter spirit” concept in the last chapter is fascinating on its own, but again, it tells more about the creativity of Hofstadter and his colleagues than about creativity or intelligence in general.

Seymour A. Papert: Mindstorms: Children, Computers, And Powerful Ideas. The book describes Papert’s vision of learning and the profound change that computers can bring about in education, especially at an early age. Inspired by Piaget’s theories and starting from his own experience of “constructively” learning mathematical concepts while playing with gears as a child, Papert presents a set of loosely connected ideas about curriculum-free, explorative learning and creativity, particularly related to mathematics. One of these ideas is to avoid a complete separation of the abstract thinking from physical, sensory-motor experiences. Another idea is that of a participatory, spontaneous, almost unconscious learning. As a concrete embodiment of these concepts, Papert advocates the use of his Logo system (turtle geometry) as an instrument for exploration. Such a use of computers by children is markedly different both from playing videogames and from using educational software that are little more than glorified textbooks: Logo lets children discover geometry in a natural, “local” way, it also effortlessly teaches modular, procedural thinking, and other algorithmic concepts. Papert’s vision is compelling, and a certain aspect of it has become reality: computers have indeed become powerful and widely available to children. However, the sweeping transformation in education that Papert envisioned has not been realized. While there are many options available to individuals, the mainstream trend has hardly been towards experimentation and imaginative tinkering, and more towards passive consumption and endless distractions. The “angry birds” of today seem to have far less potential than the turtle geometry of the 60s. Having already been fond of Logo, I still learned a great amount from the book, and I found it both informative and inspiring.

Marc Kac, Gian-Carlo Rota, Jacob T. Schwartz: Discrete Thoughts: Essays on Mathematics, Science and Philosophy. A beautiful collection of essays on the nature of mathematics and the state of its various subfields. A recurring theme is the interplay between experimental mathematics, applied fields, and the mathematical problems they inspire on one hand, and abstract, internal development of mathematics on the other – unsurprisingly, the debate on the relative merits of the two sides goes back to the ancient Greeks. The book is indeed less indiscreet than its companion, the quality of the writing and the depth of the essays is slightly uneven, but most parts of the book are interesting and stimulating.

Richard Branson: Losing My Virginity. The autobiography of Virgin Group founder Richard Branson – the story of his childhood and youth, his first business ventures, the early days of Virgin Music and Virgin Records, mainstream success after discovering Mike Oldfield: the ups and downs of building his business empire, personal life events, his sailing, ballooning and other stunts, the different domains into which his group expands and his philantropic initiatives. Overall, very entertaining and well written, interesting to learn about his approach to business, and the particularities of different businesses, as well as about his everyday life – a book filled with optimism and energy.

BookBox: embed book widget, share book list

Nassim Nicholas Taleb: Antifragile. After having thoroughly enjoyed Fooled by Randomness and Black Swan, I was eagerly awaiting the latest installment from Taleb – this time on the concept that is the opposite of fragility, which is, as he insists, not just robustness, i.e. resistance to damage, but antifragility, i.e. the capacity to actually improve in the face of mild stress, or of random perturbations. Taleb explores this deceptively simple concept in depth, bringing examples from economics, biology, medicine, ethics, and everything in between.

When it comes to Taleb’s three main books, many (though not all) of his critics admit the sharpness and originality of his ideas. Many are put off, however, by the strongly stated opinions and the aggressive, often self-congratulatory tone. On the contrary, I actually found these traits contributing to the charm of the book, it is enjoyable to read how commonly held beliefs are plain wrong and how everyone is stupid (as long as we imagine a “those present excluded” fineprint while reading, of course) – however, I was slightly annoyed by some of the repetition, and some of the case studies which seem more shaky than the others. The book gets much better towards the second half – Taleb is at his best in his philosophical asides and anecdotes involving Fat Tony, Socrates, and other characters.

One memorable concept in Taleb’s overall work is “domain dependence”, the failure to transfer competence from one domain to another – there is the story of the businessman paying the porter to lift his luggage, only to lift weights in the hotel gym ten minutes later. Taleb doesn’t suffer from domain dependence, in fact he has the opposite, which (in the spirit of his book) is not the ability to see connections between related fields, but the compulsion to draw analogies between any two things, sometimes even when there aren’t any or when they are obvious. Overall, I think the book contains many deep and counterintuitive observations, it gives a powerful perspective of looking at phenomena in terms of convexity or concavity of payoff, as well as excellent and usable practical philosophy. As all of Taleb’s writing, it challenges assumptions and provokes thinking – the fact that some parts are thought through less than others reduces its value only marginally.

Lucian Boia: History and Myth in Romanian Consciousness. This review refers to the Romanian original, “Mit şi istorie în conştiinţa românească”, which has been a bestseller in Romania and is credited with starting a current of “demythologization” of historiography. It has provoked heated debates and prompted adjectives ranging from the “courageous” and “first-of-its-kind” to the “unscientific” and “unpatriotic”. Considering its reputation, the book does relatively little myth-busting. The author finds questions such as “what really happened eight hundred years ago” less interesting than those of the type: “what did people a hundred years ago think about what really happened eight hundred years ago”. Questions of the second kind are easier to answer, and as Boia argues, they often hold more relevance to who we are and how we think today than questions of the first type.

As such, the book is not so much a Romanian history, but rather a “metahistory”, a history of Romanian historiography, which might sound like a dry topic, but the book is anything but – Boia describes in an entertaining, often subtly ironical style the mechanisms of myth-making, and traces how the dominant views of historical events evolved throughout the ages. These views were influenced partly by the intellectual currents of the time, partly by the political realities and necessities of each era. Especially interesting is the study of how the perceived significance of historical figures fluctuated throughout the ages.

The mythologization of history is not unique to Romania of course, but Romanian history does seem to offer a combination of a scarcity of resources for certain periods (resulting in a relative lack of consensus among historians), and a tendency (especially during the later part of the communist era) of considering historiography a tool for “nation-building”, legitimization of the regime, and a means of diverting attention from everyday (economic, social) problems. As Boia vigorously points out (and as some of the reactions to the book illustrate), the legacy of these trends largely lives on today. Boia’s book thus makes the important step of presenting historical myths for what they are, and treating them unemotionally, as prime objects of study. Due to the fine observations, intimate knowledge and sympathy towards the sources, and intellectual honesty displayed, the book is an excellent read, although possibly slightly confusing for someone not immersed in Romanian history.

Tony Hsieh: Delivering Happiness. An autobiography of sorts of successful web entrepreneur, Tony Hsieh. The first part of the book is more personal, telling the story of the founding of his companies, different business decisions, ups and downs of running the businesses and some personal life events. The tone is lighthearted and the stories are funny. The second part focuses more on the approaches in managing, and includes a codified “company culture” document with ample discussion on its margin. While the ideas about fanatical customer support and encouraging group cohesion within the company are interesting and probably of practical value to entrepreneurs, as the book progresses, the tone becomes disturbingly cult-like – despite claiming the opposite, the text steers into a pretentious and generic business mumbo-jumbo.

Avinash K. Dixit, Barry J. Nalebuff: Thinking Strategically. The book sets itself the goal of analysing elements of strategy in the abstract with illustrative examples from history, politics, business, sports, movies, or parenting. This might sound like a lofty approach, but the book is in fact excellent, and the examples are analysed with insight and humour. The book can be read as an introduction into elementary game theory, and it is one of the most engaging attempts at this that I have seen. It explains with clarity concepts such as sequential and simultaneous games, mixed strategies, the basics of voting systems, bargaining theory, and incentive schemes. While the book is readable for a wide audience, it doesn’t shy away from including some mathematical details when describing the structure of certain systems.

Alternatively, the book can be read as a collection of practically useful tactics and tricks one can employ in diverse situations. I find it hard to imagine how one could memorize all such “patterns”, but some simple concepts, such as that of “brinkmanship” are certainly memorable.

BookBox: embed book widget, share book list

Johan Huizinga: The Waning of the Middle Ages. (originally written in 1919)
This beautiful and influential book paints a broad picture of the Late Middle Ages, describing a world of contrasts and extreme passions, in which a feeling of decadence and insecurity, a “sombre melancholy” permeates everything. In Huizinga’s Middle Ages religious excess, solemn rituals, ceremony and pompousness are omnipresent but slowly becoming superficial, turning into parodies of themselves. In Huizinga’s words, the culture of the era is “overripe”, about to be fundamentally transformed (although Huizinga sees the transition to Renaissance differently from many other authors). Behind the carefully maintained idealized facade, we get occasional glimpses into the violent realities of life in the Middle Ages – often irrational, sometimes grotesque or comical.

The main premise of the book is that contrary to popular imagination, people in the Middle Ages had a worldview and a set of ideals and aspirations vastly different from ours. The book explores many aspects of this worldview, pertaining to love, religion, politics, art, life and death. After convincingly arguing and illustrating this thesis, in the end, the book still leaves us with the feeling that at a deeper level, or at least in its fragments, the mentality of the Middle Ages was not so alien afterall. The book is insightful and opinionated, the descriptions are vivid, they rely on ample documentation, and the arguments are compelling. The author is at his best when interpreting works of art. However, given the goal of capturing something as fuzzy as the general mood of an era six hundred years ago, the book is perhaps closer to an extended literary essay than to a precise scholarly work. Many of the passages are thought-provoking, although many of the ideas have long become part of the mainstream narrative about the Middle Ages in the time since the book was written.

Andrew S. Grove: Only the paranoid survive. The author shares his insights about the IT industry and about working in technology, drawn from his successful career as CEO of Intel. Interesting topics include the story of the pentium bug, the decision of Intel to focus on microprocessors instead of memories, the transformation of the computer industry from vertical to horizontal stratification (although, arguably, later the opposite has also been happening). The recurring theme is that of “inflection points” in technology, periods of fundamental change, in which, well, “only the paranoid survive”. The author describes subtle signals that hint at an impending change and describes strategies for dealing with it. Drawbacks of the book include the slight repetitiveness, and somewhat self-congratulatory tone, but overall I found the book insightful and also prescient for a work written almost twenty years ago.

Elizabeth Marshall Thomas: The Harmless People. We get a glimpse into the life of the Bushmen (also known as San people) in the Kalahari Desert during the 50s, when the author spent long periods of time living among them. The book describes in a simple, sympathetic manner the Bushmen’s everyday life, their hunting-gathering activities, their family life and customs, their music and celebrations, as well as their views of life and the world. There’d be little to be unsympathetic, observing the simplicity, gentleness and naivety, as well as the extensive practical knowledge about the surrounding natural world of these people. The book is beautifully written, the author avoids any trace of condescension or stereotypical portrayal – she observes the characters respectfully and lovingly and describes them in a precise and vivid way. In the last chapter the author revisits the places and people thirty years later, to find the world she earlier described in an irrevocably transformed state. She finds this transformation to be mostly for the worse: the lives of most of the characters have taken tragic turns, the traditional way of life has proven powerless against the encroaching modern society, and the implicit social order previously observed has broken down. Despite the sobering analysis, the book ends on a note of optimism about the future.

Apostolos Doxiadis, Christos H. Papadimitriou, Alecos Papadatos, Annie Di Donna: Logicomix: An Epic Search for Truth. A writer, a computer scientist, and two illustrators make a comic about the foundations of mathemathics, telling embellished stories about the lives of Bertrand Russel, Georg Cantor, Kurt Gödel and other great logicians, with cameo appearances by the authors themselves. How can the end result be? The answer is: surprisingly (or predictably) good.


Did you read some of these books and have a different opinion? Let me know in the comments.

What was the best book you read in 2013?

A puzzle of apples and oranges

I first read the following wonderful little puzzle on Mathoverflow, where it was posted by David Futer. I don’t know who invented it or where it appeared first – if you have any information, please let me know. (Note: Toshihiro Shimizu points out the following link: AoPS)

There are 99 bags, each containing some number of oranges and some number of apples (any of these numbers can be zero). Show that we can always select 50 bags out of the 99, such that these 50 bags together will contain at least half of all the apples and at least half of all the oranges.

I like this puzzle because there are many different ways to solve it. Before reading on, do yourself a favor and try solving it yourself – it is not difficult, but very rewarding. If you find a proof that is different from those below, let me know and I’ll add it to the list.


Why look for different solutions, once we already have one? Partly because it’s fun, partly because each solution might give some new insight into the problem, partly because we can showcase different mathematical techniques, and perhaps most importantly, because each new proof lends itself to generalization slightly differently – so a new proof might lead to new results. The proofs below are sketchy in places, in order to keep them short – it can be instructive to try to fill in the gaps, or to try coming up with the proof after reading just the name of the technique.

Sometimes proving a more general statement is easier (at least psychologically – by removing distracting elements). In our case, it will be more “fruitful” to prove the following statement:

Given \(n\) bags that contain oranges and apples, we can select \( \lceil \frac{n+1}{2} \rceil \) bags such that they contain at least half of each type of fruit. (we only really care about the case when \(n\) is odd)

Here are all the proofs that I am aware of.

Proof 1. (sorting) (source: I vaguely recall seeing this proof on a math competition website but I can’t find it anymore. If you have any information, please let me know see AoPS link above).

Sort the \(n\) bags in decreasing order of the number of apples (break ties arbitrarily). Select the first bag, and from each following consecutive pair select the one with more oranges (break ties arbitrarily).
It is easy to see that the selected bags satisfy the required properties.

Proof 2. (induction) (source: idea from David Futer).

First we show that for \( n=3 \) the statement is true. Indeed, we can select the bag with the most apples and the one with the most oranges (if the two are the same, we also pick another bag).
Assume that the statement is true for \(n-2\). We then prove it for \(n\):
Consider a “good split” of the \(n-2\) bags into A and B, such that A has at least half of both fruits in A+B (such a split exists by the induction assumption). Now we try adding the other two bags, x and y.
If A has as many of both fruits as B+x or as B+y, we are done (we add the other bag to A).
Otherwise if B+x or B+y have as many of both fruits as A, we are done (we add both bags to B).
Otherwise if A has less apples than B+x and less apples than B+y we are done (add to A the one of x and y with more apples). Symmetric argument for oranges.
Otherwise if A has less apples than B+x and less oranges than B+y, it means that A has less of both fruits than B+x+y, which we treated already. Symmetric argument after swapping the terms “apples” and “oranges”. There are no other cases.

Proof 3. (contradiction or “canonical counter-example”)

Suppose there is no solution, then consider a set of \( \frac{n+1}{2} \) bags which has at least half of the apples (there is always such a set), and among such sets the one with the most oranges possible.
Let this set be A, and the remaining set B (A has at least half of apples and B has more than half of oranges).
Take the bag from A with the least oranges and move it to B. Now take the bag from B with the most oranges and move it to A. If we just moved the same bag back, it means that all bags originally in B have at most as many oranges as the bag with the least oranges in A. This contradicts the fact that B has more than half of all oranges.
Otherwise there are two cases: 1) the number of apples in A drops below half when we move a bag, which means that B becomes a solution, or 2) the number of apples in A remains at least half throughout, then after the swap we increased the number of oranges in A, which contradicts the statement that the chosen set had the highest such value.

Proof 4. (counting) (source: I read this proof written in probabilistic form by Erez Druk, after being asked by Dror Ozeri on a puzzle forum – as it is a closed forum, I cannot link to it)

I write this and the next proof in the n=99 case only, to avoid excessive notation. The number of 50-sets of bags is the same as the number of 49-sets \({99 \choose 49}\). Denote the number of 50-sets that contain at least half of all oranges by \(N_{50}\), and the number of 49-sets that contain at least half of all oranges by \(N_{49}\). We have \(N_{50}>N_{49}\), unless none of the bags contain any oranges, which is an easy case anyway (there is one-to-one mapping from 50-sets to 49-sets, such that each 50-set contains its mapped 49-set). Clearly, more than half of all 50-sets contain at least half of all oranges (otherwise the previous inequality would be violated: if an 50-set does not contain at least half of all oranges, then its complementing 49-set does).
The same argument holds for apples. As more than half of the 50-sets contain at least half of the oranges, and more than half of the 50-sets contain at least half of the apples, there is at least one 50-set containing at least half of both fruits.

Proof 5. (price function) (source: from David Futer)

Assign price \(x\) to an orange, and \(1-x\) to an apple. Select the 50 most expensive bags.
\(f(x) = (\mbox{# of oranges in top 50}) / (\mbox{total # of oranges})\)
\(g(x) = (\mbox{# of apples in top 50}) / (\mbox{total # of apples})\)

We have \(g(0) \geq 1/2, f(1) \geq 1/2\), and \(f\) is increasing, \(g\) is decreasing. Clearly there is some \(x\) for which both \(f\) and \(g\) are at least 1/2, which denotes a solution (if for some \(x\) both functions would have values below 1/2, then the top 50 would not be the top 50, as its complement would have more apples and more oranges, therefore higher total price).

Proof 6. (geometry – Ham-Sandwich) (source: from David Futer, and a variant explained by Dror Ozeri on the same forum)

We represent the bags as sets in the plane, with the restriction that a straight line can cross at most two of them (this can always be achieved). We somehow encode the number of oranges and apples in each bag. There are more possible variants here: map each fruit to a red or blue point (depending on its type); or map all oranges of a bag to a single point with a weight that tells the number of oranges, and similarly for all apples of a bag; or map the whole bag to a single point for which we define an “apple-measure”, and an “orange-measure”. Now we apply the Ham sandwich theorem to obtain a balanced cut (there are different variants of this theorem, we choose the one suitable for our representation). If any “bags” are cut by the obtained line, we move them to the side which has less bags. This gives a valid solution.


  • Prove the following more general statement. Which proof technique is suitable?
    Given \( n\) bags that contain \( f\) kinds of fruits, we can select \( \lceil \frac{n+f-1}{2} \rceil \) bags that contain at least half of all kinds of fruits.
  • Prove the following variant. Which proof technique is suitable?
  • Given 100 bags that contain apples and oranges, we can select 34 bags that contain at least one third of the apples and one third of the oranges.

  • Find a general statement with \(n\) bags, \(f\) kinds of fruits and finding a subset of the bags that contains at least a \(p\)-fraction of the fruits of each kind, for arbitrary \( 0 \le p \le 1\).
  • Assume that we know the contents of each bag. Which proof leads to an efficient algorithm for finding a good subset of bags in the sense described above? In particular, which proof leads to a linear time \(( O(n) )\) algorithm?
  • Can you come up with new proofs for the puzzle? If so, I would love to hear them. Do any of the proofs admit new generalizations?

Thanks to David Futer, Lavinia Dinu, Manuel Arora, and Tobias Mömke for interesting discussions about this puzzle.

Nonlinear speedometer

As everyone knows, excessive speed is one of the main causes of traffic accidents. One of the reasons for reckless speeding is that when we think of the possible impact of a collision, our intuition fools us. We tend to assume that if we go twice as fast (say, at 80 km/h instead of 40 km/h), then the impact of a collision will be twice as large. In fact, the impact of a collision is proportional not to the speed (V), but to the kinetic energy of the vehicle (m/2 * V²), which is transformed to work as the car decelerates to zero. Here m is the mass of the vehicle, which we cannot do much about, but the other term is the square of the velocity. This means that the “impact” of hitting a wall at 80 km/h is four times as large as that of the same collision at 40 km/h.

To capture this intuition, the idea of this post is a speedometer design, that scales as the square of the velocity, to give the driver a more realistic view of the effects of speeding.

nonlinear speedometer
(larger image)


  • after sketching this drawing, I found that in 1995 Goetz Kluge was thinking along similar lines and produced a similar design. There are however some differences – see next.
  • besides showing the speed in the more familiar, circular layout, I also made the decision to draw the smaller lines at uniform density (i.e. they are spaced proportionally to the kinetic energy, not to the velocity). This design might make it harder to estimate the exact speed, say between 40 km/h and 50 km/h, but it makes the increased kinetic energy easier to grasp – when we speed from 120 km/h to 130 km/h, the hand crosses more “small lines” than from 20 km/h to 30 km/h.
  • while I haven’t seen a similar speedometer in any real car, some of the existing speedometers are in fact nonlinear. Unfortunately, they seem to achieve the exact opposite effect to the design shown above. See this example. On the linked image, and in many other modern speedometers, the manufacturers try to put more resolution in at the lower half of the range, dilating the velocities between 0 and 80. My guess is that car makers do this, because the car accelerates faster at low speeds, so dilating the lower range makes the hand movement seem more uniform at all speeds – this is probably aesthetically more pleasing, but it might come at a cost of an increased (false) sense of security, and in effect a reduced safety.

    (I got this exactly wrong – thanks Tobias for pointing it out – it is actually the opposite, on the proposed speedometer design, the hand would move more evenly – so besides being safer, it would be more aesthetical as well, the only downside seems to be the reduced resolution in the low range)

Map of named colors

In this post I present a visualization of all named colors, ranging from the more common to the more obscure, including green, blue, yellow, orchid, lavender, turquoise, acquamarine, azure, etc. etc.

From various sources [1] I downloaded around 2600 such names, I merged the different lists, and slightly edited the entries to remove duplicates and to make the collection more consistent. Then I trained a SOM (Kohonen map) [2] with 6000 units, using the open source toolbox [3] for Matlab. Finally I exported the results with some custom scripts I wrote. This last part was done similarly to the visualization of languages described in my previous post. Otherwise the data was much more reasonable this time than in the previous post: 2600 data points, 3 features (red, green, and blue values) and no missing entries.

The first plot shows the trained SOM map with each hexagonal unit being colored according to its location in the three-dimensional RGB space. The other three maps show only the red, green, and blue values of the units, respectively.

exp_som.png (~6 MB)
Figure 1. SOM map of named colors.

The second plot contains the actual names of the colors and the corresponding hexadecimal RGB values. The goal was to place similar colors near each other as much as possible.

exp_som_text_scale.png (~12 MB)
exp_som_text.svg (~3 MB) (NEW: svg format)
Figure 2. Map with color names.


Name That Color project (and references therein) by
Chirag Mehta.
Wikipedia: List of colors.

Color Space Dimensionality Reduction project by Aubrey Jaffer and references therein. This project includes two dimensional maps of colors created using different techniques.

In this project SOM is used to visualize colors, but only a handful of colors are shown.

Some of the colors in the final list I used seem to have been submitted by users at the Colourlovers website.

[2] Scholarpedia:

[3] SOM Toolbox:

Visualizing the languages of the world

I recently came across the WALS (World Atlas of Language Structures) data set [1] which contains structural information about 2676 languages and dialects [2] from all around the world. The WALS website contains a detailed explanation of all the features and it also shows on the world map the geographical distribution of the languages and of the different features. As soon as I saw the data I started thinking about how to visualize it differently in order to highlight similarities between languages and language families.

When laypeople (myself included) naively think about similarities between languages, they mostly consider similarities between words (such as between the English “brother” and German “Bruder”). As the WALS data set ignores the vocabulary, and describes instead the deeper, structural aspects of language, I was hoping to obtain some interesting visualizations.

This post describes in detail the steps I took to visualize the data. If you want to see the resulting images only, scroll down to the end.

1. The data set

The WALS data set contains 2676 languages and 192 features, mostly related to phonology and grammar. For example, one feature describes whether a language has one, two or three grammatical genders, another feature refers to the inflection of nouns, yet another refers to the usual position of the verb in the sentence, and so on.

The features seem to have been meticulously collected and curated, however, many values are missing. More precisely, out of roughly half a million possible entries, only 15% are available. It is not clear whether a missing feature means that it has not been observed, or that it does not make sense for a given language. I assume it is mostly the first case: for languages spoken by very few people it must be difficult to collect reliable data and on some languages there are few publications available. However, there are also cases when a feature is not applicable to all languages. For example, one of the features is “Nasal Vowels in West Africa” which is missing for all languages outside of West Africa. It would improve the analysis if we could distinguish between these two types of missing values, but as I had insufficient information to do this labeling myself, I simply considered all missing values of the same type.

Besides the structural description of the languages, the data set contains the following meta-data: the classification of languages into families, subfamilies and genera, the ISO-code of each language and the geographical location where the language is spoken. The latter information is not always useful, as it points to a single location. In case of Russian, for example, this location is Moscow. For Portuguese, there is no separate entry for Brazilian Portuguese, yet the geographic location indicates only Lisbon.

2. Preprocessing

Looking into the data, one thing that can be observed is that a few languages share the same ISO-code – these seem to be closely related dialects. When this is the case, it seems that only one of the languages has the full description, and for the others only the differences are encoded. Therefore, as a first preprocessing step I copied over features between pairs of languages that have the same ISO-code, whenever a feature was present for one, but not for the other language. Overall, this operation affected the proportion of missing values only marginally (about 100 languages of the 2676 have had some feature copied over). However, this again raises the question of the exact meaning of missing data. If there are other cases when some features are missing because they are “too obvious”, that could distort our analysis – however, we have to live with this uncertainty for now.

As a next preprocessing step, I “binarized” the data. All the features in the data set are “categorical”. For example, feature #52, called “Comitatives and Instrumentals” [3] has the following possible values: 1 for “Identity”, 2 for “Differentiation”, and 3 for “Mixed”.
Without getting into the details of what this feature (or other features) really mean, it is clear that the assignment of numerical values is quite arbitrary here. Therefore, to perform numerical operations on the data further on, it made sense to split this feature into three. The new features (called 52.1, 52.2, 52.3) encode whether the value of this feature is 1, 2, or 3. (for example, a value of 2 is transformed into 010 and 3 is transformed into 001).

Some of the features, however are of “ordinal” type. For example, the first feature has the following five values: 1 “Small”, 2 “Moderately small”, 3 “Average”, 4 “Moderately large”, 5 “Large”. In this case (and for some other features of such type) it is more sensible to binarize the data as follows: 1 becomes 10000, 2 becomes 11000, …, 5 becomes 11111. That is, we split the feature into different parts that encode whether the value is larger than some threshold. I binarized all features in one way or the other (overall I used the “ordinal” method for 12 features of the 192).

I performed these preprocessing steps with some short Matlab scripts I wrote, and in the end I obtained a binary data set with 1141 columns and 2676 rows and (still) most of the entries missing.

3. Dimensionality reduction

Besides the missing entries, the biggest obstacle in the way of visualizing the data is that it lives in 1141 dimensions. As a first step, I reduced the number of dimensions to 30, to make the data easier to handle. Typically, this is performed using a linear projection method called principal component analysis (PCA) [4]. When some data is missing, usually a simple “imputation” method is used, such as deleting the columns or rows with missing entries, or replacing the missing data with column averages. However, when such a large portion of the data is missing, these simple tricks are insufficient.

Instead, we can use a nonlinear version of PCA, tuned especially for the case with many missing values. This method, while reducing the number of dimensions, also attempts to “fill in” the missing values. In this case, the conceptual simplicity and efficiency of PCA is mostly lost, but the methods are still quite efficient. The theory and assumptions behind such methods are described in this paper written by A.Ilin and T.Raiko [5] (disclaimer: former colleagues). Implementations are available in an open source toolbox for Matlab [6] developed by the same authors.

Using PCA when the data is binary is not quite the optimal choice, and there exist variants especially tuned for binary data (see for example our 2009 paper [7] for some theory and experiments), however, as my goal here was not prediction or learning, just visualization, I decided to stick to the simpler methods.

After this operation, we are left with only 30 “aggregate” features and no missing data. We are now ready to visualize the data.

4. Visualization

I first decided to look at some very small subset of the data, therefore I selected the 24 rows corresponding to Romance languages [8] according to their “genus” label. To visualize these data points, I trained a SOM (Kohonen map) [9], using the open source toolbox [10] available for Matlab. This method embeds a two-dimensional grid into the higher dimensional (in this case 30 dimensions) space, such that the grid is “close” to the data points. At the same time the neighbors in the grid exert some “pulling” on each other, which serves as a kind of regularization. In the end we can visualize the grid points in the so-called U-Matrix. For visualization I wrote some custom scripts that plot the output that I saved from the SOM toolbox.
Figure 1. Romance languages U-Matrix.

Here the grid is hexagonal and has 40 cells. On the image we can see each of the 24 data points (Romance languages) mapped to the closest grid cell (in the 30 dimensional space). The color of a cell is indicative of the distances from neighboring cells (brighter color means larger distance), in this way if there are clear clusters, they will appear as dark valleys separated by bright “ridges”.

Our hope with the visualization is that pairs of languages that are similar would be mapped nearby and pairs that are different would be mapped far away. Unfortunately, in general, not all of these constraints can be satisfied, so we can’t always draw conclusions from the fact that two languages appear close to each other. This is where the coloring of the U-Matrix should help. In many cases, though, it seems that proximity does indeed correspond to linguistic similarity (here, the dialects of Romansch [13] are mapped together, just as Romanian together with Moldavian, etc.)

To get some intuition on how the languages were laid out on the map, I made a plot of the original features on the same map. This is on the next image. There are 192 copies of the same hexagonal map (one for each feature). The locations of the languages are the same as in the U-Matrix. On these maps the value of the feature is indicated for each language. The descriptions of the features can be found on the WALS website [11]. Looking at these maps we can check for two languages that appear nearby, on which features they agree. Black is for “missing data”.
Figure 2. Romance languages feature map.

Let’s also plot the geographic location of the languages on the same map. Note however, that this information was not used during the training phase. Here again the languages appear in the same positions as before, and their geographic location in degrees is indicated by color. The goal of this plot is to see if there is some correlation between geographic distance and linguistic difference. We can read off some obvious facts from the plots, such that Canary Islands Spanish is both the westernmost and the southernmost, and Moldavian is the second easternmost after Ladino [13]. Otherwise the number of data points is a bit too small here to observe interesting correlations.
Figure 3. Romance languages Lat/Lon.

On the two previous types of visualizations a hexagonal cell is colored with a single color if there is a single language mapped to it, or if all languages in that cell share the same value for a feature. Otherwise a small pie-chart is displayed showing the share of each value.

Now let’s look at a larger subset. I selected all languages from the Indo-European family of languages [12], according to their “family” label. This meant 176 rows in the data set. The visualizations are similar to the earlier ones, but now I also made an extra plot showing the different genera within this family. Here the black color shows whether a language belongs to a given genus. It can be observed that the clustering on the SOM is in many places consistent with the classification information. This means that the usual taxonomy of languages is largely consistent with the structural linguistic data, which is hardly surprising. There are however, some unlikely pairs of languages placed close to each other on the map. One could examine the data to check in each case whether there is really some structural similarity, or we can simply write it off as noise. We can also observe a quite sharp East/West separation between languages, and a bit less pronounced North/South separation (the latter picture is complicated by an outlier – the southernmost language of the family is Afrikaans [13]). For many of the features a nice clustering of the values can be observed.
Figure 4. Indo-European languages U-Matrix. (large file)
Figure 5. Indo-European languages feature map. (very large file)
Figure 6. Indo-European languages Lat/Lon.
Figure 7. Indo-European languages genera.

Finally, let’s visualize all languages from the data set in a similar way. To restrict attention to spoken languages only, I removed the 40 sign languages from the data set. There remain 2636 languages. The resulting maps are very large, but quite informative, again, there is an apparent clustering that closely resembles the known taxonomy of languages, but other similarity relations can also be observed on the map. As the maps became very large, here I only include the U-Matrix.
Figure 8. All languages U-Matrix. (very large file)

Please note that two languages can be placed nearby even if there is very little similarity between them, especially if one of the languages has very few features filled in. This is a constraint of the method used, but also inherent to the problem itself – we lose information by reducing the number of dimensions. So please think twice before using these plots for drawing conclusions about the relatedness of very distant exotic languages or for supporting such theories :)

To (partially) overcome the previously mentioned problem, I also plot the data when restricted to languages having relatively few missing entries. This also makes the maps somewhat more managable in size. Here are the plots for the data sets having less than 160 missing features.
Figure 9. All languages with sufficient data: U-Matrix.
Figure 10. All languages with sufficient data: Lat/Lon.
Figure 11. All languages with sufficient data: families.

All figures: [link to album]


[1] Dryer, Matthew S. & Haspelmath, Martin (eds.). 2011. The World Atlas of Language Structures Online. Munich: Max Planck Digital Library.

[2] “A language is a dialect with an army and a navy” – Max Weinreich
The question of what constitutes a language and the distinction between language and dialect is a difficult, and often emotionally and politically charged question – for example, in this data set Romanian and Moldavian appear as different languages, whereas many consider them to be too similar even to be considered separate dialects.

[3] Stolz, Thomas & Stroh, Cornelia & Urdze, Aina. 2011. Comitatives and Instrumentals.
In: The World Atlas of Language Structures Online. Max Planck Digital Library, chapter 52.


[5] Ilin, Raiko: Practical Approaches to Principal Component Analysis in the Presence of Missing Values, JMLR, 2010.

[6] PCA with Missing Values software:

[7] Kozma, Ilin, Raiko: Binary Principal Component Analysis in the Netflix Collaborative Filtering Task, MLSP, 2009.


[9] Scholarpedia:

[10] SOM Toolbox:

[11] WALS features:


[13] Romansch: