August 14, 2011

Picture from Wikipedia - Distribution of Korean family names. Blue is Kim, Green is Lee, Orange is Park, Red is Choi, Purple is Jung, Gray is all others.

A fascinating statistical study of Korean family names has revealed that “Kim” has been on top of the popularity charts for the past one and a half milennia. Researchers at the Umea University in Sweden and the Sungkyunkwan University in Korea took data from special Korean family books which traditionally record the genealogical tree. They determined that the distribution of Korean family names is well described by the Random Group Formation (RGF) model, which also predicts the word-frequency distribution of novels written by an author to a very good approximation.

The RGF model captures the features of the group-size distribution when a large collection of objects is divided into a number of groups. In the present case, the persons are the objects and groups are formed by people sharing the same family name. The current population of Korea is about 48 million and there are about 250 distinct family names currently in use. The model assumes optimal mixing or the maximum entropy condition, which essentially implies that marriage between individuals of any two family names is equally likely and there is no segregation or isolation of any group.

The key prediction of the RGF model is that the group-size distribution is solely dependent on the total numer of objects, or population size. For example, the word-frequency distribution of a novel will only depend on the total number of words it contains. The RGF model is shown to predict the number of new Korean family names added and also fit the general family name distributions of the past 500 years very well.

Another feature of the RGF model is that the size of the largest group is always proportional to the total size of the entire data set. In the word-frequency case, this means that the occurence of the most common word in an English text – “the” – is proportional to the total number of words in the text. In the case of Korean family names, it implies that the frequency of the most common name – “Kim” – in a randomly selected group of Koreans should always be proportional to the size of the group chosen, irrespective of the historical time or the group/population size. This is verified to remarkable accuracy for group sizes varying over six orders of magnitude and over the time period 1500-2000 AD. The study suggests that about 20% of the population has shared the name “Kim” for the past 1500 years, and in 500 AD about 10,000 people out of the 50,000 population had the “Kim” family name.

The authors speculate that these findings points to some core stability in Korean culture that has remained intact for more than a thousand years. It is interesting to note that the fraction of people sharing other less common family names fluctuates quite a bit and is not always constant.

Baek, S., Minnhagen, P., & Kim, B. (2011). The ten thousand Kims New Journal of Physics, 13 (7) DOI: 10.1088/1367-2630/13/7/073036

