Tuesday, February 14, 2017

Quantitative Measures of Linguistic Diversity and Communication

Of the 7097 languages in the world, twenty-three (including the usual suspects: Mandarin, English, Spanish, various forms of Arabic, Hindi, Bengali, Portuguese) are spoken by half of the world's population. Hundreds of languages have only a handful of speakers and are disappearing quickly; one language dies every four months. Some parts of the world (dark green regions in the map) are linguistically far more diverse than others. Papua New Guinea, Cameroon, and India have profusion of languages while in Japan, Iceland, Norway, and Cuba a single language dominates. 

Why are languages distributed this way and why such large variations in diversity? These are hard questions to answer and I won't be dealing with them in this column. So many factors – conquest, empire, globalization, migration, trade necessities, privileged access that comes with adopting a dominant language, religion, administrative convenience, geography, the kind of neighbors one has – have had a role to play in determining the course of language history. Each region has its own story and it would be too hard to get into the details.  

I also won't be discussing the merits and demerits of linguistic diversity. Personally, having grown up with five mutually unintelligible Indian languages, I am biased towards diversity – each language encapsulates a unique way of looking at the world and it seems (at least theoretically) that a multiplicity of worldviews is a good thing, worth preserving. But I am sure there are opposing arguments.

Instead, I'll restrict my focus to the following questions. How can the linguistic diversity of a particular region or country be numerically quantified? How do different parts of the world compare? How to account for the fact that languages may be related to one another, that individuals may speak multiple languages? 

In tackling these questions, my primary source and guide is a short paper published in 1956 by Joseph Greenberg [1]. Greenberg's main goal was to create objective measures that could, in the future, be used to "to correlate varying degrees of linguistic diversity with political, economic, geographic, historic, and other non-linguistic factors." His paper proceeds from the assumption that linguistic surveys have been conducted and data on what people consider their mother tongue/first language, the number of speakers of each language, vocabulary etc. are already available. Ethnologue is an example of such a global survey [2]. 

The Linguistic Diversity Index

The most basic measure Greenberg proposed is the now widely used linguistic diversity index. The index is a value between 0 and 1. The closer the value is to 1, the greater the diversity. The index is based in a simple idea. If I randomly sample two individuals from a population, what is the probability that they do not share the same mother tongue? If the population consisted of 2000 individuals and each individual spoke a different language as their mother tongue, then the linguistic diversity index would be 1. If they all shared the same mother tongue, then the index would be 0. If 1800 of them spoke language M and 200 of them spoke N, then index would be: 

1 – (1800/2000)2 - (200/2000)2   = 0.18

In the above, (1800/2000) is the probability that a randomly picked individual speaks M as their first language/mother tongue. And (1800/2000)2 is the probability that two randomly picked individuals speak M. Similarly, (200/2000)is the probability that both the randomly picked individuals speak N as their mother tongue. When we subtract these squared terms from 1, what remains is the probability that the two randomly sampled individuals do not share a mother tongue. In this particular example, the index of 0.18 is low because of the dominance of M. 

If there are more than two languages the procedure is the same. You would have one squared term that needs to be subtracted for every language. In a population of 10,000 where 10 languages are spoken and each language is considered a mother tongue by exactly 1000 speakers, the index would be:

1 – 10 x (1000/10,000)2 = 0.9.

This high value reflects both the number of languages and how evenly distributed they are in the population. 

In fact, there are fifteen countries whose linguistic diversity exceeds 0.9, as the table above shows (based on Ethnologue data [2]). The list is dominated by 11 African countries, with Cameroon at number two. India, whose linguistic diversity I experienced firsthand for twenty years, is at number 13. Two Pacific island nations – Vanuatu and Solomon Islands: small islands these, and yet so many languages! – are in the top 5. First on the list is Papua New Guinea whose 4.1 million people speak a dizzying 840 languages! The country's index of 0.98 means that each language has about 5000 speakers on average and that no language dominates as a mother tongue. 

In his book The World Until Yesterday, Jared Diamond, who did a lot of his fieldwork and research in New Guinea, has this startling anecdote:  
"One evening, while I was spending a week at a mountain forest campsite with 20 New Guinea Highlanders, conversation around the campfire was going in several different local languages plus two lingua francas of Tok Pisin and Motu…. Among those 20 New Guineans, the smallest number of languages that anyone spoke was 5. Several men spoke from 8 to 12 languages, and the champion was a man who spoke 15. Except for English, which New Guineans often learn at school by studying books, everyone had acquired all of his other languages socially without books. Just to anticipate your likely question – yes, those local languages enumerated that evening really were mutually unintelligible languages, not mere dialects. Some were tonal like Chinese, others were non-tonal, and they belonged to several different language families."
How different from what the majority of us are used to! 

While New Guinea's linguistic diversity is widely recognized and not in doubt, its high language count and the rampant multilingualism that Diamond observed nevertheless lead to us to two flaws in the linguistic diversity index.  

The first flaw is that the index assumes languages are well defined, mutually exclusive units. It ignores the relatedness between languages and the fact that a dialect may be arbitrarily called a language. What of cases where there is close relatedness and even mutual intelligibility, for example between Hindi and Urdu, or between Spanish and Italian? And what to make of those cases where two dialects may well be closely related, but nevertheless are mutually unintelligible when spoken? Further, the language question seems loaded with the question of identity and politics. Apparently there is a running joke among linguists: "A language is a dialect backed by by an army and a navy."

To partially address this, Greenberg -- who recognized these problems, and was well aware of the difficulties of distilling complex language realities into quantitative measures -- suggested that the resemblance between languages or dialects could be numerically quantified by a value between 0 and 1. This what I understood from his paper: take the combined current vocabulary of a pair of languages and calculate the proportion of words that are common to both languages in relation to the total list of words. This proportion gives us a approximate measure of resemblance. A resemblance close to 1 means that the two languages are virtually identical, and a resemblance close to 0 implies an almost total lack of relatedness. 

The resemblance can then be used to adjust the linguistic diversity index. Suppose there are three languages M, N and O spoken by 1/8th, 3/8th and 1/2 of the population and suppose the resemblance between [M, N], [M, O], and [N, O] is 0.85, 0.3 and 0.25.  The unadjusted linguistic diversity index is 0.593. If we adjust for resemblance, this value drops to 0.381 -- diversity is not as high as it originally seemed. I have explained the calculations at the end of the piece [3].

The second flaw in the index is that, by considering only an individual's mother tongue, it ignores multilingualism. As Diamond's New Guinea anecdote shows, a high linguistic diversity does not necessarily represent a lack of communication. The examples of Indonesia, India and the many countries of Africa show that it is possible to communicate in some common languages, lingua francas that span large parts of the population, while yielding space to local mother tongues. So a different kind of measure is required.  

Index of Communication

To accommodate multilingualism, Greenberg proposed the index of communication. As before, the index is a value between 0 and 1. A value close to 1 indicates high communicability and a value close to 0 indicates the opposite. If I randomly pick two individuals in a population, and each individual speaks one or more languages, then what is the probability that the individuals share at least one language in common? To ensure communicability, only one language has to overlap. (This index too has its problems. One flaw is that it ignores how well an individual speaks a particular language – something that might be hard to elicit in a survey. Another is how to set the threshold of communicability - is knowing a few basic words sufficient?)

Consider the simplest case where a population speaks only two languages, M and N. Using a census, you can calculate the proportion of the population that speaks M only, N only, and is bilingual in M and N. Suppose those proportions are 0.5 (speak M only), 0.3 (speak N only) and 0.2 (speak both M and N). To calculate the index of communication, I simply subtract the cases where the two individuals cannot understand/communicate with each other, which happens when the first individual speaks only M and the other only N, and vice-versa: 

1 – [0.5 x 0.3] – [0.3 x 0.5] = 0.7 

The same idea can be extended to more than two languages. 

I'll try to illustrate the index with a personal example. The engineering college I attended in the south Indian city of Trichy had students from all parts of the country. At the time the college was called Regional Engineering College (REC), it is now called the National Institute of Technology. There was one REC in each major Indian state. The RECs had a unique admission policy. Half of the engineering students admitted each year were from the local state – in the case of Trichy, the home state was Tamil Nadu – and the remaining half were from outside the state. The more populous states, such as Uttar Pradesh and Bihar, got more students, but even far-flung parts, the Northeast and Kashmir, had some representation.

In my first year, all the 400 odd male engineering students were packed into the same hostel (dormitory), with 5 students sharing a room. In what seemed like a deliberate policy at integration, the students were assigned rooms so that 2-3 of the students were from Tamil Nadu and each of the others was from a different state. Since states in India are organized along linguistic lines, you had 3-4 mother tongues in each room. In the corridors you could hear the two dozen major languages of India [4]. 

Despite all this diversity, communication was never a problem. Among the North Indians almost everyone knew Hindi and so Hindi was the bridge between mother tongues. The local state students– they were colloquially called Tambis by the North Indians – spoke Tamil but did not understand Hindi and were even hostile to it (even today, the Indian prime minister Narendra Modi's emphasis on Hindi annoys my Tamil friends). But all students whether North Indian or Tamil, had some working knowledge of English – the language of the textbooks, which everyone aspired to speak well if only to get access to good jobs after graduation. So English – however grammatically inaccurate or spotty – was the bridge between the locals and the North Indians. 

If I randomly sampled two individuals from that student population of 400, then there is a good chance that the two students would have different mother tongues (high linguistic diversity), but due to multilingualism they would have at least one language in common. So the index of communicability was essentially 1, if we ignore the question of proficiency. 

My own case was somewhat different but by no means unique. Although I was born with Tamil as my mother tongue, I had lived mostly in West and Central India and had picked up Hindi, Gujarati and Marathi socially (the last two have dropped off due to lack of practice). I applied to college as an out-of-state student, but was really returning to my home state. In Trichy, I could communicate in Tamil with all the local students. Indeed, my colloquial command of Tamil – all the bad words included –went up! With everyone who was not from Tamil Nadu, I used mostly Hindi or English. I learned, to my surprise, that my ability in conversational English was poor, because I'd never really spoken it socially. 

The college experience I've described applies more generally. Many parts of India are like this: different language communities live together in cities and along borders between states and multilingualism facilitates communication.  

To summarize, Greenberg's two indices capture contrasting aspects of language reality in a population. The diversity index captures the number of mother tongues and how evenly represented they are in relation to each other, while the index of communication captures how connected a population is.

In theory, a population could retain its linguistic diversity while also maintaining a high index of communication essential in a globalized world. In practice however, a worldwide rise in communication appears to be happening at the expense of linguistic diversity. The numerous but lesser known languages of Australia, North America, Central and South America are losing ground quickly. Africa is the only continent bucking the trend. India's twenty odd major languages are still doing quite well, but others are not – check out these podcasts (1 and 2) by Padmaparna Ghosh and Samanth Subramanian on the challenges of linguistic surveys and inevitability of language loss.     

Finally, here are brief notes on two different countries: Mexico and United States. I've had a long-standing interest in both these countries. Drawn to its pre-Columbian indigenous past, I traveled to Mexico six times – from Chiapas to Oaxaca in the south, to Michoacán and Mexico City in the center, to Chihuahua in the north. The United States, meanwhile, has been home for the last 16 years.  


In the last section of his paper, Greenberg demonstrates how his two measures – linguistic diversity index and the index of communication – stack up when it comes to the 31 states of Mexico, and Mexico as a whole. To do this, he used bilingual data from a census in 1930. 

Mexico's indigenous languages began to decline after the Spanish conquest of Mexico in 1521. In Greenberg's calculation, Mexico's linguistic diversity index (unadjusted for resemblance) was 0.31 in 1930 while it's index of communication was 0.83. Among individual states, though, there was a great deal of variation. The federal district (DF – Distrito Federal), which includes the highly populous Mexico City had much lower linguistic diversity of 0.12 while its index of communication was 0.99 – virtually 1, which makes sense because Spanish is indispensable in the capital. The state of Oaxaca, which I have visited twice recently and where indigenous groups have a strong presence, had the highest linguistic diversity index of 0.83. In Greenberg's data, Oaxaca's index of communication of 0.47 was the lowest in Mexico. 

But this was in 1930; I am sure things have changed in the last 86 years towards greater communicability and lower diversity as Spanish continues to be dominant. According to Ethnologue, Mexico's language count is 290 but its diversity index is down to 0.11. Most likely – this is a guess – its index of communication, which was already 0.83 in 1930, is well over 0.9 now.    

United States

According to the Ethnologue, the US has 430 languages: 219 of which are indigenous and 211 of them immigrant. North America before European settlement was teeming with indigenous languages from different families. California was one of the most linguistically diverse places in the America with around 70-80 languages from 20 language families. 

Because of the sustained ethnic cleansing that happened after European arrival, the vast majority American Indian languages are now tethering on the brink of extinction. English is dominant, which explains the country's relatively low linguistic diversity of 0.34. English is also why the United States' index of communication is likely to be very high – above 0.9 if not close to 1 (this is a guess and is not based on data). Today an American Indian who speaks, say, Navajo or Cherokee, can communicate in English with a recently naturalized Indian-American whose original mother tongue was, say, Telugu

Despite English's dominance, the United States does have a certain linguistic richness to it, thanks to immigrants (citizens or not) from all other continents to make a living here. By some estimates 800 languages are spoken in New York City!

Reference and Footnotes

1. Greenberg, Joseph H. "The measurement of linguistic diversity." Language 32.1 (1956): 109-115.

2. Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (eds.). 2016. Ethnologue: Languages of the World, Nineteenth edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com.

3. Greenberg's adjustment for resemblance between languages: Suppose there are three languages M, N and O spoken by 1/8th, 3/8th and 1/2 of the population and suppose the resemblance between [M, N], [M, O], and [N, O] are 0.85, 0.3 and 0.25. Then the linguistic diversity index adjusted for resemblance is:

1 – [(1 x 1/8 x 1/8) – (1 x 3/8 x 3/8) – (1 x 1/2 x 1/2)] 
– [(0.85 x 1/8 x 3/8) – (0.85 x 3/8 x 1/8)] 
– (0.3 x 1/8 x 1/2) – (0.3 x 1/2 x 1/8) 
– (0.25 x 3/8 x 1/2) – (0.25 x 1/2 x 3/8) 
= 0.381

The first line is exactly the linguistic diversity index we have already seen, without adjusting for resemblance. There are 3 languages so one squared term for each language. Each term calculates the probabilities that both randomly picked individuals speak the same language. There is a multiplier of 1 since the resemblance of a language to itself is 1. If we used only the first line, we would get an unadjusted linguistic diversity index of 0.593. 

The next 3 lines take care of relatedness between language pairs. The second line calculates the probability that the first randomly picked individual speaks M and the second speaks N, and vice versa. The multiplier of 0.85 indicates that there is a high resemblance, therefore speaking M and N should be treated (almost) like speaking the same language. Lines 3 and 4 do the same for language pairs [M, O] and [N, O] and the respective resemblance multipliers are used. In the end the adjusted diversity index gives us a value of 0.381, significantly lower than the unadjusted value of 0.593.

4. The beautiful Indian language tree illustration is by Minna Sundberg.   

5. This piece was first posted at 3 Quarks Daily.


Anu said...

Being new to the blogging world I feel like there is still so much to learn. Your tips helped to clarify a few things for me as well as giving..
iOS Training in Chennai
Android Training in Chennai
php Training in Chennai

Nandhini said...

This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
Pest Control in Chennai

Abiya Carol said...

Great site for these post and i am seeing the most of contents have useful for my Carrier.Thanks to such a useful information.Any information are commands like to share him.

AWS Training in Chennai