In 2021, we’re making an index of every language online

If you’re an archivist, librarian, or programmer and would like to volunteer your time, please sign up

Around the world, people are reawakening ancestral languages, proof that endangered languages need not be endangered forever. As long as a language is documented and that documentation is readily available, cultural descendants can learn it and raise new generations of native speakers. However, while the vast majority of languages have been documented to a degree, the vast majority of language documentation isn’t readily accessible, a roadblock to revitalization.

And yet. With nearly four billion people online, there has been an explosion of mother-tongue content in the form of memes, YouTube channels, public feeds on WhatsApp and Telegram, and other kinds of accessible media. In addition, there are two centuries of linguistic research gathering dust in university archives. As far as we can tell, there has been no effort to comprehensively index all of this content. How many of the world’s 7,000 languages are already accessible online?

Using programmatic web crawling and social listening systems, we plan to find an answer. In an initial phase, we will index which of the world’s languages have freely-available materials on open platforms like the Internet Archive and Wikimedia Commons, as well as at archival institutions with public-facing web portals, like the U.S. Library of Congress. In a second phase, we will comb the Internet for social media accounts and free websites dedicated to individual languages or groups of languages. By building an open index of every language on the Internet, we can make it possible for people from hundreds of cultures to access materials in under-resourced languages, a necessary first step to rebuilding cultural sovereignty.