Clean up currently available lexicons

Related to the request to require approval for a new lexicon, I would like to request a cleanup of duplicate/wrong lexicons already in the system. Obviously this only applies to duplicates and wrong lexicons (e.g. “Lexicon 1”). No common names for taxa would be thrown out, just merged into a single consistent lexicon where appropriate.

For an idea of what I’m talking about, here’s the lexicon list I posted to the request for lexicon approval thread:

Australia
Australian
Indigenous Australian
[note that Australia has many indigenous languages]

Creole (English)
English (Creole)

German
Deutsch

Spanish
Spanish (Chile)
Spanish (Perú)
Español (Argentina)
Español (Chile)
Español (Costa Rica)
Español (Ecuador)
Español Chileno

French
Français

Greek
Greek (Modern)
Modern Greek (1453 )

Ju|'hoan
Juǀ’hoan

Oshi Kwanyama
Oshikwanyama

Portuguese
Português

Ru Kwangali
Rukwangali

Setswapong
Setswapng

Slovak
Slovakian

Sotho ( Northern)
Sotho (Northern)

Swahili
Kiswahili

Teenek
Tenek

Tonga
Tongan

Manx
Manx English Dialect

Navaho
Navajo

Zeelandic
Zealandic

Scientific Names
Nombres Científicos
Nomi Scientifici
Noms Scientifiques
Wetenschappelijke Namen
Wissenschaftliche Namen

Lexicon 1
Native Name
Eng
Spa

|language|lexicon|COUNT|
|und|Aou 4 Letter Codes|1226|
|und|AOU 4-Letter Codes|1150|

Please take great care with Tongan and Tonga !`
Tonga is a South-Eastern Africa Language.
Perhaps mistakes have been made !

1 Like

Guess 141 names have an incorrect language ‘und’ and lexicon ‘Русский’. Probably this update will violate unique constraints…

I think that the records with
language=‘und’ and lexicon=‘Русский’
should be
language=‘ru’ and lexicon=‘Russian’
Seems to be 141 records.

Update VernacularNames
Set language=‘ru’,
lexicon=‘Russian’
Where language=‘und’
and lexicon=‘Русский’

They are dated around 2019-04-01T14:50:08Z, so the tap might be closed later in 2019. You can find the records in the file with the lexicon-is-not-in-english-file: VernacularNames-.csv

Also the lexicon ‘‘other’’ is remarkable… often related to (imports of) VernacularNames

VernacularNames source
http://www.catalogueoflife.org/annual-checklist/details/species/id/10073534/common/100107 and the Vernacular name is often Lampalampa. Maybe delete these entries?

@jwidness Maybe remove your Tonga/Tongan entries in the post above?

There may be a problem when there are different languages which have the same name but are quite different and are spoken in different countries. Luo in Cameroon may be completely unrelated to Luo in Kenya . This problem needs some attention. Just like Tonga in Africa and Tonga in Pacific. Then there is Yao in China and Yao in East Africa. Ata in Philippines and Ata in Papua New Guinea There must be many other duplicate nameexamples. Perhaps the country or continent, needs to be written after a language, at least for some countries !

4 Likes

Si lozi = Silozi


und
https://en.wikipedia.org/wiki/ISO_639:u


synonyms = {
“Aou 4 Letter Codes” => [“Aou 4 Letter Codes”],

Another (recently added?) duplicate lexicon that needs to be merged into Scientific Names is Scientific Name (the existence of which makes it take a little bit longer than it would otherwise to tab through the new taxon name page when adding scientific name synonyms)

2 Likes

Made a github issue here: https://github.com/inaturalist/inaturalist/issues/3693

3 Likes

Setswapong is very correct and Setswapng is a mistake that should be removed.

2 Likes

There are at least two names with the Setswapng lexicon:
https://www.inaturalist.org/taxa/340101-Cissus-cornifolia
https://www.inaturalist.org/taxa/340321-Xanthocercis-zambesiaca

Selete ( Botswana) → Selete?
https://www.inaturalist.org/taxa/588237-Kalanchoe-paniculata

2 Likes

I’m looking into this today. I have a script that does a lot of this stuff, so I’m trying to update that script and to automate it so it runs once a month or something.

However, it’s not at all clear to me what should be considered the canonical lexicon for any given set of synonyms. Is it reasonable to just choose the first name for the language on the English Wikipedia page for that language? For example, that would mean “Kwangali,” RuKwangali," and “Ru Kwangali” would all become “Kwangali.”

I’m not going to do anything about “Luo” if there are multiple languages with that name. There are only ~12 names with that Lexicon. It seems like people have been using “Tongan” for the Pacific island language and “Tonga” for the African language, so I don’t think there’s anything to do there.

Lexicons like “Und” and “Indigenous” will just be left alone since they need manual attention. I guess I could synonymize “Creole (English)” and “English (Creole)” but they’re both equally useless, right? There are a bunch of English creoles out there and presumably they could have different names for the same taxa.

5 Likes

Had a quick look at the script, may have missed it.
Occasionally there are names that don’t have a lexicon, as far as I’m aware there is no way to search for them, would it be possible to assign them to something like ‘Und’ or ‘Other’ so they can be found in the DWCA export for correction where possible?

1 Like

A few more for the script
English(india)
Español (Uruguay)
Wissenschaftliche Namen → Scientific Names

український → Ukrainian ?
臺灣閩南語 → Taiwanese Hokkien ?
Swahili (Individual Language) → Swahili ?

1 Like

It should be impossible to create taxon names without a lexicon, but there are some that exist from before we instituted that rule. I don’t want to give them a lexicon of “Und” or something because that basically violates the rule, but I can include them in the DwC-A under the language und (a real ISO 639 code!). That would meet your need, right?

The first/top name in the Language Infobox on English Wikipedia pages should be the exonym, not sure how reliable this is though. Other options may be referencing to the name used in ISO 639 or Glottolog.

Yes, the few I can recall were imports from a while back.

Yes, that would work

I would like that discussion to be on a flag, so local people can say which is the correct name for their language. Or at least a locally based scientist.

2 Likes

To be clear, you are asking for the English name of the language, right?

(As an example, for the language spoken in Germany, you would want “German”, not “Deutsch”, since language names are translated in Crowdin.)

2 Likes

The English name for the lexicon, see also topic title. Georgian, Macedonian might be an exception.

Ju Hoan, Ju|'hoan, Juǀ’Hoan → Juǀ’hoan
Swati, Si Swati → Swazi
Oshi Kwanyama, Oshikwanyama, Kwanyama ( Ovamboland) → Kwanyama
Papiamentu → Papiamento
Punjab → Punjabi