Jump to content

Module talk:Lang/data

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Edit request 24 March 2025

[edit]

Description of suggested change: Add support for additional proto-languages, under their family's ISO 639-5 codes:

  • Proto-Kartvelian: ccs
  • Proto-Uralic: urj

I ran into the need to tag these languages while performing language cleanup in Laryngeal theory. I'm certain their articles would benefit from proper tagging, as well.

Diff:

["ca-x-old"] = "Old Catalan", ["cel-x-combrit"] = "Common Brittonic", -- cel in IANA is Celtic languages
+
["ca-x-old"] = "Old Catalan", ["ccs-x-proto"] = "Proto-Kartvelian", -- ccs in IANA is Kartvelian languages ["cel-x-combrit"] = "Common Brittonic", -- cel in IANA is Celtic languages
["sla-x-proto"] = "Proto-Slavic", -- sla in IANA is Slavic languages ["yuf-x-hav"] = "Havasupai", -- IANA name for these three is Havasupai-Walapai-Yavapai
+
["sla-x-proto"] = "Proto-Slavic", -- sla in IANA is Slavic languages ["urj-x-proto"] = "Proto-Uralic", -- urj in IANA is Uralic languages ["yuf-x-hav"] = "Havasupai", -- IANA name for these three is Havasupai-Walapai-Yavapai

EnronEvolvedMy Talk Page 22:32, 24 March 2025 (UTC)[reply]

{{lang|fn=name_from_tag|link=yes|ccs-x-proto}}Proto-Kartvelian
{{lang|fn=name_from_tag|link=yes|urj-x-proto}}Proto-Uralic
Trappist the monk (talk) 22:57, 24 March 2025 (UTC)[reply]

Edit request 24 March 2025

[edit]

Description of suggested change: Add a language code for a couple more proto-languages, also using their groups' ISO codes:

  • Proto-Finno-Ugric: fiu
  • Proto-Samic: smi

I hear Proto-Finno-Ugric a debatable proto-language these days, but I'm running into the need to tag it in Laryngeal theory.

Diff:

["egy-x-old"] = "Old Egyptian", ["gem-x-proto"] = "Proto-Germanic", -- gem in IANA is Germanic languages
+
["egy-x-old"] = "Old Egyptian", ["fiu-x-proto"] = "Proto-Finno-Ugric", -- fiu in IANA is Finno-Ugric languages ["gem-x-proto"] = "Proto-Germanic", -- gem in IANA is Germanic languages
["sem-x-taymanit"] = "Taymanitic", ["sla-x-proto"] = "Proto-Slavic", -- sla in IANA is Slavic languages
+
["sem-x-taymanit"] = "Taymanitic", ["smi-x-proto"] = "Proto-Samic", -- smi in IANA is Samic languages ["sla-x-proto"] = "Proto-Slavic", -- sla in IANA is Slavic languages

EnronEvolvedMy Talk Page 23:32, 24 March 2025 (UTC)[reply]

{{lang|fn=name_from_tag|link=yes|fiu-x-proto}}Proto-Finno-Ugric
{{lang|fn=name_from_tag|link=yes|smi-x-proto}}Proto-Samic
Trappist the monk (talk) 00:15, 25 March 2025 (UTC)[reply]

@Trappist the monk: I am curious what you think of the Belarusian Latin alphabet AKA "łacinka". The IANA language-subtag-registry for BCP47 does not seem to say much in this regard. For "be", I could only find variants "be-1959acad" and "be-tarask" and that "Cyrl" script should be suppressed with "be" (but not "Latn"). Since some Belarusian seems to actually be/have been originally written in "łacinka" (vs. transliterated for readers of Latn scripted languages) is this better as a variant via something like "be-łacinka" (I am not sure that technically qualifies due to the "ł") or a romanization via something like "be-Latn-łacinka"? And should "łacinka" be added here as a transliteration addition to translit_title_table? What is the best way to markup such text: with a {{lang|be-Latn-łacinka|...}} or {{translit|be|łacinka|...}} or something else? Thank you, —Uzume (talk) 18:23, 31 March 2025 (UTC)[reply]

From the point of view Module:Lang, latn script is latn script regardless of alphabet so the general case is {{lang|be-latn|łacinka text}} or {{langx|be-latn|łacinka text}}. When the text is a łacinka-alphabetic romanization of Cyrillic Belarusian, you can use {{transl|be|łacinka}}. So far as I know, łacinka is not a 'romanization standard' so is not supported by {{transl}}.
We do not create variants like 1959acad and tarask because they must first be registered with IANA (there is no external standard from which variant subtags are derived).
If it is important to do so, you might consider creating a separate template like {{lang-sr-Latn}} which hard-codes the language label to link as [[Gaj's Latin alphabet|Serbian]]. I don't think that easter-egging the language label is a good idea so the practice should be discouraged.
Łacinka is a latn script so should be simply marked up as a latn script.
Did I answer your question?
Trappist the monk (talk) 22:33, 31 March 2025 (UTC)[reply]
@Trappist the monk: Yes, pretty much. You seem to be advocating for {{lang|be-Latn|łacinka text}} and {{langx|be-Latn|łacinka text}} and perhaps something like be-Latn-latsinka (where latsinka is BGN/PCGN for лацінка or łacinka) if and when such a beast gets registered with IANA in much the same way as zh-Latn-pinyin is although pinyin seems to also be a romanization here as well. The only downside I see if that there is no real way to differentiate between {{langx|be|лацінка}} (Belarusian: лацінка) and {{langx|be-Latn|łacinka}} (Belarusian: łacinka) except for the fact that the latter is Latin script and thus gets automatically italicized. —Uzume (talk) 03:12, 1 April 2025 (UTC)[reply]
More-or-less, though advocating is a bit strong. The purpose of Module:Lang is to provide correct html markup for non-English text in compliance with MOS:FOREIGN. Writing {{langx|be|лацінка}} and {{langx|be-Latn|łacinka}} do that. If ever IANA adopts a latsinka variant subtag, Module:lang will support it.
Trappist the monk (talk) 13:17, 1 April 2025 (UTC)[reply]

Edit request 13 April 2025

[edit]

Description of suggested change:

Diff:

["fr-ca"] = "Quebec French",
+
["fr-ca"] = "Canadian French",

Introduced in this diff. Northern Moonlight 05:56, 13 April 2025 (UTC)[reply]

See also Module_talk:Lang/data/Archive_1#Edit_request_8_January_2025. To address that request for consensus, let me propose that it is pretty self-evident that Quebec French is distinct from Canadian French (whether you call it a subset or a variant), as those articles amply describe. And Canadian French is expressible only as fr-CA in the schema used here. Is there an argument against this change based on a principle that eludes me? I have no objection to a separate question of whether fr-quebec (or something like that) ought to also exist, possibly along with other regional variants. But right now we have the problem that, for instance, Canadian French terms are being indicated as being specifically Quebec French, in error. TheFeds 08:13, 13 April 2025 (UTC)[reply]
Pinging Trappist the monk. Firefangledfeathers (talk / contribs) 16:26, 17 April 2025 (UTC)[reply]
According to this search, there are about 70 articles that use {{lang}} (~60) / {{langx}} (~10) with fr-CA (also, ~6 templates). If we make this change, someone with sufficient language skills (that person is not me) must go through those articles and make sure that all instances of {{lang(x)|fr-CA|...}} correctly identify the labeled dialect. Because Module:Lang does not have a mechanism to distinguish Québécois from generic Canadian French, we must invent one; perhaps fr-x-quebec → Quebec French.
Volunteers to make sure that the existing {{lang(x)|fr-CA|...}} templates are correctly applied or replaced with {{lang(x)|fr-x-quebec|...}}?
Trappist the monk (talk) 17:05, 17 April 2025 (UTC)[reply]
To probe a little further before selecting a tag, the infobox at Quebec French suggests fr-u-sd-caqc as an IETF tag (added in this edit), though it seems it is not one that happens to correlate directly with ISO 639 & ISO 3166-1 alpha-2. Instead it seems to be using the RFC 6067 extension defined fully in Unicode Technical Standard #35, such that u means use the Unicode extensions, sd means use a geographic subdivision, ca is a semi-redundant way to encode the region information (meaning the same as ISO 639-1 CA), and qc means the subdivision of Quebec.
Conversely, in fr-x-quebec, x is for private use, with quebec being the private use information (i.e. the string that English Wikipedia chooses to use to represent the place where Quebec French is spoken).
For the purposes of this module, how do we feel about either implementing a Unicode extension (u), a private use extension (x), or neither? It looks like Module:Lang/data currently implements a few private use codes and no Unicode codes. TheFeds 19:34, 19 April 2025 (UTC)[reply]
I sometimes think of supporting the unicode locale extension for subdivisions. The necessary reference data are available at github. But, do we really need such precision? There are 5400+ defined subdivisions. I would venture to guess that almost none of them are actually required for en.wiki to provide correct html markup for non-English text and to provide appropriate labeling and tooltips for readers. For those languages that do have specific regional needs, like Québécois, private-use tags (with the x singleton) should be sufficient.
I suppose that we could support a very limited subset of the u-sd-xxxx subdivisions on an as-needed basis if it is deemed sufficiently important to do so.
Trappist the monk (talk) 22:07, 19 April 2025 (UTC)[reply]
I'm not really too concerned one way or another about which ought to be preferred (fr-x-quebec vs. fr-u-sd-caqc), but wanted to consider the workflow of an editor attempting to use the {{lang}} and {{langx}} templates, whereby they might consult the mainspace article for guidance as to which tag to use, and find it doesn't work. We could amend the documentation for those templates to indicate that the Unicode extension is not presently supported, and that a private use tag corresponding to the ones at this module page ought to be used instead. Or, we could support some but not all—case-by-case as described. Or we could support them all, but that leads to the question whether a consensus exists to recommend one format or the other when there are now multiple ways of expressing the same concept (e.g. fr-CA = fr-u-sd-ca). Does any one alternative stand out as most elegant and workable? TheFeds 23:05, 20 April 2025 (UTC)[reply]
Presently there are 69 private-use tags known to Module:Lang. Most of those appear to refer to archaic (if that's the right word) languages. Some of them don't (lmo-x-bergBergamasque, lmo-x-cremishCremish, lmo-x-milaneseMilanese; there may be others in that list. Of those three, two have unicode IETF tags in their article infoboxen: Bergamasque: lmo-u-sd-itbg and Milanese: lmo-u-sd-itmi. For Cremish, its unicode tag is likely: lmo-u-sd-itcr.
This search suggests that there are about 140 articles that mention a unicode IETF tag. At a quick glance, most of those are for geographically specific living languages though I did find one (gem-u-sd-ua43Crimean Gothic) which is probably not a living language. There may be others; I didn't look closely.
On the other hand, this search finds about 1130 articles that use lang templates with private-use tags which suggests that editors are not too confused. But these are mostly used for dead languages so a unicode IETF tag is less likely to appear in a language article infobox (except for gem-u-sd-ua43 and perhaps others).
I guess all of this suggests to me that if we are to adopt unicode IETF tags (as needed), they should be used for living languages only and only for those that are tied to a specific geographical area within the bounds of the larger area specified by the first to characters of the subdivision subtag (it in itbg). For non-living languages, private-use tags should be used.
Trappist the monk (talk) 13:19, 21 April 2025 (UTC)[reply]
Once we make the switch, I can go through the articles manually. Northern Moonlight 01:25, 22 April 2025 (UTC)[reply]