Wikipedia:Wikipedia Signpost/Next issue/Recent research
![]() | This is a draft of a potential Signpost article, and should not be interpreted as a finished piece. Its content is subject to review by the editorial team and ultimately by JPxG, the editor in chief. Please do not link to this draft as it is unfinished and the URL will change upon publication. If you would like to contribute and are familiar with the requirements of a Signpost article, feel free to be bold in making improvements!
|
How readers use Wikipedia health content; Scholars generally happy with how their papers are cited on Wikipedia
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
How readers use Wikipedia health content
[edit]- Reviewed by Clayoquot
How do readers use health information on Wikipedia? A recent paper[1] explores this question using semi-structured interviews with 21 adults from seven countries. All participants had used Wikipedia for health information at least once in the previous year.
The research was qualitative in intent and all participants happened to have at least some post-secondary education, so the results are not necessarily representative of Wikipedia readers as a whole. Nevertheless, it gives a fascinating breadth of results. The whole paper is well worth reading – it's brief, digestible, and probably quite gratifying for Wikipedia volunteers. Some highlights:
- The most common reason for using Wikipedia was simply to "learn more" about a topic. One participant used Wikipedia to understand the relevant anatomy when preparing to have surgery. The participant said, "What all is like wrapped around that gland? That's the kind of information I was looking for and the doctors weren’t really telling me that."
- Several participants reported using Wikipedia for self-advocacy. Before or after visiting a health professional, they read Wikipedia so they can better explain their symptoms or understand what kinds of questions to ask.
- Three quarters of participants expressed "conditional trust" in Wikipedia content, meaning they scroll down to the list of references and decide whether the cited sources are good. Previous research has found that readers click links in references only 0.29% of the time.[supp 1] This paper doesn't contradict the earlier finding. However, it provides evidence that even when readers don't read a cited source, the fact that it was cited might be meaningful to them.
Briefly
[edit]- See the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.
Other recent publications
[edit]Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.
- Compiled by Tilman Bayer
"Research citations building trust in Wikipedia: Results from a survey of published authors"
[edit]From the abstract:[2]
"This cross-publisher study (Taylor & Francis and University of Michigan Press [with two other authors hailing from British technology company Digital Science]) aimed to investigate [scholarly] author sentiment towards Wikipedia as a source of trusted information. [...] A short survey was distributed to 40,402 authors of papers cited in Wikipedia (n=21,854 surveys sent, n=750 complete responses received). The survey gathered responses from published authors in relation to their views on Wikipedia’s trustworthiness in relation to the citations to their published works. [...] Overall, authors expressed positive sentiment towards research citation in Wikipedia and researcher engagement practices (mean scores >7/10). Sub-analyses revealed significant differences in sentiment based on publication type (articles vs. books) and discipline (Humanities and Social Sciences vs. Science, Technology, and Medicine), but not access status (open vs. closed access).
From the "Discussion" section:
"Our results suggest there is general trust among researchers in Wikipedia both in terms of representativeness and accuracy. Most would also recommend the Wikipedia page where their work is cited to a colleague or the general public."
"A Comparative Study of Reference Reliability in Multiple Language Editions of Wikipedia"
[edit]From the abstract:[3]
"[...] We quantify the cross-lingual patterns of the perennial sources list, a collection of reliability labels for web domains identified and collaboratively agreed upon by Wikipedia editors. We discover that some sources (or web domains) deemed untrustworthy in one language (i.e., English) continue to appear in articles in other languages. This trend is especially evident with sources tailored for smaller communities. Furthermore, non-authoritative sources found in the English version of a page tend to persist in other language versions of that page. We finally present a case study on the Chinese, Russian, and Swedish Wikipedias to demonstrate a discrepancy in reference reliability across cultures. Our finding highlights future challenges in coordinating global knowledge on source reliability."
From the paper:

"To investigate the spread of English Wikipedia’s perennial sources across multiple language editions, we identify the proportion of articles in each edition that include at least one reference to these sources. Figure 1 shows the percentage of articles referencing reliable and non-authoritative sources in the 40 editions with the largest number of articles. [...] The plot shows outliers in the two directions of the confidence interval represented by the gray area. On the one hand, the English edition is located below the confidence interval, meaning the proportion of articles citing reliable domains is larger. This observation is consistent with recent research [...], as the community of English Wikipedia is more aware of the non-authoritative domains listed in the local perennial sources list. On the other hand, the outliers above the confidence interval appear to have a relatively larger proportion of articles citing deprecated or blacklisted domains. These are Russian (ru), Armenian (hy), Chinese (zh), French (fr), and Bulgarian (bg)."
"Figure 3: Top 15 non-authoritative sources (from the perennial source list of the local Wikipedia edition or the one of English Wikipedia) by the number of citations in Russian, Swedish, and Chinese Wikipedia editions":
"Wikipedia as a Reliable Information Source: A Comparison of Chinese and English Versions"
[edit]From this post on the blog of the University of Geneva's Confucius Institute:[4]
"For English Wikipedia, we accessed the “Reliable sources/Perennial sources” page and extracted the list of reliable and controversial sources. Similarly, for Chinese Wikipedia, we accessed the equivalent page containing source reliability information list. [...] in our quantitative analysis, differences in the diversity and number of sources suggest that English Wikipedia may have access to a wider range of sources, whereas Chinese Wikipedia seems to be more selective or restricted in its choice of sources. Due to the existence of [the] “无共识” (no consensus) label, the rating of reliable sources in Chinese Wikipedia is more ambiguous than in the English version."
"ALPET: Active Few-shot Learning for Citation Worthiness Detection in Low-Resource Wikipedia Languages"
[edit]From the abstract:[5]
"Citation Worthiness Detection (CWD) consists in determining which sentences, within an article or collection, should be backed up with a citation to validate the information it provides. This study, introduces ALPET, a framework combining Active Learning (AL) and Pattern-Exploiting Training (PET), to enhance CWD for languages with limited data resources. Applied to Catalan, Basque, and Albanian Wikipedia datasets, ALPET outperforms the existing CCW baseline while reducing the amount of labeled data in some cases above 80%. ALPET's performance plateaus after 300 labeled samples, showing it suitability for low-resource scenarios where large, labeled datasets are not common. [...] Overall, ALPET's ability to achieve high performance with fewer labeled samples makes it a promising tool for enhancing the verifiability of online content in low-resource language settings."
"Providing Citations to Support Fact-Checking: Contextualizing Detection of Sentences Needing Citation on Small Wikipedias"
[edit]From the abstract:[6]
"To date, research on automating citation worthiness detection has largely focused on the most resourceful language, English Wikipedia, neglecting the applicability to smaller Wikipedias. In addition, previous research proposed models that analyze the content inherent to a sentence to determine its citation worthiness, overlooking the potential of additional context to improve the prediction. Addressing these gaps, our study proposes a transformer-based contextualized approach for smaller Wikipedias, presenting a novel method to compile high-quality datasets for the Albanian, Basque, and Catalan editions. We develop the Contextualized Citation Worthiness (CCW) model, employing sentence representations enriched with adjacent sentences and topic categories for enhanced contextual insight. Empirical experiments on three newly created datasets demonstrate significant performance improvements of our contextualized CCW model, with 6%, 3% and 6% absolute improvements over the baseline for Albanian, Basque and Catalan datasets, respectively. [...] This has implications for supporting Wikipedia projects across low-resource languages, promoting better article validation and fact-checking."
"Wikipedia and indigenous language preservation: analysis of Setswana and Punjabi languages"
[edit]From the abstract:[7]
"This study examines Wikipedia’s role in promoting and preserving Setswana and Punjabi. The research is framed by the Ethnolinguistic Vitality Theory (EVT), which suggests that language survival lies in reclamation, revitalization, and reinvigoration. A quanti-qualitative approach is used to investigate the issue, integrating quantitative metrics from Wikipedia’s statistical pages with qualitative content analysis of the articles. Data were collected from May 2022 to May 2024, focusing on article counts, edits, active editors, new pages, top edited pages, and views. [...] The findings show that Punjabi Wikipedia has a much larger content volume and user base, but comparatively lower recent activity and collaborative depth compared to Setswana Wikipedia. (Setswana) Tswana Wikipedia, while smaller in content volume, demonstrates a more engaged and active editing community, reflected by a higher depth score and a larger number of active users."
"A dual-focus analysis of wikipedia traffic and linguistic patterns in public risk awareness Post-Charlie Hebdo"
[edit]From the abstract:[8]
"This study investigates the dynamics of public risk awareness in the aftermath of the Charlie Hebdo terrorist attack on January 7, 2015, through a dual-focus analysis of Wikipedia traffic and Google Trends data. Analyzing the temporal patterns of Wikipedia page views in both English and French, sheds light on how significant media events, anniversaries, and related incidents influence public engagement with terrorism-related content over time. [...] Francophone regions, particularly France and its former colonies, exhibit a more sustained and consistent interest in the Charlie Hebdo event compared to Anglophone regions. The heightened engagement in French-speaking areas suggests that cultural and historical ties influence public risk perception and awareness."
"WikiReddit: Tracing Information and Attention Flows Between Online Platforms"
[edit]From the abstract:[9]
"[...] we present a comprehensive, multilingual dataset capturing all Wikipedia mentions and links shared in posts and comments on Reddit 2020-2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits."
See also:
Dissertation: "Wikipedia can be an uncomfortable space for those who don’t participate in hacker culture"
[edit]From the "Conclusion" section:[10]
"Over the course of this dissertation, I have shown how the infrastructure that constitutes Wikipedia, made out of various connected digital artefacts, does more than embedding values. It cocreates them – on one hand, by welcoming or resisting intervention, and by being a site of ideological negotiation; on the other hand, by suggesting, implicitly, what values are important, what constitutes a moral good in the first place. Beyond affording intervention to humans – being a substrate or a tool for ethical and epistemic meaning making – Wikipedia’s platform offers up technical values to be turned into social, epistemic, aesthetic values. This is true, for instance, of forkability: as I have shown, forkability in its original formulation informed design because of its practical advantages, concerning safety and distribution of code. Forkability then, through the community that created Wikipedia, became an epistemic value as well.
Programming practice is a key component of Wikipedia’s culture, [...] the concrete circumstances in which coders have worked define the way Wikipedia produces knowledge. [...]
A side-effect of the partial overlap between programming and creating Wikipedia’s content is that the flavour of Wikipedia’s community matches cultural traits of hacker culture. The effect of this phenomenon is two-fold. First, Wikipedia inherited assumptions found in hacker culture – downplaying the role of the body, faith in machinery, anti-aesthetic leanings, connecting intelligence and skill with the ability to code. Secondly, because of the connection between taste and belonging to specific communities, Wikipedia can be an uncomfortable space for those who don’t participate in hacker culture."
References
[edit]- ^ Smith, Denise A. (2023-08-12). ""I'm comfortable with it": User stories of health information on Wikipedia". First Monday. doi:10.5210/fm.v28i8.12897. ISSN 1396-0466.
- ^ Areia, Carlos; Burton, Kath; Taylor, Mike; Watkinson, Charles (2025-04-16). "Research citations building trust in Wikipedia: Results from a survey of published authors". PLOS ONE. 20 (4): –0320334. doi:10.1371/journal.pone.0320334. ISSN 1932-6203. PMC 12002439. PMID 40238814.
- ^ Baigutanova, Aitolkyn; Saez-Trumper, Diego; Redi, Miriam; Cha, Meeyoung; Aragón, Pablo (2023-10-21). "A Comparative Study of Reference Reliability in Multiple Language Editions of Wikipedia". Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. CIKM '23. New York, NY, USA: Association for Computing Machinery. pp. 3743–3747. doi:10.1145/3583780.3615254. ISBN 9798400701245.
/ Preprint version
- ^ Bloch, Marylaure (2024-12-03). "Wikipedia as a Reliable Information Source: A Comparison of Chinese and English Versions". Blog scientifique de l'Institut Confucius de l'Université de Genève.
- ^ Halitaj, Aida; Zubiaga, Arkaitz (2025-02-05), ALPET: Active Few-shot Learning for Citation Worthiness Detection in Low-Resource Wikipedia Languages, arXiv:2502.03292
- ^ Halitaj, Aida; Zubiaga, Arkaitz (2024-09-01). "Providing Citations to Support Fact-Checking: Contextualizing Detection of Sentences Needing Citation on Small Wikipedias". Natural Language Processing Journal. 8: 100093. doi:10.1016/j.nlp.2024.100093. ISSN 2949-7191.
- ^ Minhas, Shahid; Salawu, Abiodun (2025-01-29). "Wikipedia and indigenous language preservation: analysis of Setswana and Punjabi languages". Frontiers in Communication. 10. doi:10.3389/fcomm.2025.1442935. ISSN 2297-900X.
- ^ Elroy, Or; Woo, Gordon; Komendantova, Nadejda; Yosipof, Abraham (2025-03-01). "A dual-focus analysis of wikipedia traffic and linguistic patterns in public risk awareness Post-Charlie Hebdo". Computers in Human Behavior Reports. 17: 100580. doi:10.1016/j.chbr.2024.100580. ISSN 2451-9588.
- ^ Gildersleve, Patrick; Beers, Anna; Ito, Viviane; Orozco, Agustin; Tripodi, Francesca (2025-02-07), WikiReddit: Tracing Information and Attention Flows Between Online Platforms, arXiv:2502.04942 / Dataset: Gildersleve, Patrick; Beers, Anna; Ito, Viviane; Orozco, Agustin; Tripodi, Francesca (2025-01-15), WikiReddit: Tracing Information and Attention Flows Between Online Platforms, doi:10.5281/zenodo.14653265
- ^ Falco, Elena (2024-11-28). A technosocial epistemology of Wikipedia (Thesis). UCL (University College London). (dissertation)
- Supplementary references and notes:
- ^ Piccardi, Tiziano; Redi, Miriam; Colavizza, Giovanni; West, Robert (2020-04-20). "Quantifying Engagement with Citations on Wikipedia". Proceedings of the Web Conference 2020. WWW '20. New York, NY, USA: Association for Computing Machinery. pp. 2365–2376. arXiv:2001.08614. doi:10.1145/3366423.3380300. ISBN 978-1-4503-7023-3.
Discuss this story
(This allows for greater visibility of discussions, makes archiving easier, and prevents discussions becoming disconnected from articles during the publication process)