Wikipedia:Bots/Requests for approval/SchlurcherBot
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard. The result of the discussion was
Approved.
New to bots on Wikipedia? Read these primers!
- Approval process – How this discussion works
- Overview/Policy – What bots are/What they can (or can't) do
- Dictionary – Explains bot-related jargon
Operator: Schlurcher (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 17:54, Sunday, March 2, 2025 (UTC)
Function overview: Convert links from http://
to https://
Automatic, Supervised, or Manual: Automatic
Source code available: Main C# script: commons:User:SchlurcherBot/LinkChecker
Links to relevant discussions (where appropriate): For discussions see:
- WPR: Why we should convert external links to HTTPS wherever possible
- WPR: Should we convert existing Google and Internet Archive links to HTTPS?
Similar tasks were approved for the following bots (please note that these seem to use pre-generated lists):
- Wikipedia:Bots/Requests for approval/Bender the Bot 8
- Wikipedia:Bots/Requests for approval/DemonDays64 Bot
Key difference: My proposed bot will not depend on or use pre-generated lists, but will instead use the heuristic as described in the detail below.
Edit period(s): Continuous
Estimated number of pages affected: Based on the SQL-dump on external links there are a total of 21'326'816 http-links on 8'570'326 pages on the English Wikipedia. Based on experiences from DE Wiki, the success rate is approximately 10%, so approximately 850'000 edits the trial below, the success rate is approximately 33%, so approximately 2'800'000 edits.
Namespace(s): 1 and 6
Exclusion compliant (Yes/No): Yes, through the DotNetWikiBot framework [1]
Function details: The algorithm is as follows:
- The bot extracts all http-links from the parsed html code of a Wikipedia page
- It searches for all href elements and extracts the links
- It does not search the wikitext, and thus does not rely on any Regex
- This is also to avoid any problems with templates that modify links (like archiving templates)
- The bot checks if the identified http-links also occur in the wikitext, otherwise they are skipped
- The bot checks if both the http-link and the corresponding https-link is accessible
- This step also uses a blacklist of domains that were previously identified as not accessible
- If both links redirect to the same page, the http-link will be replaced by the https-link (the link will not be changed to the redirect page, the original link path will be kept)
- If both Links are accessible and return a success code (2xx), it will be checked if the content is identical
- If the content is identical, and the link is directly to the host, then the http-link will be replaced by the https-link
- If the content is identical but not the host, it will be checked if the content is identical to the host link, only if the content is different, then the http-link will be replaced by the https-link
- This step is added as some hosts return the same content for all their pages (like most domain sellers, some news sites or pages in ongoing maintenance)
- If the content is not identical, it will be checked if the content is at least 99.9% identical (calculated via the Levenshtein distance)
- This step is added as most homepages now use dynamic IDs for certain elements, like for ad containers to circumvent Ad Blockers.
- If the content is at least 99.9% identical, the same host check as before will be performed.
- If any of the checked links fails (like Code 404), then nothing will happen.
The bot will work on the list of pages identified through the external links SQL dump. The list was scrambled to ensure that subsequent edits are not clusted from a specific area.
Please note that the bot is approved for the same task in Commons (with 4.1 Million edits on that task) and DE Wiki.
Discussion
[edit] Approved for trial (100 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. If possible, please indicate how many pages were skipped as well. Primefac (talk) 14:15, 10 March 2025 (UTC)[reply]
- @Primefac: Thanks. I've tried, but soon realized that, as the bot changes external links, it hits CAPTCHAs per Extension:ConfirmEdit (in the investigation, I've made one manual edit to confirm). I've made an additional request at Wikipedia:Requests for permissions/Confirmed to be able to skip CAPTCHAs.
Trial complete.
- Here is a summary of the edits and how many pages were skipped:
Bot trial summary (out of 316 pages checked) Edit Type Count Percent Comment Done 100 31.7% Edits No http link found 3 1.0% No action, as link was removed as compared to the dump used No https link found 87 27.5% No action, no appropriate https link found per the above logic Incorrect namespace 119 37.7% No action, not in namespace 1 or 6
- Hope this helps. I've checked the edits and generally see no issue. The only thing to note are edits like this [2]. Here the link is both outside as well as inside another link. In these cases both get replaced in the final search-and-replace and both links continue to work. Also, based on these results from a random sample of pages, I've adjusted the expected edit count. Please let me know your findings. --Schlurcher (talk)
- As mentioned earlier, I did not particularly like this edit [3], where both the https archived and http non-archived link are given on the same page. I've now updated the initial filtering logic to correct for this: I'll now remove all http links that are contained in other links on the same page (checked against all http and https links). The revised edit with the updated logic is here: [4]. --Schlurcher (talk) 17:28, 23 March 2025 (UTC)[reply]
- {{BAG assistance needed}} Any further feedback? --Schlurcher (talk) 20:13, 6 April 2025 (UTC)[reply]
- As mentioned earlier, I did not particularly like this edit [3], where both the https archived and http non-archived link are given on the same page. I've now updated the initial filtering logic to correct for this: I'll now remove all http links that are contained in other links on the same page (checked against all http and https links). The revised edit with the updated logic is here: [4]. --Schlurcher (talk) 17:28, 23 March 2025 (UTC)[reply]
Approved. It looks good to me. As per usual, if amendments to - or clarifications regarding - this approval are needed, please start a discussion on the talk page and ping. --TheSandDoctor Talk 23:52, 6 April 2025 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard.