I'm trying to patrol pubic and certain other easily confused words using poop patrol, and I can see a few phrases that would suit this software better.
OK false positives are theoretically possible though it doesn't exist yet on Wikipedia and there once were many dozens of participants in the Olympic sport of synchronised ventriloquism. I will leave it in Botlaf. However can disolve be added as a typo for dissolve? I went through it manually a year or so back but there are about fifty again. ϢereSpielChequers18:06, 31 August 2010 (UTC)
I'm not convinced it does, I've just fixed one from June and I'd have thought AWB would have fixed it by now if it was in AWB. Can we have a specific rule for Disolv - Dissolv please.ϢereSpielChequers13:01, 18 September 2010 (UTC)
Please correct tv to TV. Although it is commonly used, acronyms should be capitalised, otherwise "tv" might be pronounced as a single syllable. McLerristarr | Mclay117:01, 24 October 2010 (UTC)
Please italicise Latin words and phrases, the most common being et cetera (or etcetera, et caetera or et cætera), de facto, de jure, id est, ad libitum, circa, floruit and exempli gratia. McLerristarr / Mclay107:49, 14 September 2010 (UTC)
I don't agree with the rule for Considered changing "consideres" → "considered", as the proper word could be "considers". (e.g. this edit) I hope you'll reconsider (pun intended) this rule. Speaking of which, adding "(Re)" to the beginning of these rules would be good too. Thanks! GoingBatty (talk) 02:55, 24 September 2010 (UTC)
"Diary products" could be legitimate; I nearly committed this edit to "Dairy products" before I noticed. I was too scared to screw up the code to edit it; could someone who knows what they're doing, please? --John (talk) 06:48, 29 September 2010 (UTC)
What does '"Diary products" could be legitimate' mean? Did you actually find it anywhere? It seems way beyond likely to me.--BillFlis (talk) 03:16, 30 September 2010 (UTC)
My initial instinct too. I found 2 examples of it (searching for the phrase finds the two... I don't remember them now). Frankly the typo seems more likely; I'd be fine with it added back (although I added some others too so don't remove those) Shadowjams (talk) 03:43, 30 September 2010 (UTC)
Little heads up for you. I was poking at AWB doing some profiling, and Regextypofix takes nearly a 3rd of the time whilst processing an article. Most of this, is doing match evaluation.
Is it possible for you to drill down deeper and see which or what kinds of regexes take the longest? Anyways we can optimize what's here from the rule-writing perspective? Shadowjams (talk) 21:25, 3 October 2010 (UTC)
Not exactly. MaxSem seems to think there was, but we'll have to dig it out. I imagine, there are a lot of rules that won't ever get matched, and are probably just pointless keeping around. I need to do a new TypoScan dump, and if I do it with some extra stats, such as the word/the rule it matched, it might give us a better idea. We have a lot of regexes!! —Reedy21:51, 3 October 2010 (UTC)
Yeah, it's huge. One-third is less than I would have guessed for the typo rules. There was a conversation (I think it's above) about whether using alteration (pipes) or character classes (brackets) was faster, since the latter is significantly faster in some implementations. For AWB it turns out the difference is small, but classes are slightly faster.
While I'm interested in the optimization issues it's mostly academic; I don't personally find the speed right now a serious issue. Even on old hardware I don't have trouble working with anything in AWB. If anything the API for saving changes (gets are quick) is a larger slow-down. If I do large database dump scans that takes a while but even then it's not extraordinarily long, and it's easily batched which is probably a more long-term and cheaper solution (in terms of coding time) than on optimizing everything. That's something I guess you ultimately get to decide, but just my two-cents. Thanks for the info, let me know if I can help speed anything up. Shadowjams (talk) 22:59, 3 October 2010 (UTC)
Re: which typo rules are the slowest. We have the 'profile typos' option to run on a particular page, but that is only for a particular page. We also have to be careful that just because a rule doesn't match any pages in a given database dump doesn't mean the rule is useless. Somebody may have fixed 20 typos using that rule the day before the dump. However, the last time I did profile typos on a page there were certain rules that were much slower than others, so we might achieve a reasonable performance improvement by focusing on a handful of rules. Still, I don't think current performance is a problem, the "1/3 of the time" Reedy mentions depends entirely on the page you run against. Rjwilmsi11:13, 7 October 2010 (UTC)
I have posted the 50 slowest typo rules, based on profiling Tiger Woods. The number at the start is the time (I think this is probably the time in milliseconds to apply the typo 100 times or something), and then the regex of the rule is given. Note that the quickest typo has a time of 2, a typical value for the majority of the rules is around 50. Therefore some rules are 5 or 10 times slower than average. Rjwilmsi11:31, 7 October 2010 (UTC)
Quick example on the 11th slowest: ($1nally): originally 0.87 seconds using Expresso for 10 iterations on Tiger Woods, using \b([A-Za-z]{2,}[a-mo-z])(?:nalyl|anlly)\b instead is 0.67 seconds. That's about 20% faster with no change to the rule's matching. Rjwilmsi11:53, 7 October 2010 (UTC)
No, not quite true, we want to match the whole word so the edit summary shows whole words being corrected. Rjwilmsi13:00, 7 October 2010 (UTC)
Converting \w to [A-Za-z] for performance improvement: that reduced typical typo time on Tiger Woods from average 7.7 seconds to average 6.9 seconds on my laptop, ~10% better. [A-Za-z] may be better as [a-z], I'll see about that. Rjwilmsi13:36, 7 October 2010 (UTC)
I think \w covers [A-Za-z0-9_] and maybe (depending on the language) extended Latin/Cyrillic characters. Mitigating that though, in most cases those probably aren't intended. Shadowjams (talk) 16:18, 7 October 2010 (UTC)
John's explanation is correct, though his example uses AWB find & replace rather than typo fixing, but both do the same edit summary condensing he's explained. Rjwilmsi11:05, 14 October 2010 (UTC)
Sorry, but I don't understand your request. Could you please specify the exact misspellings that you want to be identified and fixed? Thanks! GoingBatty (talk) 02:16, 27 October 2010 (UTC)
Possible State capitalization issue
I have had a few pages lately where AWB is trying to capitalize states that are within a web address and I dont think we want to do that. Here is one example. --Kumioko (talk) 19:52, 22 October 2010 (UTC)
It looks like AWB properly ignored the web address (the part in the brackets that uses the http:// prefix) and only tried to fix the unfortunately worded description of the web address (not in brackets, with no http:// prefix). -- JHunterJ (talk) 20:19, 22 October 2010 (UTC)
Neither of them seem to work for me in the AWB Regex Tester. In particular, although you want to change "kms" and "kgs" (which contain lower case "k"), the regex only has an uppercase "K". GoingBatty (talk) 02:29, 26 October 2010 (UTC)
Good call, thanks. Let me add lower case 'k' as an option:
I tried the AWB Regex Tester again using your Find and Replace on the text "Kgs and Kms and kgs and kms and kg and km", and it didn't find anything to replace. Hopefully one of the experts can give you a hand with this. Good luck! GoingBatty (talk) 02:20, 27 October 2010 (UTC)
That's also going to result in false positives where it tries to fix km and kg. Since we want to fix kms, Kms, Km, kgs, Kgs, Kg - but not km or kg - how about splitting this into two rules:
The one line version should be faster than the two line version. Yes, it does over-write 'km' with 'km' but it has to parse the text anyway and the outcome is unchanged. Lightmouse (talk) 17:44, 27 October 2010 (UTC)
How about one rule: <Typo word="kg/km (kilogram/kilometre)" find="([\d\.]+(?:\s| |-)?)(?:K([gm])s?|[Kk]([gm])s)\b" replace="$1k$2$3" /> Could someone please test this? If two rules are necessary, I'd suggest that one handle the capital "K" error, and the other the terminal "s" error.--BillFlis (talk) 19:09, 27 October 2010 (UTC)
It works for me, Bill. I used the regex tester on:
I've made the change to the rule. Also, modified the watt rule to correct also "kw" and removed the now-redundant kilowatt rule.--BillFlis (talk) 11:42, 28 October 2010 (UTC)
"Gramme" is rarely used in British English. It's an old spelling. But people must also note that the SI spelling of "meter" is "metre" so just basing spelling on SI is not OK. McLerristarr | Mclay113:53, 26 October 2010 (UTC)
Quite. I'm just referring to the SI unit of mass. wp:engvar says "Wikipedia tries to find words that are common to all varieties of English." There is an occasionally quoted misconception that British spelling requires 'kilogramme'. The spelling 'kilogramme' merely has the status of an old alternative. Since metrication started in the 1970s, the spelling 'kilogram' started to be adopted and is now the default.
The spelling 'kilogram' has been used in legislation for the last 25 years (e.g. Weights and Measures Act 1985). It's the spelling taught by the Department of Education] and in style guides:
All of the code in SI unit symbols seems excessive to me. For example, the code that will turn '100 kw' into '100 kW' is:
find="([\d\.]+(?:\s| |-)?)kw\b" replace="$1kW" />
It looks for a digit string. But I think it could be simplified by looking only for the last digit in the string. Thus:
find="(\d(?:\s| |-)?)kw\b" replace="$1kW" />
As far as I can see, that would give the same hit rate and the same false positive rate. The same applies across all 14 SI units. Am I correct? Lightmouse (talk) 14:44, 26 October 2010 (UTC)
Looks OK to me (unless someone writes "25. kw", which is a different error), but I would change the "?" to "*" to catch multiple spaces:
We match the entire number so that the edit summary shows the entire unit to make it easier for editors to understand the change. Rjwilmsi14:53, 26 October 2010 (UTC)
Since "It is" has its own entry in the Duplicate words section to fix "it it" and "is is", should the specific Duplicate words entry be tightened so it doesn't also look for "it it" and "is is"? GoingBatty (talk) 03:46, 30 October 2010 (UTC)
For the same reasons people still use HTML &ndash instead of the UTF-8 character (which they can get from the little tool strip below the edit window): tradition, recalcitrance, personal preference, obstinacy, obtuseness, drunkenness.--BillFlis (talk) 07:58, 30 October 2010 (UTC)
I would point out that I have my own personal convert template regex rule, and I think there's a bot going around doing similar things. While both mine and the bot's rules could fix all versions, I currently don't and I don't know what the bot does. It pays to have some standardization... but I'm not hell bent to change the MOS rules for something like this. Shadowjams (talk) 06:26, 4 November 2010 (UTC)
Ha, there's a bit of a disconnect here somewhere. If on the "Insert" pull-down menu below the "Save page" button you select "Symbols", it makes available both "m²" and "m³" (with the Unicode exponents, not the <sup> markup).--BillFlis (talk) 12:04, 4 November 2010 (UTC)
in in
This is a recent addition; I've only seen it produce false postives so far. There are many phrases ending in "in", such as "bring in", "buy in", "carry in" and so on, which can legally be followed by another phrase that starts with "in", such as "in many cases", "in 2007", and so on. -- John of Reading (talk) 08:21, 31 October 2010 (UTC)
Hi John - I'm the one who made the addition based on the typo corrected in this edit. Could you please give an example of a grammatically correct sentence that contains "in in"? Thanks! GoingBatty (talk) 14:56, 31 October 2010 (UTC)
I've just done an AWB Google search for "in in". The rule made no correct changes, and was going to damage these:
Betting (poker) - a player may go all in in exactly the same manner...
Cheating in poker - unless this exceeds the maximum buy-in in which case the player...
A search for "in in early" found a roughly even mixture of correct and incorrect fixes. I didn't save anything, so you can try it yourself. -- John of Reading (talk) 20:50, 31 October 2010 (UTC)
Based on John's feedback, I updated the rule here so it looks for a space before the duplicated word, so it won't catch "buy-in in" or "Drive-in in" anymore. GoingBatty (talk) 02:33, 3 November 2010 (UTC)
"I let the dog in in the morning." Two in's is the same situation as two on's. There's no way of getting around it. The typo fixer cannot possibly correct every typo so copyediting still needs to be done regularly. This is another typo that will have to be found the traditional way. McLerristarr | Mclay106:42, 3 November 2010 (UTC)
I think it's simply too complicated of a grammatical issue to handle with the typo rules. I'd note that there's absolutely nothing stopping anyone from using their own rules in AWB to identify common types of duplicate words (pretty much pronouns and prepositions), or just identifying duplicate words in any case (this should do it \b(\w+)\b\1\b) and using human judgment to fix them. This is probably better used for words that don't have this error. I don't have enough grammar knowledge to be confident about which words those are, but the usual "the the" examples are a good place to start. Shadowjams (talk) 06:24, 4 November 2010 (UTC)
Based on the discussion, I've reverted my change here. However, I disagree that "There's no way of getting around it."
"a player may go all in in exactly the same manner" → "a player may go all in exactly the same way"
"The thaw set in in early March." → "The thaw set in early March"
Thanks for the regex suggestion, Shadowjams, but that didn't work for me. While \b(\w+)\s\1\b did work, I found that \s(\w+)\s\1\s helps to avoid the "buy-in in" examples above. GoingBatty (talk)
As well as avoiding "buy-in in" it could avoid "buy buy-in". I know that's not a good example but I can't think of a real one right now. McLerristarr | Mclay108:15, 5 November 2010 (UTC)
As a postscript I've tackled "in in" using a variety of Google searches ("in in 1857", "born in in", and so on) and a long regexp to skip most of the false positives; 450 fixes from around 2000 candidates. There will be many others that I've missed, I'm sure. -- John of Reading (talk) 19:51, 6 November 2010 (UTC)
Before starting another controversy, does anyone object to expanding the Duplicated words entry to fix "had had" and "that that"? GoingBatty (talk) 00:21, 7 November 2010 (UTC)
Thanks for the example. Sorry for being dense, but what's the difference between "He had the apple" and "He had had the apple" ? GoingBatty (talk) 00:41, 7 November 2010 (UTC)
The second is used to refer to an action that happened before another (had something before another thing happened), as in "He had had a drinking problem, so he attended an AA meeting." The typo fixer shouldn't change something that is completely correct. PleaseStand(talk)01:23, 7 November 2010 (UTC)
I agree that the typo fixer shouldn't change something that is completely correct. So does your example mean "He had a drinking problem, so he attended an AA meeting, and he no longer has a drinking problem." ? Thanks! GoingBatty (talk) 01:40, 7 November 2010 (UTC)
"more more" and "other other" look OK to me, though "more more" will run into some false positives with song and TV program titles. (Comment revised after I saw the error in my test regexp) -- John of Reading (talk) 07:52, 7 November 2010 (UTC)
Thank you for the links. I definitely won't add "had had" or "that that". I hope that the song and TV program titles would be "More More" instead of "more more". GoingBatty (talk) 15:46, 7 November 2010 (UTC)
Does the typo fixer remove duplicate words in different casings (e.g. other Other)? I don't think it should because the capitalised word could be part of a proper name, making the duplication completely correct. McLerristarr | Mclay101:04, 9 November 2010 (UTC)
Is your list across all namespaces? I think the primary concern should be the article namespace. Anyone who wants to type "very very" or "blah blah blah" on a talk page isn't something we should be correcting. GoingBatty (talk) 17:59, 10 November 2010 (UTC)
Cool list! For comparison, the typo rule is currently fixing the following duplicates: a, am, an, as, at, and, are, become, be, by, could, did, do, for, go, has, he, if, is, it, me, more, no, of, or, other, she, should, the, their, them, then, these, they, this, thus, to, was, were, what, where, when, which, who, whom, why, with, would. GoingBatty (talk) 17:56, 10 November 2010 (UTC)
"What's more" should be followed by a comma. That's a problem with a lot of these rules; they would be correct if they were separated by a comma. McLerristarr | Mclay110:40, 22 November 2010 (UTC)
I think the automatic typo fixes are all turned off inside wikilinks. The only kind of fix that wouldn't break the link is this one, changing the case of the initial letter. -- John of Reading (talk) 07:56, 15 November 2010 (UTC)
You're right - I wouldn't expect AWB to change [[european individualist anarchism]]. However, since AWB changed "And so an european tendency..." to "And so a european tendency...", I expected it to change to "And so a European tendency..." GoingBatty (talk) 13:40, 15 November 2010 (UTC)
I found this in the manual - "If a typo rule is matching a wikilink target, this rule will be ignored on the whole page". So on that page, only, AWB thinks that "european" is allowable. -- John of Reading (talk) 14:12, 15 November 2010 (UTC)
Aha - that explains it! I tried to RTFM before posting this question, but looked in the wrong place. Could this sentence be added to the appropriate place on WP:AWB/T ? Thanks! GoingBatty (talk) 17:25, 15 November 2010 (UTC)
Not sure what you're asking for here. There's already a rule set up to change "Pre-Colombian" to "Pre-Columbian". Are you saying this rule isn't working, or are you suggesting this rule be disabled, or something else? GoingBatty (talk) 04:26, 17 November 2010 (UTC)
OK I'm finding a lot of these, in variations; "etc. ..." etc. I will try and fix as many as possible but looks like a candidate for a typo rule. RichFarmbrough, 10:04, 17 November 2010 (UTC).
Do you mean as in a proper etc. and then trailing periods (with or maybe without a space)? The current etc. rule has a kind of complicated negative lookback, so it's probably easier to just make a new rule for properly spaced etc.'s that have that feature. Test this:
The change of etc to etc. many times is not helpful. The use of a period becomes a full spot and so converting it with AWB makes this a not automatic process. How about instead convert etc to the full wording etcetera or otherwise not converting at all. Regards, SunCreator(talk)11:08, 17 November 2010 (UTC)
I'm not sure the distinction between a full stop and a period... they're effectively the same thing... and I don't understand the issue with the change unless you prefer "etc" remains instead of becoming "etc." If you have an example of where the rule's making a mistake, please provide the diff. The manual of style, however, has long considered the "etc." version correct, as has every other style guide I've ever seen outside of Wikipedia. Shadowjams (talk) 11:45, 17 November 2010 (UTC)
Period and full stop are the same I was attempting to show the difference between a dot at the end of "etc." and the ending a sentence with "etc.". They are both the same and so it's an issue. Here is a made up example.
"During the succession of lead singers of Jones, Tomson, Harry, Dickson etc Smith's vocals had always been distinguishable."
Now if you change "etc" to "etc." you end up with two sentences. "During the succession of lead singers of Jones, Tomson, Harry, Dickson etc. Smith's vocals had always been distinguishable"
A better way would be to change "etc" to "etc.," to keep the sentence going. Splitting the sentence into two by "etc." is grammatically messy at best. Regards, SunCreator(talk)23:24, 17 November 2010 (UTC)
We can't possibly account for mistakes. That sentence should be "During the succession of lead singers of Jones, Tomson, Harry, Dickson etc., Smith's vocals had always been distinguishable". If the comma has been omitted, that's not our problem. McLerristarr | Mclay106:23, 18 November 2010 (UTC)
It's not automatic, but it is complicated. I'm currently using 4 rules
<Typo word="<enter a name>" find="etc\s*.\s*…" replace="etc." />
<Typo word="<enter a name>" find="etc\s*\.\.\.\." replace="etc." />
<Typo word="<enter a name>" find="etc\s*\.\.\." replace="etc." />
<Typo word="<enter a name>" find="etc\. +([A-Z])" replace="etc.. $1" />
Plus of course the built in etc => etc.
Rule 1 deals with the actual ellipsis character.
Rule 2 assumes that four dots represent an abbreviation stop and an ellipsis, and removes the ellipsis.
Rule 2 assumes that three dots represent an ellipsis, and removes the ellipsis, replacing it with a stop.
Rule 4 assumes (very shakily) that a new sentence starts on the next word and inserts an end of sentence stop after the abbreviation stop.
This is, of course, only valid outside quotes, and even then only rules 1-3 can be given a very high positive and low negative hit rate. Rule 4 fails positively on succeeding proper nouns and fails negatively on intervening punctuation, breaks, titles, end of page etc. RichFarmbrough, 12:25, 17 November 2010 (UTC).
Sorry, I may be confused; come to think of it, it may have been regarding e.g. or i.e. or something like that. The discussion I'm thinking of had to do with trailing punctuation I think... In any case I think that issue dealt with some peculiarities of the old rule. So your question raises a good point. Shadowjams (talk) 00:52, 19 November 2010 (UTC)
Etc. and etc should be avoided in formal prose, IMO. "Such as ...", and "including ..." are just two subset terms that indicate that a list is incomplete, and avoid the brush-off informality of "etc"
"[number]-fold"
I have just removed the following as not being a typo.
AFAIK, usage of the -fold suffix (i.e. 'three-fold' as opposed to 'threefold') is a accepted/bona fide variant, and does not fall to be treated as a typo. --Ohconfucius¡digame!04:10, 22 November 2010 (UTC)
Oxford Dictionaries Online doesn't list them as variants and I can't find any instances on Google, which thinks it's a typo. Usually hyphenated compound words are British but British usage seems to be no hyphen. McLerristarr | Mclay107:24, 22 November 2010 (UTC)
Saavy --> Savvy
A new user recently requested that this typo be fixed by an AWB user. I found the misspelling in 18 articles when I ran the request. --Andrew Kelly (talk) 03:49, 23 November 2010 (UTC)
The United States National Institute of Standards and Technologyprefers "kW·h" but considers kW h acceptable. It acknowledges that the ISO allows dropping the space if there is no risk of confusion, but NIST disagrees with ISO's position.
My position is that the attention human editors give to reviewing AWB edits is often minimal, so a form that can be confusing, "kWh", should be forbidden for AWB purposes.
Also, since there are two acceptable forms, if examination of an article shows it consistently uses a correct form, together with a few errors, the AWB user must follow the established form for that article. Jc3s5h (talk) 18:00, 28 November 2010 (UTC)
The new rule is set up to change "KWh", "Kwh", or "Kph" → "kWh". RegExTypoFix can't suggest to the user to use one of multiple forms. Should it suggest "kW·h" or "kW h"? GoingBatty (talk) 02:41, 29 November 2010 (UTC)
"Kph"? It should be kW·h as that is the correct form. If people want to use the incorrect form, then that's up to MOS:NUM to decide, but a typo corrector should add the most correct form. It shold just not correct kW h. A new rule could be set up to add nbsp between units like that. Although, I'm not sure if that's a typo thing or a general AWB thing. McLerristarr | Mclay104:34, 29 November 2010 (UTC)
"Kph" is probably a typo for km/h and is certainly not a typo for kW·h. This is a clear error in the rule which must be fixed. I think the correction for the other typos should be kW·h. Editors who consistently fail to change this to kW h in articles where that form is appropriate should have their permission to use AWB revoked for failure to properly review their edits. Jc3s5h (talk) 13:57, 29 November 2010 (UTC)
I agree that AWB users should review their edits before saving, but I don't see how you would educate AWB users on the level on consistency you desire for the proper abbreviation, especially when the scientific community can't agree. You'd probably have better luck educating the editors who made the original mistakes, so the AWB users won't have to fix anything. GoingBatty (talk) 17:31, 29 November 2010 (UTC)
I would say AWB users should not use it on articles if they lack subject matter knowledge, or they should turn off any options that would make changes that require subject matter expertise to evaluate. Those who cannot be pursuaded to limit AWB use to situations they can properly evaluate should have the privilige of using it removed. Jc3s5h (talk) 18:12, 29 November 2010 (UTC)
Please don't put anything in the typo rules that requires expert knowledge. I use AWB to fix thousands of grammatical errors scattered randomly across hundreds of subject areas. If, as I read here, this typo rule is controversial or requires extra-careful review, I will simply turn off the RegExpTypoFix option on any article where this rule kicks in - and that means that other typos in that article won't be fixed. -- John of Reading (talk) 18:22, 29 November 2010 (UTC)
So John, by the same reasoning, you wouldn't want any typo correction for words that are spelled differently in various varieties of English, such as "colour", right? Jc3s5h (talk) 19:35, 29 November 2010 (UTC)
That's correct, we couldn't add a typo rule for "color > colour" or "colour > color", because they would give the wrong results too often. -- John of Reading (talk) 21:09, 29 November 2010 (UTC)
John, I don't think that's the proper analogy. Your example is a rule that is changing one correct version of the word for another. I think a better analogy is that we don't add a typo rule to fix the incorrect "colur", because the correct word could be either "color" or "colour".
In this case, a typo rule was added to fix the incorrect "KWh" or "KWh", but since the correct abbreviation could be "kW·h" or "kW h" or maybe even "kWh" (depending on which organization you want to follow), I think the safest thing would be to remove the rule and let those with the expert knowledge identify and fix all future errors.
It was the rule named "Dissi-". I don't know whether it's worth changing the rule, though, since this is such an uncommon typo - an AWB Google search finds just two examples, neither in article space. If I see that RegExpTypoFix has made an incorrect fix, I don't hit "Save"... -- John of Reading (talk) 07:35, 9 December 2010 (UTC)
Perhaps it would be wise to make a temporary rule of "dissiciplinary" to "disciplinary" to fix up the mistakes that the typo finder may have already made? McLerristarr | Mclay108:10, 9 December 2010 (UTC)
The "typo" fix here is invalid. The beach is called "Bicep Beach". (I just watched the short to confirm.) So I added {{typo}} around the word. Does AWB honor that? Is that the appropriate action in cases like this? --Mepolypse (talk) 15:43, 11 December 2010 (UTC)
Does {{not a typo}} have the options that {{sic}} has? One thing {{sic}} has is the ability to hide or display the word "sic"; sometimes you just want to tell spell checkers to leave it alone, and sometimes you want "sic" displayed in the article. --Auntof6 (talk) 03:20, 12 December 2010 (UTC)
I see the difference as {{sic}} is to tag a mistake made by someone outside of Wikipedia, e.g. in a quote, whereas {{Not a typo}} is to tag a deliberate mistake made by a Wikipedia editor or something that seems like a mistake but isn't. McLerristarr | Mclay104:37, 12 December 2010 (UTC)
That's not a typo or even incorrect, it is merely a personal preference. It is completely correct grammar to say someone was "made a Knight of the British Empire". The award is often used to refer to the recipient. Ringo Starr has an MBE = Ringo Starr is an MBE. Whether that is correct or not is not for a typo fixer to decide. McLerristarr | Mclay114:14, 12 December 2010 (UTC)
DoneThis update will ensure the correction shows up in the edit summary for all cases except the "John the baptist" fix. Rjwilmsi01:45, 3 January 2011 (UTC)
<Typoword="Cadillac"find="\b[Cc]ad(dil(l|)|il)ac\b"replace="Cadillac"/><Typoword="Be unable"find="\bnot\s+be\s+able\b"replace="be unable"/><Typoword="Aberrant"find="\b([Aa])b(b[ae]rr?|[ae]r|arr?)([ae](nce|nt|tes?|tions?)|)\b"replace="$1berr$3"/><Typoword="Accelerate"find="\b([Aa])c(cela|[ae]l[ae])rat(e(d|s|)|ing)\b"replace="$1ccelerat$3"/><Typoword="Accidentally"find="\b([Aa])cc?id[ae]nt([aei]?(ly))\b"replace="$1ccidentally"/><Typoword="across"find="\bacros\b"replace="across"/><Typoword="Adaptation"find="\b([Aa])dapt([ae]|io)n(s?)\b"replace="$1daptation$3"/><Typoword="Adaptive"find="\b([Aa])dapt[aei]tive\b"replace="$1daptive"/><Typoword="Adultery"find="\b([Aa])d[aeu]lt[au]?ry\b"replace="$1dultery"/><Typoword="Anesthe(sia/tic)"find="\b([Aa])n[ai]sth[ae](sia|tics?)\b"replace="$1nesthe$2"/><Typoword="(A/E)ffect"find="\b([AaEe])fect(s|ing|)\b"replace="$1ffect$2"/><Typoword="affidavit"find="\baf(f[ae]|[aei])(d[ae]v[ie][td](s?))\b"replace="affidavit$3"/><!--To do: catch if start with aff) --><Typoword="Affluen(t/ce/cy|tial)"find="\b([Aa])fluen(c[ey]|t(tial)?)\b"replace="$1ffluen$2"/><Typoword="(Un)Afflict"find="\b([Uu]na|[Aa])flict(e(d(ly|ness|)|r)|i(ng|ons?(less)?|ve)|less|s|)\b"replace="$1fflict$2"/><Typoword="Aggravate"find="\b([Aa])g(gr[eo]|r[aoe])vat(ed?|i(on|ve)|or)\b"replace="$1ggravat$3"/><Typoword="Agrees to"find="\bagress\s+to\b"replace="agrees to"/><Typoword="Agreement"find="\bagree?[ia]nce\b"replace="agreement"/><!-- Per http://dictionary.reference.com/browse/agreeance agreeance is "considered obsolete and a bastardization of 'agreement' " --><Typoword="Aid"find="\b(to|give|provide)\s+aide\b"replace="$1 aid"/><!--Aid vs Aide needs more work--><Typoword="Album"find="\balbumn(s?)\b"replace="album$1"/>
Before I added the above typo suggestions in (and commit more time towards making the regex), I wanted to make sure that the above is correct. Could someone familiar with the Typos regex let me know if the above formatting/regex is correct?Smallman12q (talk) 23:01, 2 January 2011 (UTC)
Here's what I changed above:
Added missing left bracket to the "Cadillac" rule.
Changed the name of the "Unable" rule to "Be unable". What is your source for changing this?
Changed the end of the "Abberant" rule to $3.
Changed the beginning of "Aggravate" rule to $1.
Changed the name of the "Agreeance" rule to "Agreement".
Moved the comments immediately after the appropriate rule.
A style suggestion: Set each rule so that the replace field has only $1 and $2, and not jump from $1 to, say, $5. Then if someone later makes a change, they won't have to count whether $5 has to be increased to $6. For example:
Have you checked to see whether all these errors actually occur in wikipedia?
Is "ambivilent" really a word? It's not in my gigantic dead-tree dictionary, and the only two occurrences I find in wikipedia are errors for "ambivalent". Even if it is a real word, I wouldn't create a rule for it, as it is apparently exceedingly rare.--BillFlis (talk) 14:06, 3 January 2011 (UTC)
No its not...that's my mistake...should be "ambivalent".I have checked some..."literaly" returns 8 results,"iliterate" returns 2. I also have a more technical question, is the typo scan plugin multi-threaded?Smallman12q (talk) 16:07, 3 January 2011 (UTC)
Here are some more suggestions...I believe I'm done with most of the A's...
<Typoword="Amidst/Amongst/Whilst"find="\b([Aa]m(ong|id)|[Ww]hil)st\b"replace="$1"/><!--archaic--><Typoword="Immoral/Immortal"find="\b([Ii])mor(t?)al(s|ity|l?y|)\b"replace="$1mmor$2al$3"/><Typoword="Immoral (2)"find="\bammoral"replace="immoral"/><!--could also be amoral--><Typoword="Ampersand"find="\b([Aa])mp(?:ers[eiou]|[[aiou]rsa)nd(s|)\b"replace="$1mpersand$2"/><Tpoword="Anecdote"find="\ban[ia]dote(s|)\b"replace="anecdote$1"/><Typoword="Annoyance"find="\bannoyment(s)\b"replace="annoyance$1"/><Typoword="Anymore"find="\bany\s+more\b"replace="anymore"/><!--http://dictionary.reference.com/browse/anymore most commonly spelled as one word--><Typoword="Anyway"find="\b([Aa])n?nyways\b"replace="$1nyway"/><!--http://dictionary.reference.com/browse/anyways Anways is not standard--><Typoword="Arctic"find="\b([Aa])rtic"replace="$1rctic"/><Typoword="As usage (since)"find="\b([Aa])s\s+(al(most|)\b"replace="since"/><!--check if first letter caps--><Typoword="As usage (because)"find="\bas\s+there\b"replace="because"/><Typoword="Assertion"find="\b([Aa])ss?ertation(s?)\b"replace="$1ssertion"/><!-- Assertation obselete--><Typoword="Opinion"find="\bopinionation\b"replace="opinion"/><Typoword="Authentication"find="\b([Aa])u?thentification\b"replace="$1uthentication"/><!-- Authentification is incorrect--><Typoword="Backward"find="\bba(ckw[eio]|kw[[aeio])rd(s?)\b"replace="backward$1"/>
I don't think we should "fix" amidst, amongst, or whilst. These are not marked as archaic in Merriam-Webster. "Ammoral" -> "Amoral" would be a better fix. "Anidote" is more likely "Antidote" than "Anecdote", IMO. Any more is right out -- "are there any more of these?" is perfectly valid. I do not understand "as al" -> "since" or "as almost" -> "since", and is missing a close paren. "As almost" (uppercase) -> "since" (lowercase) would be wrong regardless. -- JHunterJ (talk) 18:15, 3 January 2011 (UTC)
"Immoral" is probably handled by one of the "beginnings" rules. "Artic" without a final \b will damage "Articulated"; with a \b there will still be many false positives - try a search. "Opinionation" occurs only four times, three times as the title as a piece of music and once in the title of an academic work. As a general point, can one of the AWB performance experts please indicate roughly how many hits are needed to make a rule worthwhile? -- John of Reading (talk) 18:22, 3 January 2011 (UTC)
I don't agree with the "Anesthe(sia/tic)" rule. If someone had misspelt it "anasthesia", it could have meant to be "anaesthesia" or "anesthesia". We can't correct that one. McLerristarr | Mclay106:54, 4 January 2011 (UTC)
Why not? Picking either valid spelling from two variations (with the same meaning) is an improvement over a misspelling. -- JHunterJ (talk) 15:22, 5 January 2011 (UTC)
But picking the American variation on a British page or vice versa isn't that much of an improvement. The main problem is which one would we pick for the typo finder to use? Either way, it isn't fair on the other. McLerristarr | Mclay115:56, 5 January 2011 (UTC)
OTOH, picking a chiefly American variation on a British page (or vice versa) is a big improvement over a misspelling on either type of page (it only "violates" WP style, not English spelling). "Fair" isn't at issue. (Can't speak for all Americans, of course, but I'd rather see the chiefly British variation than the misspelling.) -- JHunterJ (talk) 17:32, 5 January 2011 (UTC)
The only occurrences of "opinionation" I found were 1) in a song title, "My Opinionation", hence a deliberate misspelling or nonce word, and 2) in the title of a journal article, so probably intended as technical jargon. Also, the "Backward" rule has a couple of problems: "[[" and "$1". Also, in the first set, "adaption" could be a type for "adoption".--BillFlis (talk) 14:24, 6 January 2011 (UTC)
False positive: "suppose to" → "supposed to"
In Perates the following sentence illustrates the fact that suppose can correctly be followed by to:
We do not read elsewhere of any Euphrates but the Stoic philosopher, who lived in the reign of Hadrian, whom we cannot suppose to have been a teacher of Ophite doctrine.
I am currently working on a "find & replace" run for "suppose to" and have found several other false positives. I am removing the rule. -- John of Reading (talk) 21:33, 9 January 2011 (UTC)
And you've both been working on the list of articles that are next on my list. :-) How about readding it to only find "is/was suppose to"? GoingBatty (talk) 21:53, 9 January 2011 (UTC)
I've just corrected about a hundred more typos, and have more to do. Based on the volume that the three of us have fixed, I feel that this should be added to the list, just more carefully. Sorry I didn't get it right the first time. GoingBatty (talk) 23:50, 9 January 2011 (UTC)
tieing -> tying
Since there is already a rule for "dieing -> dying", there should also be one for "tieing -> tying". AWB did not find that typo here. —bender235 (talk) 23:45, 11 January 2011 (UTC)
This will create way too many false positives. If some leaves out the comma in "he got a lot worse, then he got better" we'd end up with "he got a lot worse than he got better". McLerristarr | Mclay105:27, 20 January 2011 (UTC)
I don't think any of these are suitable: "He earned more/less then than he did now", "It was smaller/larger/bigger/etc then than it is now" -- John of Reading (talk) 08:05, 20 January 2011 (UTC)
Addressed that existing problem with the former rule with a negative lookahead. The other false positive of the missing comma is also an existing problem, not a problem with the expansion. Should this rule be deleted entirely? -- JHunterJ (talk) 15:12, 20 January 2011 (UTC)
Using <Typo word="september" find="\b(?<=[Dd]en \d\d? |[Ii] )September\b" replace="september" />
on sv:WP:AWB/T to change September to september in "Den 18 September 2008..." in sv:Arsonist Lodge on svwp, but edit summary is not filled by the expected "typos fixed: September -> september" but comes out blank "typos fixed: ". Can anyone confirm this is a bug (and help me report the bug), or explain to me what I did wrong so I should expect this undesirable result? ~ Dodde (talk) 00:06, 25 January 2011 (UTC)
If the regex does not match the match value ("September" is the match value here, and the regex doesn't match it) then the edit summary can't show it, as I documented. Not a bug. You don't need a lookbehind there, make it a normal group and replace with "$1september" then all will be well. Rjwilmsi08:45, 25 January 2011 (UTC)
What do you mean? Isn't the full "\b(?<=[Dd]en \d\d? |[Ii] )September\b" referred to as the regex? This regex does match "September" if it is preceeded by i.e. "Den 18 ". I don't want "Den 18 " to be part of the match since I don't want it to be part of the edit summary. I don't understand how this can be done without positive lookbehind. ~ Dodde (talk) 16:08, 25 January 2011 (UTC)
Match value is "September", the regex does not match it, so no edit summary is generated. What you want (to hide part of the match logic in the edit summary) is not supported. Rjwilmsi17:36, 25 January 2011 (UTC)
I believe \b(?<=[Dd]en \d\d? |[Ii] )September\b will match September in "Den 18 September 2008", but the edit summary didn't show that. -- JHunterJ (talk) 18:36, 25 January 2011 (UTC)
In all other cases, what is matched by the regex is shown in edit summary is "typos fixed: matched value > replace value" (if that now are the correct terms?). How can a .NET supported regex match suddenly be unexpectedly "not supported (for showing up in the edit summary)" without being a bug? ~ Dodde (talk) 18:58, 25 January 2011 (UTC)
rev 7571 RETF to generate edit summary as normal for regex using lookarounds whereby regex doesn't match its own match value. I hadn't realized we could support this with a simple change. Rjwilmsi08:36, 26 January 2011 (UTC)
Is AWB still ignoring matches in wikilinks? Apparently not: [3][4]
While this is beneficial for this one rule (and what I hoped to happen when I wrote it), it might be harmful for other rules. Can someone confirm this? — Train2104 (talk • contribs • count) 01:57, 25 January 2011 (UTC)
If it was typo fixes, the edit summary would say: "typos fixed: VIA → Via". Since this says "replaced: VIA → Via", that indicates to me that the person is using AWB's find and replace feature without checking the "Ignore external/internal wikilinks" box. Please follow up with the person who made the edits. Good luck! GoingBatty (talk) 02:30, 25 January 2011 (UTC)
-ish
The "-ish" rule tries to be very clever and not damage certain proper names that end in "sih", but it is not yet good enough. There is a false positive "Fasih" at Special Tribunal for Lebanon, and that's not the first I've run into today. How about a simpler rule that only changes "-sih" to "-ish" when the word is entirely lowercase? -- John of Reading (talk) 18:26, 27 January 2011 (UTC)
I know it's still being clever, but can we make it safely clever for proper names by checking the letter before the "sih"?
<Typoword="-ish"find="\b([A-Z][a-z]*[^aers]|[a-z]+)sih(ing(?:ly)?|e[ds]|ers?)?\b"replace="$1ish$2"/><!--Don't match proper names with -asih -esih -rsih -ssih -->
Just to clarify the usage, you replaced "certinaly" with "certinally" there, using AWB and its catch-all rule for replacing -aly with -ally. I will expand the "certain" rule to handle "certin", but remember to check the edits you make using AWB. -- JHunterJ (talk) 17:31, 1 February 2011 (UTC)
Is it possible to expand the contractions rule to allow for badly punctuated examples, such as did'nt, doesnt, etc? I don't think there's many of them, but it would be worth picking them up when they occur. Words like cant should be exempt from the rule. — Tivedshambo (t/c) 20:56, 1 February 2011 (UTC)
How about leaving the "cannot" rule the way it is, removing "we're", adding question marks to find zero or one apostrophe, and ensuring it doesn't change "hell" and "shell" as follows:
<Typoword="will not"find="\bwon[’'`]?t\b"replace="will not"/><!--don't change uppercase titles--><Typoword=" not"find="\b(are|(c|sh|w)ould|d(id|o|oes)|ha([ds]|ve)|is|m(igh|us)t|w(as|ere))n[’'`]?t\b"replace="$1 not"/><!--don't change uppercase titles, can't and won't have separate rules--><Typoword=" are"find="\b(they|wh(at|o)|you)[’'`]?re\b"replace="$1 are"/><!--don't change uppercase titles--><Typoword=" have"find="\b((c|sh|w)ould|they|wh(at|o)|you)[’'`]?ve\b"replace="$1 have"/><!--don't change uppercase titles--><Typoword=" will"find="\b(s?he|they|wh(at|o)|you)[’'`]?ll\b(?<!hell)"replace="$1 will"/><!--don't change uppercase titles or "hell" or "shell"-->
Tweaks to avoid matching "wont" and "whore", and more alternatives rather than negative lookbehind (and non-capturing parens where the capture isn't used).
<Typoword="will not"find="\bwon[’'`]t\b"replace="will not"/><!--don't change uppercase titles or "wont"--><Typoword=" not"find="\b(are|(?:c|sh)ould|d(?:id|o|oes)|ha(?:d|s|ve)|is|m(?:igh|us)t|w(?:as|ere|ould))n[’'`]?t\b"replace="$1 not"/><!--do not change uppercase titles, can't and won't have separate rules--><Typoword=" are"find="\b(who[’'`]|(?:they|what|you)[’'`]?)re\b"replace="$1 are"/><!--do not change uppercase titles or "whore"--><Typoword=" have"find="\b((?:c|sh)ould|they|w(?:ould|h(?:at|o))|you)[’'`]?ve\b"replace="$1 have"/><!--do not change uppercase titles--><Typoword=" will"find="\b(s?he[’'`]|(?:they|wh(?:at|o)|you)[’'`]?)ll\b"replace="$1 will"/><!--do not change uppercase titles or "hell" or "shell"-->
Amnd this is an example. Please read WT:MOS for the last week; there have been several instances where non-native speakers of English have simply substituted was not for wasn't, when the sentence needed to be recast - and where simple substitution changes the force of the sentence. WP:CONTRACTIONS says, advisedly, generally; there are occasions when the uncontracted form is a violation of idiom; there are occasions when the contraction is being quoted; and there are occasions when the change needed is much more complex than AWB can provide. SeptentrionalisPMAnderson19:40, 8 February 2011 (UTC)
Deletion of the "harmful" contraction section
I have undone this edit by Pmanderson (talk·contribs). The section containing the "harmful contractions", the ones that have provoked so much debate, was disabled by me 36 hours ago; that section is still there but is commented out. The section removed by Pmanderson is an ordinary typo-fixing section that fixes incorrect contractions, for example changing it's downfall to its downfall and was'nt to wasn't. I have restored that section. -- John of Reading (talk) 19:51, 8 February 2011 (UTC)
That rule has been written to match only a very specific list of words. The rule does not match "it's down" or "it's downfallen", only "it's downfall". There's documentation on the rule syntax at WP:REGEX, but some of the rules have grown pretty complex. -- John of Reading (talk) 20:24, 8 February 2011 (UTC)
Having slipped once, I will ask you to edit this for me. Do you have a set of commands which enforce WP:ENDASH? if so, please comment them out on the same grounds:
They are guidance only; like contractions, following the rules without thinking may change grammatical English to ungrammatical, or to English that means something not originally intended.
I think we have to be careful when changing contractions not to change proper names that may not be in quotes (e.g. Wouldn't It Be Nice) and contractions with more than one meaning (e.g. "'s" could be "is", "does" or "has"; same with "'d"). How about the following?
<Typoword="can not"find="\bcan't\b"replace="can not"/><!-- do not change uppercase titles --><Typoword="will not"find="\bwon't\b"replace="will not"/><!-- do not change uppercase titles --><Typoword=" not"find="(?!\b(wo|ca)n't\b)\b([a-z]+)n't\b"replace="$2 not"/><!-- do not change uppercase titles, can't and won't have separate rules --><Typoword=" have"find="\b([a-z]+)'ve\b"replace="$1 have"/><!-- do not change uppercase titles --><Typoword=" will"find="\b([a-z]+)'ll\b"replace="$1 will"/><!-- do not change uppercase titles -->
These are working well for me. Two more comments: I much prefer "cannot" over "can not"; and I think the third rule should be rewritten as an explicit list of the contractions it is willing to fix. This will make it safer, and, as I understand it, will stop it being a performance hit. -- John of Reading (talk) 11:01, 21 January 2011 (UTC)
Suggest:
<Typoword="can not"find="\bcan[’'`]t\b"replace="cannot"/><!-- do not change uppercase titles --><Typoword="will not"find="\bwon[’'`]t\b"replace="will not"/><!-- do not change uppercase titles --><Typoword=" not"find="\b([a-z]+)n[’'`]t\b(?<!\b(?:wo|ca)n[’'`]t\b)"replace="$1 not"/><!-- do not change uppercase titles, can't and won't have separate rules --><Typoword=" have"find="\b([a-z]+)[’'`]ve\b"replace="$1 have"/><!-- do not change uppercase titles --><Typoword=" will"find="\b([a-z]+)[’'`]ll\b(?<!ya[’'`]ll)"replace="$1 will"/><!-- do not change uppercase titles, ya'll is more likely y'all than ya will -->
based on the comments, my preference for negative lookbehind (evaluated only after a possible match) over negatice lookahead (evaluated whether or not there's a match), and an exclusion for ya'll, which is more like y'all than ya will. -- JHunterJ (talk) 12:55, 21 January 2011 (UTC)
dictionary.com confirms that "cannot" is far more common than "can not". Per this list of standard contractions, I've modified the suggestion below. Note that I added an " are" rule and removed the " will" rule, since "he'll" could either be "he will" or "he shall".
<Typoword="cannot"find="\bcan[’'`]t\b"replace="cannot"/><!-- do not change uppercase titles --><Typoword="will not"find="\bwon[’'`]t\b"replace="will not"/><!-- do not change uppercase titles --><Typoword=" not"find="\b(are|(c|sh|w)ould|d(id|o|oes)|ha([ds]|ve)|is|m(igh|us)t|were)n[’'`]t\b"replace="$1 not"/><!-- do not change uppercase titles, can't and won't have separate rules --><Typoword=" are"find="\b(they|w(e|hat|ho)|you)[’'`]re\b"replace="$1 are"/><!-- do not change uppercase titles --><Typoword=" have"find="\b(they|w(e|hat|ho)|you)[’'`]ve\b"replace="$1 have"/><!-- do not change uppercase titles --><Typoword=" will"find="\b(s?he|they|w(e|hat|ho)|you)[’'`]ll\b"replace="$1 will"/><!-- do not change uppercase titles -->
Changed the " not" rule to find "don't". Please feel free to update the rules (especially to improve performance, if needed). GoingBatty (talk) 02:06, 23 January 2011 (UTC)
I would !vote to include "he'll > he will", and so on. I think that "he shall" is unusual and, if meant, would be spelt out explicitly and not contracted. -- John of Reading (talk) 08:54, 23 January 2011 (UTC)
I realize I am a bit late to the party but I am not sure that we should be doing this with AWB. For one the MOS doesn't state that there is a preference so this is subjective and I believe will cause some tensions. I also think that in several cases its too formal and makes reading more difficult and I have seen this change being applied to the title in cite web and in quotations which are both innappropriate IMO. Cheers. --Kumioko (talk) 16:17, 29 January 2011 (UTC)
Hi Kumiko - welcome to the party! WP:MOS#Contractions says "The use of contractions—such as don't, can't, won't, they'd, should've, it's—is informal and should generally be avoided." WP:AWB/T#Description says "Typo fixing is automatically prevented on image names, templates, wikilink targets and quotes." I've seen them being applied in quotations, but the root cause each time was missing/incorrect quotation marks. Could you please provide some examples of where the typo fixes were inappropriate? Thanks! GoingBatty (talk) 20:13, 29 January 2011 (UTC)
I see that the " will" rule has been removed again. As I said above, I would prefer to ignore the possibility that "shall" was meant, and expand "'ll" to "will". -- John of Reading (talk) 16:51, 30 January 2011 (UTC)
Mclay1, could you please give examples of articles that contain a contraction ending in "'ll" where "will" would not be correct? GoingBatty (talk) 17:25, 30 January 2011 (UTC)
In modern English, shall and will are interchangeable. I don't believe we should be correcting things to one thing if another thing would also be correct. If we were to correct contractions ending in 'll, we should follow traditional grammar of first person = shall and second and third person = will. McLerristarr | Mclay117:28, 30 January 2011 (UTC)
I thought the " will" rule only changed instanced of the second and third person. Since the rule doesn't change "I'll", could you please give an example where the " will" rule would incorrectly change a first person contraction into "will"? Thanks! GoingBatty (talk) 17:37, 30 January 2011 (UTC)
It changed "we'll", although as I said that would not necessarily be incorrect in modern English. However, thinking about it, there would shouldn't be any first person writing outside of quotes. We should remove "we'll" and the others could be changed to " will". McLerristarr | Mclay109:17, 31 January 2011 (UTC)
I re-added the " will" rule but removed changes to "we'll" and "we've". I haven't edited the regex before so I hope I did it right. A problem I've noticed with expanding contractions is, for example, "couldn't that" should be expanded to "could that not" not "could not that". There are many cases where this could be the case. McLerristarr | Mclay110:22, 31 January 2011 (UTC)
See also this edit summary—where'd that come from [sic]—by User:David Fuchs at 19:37, 9 February 2011. The contraction where'd can mean "where had" or "where would" or "where did", because the verb come has the same form as its past participle. Similar contractions are when'd, what'd, who'd, how'd, and why'd. Another contraction is let's for "let us".
"where'd" -> "where did". "where would" and "where had" are not contracted to "where'd", AFAIK (e.g., no "where'd you gone?" or "if you were to leave, where'd you go?" but only "where'd you go yesterday?"). Similarly for the other interrogatives + "'d". -- JHunterJ (talk) 19:47, 11 February 2011 (UTC)
These contractions were not part of the typo rules added recently. Since the contraction rules have been reverted, I don't see adding more any time soon. GoingBatty (talk) 23:31, 11 February 2011 (UTC)
Exception to "colour" rule
AWB just tried to replace "British Coloumbia" with "British Colourmbia". Please add an exception to the "(Dis)Colour" rule. --bender235 (talk) 10:40, 11 February 2011 (UTC)
Why? British Coloumbia isn't a false positive -- it isn't spelled correctly. The AWB user should recognize it and change its fix to the correct fix, British Columbia. Having it changed as a typo was still useful. -- JHunterJ (talk) 12:12, 11 February 2011 (UTC)
Yes, "British Coloumbia" is a misspelling, but AWB applied the wrong rule. That was my point. And by the way, I did change it to the correct fix manually. --bender235 (talk) 12:24, 12 February 2011 (UTC)
New addition - "AUD/CAD/HKD/NZD/USD"
I see that RegExpTypoFix wants to change "USD$587 million" to "USD 587 million". Before we run into another "controversial fix" fiasco, may I point out that MOS:CURRENCY doesn't mention this format? Is there a discussion somewhere?
The actual text (at Men in Black (series)) is "USD$587 million worldwide on a $90 million budget" so the fixed version looks a mess - I won't be saving this edit.
Also, as a technical quibble, the rule ends with a lookahead so the change does not show up in the edit summary. I gather that's been fixed, but I'm using the 5.2.0.0 release. -- John of Reading (talk) 16:46, 12 February 2011 (UTC)
But why not? The article's name is Muammar al-Gaddafi, and that should be the only variant on Wikipedia. Just like we have a rule for "Beijing", or "Beirut", although these also have a number of variants. --bender235 (talk) 19:27, 23 February 2011 (UTC)
I think that while we can try to be consistent with our own use, his name has so many documented spellings (as shown in the article), I imagine it would be a nightmare to make exceptions for quotations from commentators, older sources, etc. I agree with JHunter, if someone has used a recognized missspelling (ie no one uses it anywhere), revert to our form, otherwise leave other forms as is.66.80.6.163 (talk) 20:01, 23 February 2011 (UTC)(mercurywoodrose)
Color is also at "color", but we can't "correct" usages of "colour", a valid spelling of the same topic. Similarly, we can't correct "Brontosaurus" to "Apatosaurus"; WP might have a preferred spelling, but the other is not a typo or misspelling. Moammar Qaddafi is (I assume) a valid spelling of the same topic as "Muammar al-Gaddafi", not a typo or misspelling, so we don't "correct" it. If an individual AWB user wants to use it to identify non-controversial replacement opportunities and use the (non-typo) replacement tools, that would work. -- JHunterJ (talk) 21:47, 23 February 2011 (UTC)
Will the 'Pre-' regex run acceptably fast now? If not, what metric are you using? (it doesn't hang for me, at least.) PS. Thanks for the enable. – Regregex (talk) 06:33, 4 March 2011 (UTC)
hda/had Error
Hello! Not sure if the regex can be adjusted to actually catch this, but "hda" should not correct to "had" if it's in a path, e.g. "/dev/hda1/foo/" :) Avicennasis @ 03:47, 18 Adar I 5771 / 22 February 2011 (UTC)
Thanks. Which rule triggered that change (found in the Typos tab on AWB)? There might be a way to identify the path in the Wikimarkup as not-English. (Also, additions to Talk pages are not minor edits. WP:MINOR. Editors who are ignoring minor edits won't see your new question.) -- JHunterJ (talk) 11:47, 22 February 2011 (UTC)
That was just an example - I forget the actual link. I encountered this on Wikibooks the other day. Since they don't have any AWB pages, the RegEx is loaded from here. I'll try to rescan and find out exactly what the link was. Avicennasis @ 04:35, 30 Adar I 5771 / 6 March 2011 (UTC)
Found an example on Wikibooks, the page is here, and the "typo" it finds is
Not sure if that helps at all, or if it can be avoided. But there's an example of what I mean. Avicennasis @ 06:41, 30 Adar I 5771 / 6 March 2011 (UTC)
This[5] change from "-" to "–" is obviously incorrect, see WP:HYPHEN. I am not sure that it is a bug of AWB, not some hand-made rule of a particular user. I know little about AWB and therefore ask here to help fix the problem, via technical changes or maybe social interaction (I have some negative bias towards automated editors and experience some troubles communicating with them). Incnis Mrsi (talk) 16:49, 8 March 2011 (UTC)
The word "replaced" in the edit summary shows that this change is a "find and replace" rule set up by the user, not something built into AWB. But I think the change is correct. The edit summary refers to WP:ENDASH, and the change seems a correct example of point 5 there, since "World War" contains a space. -- John of Reading (talk) 18:10, 8 March 2011 (UTC)
It wants to change "an usually" to "a usually". That seems a correct change, although it seems the text may have intended unusually there. -- JHunterJ (talk) 12:09, 13 March 2011 (UTC)
Ah, the text is "an usually"! Once I changed the text to "an unusually long period", then RegexTypoFix doesn't want to change it. Thanks! GoingBatty (talk) 15:17, 13 March 2011 (UTC)
"niger" matching even when part of scientific name
Although the "Niger(ia)" rule is set up to not match scientific names, it tries to change Chlidonias niger to Chlidonias Niger on articles such as List of birds of Oregon. Could someone please see if the rule can be updated? Thanks! GoingBatty (talk) 04:36, 13 March 2011 (UTC)
"To do: Remove rare words. Note that no matches today does not mean a rule is rare, since another user may have used the rule to fix many articles yesterday." Each rule consumes some resources, and the goal is not to have 100% of possible typos included at the expense of being unable to run the tool to fix any of them. -- JHunterJ (talk) 21:43, 26 February 2011 (UTC)
If a word is rare why not put it into a rolling set for each day of the month? That you can still run the tool but rare fixes will still be picked up. ϢereSpielChequers09:57, 2 March 2011 (UTC)
You can do that yourself. The AWB ruleset I run is the basic typo rules listed here, in addition to my own set of more nuanced rules (some of which require more human discretion than is appropriate for AWB). You could do the same. Shadowjams (talk) 07:23, 17 March 2011 (UTC)
"Maintenance" rule does not catch "maintanance"
I fixed about 2 dozen of these by supplying my own Find/Replace. I would like to fix the "Maintenance" rule, but I get dizzy when I look at that one for more than a few seconds. Chris the speller (talk) 17:09, 15 March 2011 (UTC)
I tweaked a rule -- sorry about any confusion that may have followed
I tweaked the "(In)Significant" rule, then reverted it, not because I saw anything wrong with it, but because when I reloaded AWB to retest the whole thing live, I found that no RegEx fixes were working for me. It was as if "Enable RegEx TypoFix" was unchecked. So in a near panic, I reverted the rule change. Well, after the reversion, RegEx fixes are still not working for me. I had saved my AWB settings before shutting it down, and reloaded settings after launching it again, so it's not some setting that I forgot. My own Find/Replace rules work fine, as do General Fixes, so I can work, but it's like sweeping with a smaller broom. Any ideas would be appreciated, as would a report that other editors are successfully using RegEx Typo after reloading AWB. As for the rule tweak, maybe an experienced tweaker can look it over and give an opinion. Chris the speller (talk) 22:05, 17 March 2011 (UTC)
Well, now *some* of the RegEx rules seem to be working for me. I'm going to reboot the whole machine (mine, not WP!), what the heck. After that, I'll retest and evaluate whether the existing two rules for "significant" are actually working better than I thought. Chris the speller (talk) 22:58, 17 March 2011 (UTC)
Only a handful of RegEx Typos work for me now, and only intermittently. I know most of you have more fun things to do, but for a change of pace, if an editor wants to help, try the following:
enable Find and Replace
add a Find and Replace for 'reponsible' to 'responsible'
enable RegEx Typos
put 'User:Chris the speller/Sandbox2' in the page list
Start.
It should fix 'reponsible' if Find and Replace is working. It should also fix about 11 other misspellings on 11 other lines if Regex Typos is working. Don't bother to save it, to allow retesting. When I do it, it doesn't catch the other 11 lines. I'd love to hear how other editors fare with this. Chris the speller (talk) 01:44, 18 March 2011 (UTC)
I am using AWB SVN 7471, just downloaded last week. Am I already 163 versions behind? How often do I need to update it? Since yours also missed all the typos on my sandbox page, it seems that mine is not the only one misfiring. I have found a few typos, but they seem to come in spurts; it finds typos in 2 or 3 pages in a row, then misses them in dozens and dozens. Thanks for giving it a try. You don't sound too worried, but I have a bad feeling about this. Chris the speller (talk) 04:04, 18 March 2011 (UTC)
Thanks, John, for providing the clear and simple answer to my problem. And thanks again, Batty, for taking the time to look into it. A humbling experience; I feel like the sorcerer's apprentice. Maybe I should change the name of this talk section to "The wrong way to set up a regression test for changes to AWB Typos" ;-) Chris the speller (talk) 14:33, 18 March 2011 (UTC)
I added a note on the project page to indicate that typos are not checked in indented paragraphs. Thanks for the info, John! GoingBatty (talk) 01:58, 19 March 2011 (UTC)
Yes, there were 10; I fixed 8, was headed off at the pass on 2 of them. I guess these may pop up at a rate of about 4 or 5 a month. I'm not sure what is a good cutoff number to qualify for adding a new typo rule.
"Lists of common misspellings" is for people, who are expected to show good judgment, brush off false positives and decide on changing "guerrila" to either "guerilla" or "guerrilla" based on the predominant spelling in the article, while "Typos" is for AWB and high speed. Even a fairly low rate of false positives makes for a pretty bumpy ride while using AWB. And "Typos" is tuned so that one rule can fix various suffixes and prefixes. Listing every possible variant in a separate rule would really bog it down, or maybe jam it good. Christhe spelleryack05:39, 29 March 2011 (UTC)
Thanks for your explanation. Also, I was confused by the "View (previous 20 | next 20) (20 | 50 | 100 | 250 | 500)" at the bottom of the search results. I assumed there were as many as 500 occurrences of that typo. Sorry. =) 128.138.43.231 (talk) 05:56, 29 March 2011 (UTC)
Capitalization of "earth"
An editor using the AWB has twice made edits to the article "Drummond Matthews" by which the text "the earth" has been capitalized to "the Earth". My contention is as follows: that "earth" should be capitalized only when it is used specifically as a name (e.g. Earth has only one moon, or Earth, Venus and Mercury are the three innermost planets). When used after "the" the word becomes a common noun analogous with "the world" or "the globe" and should not be capitalized. That's my understanding, anyway. Can the software be tweaked to enable it to make this distinction? Godingo (talk) 22:45, 31 March 2011 (UTC)
Rightly or wrongly, that's my instinctive reaction when someone raises an objection to a "New addition" and not much discussion has happened. Please re-instate the rule if you are sure that it is correct. You might want to fix note 5 in The Earth as well. Reference 80 is someone else's title, so of course that should stay as a lowercase "e". -- John of Reading (talk) 11:06, 1 April 2011 (UTC)
I fixed Ref 80 as well; someone else's automated case fixing of someone else's title was incorrect. I believe the rule is correct, but I'll see if I'm alone in this. Thanks; I didn't realize it was in the "new additions" section. -- JHunterJ (talk) 12:58, 1 April 2011 (UTC)
I've now found the right section in the manual of style, and looked at the first twenty of so potential corrections found by a database scan for \b[Tt]he\s+earth's\b. I'm happy with this rule. -- John of Reading (talk) 13:29, 1 April 2011 (UTC)
Capitalizing "Earth" on Drummond Matthews looks good to me (and I capitalized one more instance inside a wikilink which AWB won't change), as the article specifically refers to our planet. I would have guessed that the only distinction was "Earth" meant the planet and "earth" meant dirt (which seems to coincide with the MOS), but dictionary.com has other examples of where lowercase "earth" is acceptable. GoingBatty (talk) 18:13, 1 April 2011 (UTC)
It changes "imprevious" to "improvious", which is a further step away from the correct "impervious". Anyone want to tweak it to prevent this strange twist? It's probably not worth getting it to actually fix this misspelling, as only Ramalinga Swamigal had an example of it, and it is now extinct in the wild. Christhe spelleryack05:15, 10 April 2011 (UTC)
Performance question
Which runs faster, "(M|m)" or "([Mm])", on the RegEx engine used by AWB? The former is shorter by one character, but if the other construction runs faster, that might be the way to go. The Typos list has the first format at the front of most rules, such as "\b(M|m)imic(ing|ed)\b". There was a discussion in October 2010 Wikipedia talk:AutoWikiBrowser/Typos/Archive 3#Profiling heads up for you guys that said explicit character classes ran faster than shorthand character classes ([A-Za-z] faster than \w), but did not cover explicit character classes versus alternation. A web search of "regex performance character classes alternation" led me to High Performance JavaScript by Nicholas C. Zakas on Google Books, which advises against starting a RegEx with alternation. Would anyone be interested in testing this? Here's the real fun part: if it's worth changing, AWB would be a great way to change the Typo rules! Christhe spelleryack19:27, 11 April 2011 (UTC)
lineraly → linerally
AWB fixed this one[8], another user said that lineraly is the correct spelling, and cited wikt:linearly (which does not have linerally)... I do not know which one is the correct one. Christian75 (talk) 17:15, 22 April 2011 (UTC)
Someone came along after you and corrected it to "linearly", which I'm sure was intended from the context. From my search, it seems that you found the only occurrence of "lineraly" in wikipedia, so I don't think it deserves a rule here to fix it.--BillFlis (talk) 18:43, 22 April 2011 (UTC)
My point was that someone with experience at WP:AWB/T might be able to orevent incorrect changes and/or advise at WP:WTM. John of Reading has given advice there, for which I am grateful. I have not found a rule here which changes provably to probably.
Today on WT:MOS, an editor requested a script for automatically removing hyphens after -ly adverbs in compound modifiers. I have put about 140 RegEx rules on User:Chris the speller/Adverbs, along with instructions on using an XML editor to splice them into an AWB settings file. This method misses very few standard -ly adverbs, but completely avoids problems with fly, July, Italy, family and the like. Christhe spelleryack23:35, 12 April 2011 (UTC)
To clarify: I'm not pushing this method for addition to the WP:AWB/Typos list, but this seemed to be the place to let a few daring editors know how they can load Find & Replace rules if they want to take this on. Since the list contains only known standard -ly adverbs, there are very few false-positive hits. The differences still need to be examined, and a hyphen removal does not exactly jump off the difference window. The main things to watch out for are changes to quotations, links (I don't usually have the "Ignore ..." boxes checked), and longer compounds (a slowly-but-surely strategy). If this is the wrong forum to bring this up, please move this discussion or ask me to move it. Christhe spelleryack14:31, 13 April 2011 (UTC)
For any software designed for automatically removing hyphens after ly adverbs in compound modifiers, I recommend that the software discriminate at least four different categories of instances.
instances where the hyphen is automatically removed
(including those with quotations from websites which omit the hyphen)
instances where the hyphen is automatically retained
(after fly, July, Italy, and family; in surnames such as Beverly-Smith; in French-language place names such as Romilly-sur-Seine; in English-language place names such as Ashly-on-Avon; in web addresses; in quotations from websites which also use a hyphen; in only-begotten)
instances where the hyphen is automatically retained but the suffix ly (or y in adverbs such as fully) is automatically removed
assembly (in assembly-line), belly, billy, butterfly (in butterfly-bush), curly, deadly, earthly, friendly, googly (in googly-eyed), holly, holy, hurly (in hurly-burly), jelly, jolly, kelly, Kimberly (in Kimberly-Clark), lily (in lily-of-the-valley), lonely, lovely, manly, McNally, mealy (in mealy-mouthed), oily (in oily-grain), pearly, pimply, poly (in poly-Bernoulli number), prickly, reply (in reply-paid), roly (in roly-poly), sally, scaly, (in Scaly-throated Honeyguide), shilly (in shilly-shally), sickly (in sickly-sweet), silly (in silly-sider), sly (in sly-boot), steely (in steely-eyed), supply (in supply-side economics), tally (in tally-ho), whirly (in whirly-ball), willy (in willy-nilly), woolly (in Woolly-necked Stork), worldly (in worldly-minded)
Included there are these types of cases: radio and television stations (for example, KVLY-TV), Hungarian names (for example, László Moholy-Nagy).
After chewing on this for a while, my first two conclusions are (1) you have put a lot of effort into this—my hat's off to you. (2) I had incorrectly included "overly" in my list of standard -ly adverbs, and now it has been removed. Christhe spelleryack01:44, 15 April 2011 (UTC)
Why isn't "overly" a standard adverb? "John Smith was overly confident." "Overly" is an adverb and it's not slang; I don't see what's non-standard about it. McLerristarr | Mclay115:30, 16 April 2011 (UTC)
At User:Noetica/Archive4#Complications with ly (which I mentioned above), I posted these consecutive statements at 01:46, 18 June 2009 (UTC): "I try to avoid the word overly but I understand that it has gained some degree of acceptance by dictionaries. I see three options: (a) removing the hyphen, (b) removing the ly and leaving the hyphen, and (c) removing both the ly and the hyphen."
Please see Noetica's first reply, at 09:40, 18 June 2009 (UTC), under "1. Overly".
I've recently encountered some instances of "distint", which AWB attempted to change to "distant", but in each case the correct word was "distinct". The rule replaces
Thanks. I never had any use for that tab before, and wasn't aware of it. It didn't immediately answer my question, because my browser had choked on the Typos page and not loaded all of it, and that explains why my searches had found nothing. After finally getting it all to display, a search showed it was 'word="Dissi-" find="\b(D|d)isi([a-ko-rt-z]|m[a-nq-z]|s[a-km-z])([a-z]+)\b" replace="$1issi$2$3"'. I'm not going to mess with that little stinker. I don't know if it's worth adding "Disith" to the list of false positives to avoid. Thanks for the response. Now I know what to watch out for and why. Cheers! Christhe spelleryack21:10, 22 May 2011 (UTC)
An occasional misspelling and mispronunciation of 'mischievous'. I couldn't see it on the list, although it's been a while since I last edited it. Mephtalk 04:42, 23 May 2011 (UTC)
Done with this edit. -- JHunterJ (talk) 11:38, 23 May 2011 (UTC)
Its + apostrophe
Chris the speller and I have just fixed about 1200 of these in response to this request, and in over 99% of the cases the fix was to remove the apostrophe. I did a database scan for \b[Ii]ts[’'`] (three kinds of single quote), skipping those that matched \b[Ii]ts'' (italic/bold formatting). I don't know how to encode that in a single rule. If anyone thinks this is worth adding, could they add it? -- John of Reading (talk) 16:24, 1 June 2011 (UTC)
Notice that the offending apostrophe has to be followed by a whitespace character. This rule will skip bold/italics, and though it might be a little restrictive, I think the vast majority of cases will be followed by a space. Christhe spelleryack21:57, 1 June 2011 (UTC)
I'm not sure exactly where, but there's some sort of encompassing tag that the wiki engine ignores but AWB respects that's making it not check it. If you strip away all of the text except for that paragraph you'll see that it does fix it. I'll investigate and see if I can find what exactly is doing it. Shadowjams (talk) 00:58, 4 June 2011 (UTC)
Found it. The "’" in U.K.’s is making the AWB engine think it's in quotes, or something, and tripping it out. We incidentally have a MoS about this, and converting "’" to "'" (same with full quote marks) should be common, however it's such a pervasive typo that it's a bore to fix. But this is another example of how the MoS guidance is right on this point. Shadowjams (talk) 01:17, 4 June 2011 (UTC)
Thanks. In a parallel investigation, I found that taking any quotes (straight or curly) from around "live" in the paragraph following the unfixed "conjuction" allowed Typos to nail it. Shows how convoluted this problem is. Now I formally declare war: "Death to curly apostrophes!" and maybe curly double quotes, too. Thanks for the help. Happy editing! Christhe spelleryack02:24, 4 June 2011 (UTC)
Fix for youtube tag
Hi! First of all, thanks for this amazing list! It has been incredibly useful! Though there's a minor error in it. At Wikia, we use <youtube> tags, without capitals. However, AutoWikiBrowser wants to "fix" these tags by changing them into <YouTube>. Could this be fixed? Thanks! 213.93.184.183 (talk) 19:54, 5 June 2011 (UTC)
(EC) The tag's not case sensitive is it? Isn't this just a cosmetic issue?
Nevertheless, this should fix it: (?<!<)\b(?:Yout|you[Tt])ube\s. It should fix all cases of your tags too because it will just avoid all youtube phrases that begin with <. Shadowjams (talk) 22:31, 5 June 2011 (UTC)
Leading with a negative-lookbehind strikes me as more expensive than ending with one. When leading, at every position (not just every occurrence of youtube, but every position) in the text, the parser has to check whether the previous character is not a <, and if not (which is usually), then look for a mal-cased YouTube. With trailing, it only looks back if it has found a mal-cased YouTube, so should be cheaper. The compiler may disagree, but I'd have to see stats. And then I cleaned up the .com check, to look for actual .coms, so it will correctly fix " .... video on Youtube." -- JHunterJ (talk) 02:15, 6 June 2011 (UTC)
Me again, sorry for not reporting this earlier but I completely forgot: <youtube> now works, but </youtube> (note the slash) is still converted to YouTube. Thanks in advance :). 213.93.184.183 (talk) 22:24, 18 June 2011 (UTC)
The error was reading "too many 's" specifically with the Blu-Ray regexp. AWB was refusing to activate Typo correction for me because of it. Stuart.Jamieson (talk) 18:02, 12 June 2011 (UTC)
Odd (esp. since there are no apostrophes in the change). I'm running AWB 5.3.0.0 SVN 7728 on IE 9.0.8112.16421, .NET 2.0.50727.5444, Windows 6.1. Can you tell me which versions if any are different in yours? I wonder if any of my hyphen changes need to be escaped under some environments. Thanks. -- JHunterJ (talk) 18:10, 12 June 2011 (UTC)
.NET is 2.0.50727.4211 and Windows is 6.0 but there was more to the message, I only realised by recreating it - you had missed a closing bracket. Stuart.Jamieson (talk) 18:37, 12 June 2011 (UTC)
AWB is correcting "inheritence" into "inhernheritance" for some reason--it happened to me on several articles in a row. Sample: [9] Thanks! -- Khazar (talk) 15:37, 13 July 2011 (UTC)
I would like to make a typo-correction suggestion that relates to capitalization, specifically of the CamelCase variety. I'm not sure how often these words occur outside of Nickelodeon- and cartoon-related articles, but it's not uncommon to see the character name "SpongeBob SquarePants" from the show by the same name incorrectly written without the capital "B" or "P" in his name. My suggestion would be to correct "Spongebob" to "SpongeBob" and "Squarepants" to "SquarePants" through AWB. --Sgt. R.K. Blue (talk) 23:02, 17 July 2011 (UTC)
I added a rule to fix his name when both the capital "B" and capital "P" are lowercase. Let's see how this works, and maybe others can expand the rule to fix other cases. Enjoy! GoingBatty (talk) 00:20, 18 July 2011 (UTC)
I hacked away at it for a while. If you take out the Italian inter-wiki line, then the Typo fixes work. Then you can put the inter-wiki line back. Go figure. Sometimes AWB is battier than you are. Christhe spelleryack04:41, 18 July 2011 (UTC)
Thanks for figuring that out, Chris! Guess when the instructions state "Typo fixing is automatically prevented on image names, templates, wikilink targets and quotes", that means interwiki links too. GoingBatty (talk) 04:46, 18 July 2011 (UTC)
Thanks also for all the SpongeBob-related fixes; I ran AWB for a short time a little while ago and didn't pick up any more, though I'm sure there are still many out there yet to be discovered. Meanwhile, another CamelCase correction also struck me that might be worth adding: I've sometimes seen the company DreamWorks incorrectly written without the capital "W" (Dreamworks). --Sgt. R.K. Blue (talk) 08:47, 18 July 2011 (UTC)
Done - I believe all the instances of improper capitalization for "SpongeBob" and "SquarePants" have now been fixed (via typo fixing or find/replace). I'll leave it to you to work with the other wikis to get their SpongeBob-related articles fixed. :-) GoingBatty (talk) 04:38, 20 July 2011 (UTC)
Regex for SI units
I see that the regex for SI units has:
([\d\.]+(?:\s| |-)?)foobar
The following appears to be identical in effect:
([\d\.](?:\s| |-)?)foobar
Furthermore, I think 7. foobar (note trailing decimal) are not worthy targets. Thus the following code might be tighter:
I think that [\d\.]+ is deliberate in these SI rules, so that the edit summary is more informative. I don't feel strongly about the trailing decimal; perhaps wait to see how many false positives get reported. -- John of Reading (talk) 15:02, 18 July 2011 (UTC)
The edit summary does't show the regex but can show the string that the regex matched. So if the article said "123.4 foo" and now says "123.4 Foo", then the edit summary will say that instead of just "4 foo -> 4 Foo". -- John of Reading (talk) 18:43, 18 July 2011 (UTC)
My 2 cents: I think you all are on the right track as to improving the edit summary by showing as much as possible of the phrase that's being corrected. But, while the naked-decimal-point form "7. foobar" is generally deprecated (I haven't checked the style guide used here), I don't think that means we shouldn't fix the foobar part if that's needed. So we can get away with Lightmouse's last suggestion without requiring the decimal-point-covering digit:
I seem to remember some previous discussion about speeding up the code. It would be simpler just to check for the last digit. Can we make the edit summary repeat the unit name without the numeric value? Lightmouse (talk) 21:11, 20 July 2011 (UTC)
No. The strings "kg" or "km" will match the second regex but not the first. The rule mustn't match a correct use such as "3 kg". -- John of Reading (talk) 12:17, 27 July 2011 (UTC)
then it would find "3 km" and replace it with "3 km". This goes against the second bullet point here. Without looking at the source code I don't know for sure, but I imagine this could lead to edit summaries saying "typos fixed: 3 km -> 3 km". -- John of Reading (talk) 12:45, 27 July 2011 (UTC)
Ah. I tend not to worry about zero change matches in my regex. But I don't link the summary to the match. Even with that constraint, is the regex as simple as it can be? The upper case 'K' appears to be duplicated. How about:
I don't know where the problem is but AWB is trying to correct typos on the Latin Wikipedia. It's trying to correct Latin words to English. McLerristarr | Mclay110:40, 1 August 2011 (UTC)
Would removing disabled rules from the list speed up AWB page processing? If so, should they be archived somewhere so editors can refer to them before adding new rules? Thanks! GoingBatty (talk) 13:22, 31 July 2011 (UTC)
using lookbehind, but this is not supported on JavaScript and as such, are not working on WikEd. Would it be possible to change them to something equivalent without using lookbehind? Helder19:07, 1 August 2011 (UTC)
I don't think it's possible to do generally across all rules. It is possible to hack your way around these in some instances. For instance, doing tricks excluding character classes [^a] or writing more complete rules. In the vast majority of cases instances with a look-behind/ahead are trying to exclude some particular version of a word. I suppose you could ignore rules like that altogether. Maybe someone else knows of some code you could plug into those programs that would do the lookahead/behinds for you, but I can't think of a general, across the board fix, as those features do things that more primitive regex's can't. Shadowjams (talk) 16:48, 2 August 2011 (UTC)
Maybe I'm wildly misunderstanding how WikEd works, but is there a library or plugin for grease monkey that would allow this? The javascript only version I guess uses the native regex engine... it does actually support look ahead... so perhaps there is a way to convert look behinds into look aheads in an automated way.
Here's a question for computer science people... can all look aheads be made into look behinds or vice versa? If not then I'd suspect there's no elegant way to do it, but if yes... Shadowjams (talk) 16:59, 2 August 2011 (UTC)
<Typo word="Broadly" find="\b(?:(Broad)yl|(broad)yl?)\b" replace="$1ly" />
is actually a few characters shorter.--BillFlis (talk) 16:57, 2 August 2011 (UTC)
The lookbehinds here are generally more efficient -- the number of characters in the pattern is not a good indicator of efficiency. Lookbehinds look for exceptions after a match has been found. Lookaheads look for exceptions at every character and then look for a match. -- JHunterJ (talk) 17:03, 2 August 2011 (UTC)
There was a long discussion about this a few months back and the conclusion was that, while not critical, the character class is quicker and should be preferred, all else being equal. It shouldn't be dogmatically enforced, but the processing speed is important. Shadowjams (talk) 03:12, 3 August 2011 (UTC)
The debug builds of AWB have a built in typo profiling function. Alternatively you can benchmark regular expressions in general using expresso or JRegexAnalyser or something similar of your choice. Rjwilmsi22:59, 3 August 2011 (UTC)
D-dropping
G-dropping has a long history in the English language, but d-dropping is relatively new, in my experience. Apparently as a result of young people spending more time at being entertained than at being educated, and the ability of a /d/ sound to disappear between an /s/ sound (or a /z/ sound or a /ʒ/ sound or a /ʃ/ sound) and a /t/ sound, I have been seeing the letter omitted from expressions like those listed below. This can happen when they include verbs in the past tense, but especially when they include past participles (as in is/are/was/were supposed to). Of the words which I have listed here, I found most at http://wordover.com/. It seems to me that some people actually do not know the correct spelling. Even when the past-tense verb or past participle is not followed directly by the word to, sometimes the d is dropped from a word that should have it.
supposed to, forced to, advanced to, convinced to, announced to, enticed to, induced to, introduced to, sourced to, reduced to, increased to, decreased to, traced to, ceased to, dispensed to, dispersed to, asked to, passed to (See Note 1)
used to ("was accustomed to; utilized to"), pleased to, advised to, authorized to (authorised to), disclosed to, exposed to, proposed to, opposed to, refused to, generalized to (generalised to), advertised to, paused to, poised to, espoused to, surprised to, televised to, closed to (see Note 2)
changed to, charged to, enlarged to, acknowledged to, judged to, engaged to, divulged to, diverged to, emerged to, managed to, outraged to, pledged to, surged to, urged to
attached to, dispatched to, hitched to, latched to, matched to, pitched to, preached to, reached to, stretched to, switched to, abashed to, banished to, crushed to, dashed to, demolished to, diminished to, embellished to, established to, leashed to, published to, vanished to, wished to
Note 1: The expression closed to ("shut to, unopened to; blockaded to") should not be confused with the expression close to ("near to”).
Note 2: The expression passed to ("moved to", transitive or intransitive) should not be confused with the expression past to ("onward to”).
This problem can occur also with /l/, /m/, /n/, and /r/: thrilled to, claimed to, inclined to, ordered to. I have not spent time in searching for more examples of these, but I am willing to do so on request.
It seems to me that a search limited to expressions with is/are/was/were would find only true positives with most of the expressions which I listed. Here is an example.
[is, are, was, were] (not) (very) (actually, always, ever, never, often, only, really, sometimes, truly, usually) suppose to
Exactly one of the auxiliary words is present. At most one expression from each set of parentheses is present.
I have not mastered the AWB code for condensing multiple variables into one line.
When I said "as in is/are/was/were supposed to” in my first post, I meant that AWB might search for these expressions with is/are/was/were , among other possibilities, but now I am referring to restricting the search to expressions with is/are/was/were.
I suggest we write up a Find&Replace regular expression, and that you run it on some articles to see how accurate it is in finding genuine grammatical errors. Rjwilmsi07:50, 24 August 2011 (UTC)
A search of wikipedia for "was please to" (with quotes) returned 2 hits (both errors). Searches for "is please to", "were please to", "am please to", and "are please to" returned no hits.--BillFlis (talk) 13:41, 24 August 2011 (UTC)
I created a new rule for "(Ex/Op/Pro)posed to", which might have a chance of catching something. If this pans out without problems, it can later be merged with the existing "Supposed to" rule.--BillFlis (talk) 13:54, 24 August 2011 (UTC)
Can we get rid of "well-known" --> "well known" please? The hyphenated version is perfectly acceptable in my dictionary (Chambers 9th ed.), and seems to be purely a matter of taste. I'm fed up of skipping it.—An optimist on the run!19:58, 2 September 2011 (UTC)
Can you provide an example of AWB attempting to make such a change where you thought it should be skipped? That might save us a lot of thrashing around. Christhe spelleryack00:31, 3 September 2011 (UTC)
WP:HYPHEN, subsection 3, point 5, approves of “well-known”.
I've temporarily disabled the rule while we discuss this topic. Does anyone have a source that states when to use "well-known" versus "well known"? GoingBatty (talk) 22:18, 2 September 2011 (UTC)
I just checked in the Oxford English Dictionary (the enormous 20-volume one). It's given as "well(-)known" denoting that hyphenated and unhyphenated forms are both acceptable. The list of example quotations from literature includes both forms. There is no reason for AWB to change it. — Hebrides (talk) 23:00, 2 September 2011 (UTC)
But WP:HYPHEN, sub-subsection 3, point 5 clearly spells out when the hyphen is needed, and implies that the hyphen is not used predicatively. It would not need to mention different treatments of "well" if either form is acceptable anywhere. Christhe spelleryack00:35, 3 September 2011 (UTC)
The only reason to change it is if "well-known" should be used in some contexts and "well known" should be used in other contexts. GoingBatty (talk) 23:10, 2 September 2011 (UTC)
It means change "best-known" or "well-known" to "best known" or "well known" if it's followed by "for", "as", "by", or "in", but not if it's preceded by "the". GoingBatty (talk) 21:07, 4 September 2011 (UTC)
These links might be useful to editors who wish to evaluate this expression, and I suggest that they be added to Wikipedia:AutoWikiBrowser/Typos as external links for editors to use in evaluating other expressions.
It has been 9 days since I asked for an example of this rule making a change where it should not have, and none has been provided. I suggest restoring the rule. Christhe spelleryack22:39, 12 September 2011 (UTC)
If it is used as an adjective before a noun it is hyphenated, see this. So I think it is dangerous and unwise to automatically make that change unless AWB can tell that it is not before a noun. Bubba73You talkin' to me?17:21, 7 November 2011 (UTC)
Creedence
The current rule "Credence" tries to change every occurrence of "Creedence/creedence" that is not followed by "Clearweater". This includes "The Ultimate John Fogerty/Creedence Collection", "Creedence drummer ..." and "Creedence-ish". What this rule catches is 100% false positives. It's hard to think up many sentences that start with the common noun "credence", so changing this rule to change only the lower-case word would be a vast improvement. But there are no occurrences of "creedence" in lower case, so I wonder if the rule is badly needed. Am I missing something? Christhe spelleryack15:17, 17 September 2011 (UTC)
I updated the "Dissi-" rule so it does not try to change "disiciple", and manually fixed all instances of "disiciple". I'll leave it to others to change the "Disciple" rule to also fix "disiciple" (if needed). GoingBatty (talk) 02:45, 18 September 2011 (UTC)
"Advertize" is not preferred by Americans, or any group that I know of. It's an oddball spelling that seems to be tolerated to a fair degree, though most respectable dictionaries do not list it as an alternative to "advertise". I'd rather not see "advertize" used in formal writing, as in WP, but if there were a groundswell of opposition to this Typo rule, I'd be OK with its removal. Christhe spelleryack14:37, 11 September 2011 (UTC)
Funny, I've never thought about that... but I guess that's true. As an American I'll tell you I would never think to write it as advertize (although spell check doesn't seem to mind). I don't mind the rule either way. Shadowjams (talk) 02:54, 16 September 2011 (UTC)
The Typo rule "-ical" should not change "Musial" to "Musical", though the lower-case "musial" to "musical" would be fine. In any event, it should not change "Stan Musial". Christhe spelleryack17:55, 18 September 2011 (UTC)
Done by changing the "-ical" rule so it will change "musial" but not "Musial". (FYI, Stan isn't the only Musial mentioned on Wikipedia.) GoingBatty (talk) 21:42, 18 September 2011 (UTC)
I agree. A sampling showed probably 90% should be changed to "is" instead of "it is". However, this is examining what is left in WP after years of AWB users clicking "Skip" when the rule does not make an appropriate change, so we can't be sure what the ratio is for new edits. Complicating this is the fact that the rule also tries to change "it it" to "it is". Many times this is caused by a missing period, as in "Fred tried to fly it it got caught in a tree". Again, these would be expected to accumulate as AWB uses click "Skip". Something should probably be done, but I think more discussion would be helpful before we change the rule. However, if someone wants to be bold, I won't fight it. Christhe spelleryack15:29, 28 September 2011 (UTC)
Done - Since it's been over a week since Chris' response with no further discussion on Mandarax's proposal, I was bold and made the change. GoingBatty (talk) 04:00, 6 October 2011 (UTC)
Not only are almost all of the "is is" I've just corrected supposed to be "is", but most of them are in the first sentence of the lede! Hope others will help fix these. GoingBatty (talk) 04:30, 6 October 2011 (UTC)
Done with these edits. The problem you were likely facing was that the previous regexp required two typos in the word: one before the "ra" and another one after it. I opted for a negative lookbehind to avoid the null op; the other choice would be to make it two separate rules. -- JHunterJ (talk) 19:23, 22 November 2011 (UTC)
"Well known" is hyphenated if is used as an adjective before a noun, see this. So I think it is dangerous and unwise to automatically make that change unless AWB can tell that it is not before a noun. Bubba73You talkin' to me?17:21, 7 November 2011 (UTC)
Thanks for providing a source. I've disabled the rule pending further discussion. Shouldn't the hyphenation of "well-known" be a well-known rule? :-) GoingBatty (talk) 17:36, 7 November 2011 (UTC)
If it uses the rule only if followed by those words, then it should be OK. Yes, it should be a well-known rule. But is the rule well known? (I think those are correct uses.) Bubba73You talkin' to me?18:04, 7 November 2011 (UTC)
That's a crowd-sourced answer. Actually language is far more flexible, and not only do words refuse to take on the roles assigned by syntactic rules, but writers use language more flexibly. In this case, for example, Google Books shows that about 10-15% of the number uses of "well-known for" as "well-known for", including Harpers Magazine, and "Foo for Dummies". RichFarmbrough, 01:16, 12 December 2011 (UTC).
Is it perfectly conventional in normal English writing to mis-use hyphens 10 to 20% of the time. But I think WP style is to try to get it right. Dicklyon (talk) 01:20, 12 December 2011 (UTC)
(with the $2$3 at the end) would serve to combine those two. But perhaps it's best left separate. I'll let someone else opine. -- JHunterJ (talk) 22:36, 11 December 2011 (UTC)
'Publisher', 'work', 'agency' parameters in 'cite' templates
These parameters are frequently wrongly used; for example, "publisher=''The Times''" should be "work=The Times". There are some further complications; the name of the work wrongly declared as 'publisher' is sometimes included in double apostrophes so that it shows italicised (as 'work' automatically does) and sometimes not; the work is sometimes wikilinked and sometimes not; the word 'The' is sometimes part of the name and sometimes not. And quite often organisations that are correctly defined as 'publisher' such as BBC News, will be artificially (and inappropriately) italicised. Also, agencies such as Reuters are sometimes described as 'publisher' where they should be 'agency'. I've developed a set of regexes that (mostly) fix these problems. If there's agreement that this would be a useful RETF task, I'll collect them up and post them here and hopefully people with more regex knowledge than me can simplify them and plug the loopholes. Colonies Chris (talk) 20:46, 23 December 2011 (UTC)
Hi Chris. While I agree with the need for such cleanup, please remember that the Usage section states "When used on AWB, typo-fixing is automatically prevented on image names, templates, wikilink targets and quotes." GoingBatty (talk) 23:02, 23 December 2011 (UTC)
We have 50 articles with the word assitant. I've manually fixed a few but I think it would be easier to put assitant - assistant into AWB. If someone agrees with me can they make the change?
Thanks. If it is already in AWB I'm surprised there were so many, what is the typical lag nowadays between an AWB fixable anomaly going into an article and it being fixed? ϢereSpielChequers15:53, 30 December 2011 (UTC)
The AWB is "fixing" spellings that don't need to be fixed. Kekasih (Indonesian/Malay word for lover) is turning into kekaish (which has no meaning anywhere) (examples: 1, 2, 3) and Qadarsih (Indonesian/Malay name) --> Qadarish (also no meaning) (examples: 1, 2, 3); the latter would be quite embarrassing if either Indra or Titi Qadarsih get their own articles. Could this be fixed please? I've already fixed the issues in the articles. Crisco 1492 (talk) 09:52, 7 January 2012 (UTC)
I get false positives on this rule a lot too, which I'll try to start listing here so they can be logged as exceptions. Today's is "Tung Tish-shia". Thanks! -- Khazar (talk) 15:35, 21 January 2012 (UTC)
Ngugi wa Thiong'o, Ngugi wa Thiongo, Ngugi Wa Thiong'o => Ngũgĩ wa Thiong'o
Because of the complicated (for Westerners) spelling of his name, almost all of our articles for Kenyan novelist Ngũgĩ wa Thiong'o had one of the above misspellings of his name. I've manually fixed the existing offenders, but would it be possible to autocorrect the above variations in the future, or is this too small a fix? It's only 50 or so instances of this error so far, but it was misspelled in more places than it was properly spelled. Thanks! -- Khazar (talk) 07:14, 15 January 2012 (UTC)
Would all instances of "Ngugi" qualify to be converted to "Ngũgĩ" or is there something distinct about it being followed by "wa"? Would it be possible to generalize the rule for Thiongo to Thiong'o too? Shadowjams (talk) 18:02, 22 January 2012 (UTC)
Honestly, I'm not sure; I happen to know a lot about Ngũgĩ from teaching one of his books, rather than an understanding of Gikuyu language. Probably safer to limit it to him for now. -- Khazar (talk) 18:07, 22 January 2012 (UTC)
Alright, I actually may ask in the language desk in a bit to see if we could generalize the rule more. For the time being, here's a regex that should work. I didn't insert it into the typo list but you can do so if you think it's widespread enough. Or you could use it to scan the database for other instances. I have a dump from a few months ago but you've probably fixed all of them I'd find in that.
Double-hyphen is converted to an em dash (except between digits, when it becomes an en dash). However, if the word to either side of he hyphens is linked, the conversion is prevented. This means that where the dashes occur in pairs, often only one of them will be fixed.
An example is here,[11] where only the second "--" was autofixed in
included [[Eugene Ormandy]]--who later turned to conducting--and [[Eugene Lehner]].
Okay. This one should be more straightforward: the template {{ndash}} is not supposed to have a leading space. It includes a nbsp, and adding another messes up the spacing. Should be covered under typos? Deleting the space seems to be okay even when adjacent to a ref, as here. — kwami (talk) 01:26, 25 January 2012 (UTC)
WP:AWB/T#Usage states "When used on AWB, typo-fixing is automatically prevented on image names, templates, wikilink targets and quotes (including indented paragraphs)." Seems like a feature request would be the way to go on this one. Thanks! GoingBatty (talk) 03:13, 25 January 2012 (UTC)
"Populary" isn't a word... is it used correctly anywhere though? Because we could just fix that and leave out the known... although thinking about it all of the popularly .... phrases are going to include past participles right? Off the top of my head that seems right... anybody weigh in on that?
As for the regex: <Typo word="Popularly" find="\b(P|p)opular([ia]l)?y\b" replace="$1opularly" />
Yeah, my only concern is that it might be a typo for popularity as well as popularly. But let me give it a run later tonight or tomorrow and see what I get. Cheers, Khazar (talk) 23:47, 29 January 2012 (UTC)
Ran it and it looked good. Caught about twenty more "popularly"s, 1 "popularity", 3 "popular"s, but no articles where "populary" was in fact the correct spelling. Inserted the text above. -- Khazar (talk) 01:02, 30 January 2012 (UTC)
Analgous, Analagous => Analogous
How about modifying the rule for "Analogous" to also catch "Analagous" and "Analgous"? I'm having trouble deciphering what's already there for this one--I'm still trying to teach myself how to do this--and I don't want to muck it up. Thanks! -- Khazar (talk) 17:55, 1 February 2012 (UTC)
The rule as it is right now requires there to be either a double n or a double l in the first part of the word, and it allows for other errors, like the two you specify. The biggest thing we try to avoid is rules that will match correct spelling, and so these rules get ever more elaborate, and clever, in trying to ensure that anything that matches the rule is somehow wrong. That's what's going on with this rule.
Two options to incorporate your error. Either we can regroup the rule as it is, making it more complex but keeping it in one rule, or we break out a second rule. The second choice may seem simpler, but it may take longer to run 2, and it may also make long-term maintenance of the rule hard.
Here's my shot at changing the rule that's there now. Maybe someone can weigh in on which is faster (I don't have a good way to benchmark it). I'll insert this for now, but if there's something wrong about it let me know.
So my changes do one of two matches... either the word's got an extra n or l (first version), in which case we correct it, or it's fine but uses an "a" or misses the o, in which case we correct that too. But it's either or. Before it would not correct for double l or n if the o was correct. Again, please let me know if this breaks anything. I tested it some and it seemed fine though. Shadowjams (talk) 22:05, 1 February 2012 (UTC)
Thanks for tackling this one. I'm glad to see I wasn't the only one intimidated by it; I'm trying to teach myself how to write these rules on my own, but this clearly wasn't the place for me to start. Cheers Khazar (talk) 22:26, 1 February 2012 (UTC)
Can someone with regexp skills improve the rule named " ,"? If the text is "foo ,bar ,baz" then the rule changes it to "foo,bar,baz" which I think is worse. The correct fix would be "foo, bar, baz", but if that can't be managed then can the rule disable itself when there is no space after the comma?
Ideally I'd like this to be part of the general fixes, like all the other fiddly rules that shuffle spaces and punctuation, as this rule leads to clutter in the edit summary - I always tick "Add replacements to edit summary". But that's not an issue for this page. -- John of Reading (talk) 14:25, 4 February 2012 (UTC)
I've definitely encountered this problem, too, but my experience has been that for every "foo ,bar" I run into, I encounter 2-3 "foo , bar"s--in other words, that the rule autocorrects correctly more than it's wrong. That's just a subjective impression, though. Khazar (talk) 17:15, 4 February 2012 (UTC)
Nice work, Batty! I was just looking into it. BTW, I am the one who created this "fiddly" rule (I know John didn't mean any offense). I agree that it would be better in the general fixes. Should one of us make a request? Christhe spelleryack20:01, 4 February 2012 (UTC)
Agreed that a feature request would be appropriate. You may want to include more comma rules than this Typo rule fixes. (e.g. adding a space after a comma, except when there's a digit before & after the comma). Thanks! GoingBatty (talk) 20:40, 4 February 2012 (UTC)
Thank you, both, that was a very neat change to the rule. I'll update your feature request if no-one else does, but not tonight (UK time). -- John of Reading (talk) 21:45, 4 February 2012 (UTC)
Don't want the typo rule making a replacement that doesn't change how an article is displayed, so I updated the rule. Thanks! GoingBatty (talk) 04:22, 6 February 2012 (UTC)
lack of spaces with punctuation
I don't know if anything much can be done, but I've noticed that a lot of articles written by Indian editors (generally on specifically Indian topics) lack spaces after punctuation, and before it in the case of parentheses. Accounting for all the exceptions may make it impractical to address, though. — kwami (talk) 04:09, 6 February 2012 (UTC)
It's too tricky to attack with AWB, I think, but this RegEx code works pretty well with wikEd (select it on "Gadget" page in your preferences), stepping through an article one (missing) space at a time:
The biggest hangup is URLs, which the code avoids to a degree. In wikEd you can click a button to hide the references, and that gets quite a few of the false positives out of the way. The other things to watch out for are dotted acronyms (e.g. S.P.E.C.T.R.E.) and "e.g."!!!, and unspaced initials (e.g. M.K.Gandhi), especially within links. If you find a paragraph or two that is really plain text and visually lacks these pitfalls, highlight the paragraph(s) and click the replace button to change 'em all at once. Have fun! Christhe spelleryack05:31, 6 February 2012 (UTC)
I don't know whether you were just busy, or if the length of that, or unfamiliarity with wikEd scared you off. Never mind, there is hope. I have F&R rules for AWB that do a pretty good job and produce few false positives.
I would like you and other AWB users to run this through a few hundred articles and give feedback here. I think it can be considered for inclusion as a WP/Typo rule, but leaving out the "(?<!inline,)" lookbehind, which should not be necessary for a Typo rule, since Typo rules leave templates alone. That exception is to prevent changing "display= inline,title" in Coord templates. You may look at two edits of Haripad that I made today, the first with the "mild" AWB F&R rules (fixed 32 periods and 15 commas), and the second with selective use of the rough-edged wikEd RegEx code I posted at the top of this discussion. This gives a pretty good feel of what each can accomplish. Christhe spelleryack17:21, 7 February 2012 (UTC)
Perhaps I should explain what the "mild" rules are trying to fix. The first fixes space-word-period-uppercaseword-space, and the second fixes space-word-comma-word-space (or left square bracket instead of the ending space). Christhe spelleryack17:31, 7 February 2012 (UTC)
I have similar rules in my AWB. I've noticed the problem as well. I agree though that it's too error prone for AWB to have built in. I used the Find and Replace and clean up as I go. There are instances though, chemistry formulas are what I think of offhand, where there shouldn't be a space after a comma, for instance. It's things like that that make it difficult for AWB's defaults. Shadowjams (talk) 21:55, 6 March 2012 (UTC)
That was the only case of "ecspecialy" in all of Wikipedia, and it doesn't seem to warrant a change to the typo rules. It was probably the least of the problems that had beset that article, which is now much improved. Christhe spelleryack08:44, 23 February 2012 (UTC)
Violante
The rule word="-en(ce/t)" too often tries to change the given name "Violante" to "Violente". Perhaps someone who has experience with this rule could give it a good tweak. Christhe spelleryack18:50, 2 March 2012 (UTC)
I noticed that AWB autocorrected english => English and french => French in a document that I was just worked on, but not swahili => Swahili. Perhaps that could be added as well? Glancing around the Swahili articles (culture, people, language), all appear to use a capital letter. Khazar2 (talk) 18:17, 14 April 2012 (UTC)
There's a soft hyphen U+00AD hiding between the "includ" and the "ing", which the software is treating as a word break. An obvious fix is to delete and retype the word, but I'll leave it for now in case anyone has a more general solution. -- John of Reading (talk) 07:13, 28 March 2012 (UTC)
That's very odd. I may run through a database dump and try to get rid of those soft hypens if there are others lurking around. Shadowjams (talk) 16:21, 15 April 2012 (UTC)
Whoa, there are a lot of these. I looked through the WP:MoS pages and there doesn't seem to have ever been express guidance about this. I can imagine some areas where this would make sense, but it would seem that most of the time it's being introduced by someone copy-pasting from a word-processor.
I'm going to run through and see if any patterns emerge. I think in most cases though, where this occurs in the middle of a paragraph, it does nothing to help improve the formatting, but only makes AWB and other text searching less effective. Shadowjams (talk) 16:42, 15 April 2012 (UTC)
I've come across this at least a dozen times and not once was it warranted... it's always the Spanish word. FWIW I don't focus on Hispanospheric articles. I'd like to delete it from the list but am thinking I should discuss it first... has anyone else had this problem? PhnomPenciltalkcontribs20:14, 20 April 2012 (UTC)
While going through an article I found this word, "manoeuverability" which AWB changed to maneuverability. The American version. Then I get a message saying I changed it from the UK version to the American version... which is odd because it doesn't typically trigger on it. So I looked up the matter in MacMillan. [14] MacMillan states it should be 'manoeuvrable' which means 'manoeuvrability' not 'maneuverability'. The spelling version on the original article is wrong, American or British. A lot of articles seem to use 'manoeuVERAbility' rather then 'manoeuVRAbility', and given what I assume, could someone verify and update the typo list to that effect? ChrisGualtieri (talk) 06:56, 25 April 2012 (UTC)
My "Concise Oxford" confirms "manoeuvre", "manoeuvrability" as the British spellings and "maneuver" as the US spelling. I've also seen the entry for "manoeuverability" in Wikipedia:Lists of common misspellings/M, where it is listed with two possible fixes, "maneuverability [American], manoeuvrability". So I'm going to delete this rule from the automatic list and add "manoeuverable" to Wikipedia:Lists of common misspellings/M. This misspelling is too hard to fix automatically, as only a human editor can decide which fix is appropriate. -- John of Reading (talk) 10:15, 25 April 2012 (UTC)
Not sure why AWB keeps trying to go from Bicicleta to Bicycleta. Bicicleta seems to be valid, but I've hit on 10+ false positives so I was wondering what everyone else thinks about this change. As it is not English, I'd opt for its removal if in doubt. ChrisGualtieri (talk) 18:35, 26 April 2012 (UTC)
AWB incorrectly tried to change "audion" to "audition" in the context of the Oscillation article. diff. Turns out "audion" is actually a word. See this search. Maybe this is one of those cases that we just have to check that our fix is proper, but perhaps the regex can be adjusted somehow. Just wanted to point it out anyway. Jesse V. (talk) 19:40, 20 April 2012 (UTC)
This has come up several times as well. Do we capitalize the 'f' in french fries? Seems like AWB wants to mark it because of the nation. Possible link to 'French horn' as well? ChrisGualtieri (talk) 18:37, 26 April 2012 (UTC)
One common misspelling I come across is "price" for "prize". While obviously we can't get all of these with Regex, how about the following fixes to catch at least some of the more common?
Nobel Price => Nobel Prize (40ish results)
Pulitzer Price => Pulitzer Prize (20 or so)
Peace Price => Peace Prize (40 or so)
literary price => literary prize (6)
Is having a definition for 'twitter -> Twitter' possible? We currently have Facebook and Myspace, but I've noticed twitter doesn't flag. The word 'twitter' is comparatively rare to 'Twitter'. ChrisGualtieri (talk) 16:03, 30 April 2012 (UTC)
Willing to try "twitter that", but if there are false positives, we may need to have another rule like "(announced|posted)\s+(on|via)\s+(his|her|their)\s+twitter". GoingBatty (talk) 16:26, 2 May 2012 (UTC)
Could some kind regexpert add this to the 'University' line? It's so complex already that I'm wary of tampering with it. Colonies Chris (talk) 10:19, 7 May 2012 (UTC)
Another odd notice I just discovered. We have womens to women's, but not mens to men's. Kinda strange when viewing sports teams to have one correct itself and the other avoided. ChrisGualtieri (talk) 02:49, 1 May 2012 (UTC)
Ah. I figured something was preventing it. At the risk of not making another section for every issue I'll just rename this and move to my next point. Many articles have this -{{okina}}ie which triggers the 'ie to i.e.' Is it possible to create an exception to the okina matter? I've seen like a dozen of these and the use of the okina is always in words like, "Lāʻie" Or is this a matter of it existing but looking for the character for the okina? Again, I really do not understand the Typoscan database, otherwise I'd try to find my own answer. ChrisGualtieri (talk) 03:54, 1 May 2012 (UTC)
I just added a rule to only fix "mens" if it is a sports phrase: "mens basketball", "mens lacrosse", "mens sports", "mens team", "mens tennis" or "mens and womens". Hope this helps! GoingBatty (talk) 04:26, 1 May 2012 (UTC)
As for -{{okina}}ie, that's a tough one. I noticed it does not try to fix "Lāʻie", so if you create a find a replace rule to change {{okina}} to ʻ (and don't check the After fixes box), that should fix the problem. Maybe someone else can come up with a more elegant solution. GoingBatty (talk) 04:42, 1 May 2012 (UTC)
Some years ago I remember changing an Okina and getting a sound drubbing for doing so. I'll dig in my archives. RichFarmbrough, 19:11, 10 May 2012 (UTC).
"having being"
I have fixed quite a few of these (usually to "having been"), but there are still hundreds left. There are so many articles already afflicted that just fixing them all now probably won't solve the problem, and new cases will be introduced often enough to warrant a Typo rule, but I don't want to do that if there will be many false positives. I have been using a Find & replace rule "\bhaving being (\w+)ed\b" --> "having been $1ed", and it works rather well, because most of the errors to be fixed are of the form "having being relegated" or "having being diagnosed". But this misses "having being sold", "having being built" and "having being previously named". I already fixed a lot of similar cases: "have being" and "had being", but there are definitely false positives for those, such as "who existed when nothing else had being, and who created that which exists after she had come into being". Anyone have any ideas for Typo rules to fix most of these without an unacceptably high rate of false positives? Christhe spelleryack03:11, 19 May 2012 (UTC)
Hongkonger
typos fixed: Hongkongers → Hong Kongers using AWB
"Hongkonger" is not a typo; just a variant spelling. It should not be autocorrected to Hong Konger. (The accepted terminology would be "Hong Kong people" anyway.) Deryck C.20:11, 26 May 2012 (UTC)
How long should rules be in the New additions section before they get moved into the appropriate section? Thanks! GoingBatty (talk) 15:02, 28 May 2012 (UTC)
merged together => merged
There's about 500 "merged together"s in Wikipedia. This is a commonly listed redundancy (e.g., [16]) that appears to be avoided by major media organizations. (The New York Times used the word "merged" 1000 times this year, with zero uses of the phrase "merged together", for example.) I'd suggest adding a fix replacing merged together with merged; I gave this a trial run and made about 100 replacements without finding any false positives. Khazar2 (talk) 20:17, 28 May 2012 (UTC)
The documentation says "When used on AWB, typo-fixing is automatically prevented on image names, templates, wikilink targets and quotes (including indented paragraphs). If a typo rule matches a wikilink target, this rule will be ignored on the whole page." I understand the reasoning behind this, but it means that many corrections I add to the list frequently have no effect, and I have to duplicate them in my personal AWB find-and-replace list to get them to work for me. Could we have an AWB option to apply all RETF changes everywhere except within image/file names, URLs, and quotations? Colonies Chris (talk) 14:48, 22 May 2012 (UTC)
GB, I've been reading and rereading your contribution without understanding it, but now I see your 'source code', I get it - I've fixed the category link in your comment so that the text appears. Now it makes sense to me. Good solution. To take it to the next level, that option could automatically apply the same correction to all other (non-link) occurrences of that string in the article, and then there would be no need to manually add this type of misspelling to the typo list at all. And there could be similar options for Category:Redirects from other capitalisations and Category:Redirects from titles without diacritics. This would be an excellent way of making use of all the work that other editors have put into creating and categorising redirects. Colonies Chris (talk) 21:41, 23 May 2012 (UTC)
'It it' is currently being auto-corrected to 'it is', but there are cases where it can be correct. For example, 'If you set fire to it it burns with a blue flame' - yes, you might put a comma in (inserting handy x-ref for lots of relevant info on comma styles), but a common journalistic style is to omit commas, so it isn't strictly wrong to leave it out. If the word after the second 'it' is a verb or adverb then you've probably encountered such a case.
For sentences where the comma is definitely incorrect, rather than merely debatable, see cases such as 'Is it him? Is it her? Is it it?', or 'Was it it that did that?'
As I'm not a Wikipedia regular I'm flagging the mistake here rather than that diving into the AWB typo regex list and just changing it. --82.69.54.207 (talk) 11:24, 30 May 2012 (UTC)
This strikes me as not worth changing the Typo rule (at least not for the few cases where "it it" is correct). I would just change each case to {{Not a typo|it it}}. There are also cases where "it it" should be changed to "if it" or "it" or "is it", but I think we should also leave the Typo rule alone so that at least it finds these cases, and then we can fix them in the edit box. Christhe spelleryack12:50, 30 May 2012 (UTC)
Derrick Caracter
Derrick Caracter's last name goes to 'Character' automatically. I've hit this twice so far, but I was wondering if there was someway to deal with it. Its minor, but I just put a invis tag next to one. ChrisGualtieri (talk) 18:36, 4 June 2012 (UTC)
Every article that mentions him should now contain a link to his article, which will prevent the incorrect typo fixing. GoingBatty (talk) 01:42, 5 June 2012 (UTC)
Widley/Widely and Enbil/Embil
Both of these definitions alter names, and quite frequently on the Enbil- type. I have found only false positives with both of these definitions, and I do not make that statement lightly as I've done several thousand of them now. What purpose does this definition serve? Widley might just be past its proper use otherwise, I never see it as a typo for widely. Nor do I understand what is going on with Enbil as in [[Jorge Oteiza] (Enbil) Plácido Domingo (Embil) in names. ChrisGualtieri (talk) 20:10, 5 June 2012 (UTC)
In the case of "Widley/Widely", what you see (many false positives) may be an indicator that the rule is working. As editors fix the true hits and skip the false positives, the false positives will begin to predominate. I found 3 articles that need fixing, and about 30 false positives. Why not wrap a "Not a typo" template around the false positives, and then they will trouble you no more? As for Enbil, I have no idea what it's good for; I have observed it a couple of times and just hit the "Skip" button. Christhe spelleryack00:28, 6 June 2012 (UTC)
Another thought — the false positives tend to be capitalized, with true hits tending to be in lower case, so maybe the rule can be adjusted. Christhe spelleryack00:32, 6 June 2012 (UTC)
True... after doing several thousand typos one's sense of 'why is this rule in existence?' starts to peak. I had a string of ones which referred to the place (naturally capitalized) rather then 'Widley used' or some such typo. I'm not particularly ready to go altering every other case, so much so as my question is answered and I understand why such awkward rules come up. A lot of the Spanish and Italian typo words get me simply because I do not know if they are incorrect, leading to many skips. As more typos get removed from Wikipedia the more of these false positives there will be. I will finish the backlog off by the end of June, I assure you of that. ChrisGualtieri (talk) 04:40, 6 June 2012 (UTC)
'HongKong' vs 'Hong Kong' Egyptian Bank.
HSBC Bank Egypt has the text which cites the original name as 'Hongkong Egyptian Bank' AWB wants to change it to 'Hong Kong', while I think this may be a typo. A first look at the HSBC site clearly states otherwise. [17] "With HSBC Bank Egypt was established in 1982 as Hong Kong Egyptian Bank." So I think this is a case of a misunderstanding (thought I was wrong at first), but I'd still like a comment from others on this. ChrisGualtieri (talk) 00:23, 6 June 2012 (UTC)
Guess the official website is wrong. Seems to affect Hongkong Bank of Canada, Hongkong Bank of Australia, Hongkong Bank Malayasia Berhad as well. Anything can be done to make sure it doesn't ping these? Or is it best left with invisi-tags.ChrisGualtieri (talk) 03:02, 6 June 2012 (UTC)
André de Toth or Andre de Toth
I've had an editor make a post on my talk page about correcting the name 'Andre de Toth' to 'André de Toth' in accordance with the definition and from what I know of the director, the name and biographical article is André de Toth and 'Andre de Toth' is the redirect. Of his movies, he is accredited more so with 'André de Toth' then 'Andre de Toth' 16 to 4 it seems. Of the 4, 3 of them go directly to the redirect and only Play Dirty has 'Andre de Toth', but links directly to André de Toth. Hate to bring up the IMDB argument, but all of those movies list as André de Toth as does our André de Toth biography. The editor insists that he was billed as 'Andre de Toth' it should stay 'Andre de Toth' even though it is not a pseudonym and the lack of the accent appears to be a technical matter, as the director's name is André de Toth. Just wanting some input on this. ChrisGualtieri (talk) 17:42, 6 June 2012 (UTC)
If it were up to me, and I owned Wikipedia, I think I would show the accent in all cases. But there are other editors involved, and (since IMDB seems to keep track of which movies use an accent on his name) it seems like a valid method to use an accent or no accent in any movie article, according to the way each movie gives credit. The complaint that "you can't go around" doing your thing rubs me the wrong way; it implies that you don't have enough to do, or are not making thoughtful edits. In any case, since there is some resistance, you might want to avoid going at the accents with hammer and tongs in this case. Christhe spelleryack01:30, 7 June 2012 (UTC)
I will try and discuss the matter with him to see if we can't work something out. Stephen King represents a classic case with the Richard Bachman pseudonym, but Andre de Toth seems to be a localization matter that went with the movies. Though he did have variants such as 'Andre De Toth | Andre DeToth | André DeToth | Tóth Endre | Endre Tóth | Andre de Toth' it seems. I'd opt for temporary removal of the spelling rule at this time until we sort this matter out. Seems like much of the work is also absent on Wikipedia as well. I'm not going to force my preference on anyone, after all I care more about fixing errors. ChrisGualtieri (talk) 18:47, 7 June 2012 (UTC)
I commented, I do not believe italics should be used in accordance with the MLA style, but I have seen it listed on some publishers. I think it may just be a hold out of the 'Latin requires italics' even though it is very common and MLA does not require italics on common Latin phrases. Might as well put all the 'etc.' in italics then. ChrisGualtieri (talk) 04:35, 6 June 2012 (UTC)
Thanks for commenting there and here, Chris. I've also created a new topic MOS talk page to request that the MOS documents be made consistent. I've also taken the conservative approach and disabled the typo rule. Thanks! GoingBatty (talk) 22:46, 6 June 2012 (UTC)
Until this 2010 edit, the rule didn't change the italicisation, and only concerned itself with getting the dots right - no dot after "et", and a dot after "al". I've re-instated that version of the rule. -- John of Reading (talk) 20:25, 9 June 2012 (UTC)
e.g. and i.e.
Wikipedia:Manual of Style/Abbreviations#Latin abbreviations states: The initialisms "e.g." and "i.e." should not be followed by a comma. However, the rules for "e.g." and "i.e." preserve the comma (and other punctuation). Could someone please update these rules so they follow MOS? (For example, both "e.g," and "e.g.," are replaced with "e.g."). Thanks! GoingBatty (talk) 00:05, 7 June 2012 (UTC)
Once updated properly I'll ask for a new dump to be sorted so I can begin wiping those out, I must admit I've let a few of these go by due to being unfamiliar with the rule and the natural pause in speech did seem to require a comma, "He made and itemized list for his shopping trip which included eggs, milk, bread, etc., but he forgot the pasta anyways." A lame example I just created, though I wonder if it would be valid. The whole (etc., but') matter is probably rare, yet still questionable for me. If it is awkward yet valid, could the rule be modified to include it? ChrisGualtieri (talk) 18:51, 7 June 2012 (UTC)
Looks like AWB is fooled by the unbalanced italic marker in the preceding picture caption. AWB doesn't fix any typos in italicised text, because they might be quotations. If you run AWB on the current version of User:John of Reading/Sandbox, it will fix the two "Higlights" near the top of the page. If you remove the two apostrophes after "Old man and cow" in the AWB edit window and then press F5 to have the text re-processed, the typo fixer fixes the third "Higlights". So this one is a bug. -- John of Reading (talk) 15:49, 11 June 2012 (UTC)
(e/c) I don't know; there's seems to be plenty of evidence for both spellings. Just to save anyone else the trouble of looking, the rule was added in December 2007. -- John of Reading (talk) 07:51, 12 June 2012 (UTC)
Newcastle-on-Tyne => Newcastle upon Tyne. Seems to be the old name for the city and doubt that it makes sense to change the spelling. Regards, SunCreator(talk)13:16, 11 June 2012 (UTC)
This would be a great deal of false positives if all "roman" got replace but can't the rule handle it like many do with suffix , i.e. on "Embarrass" rule with "Embarras River". Regards, SunCreator(talk)12:31, 16 June 2012 (UTC)
After my trial database scan, I'm now working through a list of 250 articles that have "roman" followed by (amphitheatres?|aqueducts?|archaeology|[Bb]asilica|calendar|candles?|city|coins?|emperor|empire|farmhouses?|forts?|roads?|towns?|villas?). I think this is too messy to be a typo rule. A naive rule would damage articles about French literature (Nouveau roman) and typography (roman type). -- John of Reading (talk) 14:17, 17 June 2012 (UTC)
There is a 'False' button if you enable it in the View->Display false positive button. Does anyone use it? Is says "Add to false positive file", it that local or central file? Regards, SunCreator(talk)19:03, 17 June 2012 (UTC)
The full OED says the same, and covers all forms of English. We could decide that the typo rule is too strict, but as it stands it meets the OED guidance. Rjwilmsi14:28, 18 June 2012 (UTC)
Every American knows what "advertise" means, and some dictionaries of American spellings only recognize that spelling, omitting "advertize" completely, not even acknowledging it as an alternate spelling. I think Wikipedia should use the form that will not offend or surprise users of British English or American English. WP:SPELLING says "In both British English and American English, many words have variant spellings, but most of the time one variant is preferred over the other." Since "advertise" is the preferred spelling in both cases, it's clearly the preferred spelling. Of course, there's no need to flame over an editor who inserts "advertize", but it's OK to quietly change it to the preferred spelling. This will also tend to take care of an article where both variants appear. Christhe spelleryack15:14, 18 June 2012 (UTC)
I'm disinclined to combine them, because "best-received" is generally not a problem, and the prepositions listed in the lookahead are somewhat different. But if you want to merge them, go ahead. Christhe spelleryack14:20, 7 May 2012 (UTC)
Isn't there about 5000 of the 'well-received' typos? I'd love to help but for some reason my AWB cannot connect to the tool server to load up new lists. Even for the CHECKWIKI project. I'm stuck doing assessments until it comes back online. Something with a new version maybe? So no typoscan for me. Otherwise I'd try to do some of them. ChrisGualtieri (talk) 14:26, 7 May 2012 (UTC)
Hi ChrisG - sorry you're having trouble loading new lists - I'm not having that problem with SVN 8062. If you post on the AWB talk page to see if anyone has a solution for you, I hope it's well-received. (Sorry - couldn't resist.)
The "well-received" rule won't be picked up on Wikipedia:WikiProject TypoScan unless there's another typo on the page or until a new database dump is processed.
I'm scanning to March dump for "well received" there seem to be many thousands. Maybe I'll do some as my swan song. RichFarmbrough, 07:47, 8 May 2012 (UTC).
Why does the well received not rule want a space (or a full stop?) follwing? What is different about commas and semi-colons? Regards, SunCreator(talk)21:51, 24 June 2012 (UTC)
If you are asking whether the rule could be expanded to fix "well-received" followed by a semicolon, the answer is that it could, but I imagine that there are very few such cases. If you can provide a couple of dozen such cases, I will expand the rule. If you are asking whether the rule could be expanded to fix "well-received" followed by a comma, the answer is that I can imagine false-positive cases, such as "Schulmklopfer's first effort was a well-received, completely sold-out play." I'm not anxious to go that far. Christhe spelleryack22:50, 24 June 2012 (UTC)
The rule is working very well. There is nothing wrong with "well-received" when it precedes the noun that is being modified; this is the case in all three examples. The hyphen is not needed when "well received" is used predicatively, or when an intensifier is used, as in "He wrote several very well received books". Christhe spelleryack03:51, 25 June 2012 (UTC)
If I've understood this correctly, the first half of the complicated bit tries to match <<s'>>, <<s's>>, <<s''>> or <<s's'>>, and the second half tries to match <<;s>>, <<;s'>>, <<s>> or <<s'>>. But if you run it on Food security or on User:John of Reading/Sandbox, you'll see that it actually changes <<Womens'>> to <<Women's'>>. I think the problem is the behaviour of the \b character after an apostrophe.
At least for my test cases, this is a possible fix:
Is this a safe fix? Can anyone find a neater one? (And is it a good idea for a regexp to match a pair of consecutive apostrophes?) -- John of Reading (talk) 09:07, 22 June 2012 (UTC)
Fixed with this edit. Further improvements are probably possible, in particular as you noted with the match possibilities for double apostrophe's. I accounted for them in my addition; I'll try to account for them in the existing match as well. Soon. -- JHunterJ (talk) 23:57, 24 June 2012 (UTC)
In most cases, the rule works. In "Pope Benedict XVI is the leader of the Catholic church", the last word needs to be capitalized. There is no parallel for "the Protestant church", because there is no equivalent organization. When I see AWB trying to change "the town's Catholic church was built in 1894", I just nix that change and proceed. You could wrap a "Not a typo" template around "church" to prevent further capitalization attempts, but that's probably heavy-handed Christhe spelleryack21:29, 22 June 2012 (UTC)
According to the search box, you just fixed the only example of "georaphical" in the whole of Wikipedia. So it's not worth adjusting the typo rules to fix it. -- John of Reading (talk) 13:30, 11 June 2012 (UTC)
I've had a look through the archives and can't find any guidance on this. Since the list is already so large, I wouldn't like to see it expanded to cover rare typos. 25, maybe? Opinions, anyone? -- John of Reading (talk) 16:20, 11 June 2012 (UTC)
I think such a high number is inappropriate. It seems to be saying only fix common typos and leave the others. Is there such a downside to adding more rules? Regards, SunCreator(talk)15:32, 15 June 2012 (UTC)
Yes, because each new rule slows down the processing of each page. If a typo does not appear on many pages, it is probably simpler just to fix them. To do this with AWB, use "Wiki search (text)" and a "Find & Replace" rule - or just fix it by hand in the edit box. -- John of Reading (talk) 15:47, 15 June 2012 (UTC)
Is there a list of such words that are typos but rejected from AWB rules that we can go through to correct in the way you describe? Regards, SunCreator(talk)20:52, 15 June 2012 (UTC)
Common misspellings can be listed at WP:LCM; some of those are covered by AWB rules and some are not. I'm not aware of any place to list "uncommon misspellings". I just keep a list of any I find in a file on my computer, and every few weeks I take a break from my other projects and fix those typos instead. -- John of Reading (talk) 08:08, 16 June 2012 (UTC)
Not sure if this discussion is stale... and I'm undecided about if we need to cut out a lot of the typo fixes. I do find the searching to be slower than is ideal, but then again I'm running it on a slow computer. There's a danger in declaring which rules are rare because a lot of rules are getting fixed because they're in the list. Only if we had detailed stats about which rules hit the most could we really know which ones are rare.
And even if they are rare, they are often ones that people don't notice and correct on their own. So at the very minimum, we should put the "deleted" rules into another list, such as a secondary AWB list. That way someone could run through a database dump in batch every so often and correct these orphan typos. Shadowjams (talk) 22:55, 15 July 2012 (UTC)
This thread was about adding a new rule, not deleting an existing rule. But I agree 100% that we shouldn't delete any existing rules without collecting proper statistics on how many times each rule is used. That could be a feature request, perhaps? -- John of Reading (talk) 14:57, 16 July 2012 (UTC)
game winning goal => game-winning goal (at least 300-400 occurrences)
walkoff => walk-off (at least 50; this appears to fix "walk-off" in the sense of a 9th inning baseball win as well as its occasional use for striking workers)
game winning home => game-winning home (40-50)
game winning hit => game-winning hit (30)
I've given each of these substitutions a test run and didn't see any significant false positives. Thanks as always for your efforts Khazar2 (talk) 21:46, 21 July 2012 (UTC)
I reverted the "Overdevelopment" rule to its former state that does not treat "under-development". There are a number of articles that use "under-development" attributively, such as Grupo Alexander Bain. It's an ugly construct, and I would rather see "an under-development campus" changed to "a campus that is under development", but we can't change "an under-development campus" to "an underdevelopment campus", which has has a different, and pejorative, meaning. Christhe spelleryack12:53, 22 July 2012 (UTC)
Quran states: "The Quran...also transliterated Qur'an, Koran, Al-Coran, Coran, Kuran, and Al-Qur'an, is the central religious text of Islam". Since there are apparently several acceptable spellings, I wouldn't want the rule to change "Quran" to "Qur'an" or vice versa. Which replacements does the typo rule make that are you concerned about? Thanks! GoingBatty (talk) 01:29, 23 July 2012 (UTC)
It does seem that there are many ways to write it. That specific one seems okay from some websites, the Qu'ran has more then half a dozen 'okay' ways to write it. Koran, Coran, Quran, Qur'an, Qur’ān and al-Qur’ān are some of the most popular ones. Though it seems to be due to a shift in political correctness and accuracy of the religious text for transcription. The evolution of it is still ongoing. ChrisGualtieri (talk) 14:58, 23 July 2012 (UTC)
To give a rough estimate of the number of article pages with an AWB typo I took a sample of 1000 mainspace articles(using random in AWB). AWB reported that 17 had typos after a pre-parse mode scan. After checking manually two contained false positives and where dismissed the remaining 15 where saved(although 8 where cosmetic issues). 15 in 1000 scaled up for the 3,975,490 articles on Wikipedia is 59,632 typos page to go. Regards, SunCreator(talk)14:20, 17 June 2012 (UTC)
I sampled another thousand with 20 found of which 2 where false positives. So 18 in 1000 is 71559. Will try and check again in a months time. Regards, SunCreator(talk)16:44, 17 June 2012 (UTC)
TypoScan lists at least 80,000 left to go (depending on my own activities), but it should be clear that the first pass has a user error ratio 3x higher then what it should be on the 'skips' so I believe the actual number is 135,000 to 150,000 that WILL be hit upon by the rules contained herein. Also since we are not running 100% detection of typos with the rules, the actual number of typos on articles could be much higher. ChrisGualtieri (talk) 15:10, 24 June 2012 (UTC)
Sampled another two thousand. 42 where typos, 7 where false positives leaving 35 (btw 12 where white space typos). So a typo rate of 35 in 2000 for 4,011,244 article works out as 70187. Regards, SunCreator(talk)23:21, 26 July 2012 (UTC)
Bill isn't the only Morrisette with a Wikipedia article. I've taken the conservative approach and changed the rule so it will only fix misspellings of Alanis Morissette. As always, other ideas are appreciated. GoingBatty (talk) 01:45, 23 July 2012 (UTC)
Several dictionaries list "Guerilla" as an alternate spelling. Macmillan and oxforddictionaries.com are a couple. I don't like the single-"r" spelling at all, but there it is. Sorry, I'm going to remove the rule. Christhe spelleryack14:52, 25 July 2012 (UTC)
BTW, you might feel better after seeing that someone once tried to go the other way with this (see talk Archive 1). I also commented in talk Archive 3 that research on each article is needed before choosing "r" or "rr". Christhe spelleryack16:48, 25 July 2012 (UTC)
Umayyad entry
Not sure why this is happening, but the entry seems to be going for any loose 'd' and attempting to change it to Umayyad. Even with the shortening for 'd.' for 'died' in articles like Abdullah Al-Refai. I am not disabling it yet, but I've had 6+ false positives in the last 10 minutes. ChrisGualtieri (talk) 12:28, 26 July 2012 (UTC)
Thank you! I was wondering why it was doing that, I've only corrected 2000 typos and had it come up so many times. I don't fully understand the rules and how they operate. I hate to say it, but it had a good detection on contractions like 'they'd' and 'she'd' which bug me. ChrisGualtieri (talk) 16:13, 26 July 2012 (UTC)
However, in testing, I found less than 20 pages in a wikitext search for "servey", "serveyed", and "serveying". Several of those turned out to be false positives, matching on people whose names were actually "Servey".
Might this rule be too risky to add to the RETF list? Maybe it would be better to have it match only forms with a suffix?
If the false positives are people called "Servey" then amending the rule to avoid those starting with an uppercase 'S' would likely improve the rule considerably. Regards, SunCreator(talk)09:27, 29 July 2012 (UTC)
This seems to be a case where there are too few hits to justify a new Typo rule. There should be at least a few dozen errors, with very few false positives, before adding a new rule. Christhe spelleryack12:55, 29 July 2012 (UTC)
Extra spaces left by one or more rules that move punctuation to before <ref> tags
Sometimes, a period is moved from after a closing </ref> to before the starting <ref>, but instead of being simply moved its old location is filled with a space, resulting in two spaces between sentences. Not the end of the world, but certainly unnecessary.
This also happens at the end of a line sometimes, which seems to bypass the rules that trim trailing spaces. (Obviously that's related to running the rules in a particular order.) Tuvok[T@lk/Improve]08:44, 29 July 2012 (UTC)
Thanks, GoingBatty. Apparently I've taken your username to heart in advance of meeting you, thanks to the complexity of AWB. I'll take this elsewhere, and thanks again for correcting my heading. Cheers, Tuvok[T@lk/Improve]04:37, 30 July 2012 (UTC)
.i.e. rule for Irish websites
<Typo word="i.e." find="\bi(?:\.?e|e\.)(['\s,:;\)&-])(?<!\.ie.|'ie')" replace="i.e.$1" /><!--don't generalize to capital Ie; avoid matching website.ie; avoid matching 'ie' used as syllable -->
This rule was changed to avoid Irish .ie domains. I just noticed it's not working, and still changes .ie. to .i.e. for example on Irish poetry. Can someone with Regex wizardry take a look at correcting the issue. Regards, SunCreator(talk)22:36, 29 July 2012 (UTC)
Etc. => etc.
Only in the exception article Etc. could this start a sentence, so couldn't it be made into lowercase? Like the "i.e." rule it would seem appropriate to use only lowercase "etc." Regards, SunCreator(talk)21:39, 29 July 2012 (UTC)
There are several uppercase examples at the disambiguation page ETC, such as Etc... (a Czech rock band), Etc. (the b-sides and rarities album of the influential punk band Jawbreaker), and Etc. (a bonus disc accompanying the Pet Shop Boys' 2009 release Yes.) GoingBatty (talk) 03:22, 30 July 2012 (UTC)
Thanks. Those possibilities seem to cover a small number of topics that could be individually resolved with {{Not a typo}}. I'm encouraged to think this could be a workable rule. Regards, SunCreator(talk)16:02, 30 July 2012 (UTC)
More French loanwords
I see there are typo rules for some French loanwords. Should we also add rules for bête noire, bourrée, château(x?), passé, and séance? (Potential rules for château and séance should not include capital letters - see their disambiguation pages.) Thanks! GoingBatty (talk) 03:17, 31 July 2012 (UTC)
Not for "chateau" or "seance". "Chateau" is the English word, which allows "château" as an alternate spelling (in some dictionaries). Same goes for "seance/séance". The Château page has been roughly handled by a group of editors suffering from fairly bad cases of hyperforeignism. This is the English Wikipedia, and the standard for spellings is a good English dictionary. Christhe spelleryack04:26, 31 July 2012 (UTC)
I went to a bookshop and checked out some dictionaries but erroneously somehow thought the word to check was anonymous. Darn! Regards, SunCreator(talk)15:19, 1 August 2012 (UTC)
I fixed it (and "white-collar professionals"). "Upper-middle-class individuals", because "upper" modifies "middle-class", not "individuals". Christhe spelleryack00:38, 22 July 2012 (UTC)
I think it's waiting for someone to decide that such a change is reasonable, doable and worthwhile; the rule is already somewhat clunky. Maybe some other editor will comment; it's only been a day since the issue came up. Christhe spelleryack00:29, 23 July 2012 (UTC)
Good question. I wasn't aware of that rule and had to manually correct this. So I guess the answer is the existing rule doesn't cover it. But I'm not sure why. Regards, SunCreator(talk)03:57, 30 July 2012 (UTC)
My guess was there were unbalanced quotation marks in the article causing AWB to skip that section of the article, but I didn't see that. GoingBatty (talk) 04:14, 30 July 2012 (UTC)
No, as it corrected the word after; allready => already. See the previous edit. It appears the problem is with the rule. I will test it later. Regards, SunCreator(talk)04:45, 30 July 2012 (UTC)
The text is in User:John of Reading/Sandbox. A typo rule is disabled if it matches any wikilink in the article. By experiment, I find that this test is fooled by "links" to the File namespace. So, because the article contains [[File:First 3 egyptian pilots.jpg|thumb|upright|left|First three Egyptian pilots]], the "Egypt" rule is turned off. I'll log a bug. -- John of Reading (talk) 07:04, 30 July 2012 (UTC)
Rjwilmsi (talk·contribs) is happy to make the change if we can agree that it will do more good than harm. But, on reflection, I think it will be very difficult to work out if this change would be an improvement. Using the current code, some typos are not getting fixed - but it took a sharp-eyed AWB user to notice one of them and raise it here. Using the proposed new code, these typos would be fixed - but there would probably be some new false positives. I have no idea whether the extra fixes would outnumber the extra false positives, and it would take a serious amount of work to find out. -- John of Reading (talk) 05:43, 1 August 2012 (UTC)
There are less than a dozen articles with "stuent", and even fewer cases of the other misspellings. I would say that this so rare as to be slightly below the threshold for adding a Typo rule. Please read the section above, "georaphical => geographical", for other ways to deal with rare misspellings. Christhe spelleryack03:23, 1 August 2012 (UTC)
I suspect only the capitalisation of 'panjab' would meet the previously discussed 24 or 25 occurrence level. It just goes to highlight that the majority of typos are low volume and therefore the current AWB typo strategy misses them. Regards, SunCreator(talk)14:14, 1 August 2012 (UTC)
There is a District rule that converts 'Distict' => 'District' also, but is seems in practice the Distinct rule gets it first. Regards, SunCreator(talk)14:04, 3 August 2012 (UTC)
Plus "and a way of life long gone", "ended her life long before they reached her", "a mode of life long since defunct" and "of a life long-lived on one side. Regards, SunCreator(talk)02:30, 4 August 2012 (UTC)
No problem; if you find a misspelling that occurs in a couple of dozen articles or more, let us know. With that many to chew on, we'll try to give AWB a rip at them. Christhe spelleryack19:13, 5 August 2012 (UTC)
I'm fairly new to typo correction with AWB. In my testing of regex additions/changes, I find that AWB skips a substantial portion of the typos that would match because they're 1) in references, 2) in text indented with a colon, 3) seemingly many other areas. None of this is well documented. I don't quite understand this: we're expected to review changes anyway, so why have so many areas ignored? Here's an example: my target "origional" did not hit here [18], but an unrelated typo hit (I had edit summary trouble here, ignore that). So I manually and temporarily removed the indentation ":" that I presumed was blocking the typo fix, within AWB in that edit. Then, parsing the article again, AWB corrected the typo I wanted [19], so it was the colon causing the problem (and I manually replaced the colon). It kneecaps the project to have some many textual areas excluded from correction. I wouldn't mention it if I hadn't had about 40% or more of target typos ignored by AWB so far. Riggr Mortis (talk) 02:48, 5 August 2012 (UTC)
Wikipedia:AutoWikiBrowser/Typos#Usage states "When used on AWB, typo-fixing is automatically prevented on image names, templates, wikilink targets and quotes (including indented paragraphs). If a typo rule matches a wikilink target, this rule will be ignored on the whole page." GoingBatty (talk) 03:42, 5 August 2012 (UTC)
I've seen that; I said well documented. "Indented paragraphs": there are many ways to do that. So "Joe's Journal of Psychaitry" doesn't get corrected because it has an asterisk in front of it: pointless. Templates: the template name itself (obviously), or its parameters too? In any case, the substantive point remains. Riggr Mortis (talk) 03:54, 5 August 2012 (UTC)
I think you're taking me rather literally; but in fact, AWB is ignoring no less than seven instances of "psychatric", which relates to a regex I added the other day. Try it. An article with a bullet point and "psychatric" is List of oldest buildings and structures in Toronto (it's also contained with a link, but not the URL, so who cares—all regular text is susceptible to typos, regardless of what wikicode it's wrapped in.) Riggr Mortis (talk) 05:07, 5 August 2012 (UTC)
I think it's good that AWB does not fix "psychatric" in the reference in Manpreet Singh, since the source actually uses "Psychatric". That's an example why AWB is conservative in its corrections, and does not make changes to the other six articles where "psychatric" is in a reference or external link. GoingBatty (talk) 22:45, 5 August 2012 (UTC)
I don't agree to be honest unless people are using AWB sloppily. I won't make such a change unless I could validate it somehow. Either way, Chris's solution of setting level of exclusion is the way to go. Regards, SunCreator(talk)22:53, 5 August 2012 (UTC)
(edit conflict) I Agree with Riggr. It would seem appropriate to work towards having less content ignored. It's not ignored in wikEd and I imagine in the future that will the common editing method. Might take reading to find out the reason behind these in the past but I'm open in having more content even if that means more cleaning up in terms of image renaming, marking more {{not a typo}} etc. Regards, SunCreator(talk)04:37, 5 August 2012 (UTC)
I often put a Typo rule into my Find & Replace rules and then search for that error and run them all to ground. But it might be a better move to add options in AWB to allow Typo fixes in indented paragraphs, Wikilink targets, etc. This would let each AWB user choose his or her own comfort level with how many hits will need to be skipped, how much extra examination will be needed, and how much risk they want to take. Christhe spelleryack12:50, 5 August 2012 (UTC)
The number of typos Regex can get is also very limited. So it won't get them all, or even half of all typos on a page in which they may hide. Aside from loading every page with a built in checker (instead of AWB) we will continue to miss many simply by using AWB loaded with Regex. ChrisGualtieri (talk) 05:39, 6 August 2012 (UTC)
So you'd prefer that we not maximize the value of all the work that's been done here over years, because perfection can't be obtained? Not a strong argument in any context, really. Riggr Mortis (talk) 23:43, 6 August 2012 (UTC)
I'd agree with you if it wasn't for the fact that I've corrected more then 40,000 articles worth of typos with Typoscan. I'm in the boat of 'Regex is good', but I cannot bypass the sheer force of a modern spellchecker that offers options but retains a 97-99% detection rate or higher. Regex is limited for many reasons, but its limitations cover important typos. ChrisGualtieri (talk) 00:34, 7 August 2012 (UTC)
How common are errors involving redundant units of currency, such as "$10 dollars" and "£10 pounds"? Additional units and their symbols are mentioned in the article "Currency sign".
—Wavelength (talk) 22:07, 6 August 2012 (UTC)
née
It seems that there are up to seven ways that people spell their own name when it contains a variation of "nee", and Regex wants to change every single one of them to née. It accounts for maybe 1/5 of the "typos" that Regex picks up in my filtered searches. Is there any way we could change, or even better, eliminate this rule? hajatvrc @ 20:02, 4 August 2012 (UTC)
I was responding to Hajatvrc's statement saying "I've never seen it make a correct change with this rule.". This edit and this edit are two more correct changes. I hope Hajatvrc can provide examples of false positives, per your request. GoingBatty (talk) 20:53, 4 August 2012 (UTC)
They look okay to me. Are you saying the née change is questionable as it may not be her maiden family name? I'm somewhat confused at what the issue is. Regards, SunCreator(talk)21:16, 4 August 2012 (UTC)
I feel like there was one category of people from a certain ethnicity where nearly every woman had that as their actual name, but I'm trying to remember which one it was! hajatvrc @ 21:19, 4 August 2012 (UTC)
Based on exact-phrase Google searches, there appear to be countless women who spell it "neè" and countless women who spell it "née". I had never encountered the former until I started using TypoScan a few days ago. The problem is, I can't find a reputable source that says née is or is not the only way to spell it. Do you know of one? hajatvrc @ 22:07, 4 August 2012 (UTC)
Google News shows no English language result for "neè". "neè" is not in my Collins dictionary or online on the Oxford dictionary. Tell me what you are looking at in Google? All I see is social media and Facebook typos. Regards, SunCreator(talk)22:31, 4 August 2012 (UTC)
<Typo word="Off-" find="\b([Oo])f(?:|ff)(er(?:ed|ings?)|ice(?:r?|holder)s?|icia(l(?:s?|ly|dom|ism)|te[ds]?|ting))\b" replace="$1ff$2" />
Many rules try to avoid 'oficial' because of common foreign language usage.
The above rule does change it although the comment implies otherwise. Can we amend this so oficial is left unchanged. Regards, SunCreator(talk)14:29, 5 August 2012 (UTC)
Please do so! That and differencia or whatever it is. Same with whatever changes Enpippi to Empippi, anything which sets En to Em. These rules constantly hit upon articles with foreign languages, the chances of finding an actual correction seems very low. ChrisGualtieri (talk) 05:42, 6 August 2012 (UTC)
Also, foreign language texts should be flagged with appropriate {{lang}} templates. The all of the typo rules will ignore the text. -- JHunterJ (talk) 12:52, 6 August 2012 (UTC)
So what language is "Interlingue" or "Sillaba votz es literals" or "La Diferencia"? Some times you only get a word and Wiki article deal with everything including the most unusual ancient languages. Labelling text is not only time consuming to research but if incorrect misleading to those that later edit the article. So useless it is obvious I use {{Not a typo}}. Regards, SunCreator(talk)22:11, 6 August 2012 (UTC)
Current rule:<Typo word="Département(al)" find="\b([Dd])epartement(ale?)?\b" replace="$1épartement$2" />
I don't understand this rule which changes Departement => Département(the French word for department), but why not go with Departement => Department the English spelling. On the English Wikipedia even Departments of France has the spelling departments. Regards, SunCreator(talk)13:10, 29 July 2012 (UTC)
See false positive here. Maybe the rule can be made more specific i.e to change to French spelling if being preceded with le or des or proceeded with au or des. Regards, SunCreator(talk)13:22, 29 July 2012 (UTC)
I've already done the ones which specifically mention the french variant and ignore all others for a great many pages, the false positives vastly outnumber the real ones. ChrisGualtieri (talk) 14:40, 29 July 2012 (UTC)
I disagree with that. The correct term for a French department is département. Department and département are not the same. So when referring to the specific département, as in the French département of Côtes-d'Armor I would expect to use the correct term. Seems to be a matter that was previously dealt with back in 2006 and never again. Why use an english word when the french term is there. ChrisGualtieri (talk) 03:34, 1 August 2012 (UTC)
In one way that is correct, because it depends on context. But the rule currently has no context and thus blindly recommends changing every departement typo to the French when the English may be correct. It's the same as the 'distict' typo that could be either 'distinct' or 'district'. Regards, Sun Creator(talk)10:38, 9 August 2012 (UTC)
Qaran → Qur'an
I'm concerned about AWB changing Qaran → Qur'an in edits like [24] and [25]. Qaran clearly is used in these cases as a placename, and searches indicate that such a place exists (see, for example, here). People are using AWB to turn such usage into nonsense. — Hebrides (talk) 12:55, 8 August 2012 (UTC)
Thanks. Also, how do I search for all instances where AWB has changed Qaran → Qur'an so that I can decide whether to change them back? This is vital. — Hebrides (talk) 13:06, 8 August 2012 (UTC)
Not sure, that's difficult. Maybe get a database dump(or someone who has one) prior to the rule being added(Feb 28,2010) and find articles with 'Qaran' spelling and check they are still okay? Regards, Sun Creator(talk)13:28, 8 August 2012 (UTC)
I wish there was a way, I'll ask around about searching edit summaries. Because an edit summary search tool would bring this one up with the way AWB works, it won't catch 100% if the typo changes are numerous, but I bet it would grab a majority. ChrisGualtieri (talk) 13:43, 8 August 2012 (UTC)
For the new full stop rule, please report any false positives here. I've ran it though several thousand of the most difficult articles, domain stuff mainly but it's conceivable that there it has a blind spot, but I don't know where to look. So any reports of false positives would be useful, even one would be great. Regards, Sun Creator(talk)13:25, 8 August 2012 (UTC)
Question. I assume it is meant to fix errors such as this, "public.Among" -> "public. Among" in Kairos Future, right? It is not adding the space to this and other articles, I haven't taken it on a test drive in the 'India section' of Wikipedia where such sentences have higher then normal errors and lack of spacing. ChrisGualtieri (talk) 14:59, 8 August 2012 (UTC)
Still continues. The only reason it hits the page with Regex is because of an actual typo from before, but it is not catching the spacing matter. ChrisGualtieri (talk) 15:14, 8 August 2012 (UTC)
I've disabled this. It's a great rule but many computer articles have valid 'Somevarible.Somefunction' or 'Somesoftware.Someproduct' used in them. I don't feel that adding {{not a typo}} to many articles is productive at this point. Regards, Sun Creator(talk)16:58, 9 August 2012 (UTC)
Rule tuning
Before fine tuning the existing rules I'd like to establish the purpose clearly and ideally get consensus on the general intent of the rules.
Degree of precision
At one end you can have blunt rules with many false positives or you can have precise rules which deal with specific variations of a word that have yet to occur.
Some options on this spectrum maybe:
Basic word, anything goes, no consideration of variants
Check the most common related forms
Check variants in several dictionary's including related forms
Check variants in several dictionary's including related forms ignoring stuff not in the wild
Check variants in several dictionary's including related forms and related forms of related forms etc
Check variants in several dictionary's including related forms and related forms of related forms etc ignoring stuff not in the wild
No false positive is acceptable, disable any rule that produces any false positives
Most rules today appear to be a 2, occasionally some are 4. It's also to be noted that precision is related to length of root letters. I'd like to see rules become more precise, ideally a 6. Regards, Sun Creator(talk)15:07, 9 August 2012 (UTC)
Exceptions
How much should a rule deal with exceptions?
A rule should:
Ignore exceptions
Handle the most obvious exceptions
Handle common exceptions found or reported
Handle common exceptions occurring in Wikipedia
Handle common exceptions occurring on the internet
Handle reoccurring exceptions in Wikipedia
Handle reoccurring exceptions on the internet
Handle all exceptions in the wild(properly technically impossible)
URL options
Regardless of a rule you could add a begin and end part to deal with avoiding websites URLS and domain name but it would result in a longer rule and an occasional miss of a typo. Is this a desired option?
Splitting up existing rules
In some cases splitting a rule into two would result in more precision. Especially if a rules doesn't deal with a single typo. If precision is the aim is it okay to split a rule?
Multiple possibilities
Many typos have multiple possibilities. 'distict' could be corrected to 'district' or 'distinct' or simply ignored.
Maybe in the future a disambiguation option like a spell checker could be available but for now we have a more limited choice.
Many of our current false positives are as a result of a rule picking the incorrect choice out of multiple possibilities.
Should the purpose be to correct with multiple rules, correct to the most likely word with only one rule or leave it alone entirely?
Documentation
In order to tune a rule you have to first work out what you want it to correct, what to avoid and once a rule is created to know it's pitfalls. It would seems appropriate to leave separate documentation showing the typos fixed along false positive information ,so that others could check or adjust a rule at a later time. Would individual /Typos/Rulename pages for each rule be welcomed?
(Aside) And if the general fixes added "[[AWB/GF|general fixes]]" if and only if the general fixes did anything, I wouldn't have to pick one of my two edits summaries before saving each edit. -- John of Reading (talk) 20:19, 1 August 2012 (UTC)
Why is AWB changing "C# code" to "C#code"? I haven't tried any tests, but several other programming languages also end with # and might be caught by the same unfortunate rule. – Hebrides (talk) 10:26, 10 August 2012 (UTC)
I think that's because AWB has logic to remove the space after # for the external links sections. Maybe the code needs to be refined a little. Kumioko (talk) 11:08, 10 August 2012 (UTC)
Sorry, I was just AWBing through 500 new articles and when I spotted it wanted to change [[C# code]] to [[C#code]] I just clicked Skip for that article. So I'm sorry I have no idea which of the 500 it was. A few articles later I decided I'd better flag up this problem here. I don't have AWB on the computer I'm using this evening, or I'd test it out by putting [[C# code]] into a sandbox. — Hebrides (talk) 21:12, 10 August 2012 (UTC)
The next time AWB tries to make a questionable change, the first thing to do is hit the "Typos" tab, and it will show you what Typo rule fired on that article. Christhe spelleryack21:16, 10 August 2012 (UTC)
Thanks, Rjwilmsi, but you seem to have included only C# and F# in your exception. Probably worth catering for A# and J# too. Cheers — Hebrides (talk) 11:57, 13 August 2012 (UTC)
Why is Womens always converted to Women's with the "-men's" rule but Mens is not? I don't understand the rule or maybe the exceptions. Regards, Sun Creator(talk)12:06, 10 August 2012 (UTC)
Not compelling. The only 'womens' exception is an organisation without any mention on Wikipedia except the Apostrophe page. Regards, Sun Creator(talk)15:45, 10 August 2012 (UTC)
"long time" hyphenation
An uncertain suggestion for discussion: is it possible or wise to establish a rule hyphenating "long time" before (and only before) a noun? I've been manually cleaning up some by searching phrasing like "his long time" or "her long time", but this won't catch phrases like "Jane Jones, a long-time opponent of birth control," etc. On the other hand, a rule of "long time [noun]" to "long-time [noun]" would create some false positives from "a long time period" or a "a long time capsule". Khazar2 (talk) 20:41, 11 August 2012 (UTC)
As an update to this, I've now corrected several hundred instances of "long time friend" to "long-time friend" with AWB. If it's not possible to make a more general rule about this, perhaps one could be crafted simply by looking for common phrases like "long time friend", "rival", "boyfriend", etc. Khazar2 (talk) 23:59, 12 August 2012 (UTC)
What should not be overlooked is that most dictionaries indicate that "longtime" should be closed, not hyphenated. If you prefer the hyphenated form (allowed in some dictionaries), the most proper way to fix these is to make two passes: 1) Skipping pages that contain "longtime", changing "long time" to "long-time"; 2) Skipping pages that do not contain "longtime", changing "long time" to "longtime". This way the changes will conform to the style of each article. My preference is "longtime", but Macmillan (usually the best reference on hyphenation) and Cambridge specify "long-time", so I won't change that to the closed form. Christhe spelleryack14:57, 13 August 2012 (UTC)
The "-ound-" rule now no longer matches further endings yet still has a $2, what is the rule now supposed to be doing? Rjwilmsi06:20, 13 August 2012 (UTC)
Oops, I've removed the $2, it's not needed and was tested without it. The words ending(if there is one) is left the same as this rules deals with the earlier "uond" part so now both "Gruond"=>"Ground" and "Suondproof"=>"Soundproof" work. Regards, Sun Creator(talk)09:40, 13 August 2012 (UTC)
Though now it won't meet the convention that typo rules match at least a whole word, so that the edit summary shows entire words? Rjwilmsi21:39, 13 August 2012 (UTC)
Wasn't aware of any such convention. Don't see that written anywhere, but I'll go adjust it to give a pretty edit summary. Regards, Sun Creator(talk)21:58, 13 August 2012 (UTC)
The edit summary now shows the middle and end of word. It is convertion to show the word in the edit summary in full? This rule doesn't look like it's ever shown the word in full. It's possible to do that of course, but it's a few more cycles to do it that way. Regards, Sun Creator(talk)22:19, 13 August 2012 (UTC)
Extra rules with false positives
What do we do with rules that naturally have lots of false positives but are still useful when used with care. I have some in my find and replace. Do we want to throw them in the standard rules? Properly not, but shall we have a seperate list for anyone who wants additional find and replaces? Regards, Sun Creator(talk)14:00, 13 August 2012 (UTC)
The American Super Bowl may well always be spelt this way but il Superbowl (the Italian equivalent) is not and this has now twice been corrected on this page and perhaps on others. Please could this error be corrected? mgSH12:12, 18 August 2012 (UTC)
I found three pages where this has happened, and corrected them all and wrapped a "Not a typo" template around them. This should prevent both AWB users and manual editors from changing them. This is the best way to handle such a rare occurrence, rather than monkeying with AWB. Christhe spelleryack15:09, 18 August 2012 (UTC)
Some of the "New" additions have been there for a very long time and some even duplicate typo fixes found further below. What is the procedure if any for moving them down? How long do we leave them there before they are no longer new?
Also, some, such as some names seem unnecessary and relatively low impact. Some such as Sam Elliot would probably be better IMO if we just took a few at a time and ran them as tasks, removed them from the list and add them to a subpage showing they were there and what we did about them. Kumioko (talk) 01:13, 28 August 2012 (UTC)
One thing that might help editors to answer those questions is a mechanism for recording, for each listed item, the date and time of its addition to the list, the date and time of its removal from the list, and the number of true-positive corrections made because of its presence on the list.
Each edit is recorded in the wiki, so you can find out when something is added or deleted. But how would you suggest capturing the number of "true-positive corrections"? GoingBatty (talk) 03:57, 28 August 2012 (UTC)
The revision history does show many of the details that I mentioned, but some searching is required if one wishes to find the date and time of the addition or removal of a particular item. I had in mind a separate list for compiling additions and removals, which now I suggest can be a sortable wikitable with columns for "item", "date and time of addition", and "date and time of removal".
The AutoWikiBrowser might record the number of revisions (supposed "corrections") that it makes for each item listed at Wikipedia:AutoWikiBrowser/Typos. Those numbers might be compiled in one place, possibly in a fourth column in the previously mentioned sortable wikitable. Human editors who revert "false-positive" corrections might record corresponding numbers in a fifth column there. Human editors might also record, in a sixth column, the difference between the numbers in columns 4 and 5. Human editors might also record, in a seventh column, the value of each number in column 5 as a percentage of the corresponding value in column 4. Spreadsheets might help with the calculations.
What is the default scope of a typo rule in AWB, I mean does it search in: interlanguage links, inside <--- commented out text -->, does it search inside <syntaxhighlight=code> here</syntaxhighlight>, <ref>references</ref> and "quoted text"? Some rules don't apply in some case for example some consider grammar should not be done in quotes but spelling typos can be. Perhaps an option can be added to each rule to define it's scope. Regards, Sun Creator(talk)13:46, 29 August 2012 (UTC)
I believe that the typo rules in general skip the following things: Comments, templates, and the area next to sic templates. I'm not sure about Source code or other HTML tags. Kumioko (talk) 14:46, 29 August 2012 (UTC)
Ultra-high-definition television
I think that "Ultra-high-definition television" looks fine and proper. On the other hand, I wouldn't be brokenhearted if the hyphen after "ultra" were dropped, because there is really no chance that a reader would stumble over it by thinking that "ultra" was modifying "definition" or "television", which is the driving reason for using hyphens in compound modifiers; "ultra television" would not be understood. But "ultra high-definition television" looks strange with just the one hyphen, as "high definition" is so pervasive that it does not really need a hyphen even when used adjectivally. So I would prefer two hyphens, or no hyphens, to a single hyphen. As for "Ultra-high definition television", now that could be a stumbling block for readers. Christhe spelleryack12:44, 31 August 2012 (UTC)
Well, now that I vented, I see that there is a lively discussion about a proposed renaming on Ultra-high-definition television, where the choice is between good punctuation and the punctuation chosen by the industry's engineers and advertising folks; not surprisingly, they have chosen the worst option of the four possibilities for hyphenating (or not hyphenating). That talk page is a better venue for this discussion than the AWB/Typos talk page. — Preceding unsigned comment added by Chris the speller (talk • contribs) 18:42, 31 August 2012
I've been playing with searches for "full-time" and "part-time", and these rules seem to generate an unfortunate number of false positives--or perhaps a better way to put it would be unnecessary positives. My understanding is that the phrase "full-time work" must always be hyphenated, but "work full time" may or may not be. Quick searches of the LA Times [26] and NYT [27] show that their style guides allow both usages, so the hyphenated/non-hypenated appears to be a null issue. Would it be possible to reset this rule to only cases where the words "full time" or "part time" precede the noun? Khazar2 (talk) 14:51, 30 August 2012 (UTC)
Macmillan Dictionary (which I have found to be very specific and very dependable on hyphenation issues) lists the adjective "full-time" with the notation "usually before noun" – "It is hard to combine study with a full-time job." And it lists the adverb "full-time" – "Her youngest child is in daycare full-time." Is there a case where a sentence is better because "full time" is unhyphenated? I can't think of a case where the hyphen could confuse a reader, and it sure is going to make the fixing of the adjective more difficult if the Typo rule has to list all possible nouns that could possibly follow "full-time", or adjective-noun phrases, such as "a full-time, permanent job". WP:HYPHEN says "Consult a good dictionary", but not "Consult a big newspaper". The punctuation in most Wikipedia articles stinks; how is it ever going to get better if more obstacles are placed in front of editors and tools are taken away? Christhe spelleryack02:34, 31 August 2012 (UTC)
I share your concern for Wikipedia spelling and punctuation, of course. But I'm also wary of setting AWB to auto-correct things that appear to be legitimate variation, and this rule generates a tremendous number of neutral edits. An equal case could be made that by having tens of thousands of valid sentences like "he worked full time" flagged for review and correction is itself an obstacle, due to slowdown it creates in other work. (And it does seem to me that newspaper style guides can be considered at least a legitimate variant here; at the very least, if the New York Times is also employing it, this is not a usage that's begging for correction.)
I'm a big fan of your work generally, though, so having said my piece, I'm happy to yield to your judgement if no one else objects. Cheers, and thanks for all your work, Khazar2 (talk) 03:11, 31 August 2012 (UTC)
I prefer the exclusive use of the hyphenated form for the technical reasons explained by Chris the speller, and technical reasons have been invoked at WT:MOS and WP:MOS. To forestall complaints by subsequent editors, the edit summary can mention "technical reasons". Also, I recommend that this be discussed at WT:MOS, but please wait until User:Noetica is again available.
I can't think of a single instance where full time shouldn't be hyphenated, whether before or after the noun it qualifies: she worked full-time; they were resource-constrained. Sure, there's slightly less imperative to hyphenate after the noun than before, but some items have the hyphen ingrained wherever they are. Here's a grammatical twist that would be a false positive: the stadium was full time after time. Probably vanishingly rare. Tony(talk) 03:51, 31 August 2012 (UTC)
I think doing such things automatically is always dangerous, and people who can't think of where it's not right are not being very imaginative. See for example the usage here.
Sorry to return to this one again, but I've encountered another "full time" situation that I wanted to check in on. When one says "full time" to mark the completion of a rugby or association football match, should this be hyphenated? (As in, "a few minutes before full time, ...") I've run into a few dozen of these in football articles so far, and wanted to check before changing any. I note that Wiktionary has this listed at "full time" (unhyphenated), but my American dictionaries don't cover this usage. Khazar2 (talk) 21:55, 1 September 2012 (UTC)
Agree with Chris: no trust in Wiktionary from me. On the football term, half-time seems more likely to demand the hyphen. I'm unsure, but wouldn't be upset if the term weren't hyphenated predicatively (after the noun). But before the noun, like full-time score, it would be needed. Tony(talk) 08:39, 2 September 2012 (UTC)
Double letters
I notice a lot of the words on the typo list haev double letters like, TT, SS, RR, PP, etc. but we aren't using any logic to catch for typos where people misspell them. Mississipi rather than Mississippi for example. I realize that this won't work for every one but there are a lot I think that could. Kumioko (talk) 20:44, 31 August 2012 (UTC)
It's a good idea to check for a single occurence when a double occurence is expected. I added this to the format(t) rule a while back to handle formating instead of formatting etc. Regards, Sun Creator(talk)22:43, 31 August 2012 (UTC)
I have had many complaints and questions in the past about the difference between "on board" and "onboard", so I will lay it out here and reference this discussion in a comment attached to the rule.
The adjective "onboard" (or "on-board", according to a few dictionaries) is attributive, and is always followed by a noun (or another adjective and noun):
"They brought their own sandwiches, as the onboard food was usually tasteless".
"He hoped there was enough power for the on-board electrical devices."
The prepositional phrase or idiom "on board" indicates that something is located or installed in a train, airplane or vessel:
"Everyone was on board, so he shut the door."
"She was glad to see that there was a toaster on board the lifeboat."
The Typo rule fixes many cases where "onboard" is followed by something other than a noun or adjective (such as punctuation, an article or an adverb), indicating that it is not used attributively, so it knows that "on board" should be substituted. The rule certainly misses many misuses of "onboard", but after much testing it has produced next to zero false positives, and there are a ton of these to be fixed. Christhe spelleryack01:15, 23 July 2012 (UTC)
Sorry, you are making a basic grammatical error here. Atributive Attributive adjectives are not always immediately followed by the noun, although they are usually. The important fact when considering the adjective when it occurs after the noun is whether or not there is a linking verb between the noun and the adjective. If there is no such verb then the adjective is still attributive. - Nick Thornetalk15:01, 31 July 2012 (UTC)
I was trying to keep things simple enough that most AWB editors (and AWB critics) can get a handle on what the rule is trying to accomplish without spending a whole afternoon on a grammar refresher course. The point is that the rule does a good job of avoiding changes where "onboard" is an attributive (that's spelled correctly, BTW) adjective. If you have seen cases where the rule has changed an actual attributive case of "onboard" to "on board", please let us know. I think you'll have a hard time finding even one or two cases of the attributive use of "onboard" that is not followed immediately by a noun or another adjective. If you can't find such cases, what is the point of making this discussion more complicated? Our purpose here is to improve and maintain Wikipedia, not to display our knowledge of the fine points of grammar. Creating AWB Typo rules is largely a game of controlling the odds, and this rule seems to be ahead of the game at this point. Christhe spelleryack19:30, 31 July 2012 (UTC)
Sorry about the spelling mistake in my first use of the word, now corrected. (I always try to keep my spelling correct.) I take your point about trying to keep things simple, but I question whether that is always a good thing when dealing with subtle points of grammar. As for an example, the reason I raise this whole issue was this edit of an article on my watch list. I think that bots are not best suited to making grammatical changes on the less well understood points of grammar, not least because there are always exceptions, usually contextual in nature, that make it hard or impossible to codify every possible situation. - Nick Thornetalk22:45, 31 July 2012 (UTC)
I'm skeptical that the edit you flag here is a false positive. Googling NYT and BBC (to make sure ENGVAR isn't an issue), "people on board" outnumbers "people onboard" by about 150:1. "Personnel onboard" has a smaller sample size but equivalent results. Clearly the former is the preferred usage. Khazar2 (talk) 23:17, 31 July 2012 (UTC)
Khazar2 is right: in the example provided by Nick Thorne, "onboard" was not used attributively, and should be two words; it is a prepositional phrase, the equivalent of "aboard". While writing that last sentence, I suddenly realized that there is a simple test to help decide whether "onboard or "on board" should be used: if "aboard" could be substituted, then "on board" is correct; otherwise, "onboard" should be used. Using the above example, "Everyone was aboard, so he shut the door." makes as much sense as "Everyone was on board". Another point: AWB is not a bot; editors are looking at each change to verify its correctness. Christhe spelleryack03:10, 1 August 2012 (UTC)
Your example fails because there is a linking verb between the noun and the adjective, an important point. In the Nias article it said the aircraft had 11 people onboard. This could have been written the aircraft had 11 onboard people with no change in meaning, it just seems a little unnatural which is why the adjective follows the nouns in this case. The word onboard in both cases is being used attributively - it is attributing the property of location to the people. As a former Fleet Air Arm officer, I watch many pages related to naval aviation and nautical matters. It was because of this that the subject came to my attention. The word onboard is perhaps not very common in everyday speech, but in aviation and nautical discussions it has a particular meaning which is not quite the same as on board. One of the things that disappoints me about Wikipedia is that sometimes well intentioned people make changes to articles that indicate an incomplete understanding of the particular subject. It is a form of unintentional dumbing down of the encyclopedia. I would have thought that one of the purposes of the encyclopedia is to educate people. If part of that is making sure that obscure points of grammar are attended to then IMO that is no bad thing. This is not a criticism of your work, on the contrary, fixing up spelling and grammar mistakes in the encyclopedia is a great service to the community. In this case however, I think you're missing a subtle shade of meaning. In any case I don't plan keep on about this. If you decide to change the article back I will of course be happy about that. If not, well let's face it, it's not the most pressing issue on Wikipedia is it? - Nick Thornetalk23:20, 1 August 2012 (UTC)
I know perfectly well what a prepositional phrase is. The most pressing issue on WP is accuracy, but spelling, grammar and punctuation are important. I will continue to correct those aspects as well. Christhe spelleryack02:56, 2 August 2012 (UTC)
No, not different, though there is one editor who claims that it is whenever his personal hyphenation style is at variance with every modern dictionary. Christhe spelleryack01:31, 4 September 2012 (UTC)
Misspelling of "government"
I recently corrected more than 30 misspellings of "government", which I found by searching for "goverment". Along the way, I found many occurrences of that misspelling in web addresses and Wikipedia file names. Is there a practical method for correcting the misspelling in those file names?
—Wavelength (talk) 00:25, 1 September 2012 (UTC)
I have not yet edited on Wikimedia Commons, and I am not yet ready to monitor a watchlist there, but I have started a list at User:Wavelength/About Wikipedia/File namespace#Misspellings in titles. In the future, I might comment at Wikimedia Commons, and provide a link to that list. Meanwhile, other Wikimedians are welcome to monitor that list and to mention it or its contents at Wikimedia Commons.
In Eden Springs Europe, AWB wants to change "securing a EUR 150 million credit facility" to "securing an EUR 150 million credit facility". Seems this change is incorrect whether you would pronounce this "a hundred-fifty million euro", "a one hundred-fifty million euro" or "a euro 150 million". Thoughts? GoingBatty (talk) 20:07, 3 September 2012 (UTC)
Should be "a EUR 150 million" but "an EUR 80 million", I've updated the rule to leave it unchanged, please reload it. I think there will be other currencies with the same thing, USD is ok because capital U is pronounced 'yoo'. Regards, Sun Creator(talk)20:39, 3 September 2012 (UTC)
Writing typo rules says : Avoid having a rule detect a correct spelling
Is the above a rule or a guide? If it's a rule then both the lifetime break it. There maybe others.
It seems however that it's better to create a single rule that detects correct spelling then multiple ones, but perhaps I'm missing something. Regards, SunCreator(talk)08:58, 6 August 2012 (UTC)
The point really was not one in particular but the logic behind why to see if it is still required. There are long standing rules that match correct spelling
"New Hampshire", "Rhode Island" and "Uninhabited" rules match correct spelling to name but three
Rules that detect a correct spelling don't seem to cause anyone a problem
The writing of rules to correct more then one problem is comprimised by the avoiding of correct spelling. When you have two or more possible things to correct you have to pick the lesser one to ignore in order to avoid a self match.
i.e The "Cayman Islands" Capitalisation rule corrects "Cayman islands" but "cayman Islands" is left untouched, it's a comprise by the rule writer to avoid the correct spelling.
"New Hampshire" and "Rhode Island" are not long standing rules; they were added on August 26, 2012. I've fixed the Rhode Island rule so it doesn't match the correct spelling. I'll wait until we get consensus on my discussion below about the "New England" rule before changing the "New Hampshire" rule.
The "Uninhabited" rule appears to only match "Unihabited" (missing the second "n"). When I run AWB on articles that contain "Uninhabited", it's not identifying a typo fix for me. Do you have an example? Thanks! GoingBatty (talk) 04:39, 5 September 2012 (UTC)
Okay, but the examples are kinda getting away from the point, having detect a correct spelling does no harm(it seems) but writing the rule to avoid does harm because the rule is made to avoid a part that it could otherwise correct. Regards, Sun Creator(talk)05:10, 5 September 2012 (UTC)
It seems that the New England (& New Mexico) rules may encounter four capitalization varieties:
"New England" (correct, should not be changed)
"New england" (incorrect, should be changed to "New England")
"new England" (may be correct, should not be changed)
"new england" (incorrect, "England" should be capitalized, but "new" should not based on rule above)
Therefore, if my premise is correct, the only thing this rule should be doing is capitalizing "England", why not just have an "england" --> "England" rule? Thanks! GoingBatty (talk) 04:12, 5 September 2012 (UTC)
ah, this has caused me hours of thinking aleady . 4 is undermined and could be "New England" or "new Engand" but chances are the former I think but the England part is sure capitalized. If you decide the 'new' in 4 should not be capitalized (as doing so could be a FP) then yes just capitalize Endland(ers?) and same with Mexico rule. New York rule also? New Jersey could all be false positives - i.e do no capitalization, I had FP with this a few times. The team wore the new jersey. Regards, Sun Creator(talk)05:00, 5 September 2012 (UTC)
Profiling the typos
I've just noticed from profiling.txt that it takes my computer 27 seconds to run RegExTypoFix on List of Doctor Who universe creatures and aliens. On the one hand, we've been adding more and more clever rules; on the other, some users are noticing that the latest versions of the program are slower than the older versions.
I've run AWB's "profile typos" option and have posted the results in User:John of Reading/Sandbox (permanent link). They are sorted by CPU time. Right at the top of the page is the new "a to an" rule, but there are many others not far behind.
Where do you get the profiling.txt? Some of those can be optimised, I spent some time already looking at speed of the "a to an" rule in regex, it's effectively 5+ rules in one so it doesn't surprise me. Will look into optimising some of the rules with the profiling.txt once I can emable/find it. Regards, Sun Creator(talk)21:45, 5 September 2012 (UTC)
Of the 267 endlings, 31 take over a third of the total processing time. They use [A-Za-z]+ at the beginning everytime, with no prechecking in order to make the edit summary pretty.
Endings can be made fast by removing the beginning check:
i.e. for the '-itely' rule \b([A-Za-z]+[lnst])itly\\b (currently 443ms) => ([A-Za-z][lnst])itly\b (73ms) although the edit summary would not be 'pretty' and say for example 'litly => litely' instead of say 'impolitly => impolitely'.
Nevertheless 31 multipled by around 370ms saving per rule is over 10 seconds. So of John's 27 seconds over 10 seconds can be saved by amending those 31 endings. Regards, Sun Creator(talk)01:48, 6 September 2012 (UTC)
Benefit outweighs the confusion, anyone can see the change in the diff anyways. It if it speeds the process up by a considerable %, then by all means go for it. Even the rules like 'a' to 'an' does not show the word which follows, it notes the change and not the reason for the change because it would be very time consuming to explain it. If anyone DOES have a problem with it, they can see the diff or come here. I'm also planning to build a new list of Regex typos from the database dump, if by amending these rules increase speed by 30% then that should be enough of a reason to have just the change highlighted in this case. ChrisGualtieri (talk) 16:26, 6 September 2012 (UTC)
Some great work here. Don't change anything just yet though, I might be able to make a code change to the way the rules are processed to improve the speed of the endings rules without affecting the edit summary or having to change the rules themselves. Rjwilmsi17:21, 6 September 2012 (UTC)
As a followup to a comment I made a while back. I think that there are a number of typo corrections in the existing list that are low numbers so it might be beneficial to move those to another "Inactive typos" list or something that can be run periodically instead of every time. I had checked a few in the past and couldn't find some of them at all and others only had 1 or 2 articles affected so it seems of little value for them to be on the "active" list. Kumioko (talk) 17:51, 6 September 2012 (UTC)
Sun Creator, you're not comparing apples with apples with your 31x, 370 ms analysis. Firstly, the typos.txt output is for 1,000,000/article length iterations (so 3 for the Dr Who test case), so any numbers should be divided by 3. There are 35 rules starting "\b([A-Za-z]", they total 6975 ms out of a total of 113889 ms for the Dr Who test case (on my PC, where AWB typos time is around 13.5 s, so John must have a slower CPU). That's 6.1%, so even if we rewrote them to take zero time, they would only give about 1.6 s improvement for John. Secondly, and you couldn't have known this, the typos profiling does not profile the way in which AWB actually runs the typo rules. The profiling just does IsMatch for each typo rule for n iterations and returns the summed time per rule. The actual typo fixing puts typos into larger typos in groups of 20 (this is faster, the sum profiling time / iterations for me is 113889/3 i.e. around 37 seconds, but around 13 seconds at runtime) and does IsMatch against the grouped one. (An example group is \b(([Ss])ea-(board?|foods?|m[ae]n|ports?|planes?|wards?|weeds?|worth(?:y|iness))|um([dntv][a-z]+)|([Uu])(?:n|nnn)(amed|atural[a-z]*|avigable|ecessar(il)?y|eeded|otice[a-z]*|umber[a-z]*)|([Ww])(ere(?:abouts|by)|isker(?:s|ed)|istl(?:er?s?|ed|ing))|([Xx])yph([io][a-z]+)|([IiUu]n)?([Aa]ccept|[Aa]rgu|[Cc]ap|[Cc]onfigur|[Ff]orgiv|[Hh]ospit|[Mm]istak|[Nn]ot|[Oo]ppos|[Ss]cal|[Tt]ranslat|[Uu]s|[Vv]alu|[Vv]ulner)(?:ea?|[eiu]a?)b(l[ey]|ilit(?:y|ies))|((?:[IiUu]n)?[Dd]e)(bat|cid|fin|form|grad|[lt]ect|not|pend|plor|p?riv|sir|spi[cs])(?:ea|i)bl([ey])|((?:[IiUu]n)?[Rr]e)(ad|ason|charge|cogni[sz]|concil|cover|cycl|deem|mark|mov|new|pai?r|pea[lt]|place|put|view|voc)(?:ea?|[eiu]a?)b(l[ey]|ility)|([BbFfHhJjmNnRrSsTtw]?|[Tt]r)aill(ed|ing)|([Mm]is|[Rr]e)?([BbFfMmRrTtWw]|[LlPp]e|[BbCcFfWw]re|[Ss](?:[hlnot]|[np]e|[ct]re))kaing(s)?|([DdQq]u|[Ee]qu|[FfNn]at|[FfNn]orm|[LlRr]eg|[Ll]oc|[Rr]e|[Tt]o[nt]|[Vv]it)all+it(y|ies)|([Ff]il|[Ll]ig|[Tt]est|[Tt]ourn)ia?ment(s?|ary)|((?:[Pp]?[Rr]e)?[Aa]rr|(?:[Ee]x|[Ii]nter|[Ss]hort|[Uu]n)?[Cc]h|[Dd]er|R|r)an(?:gei|egi)?ng|([Bb]ot|[Mm]ech|[Pp]urit|[Ss]at)annical(s?|ly)|([Aa]dam|[Aa](?:bu|tte)nd|(?:[Dd]is|[Rr]e)?[Aa]ppear|(?:[Rr]e)?[Cc]ogni[sz]|(?:[Aa]s|[Cc]on|[Dd]is)son|[Dd]efend|[Ii]gnor|[Mm]erch|[Oo]xid|[Ss]erv|[Vv]ac)(?:en|and)(ts?|tly|ci?es?|cy)|([Aa](?:ccep|cqu(?:ain|it)|dmit)|[Bb]la|(?:[Nn]on)?[Cc]omba|[Ee]xpec|(?:[Ii]n)?[Hh](?:ab|e[rs])i|[Ii]mp[ao]r|[Mm]ili|[Pp]it|[Rr]e(?:luc|mit|pen))t[ei]n((?:c[eiy]|t(?<!\b[Rr]emittent))[a-z]*)|([Aa]ssi|[Cc]on|[Ii]ncon|[Dd]i|[Ii]n|[Rr]esi)st(?:atn|ent)(s?|ly)|([Ee]dw|[Hh]ow|[Rr]ich)rad((son)?s?|ians?)|([Bb]ound|[Dd]iction|[Ll]egend|[Pp]rim|[Ss](?:al|econd)|[Tt]ern)e?r(y|ies)|([Aa]br|[Ee]v|[Ii]nv|[Oo]cc|[Pp]ersu)ation(s?|al(ly)?))\b though every time the AWB rules page changes the groups may change as well). Last weekend I looked at the group size number (20, and also the fact that the typos are compiled regexes so add 10 seconds to the typo time on the first run in the AWB session) and could not find a number that gave better performance. However, all may not be lost: 26 of the 205 groups take 6 of the 13 seconds. Rjwilmsi20:12, 6 September 2012 (UTC)
That does not make any sense at all. You are saying "profiling.txt" which you call "typos.txt" has figures that are three times to big? Well it doesn't matter because I worked out my own baseline. The List of Doctor Who universe creatures and aliens article takes 27 seconds to process on my PC also, the same as Johns. I used the Regex tester repeatably to get an accurate figure for the rule times both with and without the "[A-Za-z]+" pre-code. How can it take 113889 ms for you in total, that's 113 seconds, but you have a faster PC? Perhaps you have made some miscalculations? Regards, Sun Creator(talk)21:18, 6 September 2012 (UTC)
Grab the 5.4.0.1 snapshot and use Tools->Profile typos. Profiling.txt only has the runtime typos summary (13 s for me, 27 s for you). Typos.txt has the detail (113889 ms total for me), but as I've said it's not detailing exactly what AWB actually runs, so you cannot directly compare one and the other. Rjwilmsi21:38, 6 September 2012 (UTC)
So profile.txt reports a figure in milliseconds about 8 to 9 times the actual time taken, could file that as an AWB bug. Regex tester is similar to profile.txt as it gives similar figures so is also has time reporting bug. Recalculating the '-itely' rule it currently takes 50ms and without the pretty edit summary would be around 9ms The saving for removing 40ms from each of the 31 ending pretty edit summaries would total around 1.2 to 1.3 seconds, not so good as first thought. Regards, Sun Creator(talk)23:13, 6 September 2012 (UTC)
There's no bug. Profiling.txt and Typos.txt are measuring different things. With the regex tester you are replicating the Typos.txt method; 1.2 to 1.3 s on Typos.txt total is probably about the saving that changing the endings rules would achieve, roughly in agreement with what I measured when I changed \b([A-Za-z]+ to ( during the typo load to simulate the rules change. However, when it's applied to the actual method (as measured by Profiling.txt), it's not going to be the same time saving. As the AWB grouping behaviour already makes the typo rules about 3x faster overall (c.f. my 37 seconds Typos.txt to 13.5 s Profiling.txt), you could estimate that any time saving will be 1/3 as much. On Dr Who I only measured about a 200 ms improvement on 13.5 s for Dr Who in profiling.txt. What I think this means is that finding large improvements in performance is going to be hard, if not very hard. On the other hand, the Dr Who list is the most extreme example I've seen, typical large articles (featured articles etc.) run (Profiling.txt) in 2 to 3 seconds for me, which seems reasonable. Rjwilmsi06:14, 7 September 2012 (UTC)
Thank you for an attempt to improve this. An idea(!) to speed this up. Could a new type of replace variable be added to the replace= that is only used in the edit summary but not in the actual article? If so then faster rules can be made that use a post look behind when matching to find cycle consuming pre-text for the edit summary. The find for the 'itely' rule would be "([A-Za-z][lnst])itly\b(?<=([A-Za-z]*)[A-Za-z][lnst]itly)\b" and the replace would be "%2$1itely" where %2 is the contents of $2 but %2 is not applied to the article only the edit summary. Regards, Sun Creator(talk)22:19, 7 September 2012 (UTC)
The combining of 20 or so together appears very clever especially to vary it according to the rules. How does it know what to replace with. You could manually combine many rules if the replace restiction was removed, not that manually combining would necessarily be any better then the automatic way. Regards, Sun Creator(talk)11:12, 7 September 2012 (UTC)
"Involved" with "Revolved" added
I'm not convinced that this is a good idea. There were only 4 cases of "revovled/es/ing" (one of which was in a title that Typos wouldn't fix), and I corrected them. Not worth the extra cycles for so few hits. Same thing for the upper-case "Invovled"; vanishingly rare, except in a couple of titles, where Typo rules won't touch them. Christhe spelleryack20:24, 7 September 2012 (UTC)
In the article 2012 in paleontology is written: The type species is Bicentenaria argentina. The Regex rule is, find="\bargentin(a|e(an)?s?)\b(?!'')" and it avoids matching in the Regex tester, yet when AWB looks at the article it matches and wants to capitalise argentina to Argentina. Can someone look and see what the problem is in this case. Regards, Sun Creator(talk)10:35, 14 September 2012 (UTC)
I tweaked the rule, changing the order of \b and the negative lookahead, just on a hunch, but no help. This looks like a bug, as it works fine as an F&R rule. Christhe spelleryack13:22, 14 September 2012 (UTC)
Italic text should automatically be hidden from typo fixing, so not sure what's happened here. I'll investigate later. Rjwilmsi14:15, 14 September 2012 (UTC)
Thank you. I added backslashes before each single quote, for "\bargentin(a|e(an)?s?)\b(?!\'\')" as that works in Regex and worked in the typo rule also. Worth knowing that '' does not work as it appears. Regards, Sun Creator(talk)14:17, 14 September 2012 (UTC)
Regex testing
You can use the AWB find and replace to test new typo rules. I just found this out and feeling like a n00bie, so sharing in case others might not know of this excellent way of testing new rules. Regards, SunCreator(talk)12:55, 8 August 2012 (UTC)
Confirmed with Merriam-Webster as well as NYT, LAT, BBC, and Guardian that the latter is the correct usage. I've run about a hundred of these in AWB with only one false positive so far, the unusual phrasing "he found himself employed by... " Khazar2 (talk) 19:15, 4 September 2012 (UTC)
Both sound logical to me, though the list Wavelength links includes words that are comparatively rare ("self-aligned") along with more common ("self-abuse"). I'm not savvy enough on the programming side to know whether it's worth winnowing that list down to only common errors. Khazar2 (talk) 23:29, 4 September 2012 (UTC)
In my copy of The New Merriam-Webster Pocket Dictionary (1965), the main entry "self-" is followed by this list of 96 derivative entries (which I have divided into nine groups of 10 words each, and one group of 6 words):
I did a Wikipedia search for ~"self complacent", and only hit 7 pages, all of them correctly hyphenated. This exercise, repeated 95 times, will probably indicate which ones are worthwhile candidates for inclusion in a Typo rule. A search of ~"self inflicted" came up with 913 pages; a pre-parse AWB run changing "elf inflicted" to "elf-inflicted" would be a fairly quick way to count how many of the 913 are not hyphenated. If any search finds a couple dozen unhyphenated cases, it should probably be included. If this makes sense, I'll search a group of 10 words, and someone can volunteer for another group or two. Christhe spelleryack18:08, 5 September 2012 (UTC)
Sounds good to me. Since I'm already doing self-employed, I'll start with this row and will check back in in 24-48 hours:
Ok, first set of results. The number in parentheses is the total corrections I made, followed by a plus sign if I didn't finish the list: self-driven (3 corrections), -educated (26), -education (14), -employment (40-50), -employed (200+), -esteem (50+), -evident (30+), -explaining (1), -examination (40-50), -explanatory (50-60), -expression (30+).
Once your gone 4 letters in with "self" the cycle time usage will be negligible so you may as well add all the "self-" variants you want. Regards, Sun Creator(talk)01:12, 9 September 2012 (UTC)
In that case, how should we proceed from here? Include the full list above, or check each first to make sure none creates some large number of unforeseen false positives? I'll continue correcting words from the list individually for the time being. Khazar2 (talk) 18:44, 11 September 2012 (UTC)
Worked through another set this week, which I'll just mark yes or no this time for whether or not it is worth including (using a threshold of ~30 results): self-forgetful (no), self-giving (no), self-governing (yes), self-government (no), self-help (yes), self-importance (no), self-important (no), self-imposed (yes), self-improvement (yes), self-induced (yes).
Would anyone watching this thread like to include some or all of this list? I'm also still curious to hear what other editors think of the possibility per Sun Creator that we simply add the full list to AWB. -- Khazar2 (talk) 16:59, 18 September 2012 (UTC)
Each individual one doesn't need to be occurring 25 times, once is fine as the self- rule in total will exceed 25 easily. What is important is that no known false positives happen. Regards, Sun Creator(talk)17:18, 18 September 2012 (UTC)
I posted the words from The New Merriam-Webster Pocket Dictionary (1965) after I posted a link to a list at http://www.onelook.com/?w=self-*&ls=a and after Khazar2 suggested "winnowing that list down to only common errors".
Well, I can check over the Webster's list, but I've noticed that even some from that list only had 20-30 total occurrences in the correct form in WP; I'll probably pass on checking the OED ones, too. Obviously I've no objections if you want to test and include a longer list, though, Sun. Khazar2 (talk) 20:05, 18 September 2012 (UTC)
Adding Oxford(OED) because it is usually considered the most authoritative source. I notice that OED does not have self-explaining or self-giving. Would it be okay to remove those? Regards, Sun Creator(talk)20:28, 18 September 2012 (UTC)
While you might think the above will be slow because it deals with 398 words, it's actually executes very fast and is quicker then the (In)significant rule. Regards, Sun Creator(talk)00:38, 19 September 2012 (UTC)
The typo combination 'Ihs'(note it's a capital 'i') seems unlikely. Also wondering why the replacement is 'His' and not 'his'. Regards, Sun Creator(talk)21:18, 16 September 2012 (UTC)
Just 11 matches in my current database dump, and all are false positives in web addresses, image names and such like. I'm sure this one can go. -- John of Reading (talk) 07:49, 17 September 2012 (UTC)
Proposal for a fuzzy rule
I have investigated a very "fuzzy" rule to replace the current "Individual" rule. The old rule effectively fixes two misspellings (idividual and indvidual) with two more possible endings (individuals and individually). The new rule fixes about a thousand possible misspellings (though many of these would be double or triple typos, and so would be very unlikely) and any number of endings, such as "individualistically", "individualized" and "individualism". The new rule has actually caught some words with double typos (e.g. "indvidiuals") and doubled syllables (e.g. "individidual"). I have used it to fix almost 100 articles with misspellings that the old rule doesn't find.
It finds any word of the form "i__d__v__l__" that has other letters from "individual" in just about any order.
Are there any problems with this approach? I think it will run about as fast as the old rule. I have not found any false positives in English text, though there may have been one or two in some Romanian text (these should have "Not a typo" templates slapped on them, anyway). I would be pleased if some of you would try it out and comment. I think this technique has promise for a number of other words that have many possible misspellings. Christhe spelleryack03:55, 20 September 2012 (UTC)
The concept is a good one for longer words where fixtures(fixed parts) like the "i__d__v__l" make it unique and excludes false positives, I used it on the 'Wiki(p/m)edia' rule recently.
The draw back is you have to do extra false positive and foreign spelling research, for false positives in this example 'Id' is short for Identification and val short for value, so you could have 'Idval' as a false positive; DVD is a product and apple has/had an idvd that is almost a match for this rule, Idvallo and Idaville are places, and those are false positives on the above rule. Useful tools to check are Wikipedia search preview, Google, multilingual. Once a well researched rule is written you can pick up many more typos then a normal rule.
Further refinement of the rule is possible. Things that are excluded at the beginning of the rule tend to help it's speed for that reason I would have a fixed second letter the 'n', but add 'm' because it's adjacent on the keyboard, and thus making the structure "i(n/m)d__v__l" which is more robust to false positives. Vowels are often substituted in misspelling, so I'd consider making them optional as well as missing, so 'Induvidual' is quite plausible with typos on both Wikipedia and google occurring, same for 'Indavidual', 'Indivadual', 'indevidual', 'indivedual', 'Indivadaul' which google have as occurring somewhere. A check on the vowels after the 'l' does not find anything helpful, 'Individual(e/i)' are foreign, 'Individuala' could be also and 'Individual(o/u)' doesn't occur on google. End result:
The above still have a minor issue with foreign(?) endings 'Individuale' and 'Individuali' so easist to add a lookahead excluding them, plus allow for hyphened words and pretty edit summary for misspelling of individual's.
Thanks. I met you about halfway, and put it into production. It fixes a ton of misspellings, and finds very few false positives. I have excluded "Individuel", which is a fairly common misspelling, but also a French word. If a few dozen in French phrases get wrapped with "Not a typo" templates, we could remove the exclusion. Christhe spelleryack18:34, 20 September 2012 (UTC)
The recent change brings to end most possibilities with this rule. What was the issue before? Example would be nice. 01:25, 25 September 2012 (UTC) — Preceding unsigned comment added by Sun Creator (talk • contribs)
Consider new rule for "the so called" -> "the so-called"
I think we could have a new typo rule for "the so called" -> "the so-called", but would value input/investigation by others. Thanks Rjwilmsi08:48, 22 September 2012 (UTC)
No, just the former case. See www.macmillandictionary.com, which is head and shoulders above the other dictionaries for hyphenation; it says "adjective [only before noun]". Christhe spelleryack23:56, 23 September 2012 (UTC)
Are you asking whether its function can be accomplished without the lookbehind? Yes, but it would be slower to look for "A|a|by|of|The|the|These|these|Her|her|Their|their|This|this|His|his" first and "so called" second, or so we believe. Or are you asking whether we can skip the lookbehind and just hyphenate all cases of "so called"? The comment after the rule explains one type of sentence where it should not be hyphenated, as in that case it is not an adjective that precedes a noun. Christhe spelleryack17:54, 25 September 2012 (UTC)
Nice ideas. Previous discussion has put a requirement of 25 occurrences before it's worth a rule, due to each new rule slowing down the checking. In each of these cases it appears less then 25 are available after the false positives of initial are removed. On a related note I've started some work on an improved -ably and -ally rule but they are far from finished; see User:Sun Creator/-ally and User:Sun Creator/-ably. Regards, Sun Creator(talk)04:22, 24 October 2012 (UTC)
I noticed there is already a character rule. That rule could be amended to fix the above typo but in reading of the rule it appears the rule would already fix it. Regards, Sun Creator(talk)04:31, 24 October 2012 (UTC)
"Intially" already has a Typo rule; the cases you see are on pages that have not had AWB Typos run on them lately. I haven't checked the other two suggestions yet. Christhe spelleryack04:34, 24 October 2012 (UTC)
So, in conclusion: more people are required to clean the typo's that existing rules already fix. I noted back a week or so that for the 'a to an' rule alone there are 17,000 plus articles with the grammar/typo. Regards, Sun Creator(talk)19:46, 24 October 2012 (UTC)
Interesting! The report would be more useful, though, if it only included unpiped links where the wrong spelling is visible to the reader - like the three articles which link to Tennesseee. Both articles which link to Natural satelite do so through a piped link, so those don't need to be fixed. -- John of Reading (talk) 17:13, 6 December 2012 (UTC)
The change seems fine. Vowel sounds have an an before them, 'i' pronounced 'aɪ'. Why would it not be an i7, an i6, an i8, an iPhone, or an iPad? Regards, Sun Creator(talk)22:24, 16 December 2012 (UTC)
Let's suppose that the word foo is almost always an incorrect spelling of fob, and that someone has added a typo-fixing rule to do the fix.
Current behaviour
If an article contains a link [[foo]] or [[abc foo def]], then the typo-fixer has been coded to turn off that rule for the entire article. It won't change foo to fob in the body text of the article. Sometimes this makes sense, as the existence of a [[foo]] link is a sign that the word has a special meaning when used in the article. Sometimes this merely means that a typo goes unfixed - the link is itself a typo, possibly an unintended red link or a link to something in Category:Redirects from misspellings. As I understand it, AWB cannot efficiently discover whether each link is a red or blue link, an article or a redirect.
It seems very difficult to assess whether this is the best behaviour. It takes a sharp-eyed AWB user to notice that a foo has been left unchanged. Questions, then:
Does anyone have a feel for how many correct fixes are missed because of this behaviour?
Does anyone have a feel for how many incorrect fixes are avoided because of this behaviour?
Proposed change
I propose that a link to an image should never cause a typo-fixing rule to be turned off. Images are frequently uploaded with non-English names, or with typos in their names, or with names that don't conform to our picky hyphenation or capitalisation rules; and since image names aren't displayed in the articles there is no great incentive to rename them. I think, then, that the existence of a link to [[File:abc foo def.jpg]] should not be taken as a sign that the word foo has a special meaning in the article, and that the "foo to fob" rule should be allowed to run normally.
Support - In what concerns Portuguese, names in images are most of the time misspelled (usually people do not use diacritics), and the bad collocation of Diacritic is one of the most common errors in pt.wikipedia. As AWB is to be operated by people (not bots), a think of two possible solutions:
Keep the feature (external links disable a Typo rule) but giving an Alert similar to 'sic tag/template - "Contains matching external links"
There is a related bug report here which I have verified in my sandbox - a typo-fixing rule is currently disabled if it matches an interwiki link. This looks dubious, as these links are in arbitrary non-English languages most of the time. I propose that an interwiki link should never cause a typo-fixing rule to be turned off. -- John of Reading (talk) 22:43, 7 December 2012 (UTC)
Some doubts - As above, in what concerns Portuguese, most names in interwikis to foreign languages do not use diacritics, even if the name is Portuguese. In the other hand, if the article is about a foreign subject, the existence of the interwiki may prevent a correction of a False Positive. A 'balance' must be done, but I have no data to give an opinion.
rev 8834 Typo restriction that typo rule not applied if it matches a link target will apply only to wikilinks, not image/interwiki/category links. Rjwilmsi19:17, 21 December 2012 (UTC)
"full time" again
I undid my edit at 1960 Norwegian Football Cup and went to look at the rule to see how it could be tweaked. To my surprise, it's a very simple rule that does not try to distinguish "full time" used as a noun from "full-time" used as an adjective or adverb. My printed Concise Oxford makes this distinction, as does Collins (Onelook.com).
I have received a complaint from an Australian editor that "the Macquarie Dictionary allows for both with nee being listed first". Is this the death knell for our Typo rule? Christhe spelleryack12:13, 20 December 2012 (UTC)
No no! Trust me: the Macquarie is not a good guide in these matters. And even if it were an accurate index of substandard Australianisms, it wouldn't matter. Please, let's continue to prefer née, which is internationally accepted and understood. (What next, phenomena and criteria are accepted as singular forms, because some dictionaries list them as such?)
Major British and dictionaries prefer née. Rationally.
Is there any critical discussion on the Macquarie Dictionary, to back up your comment. It is not reflected in the article on the dictionary. Anyway, the bot does not appear to pick up all instances of nee. Paul foord (talk) 10:28, 30 December 2012 (UTC)
Paul, I don't know if there is any published critical discussion that makes the exact point that I make above. I do know that Wikipedia is for an international readership, and that internationally accepted forms are therefore preferable. Macquarie is based on a third-rate US dictionary, but it bends over backward to distinguish itself and to justify itself as dinky-di Australian. For example, over two decades it gave shockingly ignorant pronunciations of many foreign terms, apparently on the ground that they can indeed be heard here and there. Corrected in recent times, mercifully. NoeticaTea?23:50, 30 December 2012 (UTC)
Sounds like the Macquarrie will never be good enough in your view - how is this different from cultural cringe. I also understood that en-au was an internationally accepted variety and accepted on Australian pages. -- Paul foord (talk) 20:59, 1 January 2013 (UTC)
Is there a dictionary for en-au you would recommend? -- 21:05, 1 January 2013 (UTC)
The one I use is is the spellchecker for the browser and using English (Australian, American) nee is not flagged while née is flagged as misspelled, but using English (British), neither are flagged as misspelled. Apteva (talk) 19:49, 3 January 2013 (UTC)
There were only two cases of this particular misspelling in all of the millions of pages in en.wikipedia (both now fixed), and it is not a false positive, as it did not try to change a properly spelled word. It seems to me that it is not worth messing with the Typo rules to handle this extremely rare misspelling that might only pop up once every few years. The typo rules are not intended to catch every possible mistake, just somewhat common ones. As the first sentence of the Typo rules says, "These regular expressions find and fix common misspellings". At least the current rule brought the misspelling to your attention. Christhe spelleryack13:57, 29 December 2012 (UTC)
"A " or "a " → "An" or "an" when following vowel preceded by "[["
AWB typo fixing recognises "a" or "A" before a vowel and corrects it to "an" or "An". All good. However it misses cases where the vowel is separated from the article by link syntax eg "A [[iron..." . An enhancement to recognise this would be good.
Instead of rules for "Wade–Giles" and "McCune–Reischauer" and more, should we have a rule that changes all "Foo-Bar" to "Foo–Bar"? Thanks! GoingBatty (talk) 00:39, 9 January 2013 (UTC)
You shouldn't have to allot a lot of time for this one
I've been encountering the typo "alot" a lot. The typo rule for "Allo-" under "Beginnings" changes it to "allot", which is incorrect for every instance I've come across. In general I don't like the unencyclopedic tone of "a lot" in articles, but just for typo fixing, the typo should be changed to "a lot" rather than "allot". MANdARAX•XAЯAbИAM03:49, 12 January 2013 (UTC)
Done - I changed the "Allo-" rule so it won't change "alot" to "allot". Instead, the "A lot" rule will change "alot" to "a lot". Thanks! GoingBatty (talk) 15:40, 12 January 2013 (UTC)
Wilma Doesnt
What she doesn't do I don't know, but she certainly doesn't have an apostrophe in her name. She's listed in 147 articles at present. Could a regex expert make her an exception to the apostrophe rule please. Thanks. An optimist on the run!22:59, 19 January 2013 (UTC)
You're right, Webster's says it can go either way. Anyone else care to comment? We usually allow variant spellings that are mentioned in any decent dictionary. Christhe spelleryack03:45, 22 January 2013 (UTC)
I see that, but it just provides a list of miscapitalized phrases. I don't see how that helps us decide whether "french fries" should be changed. Christhe spelleryack05:28, 22 January 2013 (UTC)
Each phrase listed is linked to a list of links to definitions in various dictionaries. This is in regard to your comment "We usually allow variant spellings that are mentioned in any decent dictionary."
Done, although someone may wish to consolidate the two "Massachusetts" rules. I am also fixing each page that contains "Massachusetss". GoingBatty (talk) 04:18, 23 January 2013 (UTC)
Nbsp
To avoid situations like this one where the page was to be re-procecced to add nbsp between the Kg and the number, the typo rules should be updated to include nbsp in the fix. (This problem just appeared because we yesterday moved typo fixing AFTER general fixes. -- Magioladitis (talk) 12:35, 12 February 2013 (UTC)
I don't know whether this is the right place but...I noticed two users recently changing proprietorial to "proprietarial" in the same article using AWB. (These edits here and here) Is it something to do with AWB? AFAIK, "propietarial" doesn't exist in any English variety. I only care because I watch the article in question and don't want to have keep correcting this if AWB users are going to repeatededly make this edit. DeCausa (talk) 20:45, 18 February 2013 (UTC)
Hi! Even English is not my native language, I found one misspeling: have/had/has/having bee or have/had/has/having bene → have/has/had/having been. Do you see any false positives? Because English is not my native language, I didn't add it, I'm only suggesting. Thanks. Matt S. (talk | cont. | cs) 14:10, 7 March 2013 (UTC)
Hi Matt - thanks for the suggestions! For the first rule, I'm concerned that there would be false positives, such as "having bee hives" (see Lake Isle of Innisfree). For the second rule, most of the matches are within quotations from hundreds of years ago. I'll manually fix the few instances of these instead of adding rules. Thanks for the suggestions! GoingBatty (talk) 04:32, 8 March 2013 (UTC)
This edit changed 1980's -> 1980s, however it didn't change 1990's -> 1990s. It does change 1990's if it is the only date change in an article. Problem if two date changes on the same line? Bgwhite (talk) 22:28, 10 February 2013 (UTC)
The "Decade apostrophe" typo rule is set up to look for the word "the". So in your example, it changed "the late 1980's" to "the late 1980s", but didn't touch "and early 1990's" because there is no "the". GoingBatty (talk) 23:02, 10 February 2013 (UTC)
Once again you witnessed why I went into math and computers instead of written/spoken professions. Thank you Batty. Bgwhite (talk) 05:51, 11 February 2013 (UTC)
"1990's" can be correct, such as "In 1990's Die Hard 2...", which is why the rule is designed to look for the word "the". Could you please provide an example where the typo rule is suggesting an invalid change? Thanks! GoingBatty (talk) 19:45, 24 March 2013 (UTC)
Fixed - Since List of Other Backward Classes is the only article in Wikipedia that contains "Polinativelama", I wrapped the word with {{not a typo}} so AWB won't try to change it again. In the future, could you please provide the article in question? It's helpful to run the article through AWB again (without saving the edit) to see which rule is making the change. GoingBatty (talk) 19:41, 24 March 2013 (UTC)
Enborne is not a typo
This. Twice. It's a waste of my time finding my watchlist clogged with unnecessary and inconsequential edits in violation of WP:AWB#Rules of use item 4, but when those go hand in hand with incorrect "typo" fixes, it just annoys the hell out of me. I shouldn't need to clean up bad AWB edits. I suppose that I now need to search for those instances of "Emborne" on pages which I don't have watchlisted, and check that they're not also bad "typo" fixes that should have been left as "Enborne". --Redrose64 (talk) 20:08, 22 March 2013 (UTC)
Done. I have changed the "Emb-" rule to avoid creating "Emborne". If you see other shortcomings in the rules, please bring them to our attention, but the sharp tone is not necessary. We are all volunteers here. BTW, the "Emborne" change does not violate item 4 of the "Rules of use"; it is not inconsequential. If you wish to complain to the person who approved and saved the changes that the Typo rule suggested, I think you are on solid ground, but please be kind. Christhe spelleryack14:57, 23 March 2013 (UTC)
</syntaxhighlight>
This would accomplish that all templates would start with an uppercase letter which although it does nothing to functionality, it is a pet peeve of mine and having the capital letter actually makes me notice the template quicker and more easily than without it. I know the mediawiki core really doesn't care, but there is no reason for templates to not match their respective page name titles. If this is considered to be unnecessary "fluff", which I kind of expect people to say, could AWB's general fixes at least be modified to capitalize templates it injects? Thank you. Technical 13 (talk) 16:35, 22 May 2013 (UTC)
I am actually going to start that discussion. I don't see any reason there can't be a per-user preference to capitalize templates or at least set the capitalization of templates that AWB injects/updates. I was just busy with RL stuff today starting a new semester in school this week and I had lots of related errands I needed to run. I'll likely open the discussion tomorrow. I'm personally opposed to using AWB to solely capitalize templates, but don't see it as an issue if people are there fixing multiple other things at the same time. I personally feel that it is the better option considering $wgCapitalLinksforces capital links anyways. Technical 13 (talk) 00:03, 24 May 2013 (UTC)
Humourous
The software says the word "humourous" should be changed to "humorous". On wikt:humourous it says it is "uncommon, nonstandard, and "Nowadays, this spelling is much less common than humorous, even in regions where the spelling humour is overwhelmingly preferred." Even though it is non-standard, it is still an alternate spelling. So should "humourous" be getting changed to "humorous"? Inks.LWC (talk) 03:14, 23 April 2013 (UTC)
Wiktionary entries should always be cross-checked against respected dictionaries. "humourous" isn't listed at all in my Concise Oxford or Shorter Oxford; it only has poor-quality hits at OneLook.com (compare humorous at OneLook.com. So I think the rule is correct; the word is so uncommon that it could distract readers who notice it. -- John of Reading (talk) 10:34, 23 April 2013 (UTC)
I apologize for making the revert, that I made in haste after doing a google search to confirm that the word existed with that spelling. I think for Canadian spelling, we should go with the Hansard style-guide (http://www.hansard.ca/styleguide.pdf), which says "humour, but humorous". Although a quick search of a actual hansard entries will find both spellings common. -- Earl Andrew - talk13:26, 23 April 2013 (UTC)
AWB did this strange edit; I think it is based on some issue with the Regex template. I saved the edit as mere proof that AWB does it; the page is Dale and should hit for anyone else wanting to test the page. ChrisGualtieri (talk) 15:07, 18 June 2013 (UTC)
@ChrisGualtieri: - Saving a bad edit makes it harder to duplicate the error. Instead, as long as you report the page with the problem, someone else can run AWB to see which rule is causing the error and then fix the rule. Thanks! GoingBatty (talk) 22:33, 20 June 2013 (UTC)
When typo checking I noticed it was capitalising 'facebook' to 'Facebook'. The phrase was something like "...twitter and facebook..." to begin with but it only auto-corrected the facebook. Could this be added please — Preceding unsigned comment added by Jamesmcmahon0 (talk • contribs) 09:12, 19 July 2013 (UTC)
@Tikuko: - If you don't edit articles about ornithology, then I suggest making a find and replace rule in AWB for "twitter" to "Twitter", and always check your edits before saving. GoingBatty (talk) 18:36, 27 July 2013 (UTC)
mid- (New Zealand spelling)
Grutness (talk·contribs) posted on my talk page after I used the typo fixer to changed a mid 1870s to mid-1970s. He says that the New Zealand convention is to not use the hyphen. Can anything be done to change AWBs behaviour on this or should something like {{bots|deny=AWB}} be added to affected pages, although this would obviously exclude them from lots of other useful corrections. Jamesmcmahon0 (talk) 09:53, 19 July 2013 (UTC)
You are correct, {{bots|deny=AWB}} is too drastic; if a rule is good except in a handful of cases, the proper workaround is to mark them with {{Not a typo}}. But I think the best way forward in this case is for the rule to be disabled while we discuss it. I've done that, so if you restart AWB or reload the typo list (Ctrl+R) you won't make any more of these changes. The rule was added by GoingBatty (talk·contribs) [ping!]. -- John of Reading (talk) 10:29, 19 July 2013 (UTC)
Slight correction - it's optional in New Zealand, and you will find both forms. Generally though, the variety without the hyphen is far more common. So if it can't be fixed, it will be just a minor annoyance rather than anything drastic. Grutness...wha?11:19, 19 July 2013 (UTC)
If there are just a handful of cases, John's suggestion is correct, or consider wrapping "As written" templates around them. It's a rarely used alias of "Not a typo", but it seems well suited for a case where neither the presence or absence of a hyphen is wrong. Christhe spelleryack13:46, 19 July 2013 (UTC)
GoingBatty, If you look at the examples I posted on James's talk page you'll see they didn't include the Herald. They only included the NZ Government, the national library, the national encyclopedia... Grutness...wha?00:41, 20 July 2013 (UTC)
Well, as I said, it's optional and you'll see both forms. In general, though, you'll find the hyphenless form more often. Grutness...wha?01:21, 21 July 2013 (UTC)
I think many (perhaps most) Americans consider "publically" a marginally acceptable spelling, substandard at best. Not all American dictionaries accept it; I don't know why Merriam-Webster stooped so low in this case. Christhe spelleryack01:15, 10 June 2013 (UTC)
Has this actually been rectified? I've just found an instance of the use of "publically" here. The fact that proof readers at Merriam-Webster fell asleep on the job doesn't justify a blatant misspelling. The fact that the item is little more than blatant self-promotion is an issue unto itself. --Iryna Harpy (talk) 05:57, 22 July 2013 (UTC)
It's fine to correct "publically" in articles that are established as using British English; at this point we can't add a rule to change it in articles that use American English. Christhe spelleryack16:01, 22 July 2013 (UTC)
Since the 'double dollar' rule (by Chris the speller (talk·contribs)) which finds instance of '$100 million dollars' (and similar) was added, I have corrected many instances of it and have yet to see a false positive. Would it be possible to expand it or make a new rule to catch 'double pounds' i.e. instance of £100 million pounds etc. Jamesmcmahon0 (talk) 17:19, 29 July 2013 (UTC)
I've been making my way through the 1500+ 'Double dollars' and 500+ 'Double pounds' I just committed this edit; notice in "They also donated $50 million in the $100 million dollar cost for the new 14-story" the rule missed the second 'dollar' I assume this is becasue it's not plural, is it possible to catch these without getting too many false positives? Jamesmcmahon0 (talk) 15:07, 2 August 2013 (UTC)
In your examples, it is used substantively. If you visit http://www.onelook.com/?w=at+bat&ls=a, you will find a list of links to definitions. Some entries have the expression hyphenated.
Just to clear up any confusion, Epeefleche is responding to some edits I was making on my own initiative to try to address the awkward mix of "at bats" and "at-bats" that we often have in the same article. This isn't currently in the typo list. So far as I can see, the MLB official rules, MLB.com, the New York Times, Sports Illustrated, ESPN, The Associated Press, LA Times, Chicago Tribune, etc., all use only "at-bats" for the plural (though it can be either "at bat" or "at-bat" in its singular form); I was unable to find any contemporary publication that used "at bats". Epeefleche has stated that MLB publications of the '80s and '90s were less consistent in this, however, so I've agreed not to standardize any more of these. -- Khazar2 (talk) 22:47, 29 July 2013 (UTC)
Apologies -- I thought from the edit summary, which pointed to AWB, that it was listed as a typo in AWB. If not, then no matter. I pointed to the Official Baseball Rules (as codified and adopted by the Professional Baseball Official Playing Rules Committee), published by The Sporting News through at least 2005, and the Official Rules of Major League Baseball, published by the commissioner's office through Triumph Books in the late 1990s, which never used the hyphen, the Macmillan Baseball Encyclopedia (MLB's official encyclopedia through the 1990 edition) which always used "at bats", from the first edition (1969) through the final tenth (1995); Total Baseball (MLB's official encyclopedia beginning with the 1995 edition) used "at-bats" beginning with the first edition (1989), but dropped the hyphen beginning with the sixth edition (1997); the final seventh edition was in 1999; and the official American League Red Book and National League Green Book which did not use the hyphen at all (from the late 1940s) until the AL began using it in 1987; the NL book never used it. There is certainly an inconsistency across sources -- and even within some sources. But I can't see sufficient support for the notion, given the above, for asserting that "at bat" is incorrect ... though there is a discussion, to be sure, as to whether at-bat is "correct" as well.--Epeefleche (talk) 00:02, 31 July 2013 (UTC)
I found only one other similar misspelling, and have fixed it. Considering the rarity and the fact that the rule did not harm a correctly spelled word, I don't think there is much to be gained by messing with the rule. It brought a misspelling to your attention, and that is a point in its favor. Christhe spelleryack13:39, 7 August 2013 (UTC)
This is valid in Canadian English according to Wiktionary. At least, I can confirm that while I was fixing this word back in March, I found it in many Canadian articles, and took care not to change it. Any other opinions out there? -- John of Reading (talk) 20:14, 14 September 2013 (UTC)
The online version is dated June 2012. Print version was last updated in 1989. Both "seafaring" and "sea-faring" are given in the OED. The hyphen isn't a typo. DrKiernan (talk) 20:22, 19 September 2013 (UTC)
I have to take your word for it. Most folks in the US do not have free access to either online or print OED. Even my neighborhood library has only a Compact OED in print. I consider your request to remove the Typo rule satisfied, since you removed it yourself. Happy editing! Christhe spelleryack20:39, 19 September 2013 (UTC)
Unspace em dashes
Per MOS:EMDASH:
some words — some words → some words—some words.
I think it would always be correct to change a spaced em dash to an unspaced em dash. Then if another editor thought the spaced en dash looked better, they could always change it. That would be the end of the matter. Christhe spelleryack02:05, 22 October 2013 (UTC)
I see your point! I suspect these words are probably not that frequently used at this time, but may become more frequent in the future. --Danrok (talk) 03:36, 28 October 2013 (UTC)
qualy → qualification
There's this unpleasant habbit of abbreviating qualification as qualy, like for instance here or here. AWB tries to correct it to qually, which is obviously false. Could someone please add a rule that replaces it with qualification? --bender235 (talk) 17:30, 31 October 2013 (UTC)
Agree, "qualy" is also used in motorsport. Probably should not be used in text on wp except in quotes as it's slang language, but also not something I think we can/should deal with in typo rules. Rjwilmsi08:24, 1 November 2013 (UTC)
Don't have AWB in this PC, but I wonder if changing the rule from "\b([A-Za-z]+)sih..." to "\b([A-Za-z]+)i?sih..." will fix it. GoingBatty (talk) 17:35, 31 October 2013 (UTC)
Just for info. It's Ballyhealy not Ballyheally as incorrectly changed here. Not sure which rule(not much time to check), but likely the "-ally (2)" Regards, Sun Creator(talk)14:00, 1 November 2013 (UTC)
My search for now days (instead of nowadays) reported 115,885 results. I suggest the inclusion of wikt:nowadays in the edit summaries. That is for a convenient reference, and not because of any special reliability of Wiktionary.
—Wavelength (talk) 17:16, 3 November 2013 (UTC)
However a search for the phrase "now days" (with the quotation marks) reported only 87 results. I expanded the existing "Nowadays" rule to also fix "now days" and "now-days". Feel free to use whatever edit summary you wish when fixing these. GoingBatty (talk) 20:46, 3 November 2013 (UTC)
Thank you for reminding me about the quotation marks, and for expanding the rule.
I've been gradually cleaning out some "of the of the"s lately (e.g., [28], [29]), but there seem to be enough left to justify adding it as a regular expression. I've cleaned up 100 of these or so without encountering any false positives for the rule. Would someone be willing to add it? -- Khazar2 (talk) 12:25, 5 November 2013 (UTC)
Done. Rule "of xxx of xxx" also fixes 'of his of his' and anything else. I happened to hit a bunch like this last week, including about 50 'for the for the'. For now, let's see how this rule goes. Christhe spelleryack17:40, 5 November 2013 (UTC)
Anyone who's brave enough to run a general rule for these cases can use:
find "\b([a-z]+) ([a-z]+) \1 \2\b"
replace "$1 $2"
The only false positive I have found so far is "calling a spade a spade", but there are probably many others, so this is not a good candidate for a Typo rule. Christhe spelleryack17:54, 5 November 2013 (UTC)
By coincidence I'm currently working on a list from a database scan for a regex very like that one. I'm skipping about 60% of the list and saving only 40%. Some of the 60% are articles where the problem has already been fixed, since my database dump is from September. But consider phrases such as "arm in arm", "side by side", "smaller and smaller", "back to back" - all these are used in contexts where the preposition occurs again just before or after the phrase. Still, if you look at my last few hundred contributions you will see the fixes. -- John of Reading (talk) 18:17, 5 November 2013 (UTC)
The article "Repetition (rhetorical device)" might be helpful. I found that page and others like it by doing a Google search for phrases with repeated words.
I try to keep on top of Loosing - Losing among the ones I search for, but there's one subset that I think could go to AWB as it doesn't get false positives. "loosing on penalties" should always be "losing on penalties". ϢereSpielChequers19:17, 11 November 2013 (UTC)
Hi Bill, since you seem to have found the only instance of "espicialy" in Wikipedia, I don't think there would be a case. Happy editing! GoingBatty (talk) 20:38, 11 November 2013 (UTC)
Yes I noticed it's rare on Wikipedia, but I do see it quite frequently everywhere else, which prompted me to search for it here in the first place. But alright. -- Ϫ01:27, 14 November 2013 (UTC)
Is a proactive approach to typo fixing not worth it? Is there a cost to adding new rules? Just curious. -- Ϫ01:29, 14 November 2013 (UTC)
Do not change "Lachlan Nieboer" to "Lachlan Neighbor". This is a name. Please make an opt-out or add this to a white/blacklist(?) -(t) Josve05a (c) 23:04, 23 November 2013 (UTC)
Films and songs should be inside italics or quotation marks (and wikilinked, if possible), so typo fixing shouldn't change them. GoingBatty (talk) 16:34, 27 November 2013 (UTC)
@GoingBatty: would you be able to add the diacritics rule(s) to change Fiance to Fiancé and Fiancee to Fiancée please. I've got better at RegEx but definitely don't trust myself to add a typo rule yet! Jamesmcmahon0 (talk) 12:06, 2 December 2013 (UTC)
@Jamesmcmahon0: - I'm currently travelling without AWB access, so I'd like to wait a few days until I get back to AWB. I'll be happy to add this rule, but I want to immediately see whether "fiance" should always be changed to "fiancé", or if some should be changed to "finance". However, if another editor wants to do this, please go ahead without me. GoingBatty (talk) 05:19, 3 December 2013 (UTC)
WP:HYPHEN (sub-subsection 3, point 3) says the following.
Many compounds that are hyphenated when used attributively (adjectives before the nouns they qualify: a light-blue handbag, a 34-year-old woman) or substantively (as a noun: she is a 34-year-old), are usually not hyphenated when used predicatively (descriptive phrase separated from the noun: the handbag was light blue, the woman is 34 years old). Where there would otherwise be a loss of clarity, a hyphen may optionally be used in the predicative usage as well (hand-fed turkeys, the turkeys were hand-fed).
When "year" and "old" are modified by the word "one" or the figure "1", then a semantic understanding of the context is necessary for deciding whether hyphens are required: "one year old" or "one-year-old". Otherwise, plural numbers (as words or as figures) with the singular form "year" indicate that hyphens are required: "244-year-old" and "ninety-nine-year-old".
Presumably, the author has not absent-mindedly omitted the plural suffix "s" from places where it should be, and has not followed the pattern of some foreign languages, such as Russian, where numbers ending in the digit "1" are used with singular nouns: "двадцать один год", where "двадцать один" means "twenty-one" and "год" means "year" (singular). (See http://www.russianlessons.net/lessons/lesson11_main.php and http://learnrussian.rt.com/speak-russian/russian-numbers.)
We need to be careful about omitting or adding a space in multiple-digit numbers: "25-year-old" for "2 5-year-old" or vice versa, or "480-year-old" for "4 80-year-old", or vice versa. Also, we need to avoid confusion among "twenty-one year-old" and "twenty-one-year-old" and "twenty one-year-old", or among "five hundred-year-old" and "five-hundred-year-old" and "five hundred year-old".
Please add a rule that would find numbers (except "one" and "1") followed by "year old", and insert the missing hyphens, whether the numbers are expressed as words or as figures. Please include all multiple-digit numbers ending in "1" or "one", for example, "21", "twenty-one", "321", and "eight hundred forty-one". Occurrences of "one year old" and "1 year old" would have to be checked in a different process. Mentioning "WP:HYPHEN (sub-subsection 3, point 3)" in edit summaries would be helpful.
(All of the previous examples are possible for a context about trees in a park, or buildings in a community.)
Also, if the plural suffix "s" is attached to the word "old" ("five-year-olds"), then the expression is used substantively, and even expressions with "one" or "1" have the plural suffix "s" and should have hyphens. Please do include "one" and "1" in hyphenating these expressions.
(These expressions usually refer to people, but could also refer to animals.)
—Wavelength (talk) 03:36, 2 December 2013 (UTC) and 03:42, 2 December 2013 (UTC) and 05:49, 2 December 2013 (UTC) and 06:19, 2 December 2013 (UTC) and 16:21, 2 December 2013 (UTC)
This rule will do much of what you request, and will find very few false positives:
Find: " year old(s?)\b(?<=\b(?:\d+|[Tt]wo|[Tt]hree|[Ff]our|[Ff]ive|[Ss]ix|[Ss]even|[Ee]ight|[Nn]ine|[Tt]en) year olds?)(?<!\b1 year olds?)"
Replace" "-year-old$1"
To fix "a group of one year olds", a second rule would be needed.
This rule will not change "a group of twenty one year olds", and I'm glad it won't.
It will not fix all cases of spelled-out numbers that are higher than 10, but WP:MOSNUM recommends those that require more than two words be expressed as numerals. I change "eight hundred forty-one year old" to "841-year-old" when I run across these. Teens and "-ty"s like "fifteen", "Thirty", etc. could be added to this rule, but not "thirty-one". Well, not easily.
Mentioning "WP:HYPHEN (sub-subsection 3, point 3)" in edit summaries is not doable through the Typo list, and would have do be specified by an AWB user who has selected a list of articles whose main shortcoming is this lack of hyphenation. Not sure how I would do that.
Thank you for your reply. What do you think of this rule (modified from the one presented above)?
Find: " year old(s?)\b(?<=\b(?:\d+|[Tt]wo|[Tt]hree|[Ff]our|[Ff]ive|[Ss]ix|[Ss]even|[Ee]ight|[Nn]ine|[Tt]en|[Hh]undred|[Tt]housand|[Mm]illion||[Bb]illion|[ 0123456789][ 0123456789][0123456789]) year old[ s]?)(?<!\b1 year old[ s]?)"
There is a double vertical bar before "[Bb]illion", and that matches everything, which is not what we want. I'm not sure what you intend "[ 0123456789][ 0123456789][0123456789]" to do; it allows "he was 1 year old when" to be changed to "he was 1-year-old when" (note extra spaces).
The construction with "old[ s]?)" causes it to miss "a six year old, well-fed boy".
How about this:
Find: " year old(s?)\b(?<=\b(?:\d+|[Tt]wo|[Tt]hree|[Ff]our|[Ff]ive|[Ss]ix|[Ss]even|[Ee]ight|[Nn]ine|[Tt]en|[Ee]leven|[Tt]welve|[A-Za-z][a-z]+teen|[Tt]wenty|[Tt]hirty|[Ff]orty|[Ff]ifty|[Ss]ixty|[Ss]eventy|[Ee]ighty|[Nn]inety|[Hh]undred|[Tt]housand|[MmBb]illion) year olds?)(?<!\b1 year olds?)"
That looks good, as far as I can tell. The double vertical bar in my attempted rule was a typographical error. By "[ 0123456789][ 0123456789][0123456789]", I was hoping to accommodate numbers in figures from 1 to 999, but my understanding of the coding is very rudimentary, and I am not sure about how to manage null quantities in leading "hundreds" positions and "tens" positions. Maybe it should be "[ 123456789][ 0123456789][0123456789]". Also, I do not know how to make it exclude "1" itself, in the case of predicative expressions, and also "0" itself. Also, maybe it would be simpler to have one rule for attributive and substantive expressions, where "one" and "1" are included, and another rule for predicative expressions, where "one" and "1" are excluded. Incidentally, where can I best learn the coding?
In retrospect, I realize that perhaps your rules (the first and the second) are intended to apply to both "old" and "olds", for attributive and substantive expressions having numbers greater than "one". Is that the case? In that case, a second rule would be needed for substantive expressions using "one", such as "a group of one year olds", as you indicated in your first reply. (Maybe a diagram would help me to keep my thoughts organized.)
This is more complex than I visualized when I started this discussion. Maybe I will study it more thoroughly in the future, and start another discussion.
"also know as " and "also knows as " can both be added to AWB as typos of "also known as ". I've fixed enough over the last year or so manually to be confident it would be a good test for AWB. ϢereSpielChequers22:34, 11 December 2013 (UTC)
Since the Latin phrase "et al." is often italicized (et al.), RegExTypoFix should fix the incorrectly punctuated, non-italicized version (et al→et al.) or the incorrectly punctuated, italicized version (''et al''→''et al''., which yields "et al.") but should ignore the correctly punctuated, italicized version (''et al''.) so that it's not creating incorrect double punctuation (''et al''.→''et al.''., which yields "et al.."). Ninjatacoshell (talk) 16:17, 13 December 2013 (UTC)
@Ninjatacoshell: - You're right about the logic. Could you please give an example of an article where an incorrect fix occurs, and which tool suggests the incorrect fix? (e.g. AWB, WPCleaner, wikEd). Thanks! GoingBatty (talk) 18:24, 13 December 2013 (UTC)
Would it be possible to change 1,2,3,...9 to one, two,... , nine? As per MOS:NUMERAL. Maybe by looking for plurals such as 2 things? There's probably a couple of other tricks to get low false positives...
Also changing pronounceable fractions such as 1⁄4 to written words such as: 1/4 yd to a quarter of a yd Jamesmcmahon0 (talk) 13:08, 18 December 2013 (UTC)
This would be quite challenging - I wonder if there would be false positives such as "January 2 concerts" to "January two concerts". Also MOS:NUMERAL states:
"there are frequent exceptions to these rules."
"Comparable quantities should be all spelled out or all figures: we may write either 5 cats and 32 dogs or five cats and thirty-two dogs, not five cats and 32 dogs."
"Common fractions for which the numerator and denominator can be expressed in one word are usually spelled out, e.g. a two-thirds majority; use figures if they occur with an abbreviated unit, e.g. 1⁄4 yd and not a quarter of a yd."
Please don't chnage Xi'an University of Architecture and Technology to Xi'a University of Architecture and Technology. (t) Josve05a (c) 18:55, 26 December 2013 (UTC)
On Winner-take-all AWB wants to change * L. Itti, C. Koch and E. Niebur[...] to * L. Itti, C. Koch and E. Neighbour[...] Maybe adding tis page to a whitelist or something, since making change in the RegEx for this case might be difficult. (t) Josve05a (c) 01:19, 27 December 2013 (UTC)
@Josve05a: One option is that you could put the reference inside a citation template, which would prevent AWB from "fixing" the typo. (It might also help to get the references in one consistent style.) GoingBatty (talk) 01:28, 27 December 2013 (UTC)
I just went through the 70 or so articles that contained her/his confident. While most were correct, I did change 12 to her/his confidant and only one to her/his confidence. GoingBatty (talk) 20:06, 27 December 2013 (UTC)
Thanks! I know that I can use {{Not a typo}}, but I feel more confident if ig gets added here so that no othe page with it in might get changed by misstake.
LanguageTool looks like something interesting to play with, although you need to review your changes carefully before saving to ensure you're not changing text inside quotes. If you want to add it to your Tools menu, you can add this to your Custom JavaScript file:
// Add LanguageTool launcher in the toolbox on left
addOnloadHook(function () {
addPortletLink(
"p-tb",
"http://community.languagetool.org/wikiCheck/index?url=" + wgPageName,
"LanguageTool"
)});
After further consideration, I'm going to change the rules to be "several different" → "several" and "many different" → "many". GoingBatty (talk) 17:28, 29 December 2013 (UTC)
Sorry I didn't make it clear that I understood your request, implemented your request, and then changed it. It's all been disabled anyway, based on the conversation below. GoingBatty (talk) 02:32, 30 December 2013 (UTC)
Can I suggest removing "many different"/"several different", which aren't synonyms of "many" and "several". ("The warehouse contains many books" and "the warehouse contains many different books" have very different meanings.) I foresee a lot of bad feeling arising from the false positives this rule will generate. Mogism (talk) 17:44, 29 December 2013 (UTC)
Per my comment on GoingBatty's talk, I would consider the other three rules ("with the exception of", "so as to", "as to whether") all to be correct usage in British English (and probably in derivative versions such as Indian, Australian etc, although I can't say for certain), and in at least a good proportion of cases the suggested alternatives appear inappropriately informal in British English use. As a test I've just dropped five random British-topic FA's (Great Fire of London, Royal Assent, Queen Victoria, Brill Tramway, William Shakespeare) into LanguageTool and in three it's found at least one of these "errors" - while FAs aren't perfect, they've all been through review processes by multiple editors who are normally very picky about spelling and grammar, none of whom have flagged "so as to" etc as an issue. If the grammar changes do go ahead, can there be a way to opt-out of them and just apply the typo-fix list rather than the full list - I'd estimate that with these rules in place my false-positive rate has gone from 5-10% to around 90%. (And all this is aside from the backlash that will ensue from people having their text flagged as a "typo".) Mogism (talk) 18:34, 29 December 2013 (UTC)
New Testamant --> New Testament
Something to add to the code? Changing "New Testamant" --> "New Testament"? I have seen this spelling a few times when reading pages, like this one. (t) Josve05a (c) 20:20, 29 December 2013 (UTC)
Since the currency mark shoulbe be put at the beginning of the number, I suggest that this will be added to the RegEx. (Change 45£ to £45). (t) Josve05a (c) 12:13, 29 December 2013 (UTC)
I've added wikilinks to the seven articles which mentioned Ursula Oppens without linking to her article. That will stop the typo-fixer damaging those articles. -- John of Reading (talk) 21:17, 2 January 2014 (UTC)
@Josve05a: The general fixes says "Removes ordinals from full dates per WP:DATESNO; does not alter on the 3rd November 2008 (i.e. the plus ordinal) to avoid introducing bad grammar", which is why this didn't change. We don't want to introduce any bad grammar via typo rules either. GoingBatty (talk) 15:25, 30 December 2013 (UTC)
@GoingBatty: I can't think of an example of how applying WP:DATE standards to "on the 3rd November 2008" can cause bad grammar; I fix these all the time with AWB. "Wilbur hit the lottery on the 3rd November 2008 and quit his job" should be changed to "Wilbur hit the lottery on 3 November 2008 and quit his job", right? Where's the problem? Christhe spelleryack18:30, 30 December 2013 (UTC)
That correction could fail if there's another noun straight after the date. "Wilbur hit the mainmast on the 3rd November 2008 ferry and sank the boat". Since "the" goes with "ferry" here, it mustn't be removed. A contrived example, of course. -- John of Reading (talk) 21:25, 2 January 2014 (UTC)
AWB changes 5th of july (with lower case 'j') to 5th of July (with upper case 'J'). It then takes a second run to change 5th of July (with upper case 'J') to 5 July (with upper case 'J'). See this edit- (t) Josve05a (c) 04:29, 4 January 2014 (UTC)
A set of rules such as the following can fix the format and the capitalization at the same time; this rule was built with the restriction that the date must be preceded by "on". This rule fixes 3 months; another rule could fix March and May, one could fix April and August, and one rule would be needed for each of February, September, October, November and December. This rule also removes a comma after the month. But it would damage the "ferry" example that John of Reading has provided. Such cases, if found, could be placed inside a "Not a typo" template, but I don't think this kind of rule is ready for general rollout. Many AWB users could manage to be careful enough, but there would be some slip-ups and subsequent complaints.
@NicoV: Done. It's unusual to have two rules with the same name. OK by me if someone wants to change one of the names. Remember, "Arose" by any other name would smell as sweet. Christhe spelleryack23:22, 16 January 2014 (UTC)
Mogism, I was 'about' to make that change, but didn't save. I post these kinds of false results here instead of using {{not a typo}} because I feel more secure getting a second opinion and I don't know how many pages has the same string of words. (t) Josve05a (c)02:04, 18 January 2014 (UTC)
@Josve05a: If you search for "Javier Inocente Pérez Torres" (with the quotation marks), you'll see that Javier Valcárcel is the only page with that string of words. However, there are over 400 pages with "Inocente", so maybe someone wants to update the "Inn-" rule to exclude it? GoingBatty (talk) 01:50, 19 January 2014 (UTC)
signed a contact
I've changed a few "signed a contact" to " signed a contract ", you can have the rest and any future ones for AWB. ϢereSpielChequers07:10, 18 January 2014 (UTC)
This allows up to 4 intervening words, and, though I have yet to see a false positive, I think it is better for a few brave (and attentive) souls to run this as their own F&R rule, but not as a Typo rule. Really brave souls may easily expand the rule to fix cases with 5 or 6 intervening words. I was inspired by the possibilities provided by the new CirrusSearch back end for the Special:search page:
The "~2" after the target specifies up to 2 intervening words, and this can be expanded as needed. AWB does not have his capability (yet?), so a text search for AWB has to be very wide, like the words "signed" and "contact" (without the quotation marks, just the two words). If you also put the Find rule in the Skip tab in the "Doesn't contain:" box, checking "Regex" and "Case sensitive" boxes, you will speed up the processing by very quickly skipping pages where the two words don't appear that close to each other. Christhe spelleryack02:42, 20 January 2014 (UTC)
@Chris the speller: You might want to tweak the rule to include cases with multiple spaces between words:
Greetings, I was using AWB to fix some typos and formatting over at Wikia military and found a possible error. Just wanted to let you know. here is an example of the one I ran into. Reguyla (talk) 18:58, 22 January 2014 (UTC)
@Reguyla: Thanks for posting here, but I don't think we can do much about this one. There's a soft hyphen hiding inside the word, so the typo fixer thought it was working on two words "respon" and "sible". -- John of Reading (talk) 19:06, 22 January 2014 (UTC)
Oh ok, thanks. It wonder if it would be better to not have a typo check for that word then. Its a pretty common word so I could see this being a common problem. I wonder how many have already been changed. Reguyla (talk) 19:14, 22 January 2014 (UTC)
@Reguyla: Could you please elaborate on how often the hidden soft hyphens are used? I changed the three instances of "responsible" to "responsible" on the English Wikipedia. Thanks! GoingBatty (talk) 23:10, 22 January 2014 (UTC)
I honestly don't know. I've seen it a half dozen times at Military and a couple times in other wiki's at Wikia. I'm not sure exactly how many though and I honestly don't even know how to find out. Reguyla (talk) 23:17, 25 January 2014 (UTC)
We could replace all of the regexps' \bs at the ends of words with (?:\b|\u00AD). Or the AWB parser could do automagically (as a code change). -- JHunterJ (talk) 11:47, 26 January 2014 (UTC)
@JHunterJ: That wouldn't be enough; it would have to be an AWB code change. Otherwise we'd have to adjust every letter of every regex, so that, for example, "Establishement" was still corrected to "Establishment" even though there were soft hyphens at arbitrary points within the word. Very messy. -- John of Reading (talk) 07:29, 27 January 2014 (UTC)
Do we know how many pages (on en-wiki) contain these soft hyphens? And how many pages legitimately need these soft hyphens (suppose a few pages on Unicode characters etc.)? We may be able to do a cleanup / make it a CHECKWIKI error. Rjwilmsi08:13, 27 January 2014 (UTC)
@John of Reading: Well, it wouldn't be enough to make sure that we changed all of the possible misspellings, true. But we're missing possible misspellings now. It would be enough to make sure that we avoid "fixing" things that aren't the misspellings we have regexps for, such as the respon-sible example. And that's one of the precepts of AWB/T, that we don't break correct words, even if that means we aren't able to fix all incorrect words. -- JHunterJ (talk) 11:00, 27 January 2014 (UTC)
Oh, I see. If I've understood you correctly, for that you'd need something like (?!\u00AD)\b [not tested], to peek ahead at the word delimiter and make sure it wasn't a soft hyphen. "Change respon to respond unless there's a soft hyphen coming up" -- John of Reading (talk) 11:08, 27 January 2014 (UTC)
Or Java could fix (change) their regexp parser to recognize that a soft hyphen isn't a word boundary. Probably soft hyphen should be included in the \w word character set. :-) -- JHunterJ (talk) 11:15, 27 January 2014 (UTC)
@Rjwilmsi: A different idea - could "HideMore" be taught to hide any word containing an embedded soft hyphen? Then these words would be exempt from typo-fixing without us having to change any regexps. -- John of Reading (talk) 07:51, 28 January 2014 (UTC)
Yes, should be doable, much better than trying to change every regex. Though that's prevention rather than a cure isn't it: still seems to me that we should clarify at WP:MOS whether soft hyphen is allowed/encouraged/disallowed, as there may still be a need for some cleanup? Rjwilmsi10:44, 28 January 2014 (UTC)
Just a pointer. This is a little different than what you guy do, but it overlaps. (Also, I've been an admirer of your work here for years, though I don't use AWB myself.) - Dank (push to talk) 19:36, 31 January 2014 (UTC)
Indian rubber
"india rubber" and "india-rubber" should be exceptions to the cap fix india → India. (Cap'n varies, but it's not a typo, and l.c. is found in the OED.) — kwami (talk) 20:17, 4 February 2014 (UTC)
Add two options to AWB to enable new and experimental typos, this would involve creating a new section at AWB/T for the experimental ones. The new typos would include everything under 'New additions' and would be enabled by default. The change would be the addition of an section on that page for experimental typos and an option in AWB, disabled by default, to use them. This experimental section could include typo fixes that are a work in progress, or possibly ones that will never 'graduate' due to the high false positives. It would mean that these type of typos fixes could be more collaboratively worked on and used by anyone who knows what they're getting in for. Jamesmcmahon0 (talk) 11:16, 12 February 2014 (UTC)
I'm in favor, I'd like to suggest some fairly straightforward usage regex rules, but I'd prefer that typo fixers have the option of opting in or out. - Dank (push to talk) 12:35, 12 February 2014 (UTC)
Québécois rule
I have disabled the "Québécois" rule after seeing Oreo Priest revert an edit of mine here. Dictionary.com indicates that "Québecois" and "Quebecois" are acceptable spellings. Is there any need to have a more limited version of this rule? Thanks! GoingBatty (talk) 20:22, 16 February 2014 (UTC)
High-water mark
The fact that the WP article is mishyphenated does not determine our direction. The fact that Collins, education.yahoo.com, Merriam-Webster show it hyphenated is sufficient reason for having the rule. Christhe spelleryack21:11, 16 February 2014 (UTC)
@GoingBatty: Yes, the article should have a hyphen in its name. The exception (and source of much confusion) is that the US Government does not use a hyphen in "ordinary high water mark", which is a term that has legal uses. (You didn't really expect them to use correct punctuation, did you?) Christhe spelleryack04:53, 17 February 2014 (UTC)
Currency
A recent rule-change (not sure what, but I've only just started noticing this so I assume in the last few days) is "correcting" the appearance of the dollar sign in the middle of a string of numbers (so 25$00 becomes $2500 and so on). However, this is the correct format for the pre-Euro Portuguese escudo (and assorted other currencies in the former Portuguese empire), and literally every incidence I've found of this is a false-positive in an article on a Portuguese (or Brazilian, Macanese etc) topic. If it's not going to break something else, could consideration be given to disabling this one? Mogism (talk) 01:49, 18 January 2014 (UTC)
@Mogism: Fixed! I apologize that I didn't make it clear above that I was asking for one example so I could fix the problem, not because I doubted the problem existed. Thanks for reporting it! GoingBatty (talk) 21:32, 19 January 2014 (UTC)
Thanks for that. I gave a range of articles to show that this was standard practice, as opposed to a couple of obscure articles using an archaic formatting style, or a few edge-cases. For instance, the new "first debuted" rule is wrong in some instances ("NBC were so confident in Friends that they commissioned a second series before the first debuted"), but I wouldn't argue for removing that rule as it's clearly a correct fix in most cases. Mogism (talk) 21:33, 19 January 2014 (UTC)
We could, but should it be changed to −4 °C (−4 °C) instead? (I just submitted a request to have AWB genfixes add the non-breaking space.) GoingBatty (talk) 02:51, 20 February 2014 (UTC)
But wait, is that an accidental double or an intentional one. Perhaps it should change "minus - 4°C" to "+4 °C (or just 4 °C) instead since a double negative make a positive? — {{U|Technical 13}}(t • e • c)03:07, 20 February 2014 (UTC)
@Technical 13: In the case I saw it was definitely a mistake and not a weird use of double negatives... whilst that is a possibility I would think that it's also very unlikely that the editor meant +4 °C but typed minus -4 °C. GoingBatty (talk) 14:39, 21 February 2014 (UTC)
I don't disagree GoingBatty, but since it is a possibility, there is no way for AWB to know if it was an intentional double negative or not, and therefor probably shouldn't try and correct it. — {{U|Technical 13}}(t • e • c)17:27, 21 February 2014 (UTC)
Hm...Maybe. But in this case it has a dash in it making it harder, since sometimes it might be written as -, —, {{ndash}} or –. It could become hard to find every instance of it and wikilink it properly. (t) Josve05a (c)20:43, 22 February 2014 (UTC)
And if it is in a disambig it might say something like:
* [[NAME with exiting article]], a place in Sainte-Adresse, France
If there are different ways of spelling it, then there would need to be different solutions to fix the rule. For "Sainte-Adresse" and "Sainte—Adresse", there was one I could fix with a wikilink, and 2009–10 Coupe de France 1st round could use your attention for multiple fixes. Are there lots of other pages where incorrect fixes are being made? GoingBatty (talk) 21:30, 22 February 2014 (UTC)
I suggest that AWB should make sure the typo is not part of a link.
I had a couple of occasions recently where AWB did a typo fix which was perfectly valid but the correct action should have been ignoring the typo. Specifically, if the spelling mistake was within a URL or an image file name. Unfortunately, I have realised this is a bug hours after seeing the problem and the two actual examples are lost somewhere in my edits. Periglio (talk) 07:08, 16 March 2014 (UTC)
Were you using "Find & Replace" expressions that you'd set up yourself, or had you just ticked "Regex typo fixing" on the "Options" tab? Either way, the developers will need to see the diffs before they can comment with any precision. -- John of Reading (talk) 07:50, 16 March 2014 (UTC)
This was with the "Regex typo fixing" box ticked. I just happened to notice a couple during my edits which I did not save so there are no diffs. As I said, I did not think about it until long after the event, so I am unable to find the article where this took place. I have also been unable to recreate it on my user page so I apologise for assuming it was a general oversight! If I see it happen again, I will be back. Periglio (talk) 10:32, 16 March 2014 (UTC)
@Periglio: One of the items in Wikipedia:AutoWikiBrowser/Typos#To do shows one way we can update the typo rules to ignore URLs, but there are others. If you notice this behavior in the future, please post here (don't save a bad edit), so we can see how we can adjust the article and/or the typo rule. Thanks! GoingBatty (talk) 21:27, 16 March 2014 (UTC)
New rule "।"
I've just seen this new rule suggest a change at Kali, removing spaces before each "।" character - that's Unicode \u0964, not an ordinary pipe. Could Wikiuser13 (talk·contribs) or any other editor explain for me why this is a typo that needs fixing, as it's not obvious. -- John of Reading (talk) 10:33, 18 March 2014 (UTC)
Sometimes when I am typo fixing, I will be presented with a page that has no changes automatically applied. However I have the skip if no typos are found setting checked so surely it should skip these pages automatically? This seems to happen on pages that use diacritics in the title (I haven't fully confirmed that). For an example try: Demographics of São Paulo you should also notice that the edit summary is given as fixing Sao Paulo to São Paulo. Jamesmcmahon0 (talk) 13:42, 28 March 2014 (UTC)
I can't easily create a sentence where "a another" would be valid. As we have often pointed out in other discussions, the Typo rules are meant to correct common minor mistakes in text that somewhat resembles good English. Typo rules can't fix every possible mistake, and "an another" makes no less sense than "a another", so it's not really doing any harm. When I see a Typo rule tripped up like that, I fix the sentence and continue on my merry way. Christhe spelleryack04:34, 31 March 2014 (UTC)
Many editors gnome and make that change so I thought I would put it in. But I'm fine with it not going in if it's not considered a typo. I'm always suspicious of Oxford, they allow all sorts of funny spelling ;-) -- Ohc ¡digame!07:40, 5 April 2014 (UTC)
Using WPCleaner I get this message in the Java-consol. I don't know what anything of it means, but I think it has something about this page...
Extended content
WARNING: Incorrect pattern syntax for [\b([Dd]is|[IiUu]ndis|[Ee]x)tin?[gq]i?ui?sh?((?:ab[li]|e[drs]|ing|ment)[a-z]*)?\b(?<!tinguish[a-z]*)]: Look-behind group does not have an obvious maximum length near index 97
\b([Dd]is|[IiUu]ndis|[Ee]x)tin?[gq]i?ui?sh?((?:ab[li]|e[drs]|ing|ment)[a-z]*)?\b(?<!tinguish[a-z]*)
^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Pp])rei?v(?:[eious]+(?<=s[eiou]*)|iou)e?l+e?y(?<!reviously)\b]: Look-behind group does not have an obvious maximum length near index 35
\b([Pp])rei?v(?:[eious]+(?<=s[eiou]*)|iou)e?l+e?y(?<!reviously)\b
^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Aa]) ?([Aa](?!nd\b|AA?T?|s\b|ldo|lguien\b|pagar\b|probat\b|rtelor\b|tahualpa\b|ustriei\b|\b|ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)[A-Za-z0-9]{0,99}|[Ee](?!u|dil\b|mpezar\b|ncore\b|nse[nñ]ar|ntenderse\b|sa\b|spa[nñ]|st(a\b|é|e\b)|vrop|w[abei]|\b|GP|RN|TB|URO?)[A-Za-z0-9]{0,99}|h(?:aut[besu]|eir|our|ones|onou?r|ors\sd)[A-Za-z0-9]{0,99}|[Ii](?![0-9]|[nst]\b|[IiVvXx]\b|[Ii]|greja|nglat|nstitucí|mmagini\b|ts\b|ure\b|\b|DR|LS|NR|QD|RR|SK)[A-Za-z0-9]{0,99}|[Oo](?!ax|bra|cho|d\b|f\b|ggi|kol[íi]e?\b|[Nn][Cc][Ee]|[Nn][Ee](\b|[A-Fa-fHhJ-Qj-qS-Zs-z0-9]|r[a-np-z])|rfu\b|opa|ra?ului|ra[s?]ului|ui|MR)[A-Za-z0-9]{0,99}|u(?!b[aio]|[ef]|ga[ln]|in|itz|k|lu|n(\s|:)|na(\b|n|r)|nes|ni([^m]|mo|\b)|[rst][aeiou]|rl\b|v[aeiru]|\b)[a-z]{0,99})(?<=\b[A-Za-z]{2,99}(?<!:|\btoda|\bpara|\b[Ii]nterpreta|\b[Vv]olta|\bva|\bund|\brecibe|\bde|[Vv]eche|\bque|\b[Rr]oi|\b[Ii]l|\scom|\bllevan|\btren|\b[Vv]olver|\be[nst]|\bnous)(?:\.\s?\s[Aa] |\,?\;?\sa ) ?\2)]: Look-behind group does not have an obvious maximum length near index 927
\b([Aa]) ?([Aa](?!nd\b|AA?T?|s\b|ldo|lguien\b|pagar\b|probat\b|rtelor\b|tahualpa\b|ustriei\b|\b|ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)[A-Za-z0-9]{0,99}|[Ee](?!u|dil\b|mpezar\b|ncore\b|nse[nñ]ar|ntenderse\b|sa\b|spa[nñ]|st(a\b|é|e\b)|vrop|w[abei]|\b|GP|RN|TB|URO?)[A-Za-z0-9]{0,99}|h(?:aut[besu]|eir|our|ones|onou?r|ors\sd)[A-Za-z0-9]{0,99}|[Ii](?![0-9]|[nst]\b|[IiVvXx]\b|[Ii]|greja|nglat|nstitucí|mmagini\b|ts\b|ure\b|\b|DR|LS|NR|QD|RR|SK)[A-Za-z0-9]{0,99}|[Oo](?!ax|bra|cho|d\b|f\b|ggi|kol[íi]e?\b|[Nn][Cc][Ee]|[Nn][Ee](\b|[A-Fa-fHhJ-Qj-qS-Zs-z0-9]|r[a-np-z])|rfu\b|opa|ra?ului|ra[s?]ului|ui|MR)[A-Za-z0-9]{0,99}|u(?!b[aio]|[ef]|ga[ln]|in|itz|k|lu|n(\s|:)|na(\b|n|r)|nes|ni([^m]|mo|\b)|[rst][aeiou]|rl\b|v[aeiru]|\b)[a-z]{0,99})(?<=\b[A-Za-z]{2,99}(?<!:|\btoda|\bpara|\b[Ii]nterpreta|\b[Vv]olta|\bva|\bund|\brecibe|\bde|[Vv]eche|\bque|\b[Rr]oi|\b[Ii]l|\scom|\bllevan|\btren|\b[Vv]olver|\be[nst]|\bnous)(?:\.\s?\s[Aa] |\,?\;?\sa ) ?\2)
^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b(?<!{)[Ss][Qq][-.\s]+[Kk][Mm][Ss]?\b]: Illegal repetition near index 5
\b(?<!{)[Ss][Qq][-.\s]+[Kk][Mm][Ss]?\b
^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Dd]is|[IiUu]ndis|[Ee]x)tin?[gq]i?ui?sh?((?:ab[li]|e[drs]|ing|ment)[a-z]*)?\b(?<!tinguish[a-z]*)]: Look-behind group does not have an obvious maximum length near index 97
\b([Dd]is|[IiUu]ndis|[Ee]x)tin?[gq]i?ui?sh?((?:ab[li]|e[drs]|ing|ment)[a-z]*)?\b(?<!tinguish[a-z]*)
^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Pp])rei?v(?:[eious]+(?<=s[eiou]*)|iou)e?l+e?y(?<!reviously)\b]: Look-behind group does not have an obvious maximum length near index 35
\b([Pp])rei?v(?:[eious]+(?<=s[eiou]*)|iou)e?l+e?y(?<!reviously)\b
^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b([Aa]) ?([Aa](?!nd\b|AA?T?|s\b|ldo|lguien\b|pagar\b|probat\b|rtelor\b|tahualpa\b|ustriei\b|\b|ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)[A-Za-z0-9]{0,99}|[Ee](?!u|dil\b|mpezar\b|ncore\b|nse[nñ]ar|ntenderse\b|sa\b|spa[nñ]|st(a\b|é|e\b)|vrop|w[abei]|\b|GP|RN|TB|URO?)[A-Za-z0-9]{0,99}|h(?:aut[besu]|eir|our|ones|onou?r|ors\sd)[A-Za-z0-9]{0,99}|[Ii](?![0-9]|[nst]\b|[IiVvXx]\b|[Ii]|greja|nglat|nstitucí|mmagini\b|ts\b|ure\b|\b|DR|LS|NR|QD|RR|SK)[A-Za-z0-9]{0,99}|[Oo](?!ax|bra|cho|d\b|f\b|ggi|kol[íi]e?\b|[Nn][Cc][Ee]|[Nn][Ee](\b|[A-Fa-fHhJ-Qj-qS-Zs-z0-9]|r[a-np-z])|rfu\b|opa|ra?ului|ra[s?]ului|ui|MR)[A-Za-z0-9]{0,99}|u(?!b[aio]|[ef]|ga[ln]|in|itz|k|lu|n(\s|:)|na(\b|n|r)|nes|ni([^m]|mo|\b)|[rst][aeiou]|rl\b|v[aeiru]|\b)[a-z]{0,99})(?<=\b[A-Za-z]{2,99}(?<!:|\btoda|\bpara|\b[Ii]nterpreta|\b[Vv]olta|\bva|\bund|\brecibe|\bde|[Vv]eche|\bque|\b[Rr]oi|\b[Ii]l|\scom|\bllevan|\btren|\b[Vv]olver|\be[nst]|\bnous)(?:\.\s?\s[Aa] |\,?\;?\sa ) ?\2)]: Look-behind group does not have an obvious maximum length near index 927
\b([Aa]) ?([Aa](?!nd\b|AA?T?|s\b|ldo|lguien\b|pagar\b|probat\b|rtelor\b|tahualpa\b|ustriei\b|\b|ED|FN|LL|MD|NG|OA|RS|UD|WG|ZN)[A-Za-z0-9]{0,99}|[Ee](?!u|dil\b|mpezar\b|ncore\b|nse[nñ]ar|ntenderse\b|sa\b|spa[nñ]|st(a\b|é|e\b)|vrop|w[abei]|\b|GP|RN|TB|URO?)[A-Za-z0-9]{0,99}|h(?:aut[besu]|eir|our|ones|onou?r|ors\sd)[A-Za-z0-9]{0,99}|[Ii](?![0-9]|[nst]\b|[IiVvXx]\b|[Ii]|greja|nglat|nstitucí|mmagini\b|ts\b|ure\b|\b|DR|LS|NR|QD|RR|SK)[A-Za-z0-9]{0,99}|[Oo](?!ax|bra|cho|d\b|f\b|ggi|kol[íi]e?\b|[Nn][Cc][Ee]|[Nn][Ee](\b|[A-Fa-fHhJ-Qj-qS-Zs-z0-9]|r[a-np-z])|rfu\b|opa|ra?ului|ra[s?]ului|ui|MR)[A-Za-z0-9]{0,99}|u(?!b[aio]|[ef]|ga[ln]|in|itz|k|lu|n(\s|:)|na(\b|n|r)|nes|ni([^m]|mo|\b)|[rst][aeiou]|rl\b|v[aeiru]|\b)[a-z]{0,99})(?<=\b[A-Za-z]{2,99}(?<!:|\btoda|\bpara|\b[Ii]nterpreta|\b[Vv]olta|\bva|\bund|\brecibe|\bde|[Vv]eche|\bque|\b[Rr]oi|\b[Ii]l|\scom|\bllevan|\btren|\b[Vv]olver|\be[nst]|\bnous)(?:\.\s?\s[Aa] |\,?\;?\sa ) ?\2)
^
Apr 19, 2014 6:48:08 PM org.wikipediacleaner.api.data.Suggestion createSuggestion
WARNING: Incorrect pattern syntax for [\b(?<!{)[Ss][Qq][-.\s]+[Kk][Mm][Ss]?\b]: Illegal repetition near index 5
\b(?<!{)[Ss][Qq][-.\s]+[Kk][Mm][Ss]?\b
^
Could a rule be created for typos of the form "at he (end|beginning|corner|side|etc...)" to be changed to "at the ..." Jamesmcmahon0 (talk) 12:34, 16 May 2014 (UTC)
There are only a handful of these. A new Typo rule is usually only considered when it would catch about two dozen or more errors. If you had just fixed 50 of these this week, that would be a different story. But the general problem you have brought to our attention is interesting. When 'he' follows any preposition, it is frequently an error of some kind. I found cases where "with he" should have been "with him", and cases where "with" was just an extraneous word. I'm having fun with this angle, but a new Typo rule is probably not the best way to deal with it. Christhe spelleryack16:33, 16 May 2014 (UTC)
This is the sort of thing that's really useful to run against a database dump. If you have the RAM/diskspace/processing power to do it, get the most recent dump and scan for these typos. Then either fix them then, or make a subpage on your own userpage of them to work through slowly. This ensures a human looks at each edits, and once the bulk of the fixes are done, new errors will build up slowly. Shadowjams (talk) 06:53, 17 May 2014 (UTC)
If you need a couple of database scans, just give me a yell. I've always got the current dump file. If it is a common problem, can make a request to be added at WP:FIX. Bgwhite (talk) 07:20, 17 May 2014 (UTC)
@Shadowjams: IMHO, you don't need a database dump to do this. You could do this in AWB by making a list and using find/replace rules to do your testing. GoingBatty (talk) 00:29, 19 May 2014 (UTC)
Copied from User talk:Mogism (with some parts of the comments removed):
I do not know what is standard advise in American specific dictionaries but the international OED says:
ˈcounter-, prefix
...
In those compounds which we have taken from French or Italian, the consolidation of the word is usually greater than in those formed in English, and they are regularly written as single words, as counterbalance , counterfeit , countermand , countermarch , though sometimes with the hyphen. The stress is normally, in verbs and their derivatives, on the root, in nouns and their derivatives, on the prefix: cf. to underˈgo , ˈundertone . But there are exceptions, esp. where the noun stress is taken by a verb of the same form, as in to ˈcounterfeit . In words formed in English the two elements are in looser union, both accentually and in writing. In verbs the rhetorical or antithetical stress on the prefix may be equal to, or even for the nonce stronger than, that normally on the root, as in to plan and ˈcounter-ˈplan (ˈcounter-ˌplan ), and the two parts are properly hyphened. In nouns, when the counter- word is contrasted explicitly or implicitly with the simple word (as in 2b – 2d), the predominant stress of the prefix is strongly marked, as in ˈcounter-cheer , ˈcounter-aˌnnouncement . These are properly written with the hyphen (now rarely as a single word, but occasionally in two separate words). When such a contrast is not distinctly present (as in 2e, 2h), the predominance of the prefix is less marked, and the root-element may receive an equal or greater stress; in such case there is a growing tendency to write the prefix as a separate qualifying word, and in fact to treat it as an adjective. Thus counter-side , counter-truth , become counter side , counter truth : see counter adj.
All permanent compounds in counter-, with some of the more important of the looser combinations, are given in their alphabetical order; of the casual combinations (many of them nonce-words) of obvious meaning, examples here follow.
...
[2]b.
(a) Done, directed, or acting against, in opposition to, as a rejoinder or reply to another thing of the same kind already made or in existence. (The stress is on the prefix; in long words there is a secondary stress on the accented syllable of the root-word.)
Wikipedia's policy on this is intentionally vague - "There is a clear trend to join both elements in all varieties of English (subsection, nonlinear), particularly in American English. British English tends to hyphenate when the letters brought into contact are the same (non-negotiable, sub-basement) or are vowels (pre-industrial), or where a word is uncommon (co-proposed, re-target) or may be misread (sub-era, not subera). American English reflects the same factors, but is more likely to close up without a hyphen. Consult a good dictionary, and see National varieties of English above.", but definitely leans towards the removal of hyphens.
This is in direct opposition to the OED as regards the "counter" prefix, but as with much of the OED, take their rules with an extreme pinch of salt - they have a famously loose relationship to standard English of the type actually written anywhere in the world, to the extent that Oxford English has its own language tag (en-GB-oed) to differentiate it from British English. British style guides don't agree with each other; the Guardian is firmly in the single non-hyphenated word camp, the Economist firmly in the "hyphenate" camp, and the Times is mute on the matter. All major US guides (as far as I can see) oppose the hyphen, including the CMOS hyphenation guide which the overwhelming majority of North American sources follow.
This is a tightly limited rule that removes the space or hyphen from only counter-attack, counter-part and counter-point. (I wouldn't think the last two are at all controversial.) "Counterattack" corresponds to the spelling used in the WP article of that name. I don't think there was any discussion on the subject - it was accepted as uncontroversial. But a key characteristic of these rules is that they should only be making changes that are considered uncontroversial so by all means raise the question at WT:AWB/T. Colonies Chris (talk) 22:55, 17 May 2014 (UTC)
"counter-part" and "counter-point" are both spelt "counterpart" and "counterpoint" in the OED (and are not shown as spelling mistakes with the British English spelling checker I am using). "counterattack" show as a spelling mistake and is spelt "counter-attack" in the OED (see above), therefore I recommend removing counter-attack from the list and also not automatically "fixing" counter attack unless the type of English used on a page can be ascertained automatically.-- PBS (talk) 09:53, 18 May 2014 (UTC)
I think "counterattack" and "counter-attack" should be left alone by AWB, but "counter attack" is not acceptable in American or British English. Changing the open spelling to the closed spelling produces "counterattack", which is always correct in American English and often correct in British English, and is at least more easily grasped than "counter attack". If, later on, an editor wants to insert a hyphen in an article that uses British English, fine. But it is not being kind to WP's readers to leave it open. Christhe spelleryack04:17, 19 May 2014 (UTC)
The statement by Colonies Chris about "tightly limited rule that removes the space or hyphen from only counter-attack, counter-part and counter-point" is not true of the current rule. A space will be removed from "counter attack", but a hyphen in "counter-attack" will be left alone. Christhe spelleryack04:32, 19 May 2014 (UTC)
A word boundary after "attack" deals with that. It's true that people sometimes write "to counter attack from the sea", and we don't want to change that to "counter-attack", but we do want to catch it and change it to something better (counter attacks, or better yet, defend against attacks). - Dank (push to talk) 11:59, 19 May 2014 (UTC)
@Chris the speller it seems that the OED disagrees with you they write about the words in their range of 2b to 2d (of with counter-attack is one) "... These are properly written with the hyphen (now rarely as a single word, but occasionally in two separate words), so why do you think that "counterattack" more appropriate than "counter attack" and why do you claim that "'counter attack' is not acceptable in ... British English"? -- PBS (talk) 19:53, 19 May 2014 (UTC)
I get the conflict here, PBS. It probably doesn't make sense to make a lot of automated edits to a new article that strike the primary writer as "nitpicky" ... it might give them the impression that we don't value their contributions, or that we're focused on the wrong things. OTOH, most writers, most of the time, actually appreciate good advice, and "counter attack" isn't good advice ... it isn't in SOED, Oxford Dictionaries, Cambridge Dictionaries, or any style guide I've seen. Where to draw the line at WP:RETF is not my call, but I'm working on copyediting software aimed at writers who have actually asked for writing advice, and for them, I can't recommend "counter attack" in any variety of English. - Dank (push to talk) 20:38, 19 May 2014 (UTC)
And Dank's example above, "to counter attack from the sea" (not optimal but valid), where "counter" is a verb and "attack" is a noun, illustrates why we need to close them up or hyphenate when "counterattack" or "counter-attack" is either a verb or a noun. Christhe spelleryack21:22, 19 May 2014 (UTC)
I agree with Dank above - nitpicky fixes, even when correct, can make Wikipedia appear a snooty and nitpicky place to people unfamiliar with it. As I've said before here, I really wish we could find an alternative default edit summary than "typos fixed", since not all the changes made at WP:AWB/T are typos. Regulars know that these summaries are just an artefact of the software, but for a new editor, it must look like an accusation or a questioning of their competence, when someone makes an ultra-trivial change like amending the apostrophe in "Guns N' Roses" from curly to straight, but labels it as a typo. (Oh, I know it's not good practice to make a change this trivial unless you're changing something more substantial as well, but it does happen.)
As regards PBS's original point, as I said on my talk I wouldn't argue if someone removed "counter-attack" from the list. While I personally think we should standardise on the non-hyphenated form, leaving the hyphens in place won't cause any harm - it's not as if they change the meaning of sentences, or render something confusing - and if there's a reasonable argument to be made for keeping them then it's not an uncontroversial change.
I do repeat what I said on my talk about treating the OED as canonical - they make some very weird calls which are out of keeping with the way English is actually used, most famously their refusal to accept the "-ise" suffix and insistence on the serial comma, and even hyper-conservative style guides like The Times now reject them. Mogism (talk) 16:08, 20 May 2014 (UTC)
Issue
I was just wondering, if you come across an article that finds multiple typos but one of the typos is the correct way of spelling it (e.g. a surname), how do you ignore it so it only changes the incorrect typo and doesn't change the other?--Mjs1991 (talk) 08:51, 26 May 2014 (UTC)
I don't even know what "241 00€" is supposed to represent; how can you expect a Typo rule to know what's going on? If the article contained properly formatted numbers instead of garbage, I think the Typo rules would work fine. Christhe spelleryack04:23, 29 March 2014 (UTC)
I agree that the article is poorly written, I assume the 241 00€ is supposed to be 241 000€ but obviously it can't be expected to fix that. I couldn't find any guidelines in MOS:CURRENCY for how to layout large amount so I would think that £1 000 000 is equally acceptable as £1,000,000. Could the rule be changed to look for groups of three numbers separated by either a space or a comma and fix accordingly? Jamesmcmahon0 (talk) 18:38, 29 March 2014 (UTC)
Yep, I would agree with that, Thanks. Since that has come up, is the any way to correct 123 456 etc to 123,456 without hideous amounts of false positives? Jamesmcmahon0 (talk) 22:53, 30 March 2014 (UTC)
I think there would be lots (by our standards) of false positives, such as "the plane dropped 3 500 pound bombs on the target", which is poorly formatted and unhyphenated, but understandable by a human reader. The correction you suggest would change its meaning. I'm not in favor of risking that kind of damage. Christhe spelleryack04:23, 31 March 2014 (UTC)
This is a common problem with text translated from other languages, particularly if it is done by a person from their mother into English. Other languages use other delimiters in large amounts (for example German uses points where English uses commas and commas where English uses points), so a German "100.000,99" is "100 000,99" in French and 100,000.99 in English. For those who are unsure (or French ;-) ) missing out the delimiters or not converting them is the simplest solution to translating numbers. Hence this will be more of a problem of incorrectly formatted numbers with certain currency symbols than others, and the Euro is going to be one of them because some(most?) European continental languages place the currency symbols after the number. Also according to this paper different dialects of English do or do not use spaces between the currency symbol and the numeric amount. -- PBS (talk) 10:26, 18 May 2014 (UTC)
I've been running into the same problem, and I think my suggestion would be to use the trailing currency symbol as the litmus for reformatting a number. So "5 500$" -> "$5,500", but "5 500" goes unchanged. I've also run into other separators, like a single quote, so maybe it just needs to match a space or any punctuation. I think the way to deal with a "241 00€" is to go with "€241.00" - ie matching a trailing group of not 3 digits as decimals. I think this could be the most robust, and requires significantly less editor intervention than punting on any non-comma separator. VanIsaacWScont08:40, 13 June 2014 (UTC)
Dependant vs. dependent
Wondering if we could craft a rule to change "dependant" to "dependent" when necessary. My understanding is:
In British English:
"Dependent" means reliant on.
A "dependant" is a person (usually a child or a spouse).
In American English, you can use "dependent" for both. (copied from grammar-monster.com)
I'm thinking we could try either:
change "dependant on" (but not "a dependant on") to "dependent on", OR
change "is/are now/highly/very/mostly dependant on" to "is/are now/highly/very/mostly dependent on" (more adverbs as needed)
But Collins English Dictionary says "casted" is an adjective that means "belonging to a caste". I also found "protect the patient's casted foot". Maybe check for "casted" that is preceded by he, she, they, was, is, be, being, etc.? Christhe spelleryack14:01, 16 June 2014 (UTC)
The right way to protect foreign-language text from English spell checkers is to use the "lang" template. I have employed it in your example page. Christhe spelleryack13:42, 19 June 2014 (UTC)
I'm seeing AWB trying to 'correct' including to includeing, giving the rule includ --> include for a reason. Thank you, BethNaught (talk) 13:33, 24 June 2014 (UTC)
@Bgwhite: I can't reproduce the issue. Since AWB won't make typo changes inside templates, and your edit to the article was to remove some extra braces, my guess is that AWB was confused about where the infobox ended, and didn't realize that the |european= parameter was part of the template. Thanks for fixing the article and not saving a bad edit! GoingBatty (talk) 02:18, 25 June 2014 (UTC)
I've been running across scattered instances of "full time" → "full-time" in soccer articles fairly regularly, where it is used to refer to the end of the game, eg "the score was tied 1-1 at full time". From my memory of context, I think this can be resolved by simply retaining "full time" when preceded by "at". If no one has a counter example, could someone better with regexp program this in? VanIsaacWScont05:57, 3 July 2014 (UTC)
Although the "i.e." rule was designed to skip URLs, AWB wants to change www.currahaparish.ie. to www.currahaparish.i.e. in Curraha. I tried fixing the rule, but my fix didn't work. Could someone else please take a look at this? Thanks! GoingBatty (talk) 14:42, 5 July 2014 (UTC)
Name caught in adverb spelling web
The last name "Dealy" keeps getting caught in the adverb suffix ly -> lly rule. It wouldn't get an erroneous skip of "idealy" if you match to the capitalized spelling only. VanIsaacWScont08:17, 6 July 2014 (UTC)
@Poveglia: The rule already exists - see this edit I just did. Do you see any articles where AWB isn't fixing the typo (when the typo isn't in a protected area, such as a URL or template?) Thanks! GoingBatty (talk) 03:29, 12 July 2014 (UTC)
I feel stupid now. I think I must've made a typo myself after pressing Ctrl-F. I fixed the "goverment" typos manually. Thanks, Poveglia (talk) 05:36, 13 July 2014 (UTC)
new jersey
I'm not sure how often this would happen, but AWB/T capitalised thinking it was a place name, but in fact the context was "a new jersey" of a hockey team. [31] --Brenotalk11:32, 6 August 2014 (UTC)
I recently had to correct around 40 of these. While these may be honest typos, I have seen the misspelling used in a pejorative sense, and therefore susceptible for use by vandals. Not a biggie, but it would be nice if I didn't have to have it set up as a find/replace when I do cleanups. Stevie is the man!Talk • Work11:13, 20 August 2014 (UTC)
Hello, please help us with one issue, with AWB in Armenian Wikipedia we cannot use Wikisearchtext and Wikisearchtiltle command. Thanks beforehand. --ERJANIK (talk) 13:54, 6 September 2014 (UTC)