Conrad.Bot
Joined 30 March 2008
A bot, using pywiki framework that is run by User:Conrad.Irwin.
Tasks
edit- (Approved) Link fixing for deleted/deletable redirects.
- (With consensus) Uploading the index files, see User:Conrad.Bot/Indexing.
(Without explicit approval) ReplacingA bad idea...{{see}}
to{{also}}
. (Will only work on pages that start with{{see|
and contain no other occurances of{{see|
to avoid propagating formatting errors)
Anagrams
Adding and updating ==Anagrams== sections in:
- English
- French
- request your language here
For both of these languages:
- Ignore anything containing a number, or which looks like a prefix, suffix or interfix '(^-|- -|[0-9]|-$)', or which only has "
{{misspelling of}}
" definitions. - normal-form is the lower-case word, remove all diacritics, remove all non-letters.
- The base anagram is formed from sorting the normal-form's letters into order, anything that has the same base anagram, but a different normal-form is considered an anagram.
Indexing
- Horribly out of date now...
This page may be out of date, but it should accurately reflect the current status when it was updated.
Languages
edit- On multiple pages: Hungarian, Irish, Italian, Spanish, Galician, Ancient Greek, English, Lithuanian
- On one page:Mapudungun, Hiligaynon
Overview
editcreate_indices.sh
Downloads the latest XML dump from http://devtionary.info/w/dump/xmlu and then runs the following programs.nicen.dump.awk
Normalize the XML dump, removing entries I am uninterested in, and formatting those that I am more readablyextract_words.awk
Scan through the dump and add every entry that contains at least one definition that doesn't look like a "form of" definition to a list. This step also stores any audio files it finds, as well as noting whether the link will need a #Language as it is not the first section on the page.- Entries whose only definition line consists entirely of a template (except
{{SI unit}}
and{{given name}}
) are excluded - Definitions start with "compound of" are excluded
- Definitions that contain variations on X form of, where X is present/perfect/plural/singular/past historic/preterite/compound/ending in ive are excluded.
- This is of course guess work, and if you notice words that should be in the index, but aren't, or words that shouldn't be in the index but are, let me know.
- Entries whose only definition line consists entirely of a template (except
get_trans.py
Scan through the dump and add every translation of words in languages that are being indexed, and add them to the lists created in 2.get_missing.py
(For some languages) scan through the current index for that language and add all words there to the list as "missing".split_index.<language name>.pl
Split the list for each language into files for each starting letter, corresponding to the list of entries on each page, and (for newly added languages) sort them, and divide them by second letter.format_index.<language name>.pl
Format the per-letter lists into wikitext (for the older few languages, the sorting and splitting by second letter happens here).indexupload.py
Upload each formatted output file
Sorting and splitting
editFor all languages, the strings are first normalized to lowercase. As I get round to it, I intend to rewrite the old-style ones as new style ones.
Ancient Greek
edit- Remove all space and punctuation.
- Treat any remaining non-alphabetic and
𐠀ϝϻϡϙ
as0
. - Remove diacritics.
- Split on first two characters.
- Use
el_EL.utf-8
to sort original strings.
Galician
edit- (old style)
- Remove all diacritics (except
ñ
). - Treat non-alphabetic characters as
0
. - Split on first two characters.
- Sort on normalised form.
Hungarian
edit- (old style)
- Replace
á é í ó ú ő ű
witha e i o u ö ü
- Treat non-alphanumeric characters as
0
- Split on fist two
(cs|gy|ly|ny|sz|ty|zs|[[:alpha:]0])
- Sort on normalised form.
Irish
edit- Remove all space and punctuation.
- Treat any remaining non-alphabetic as
0
. - Remove diacritics.
- Remove any leading
an
. - Split on first two characters.
- Use
gl_GL.utf-8
to sort original string.
Italian
edit- (old style)
- Remove all diacritics.
- Remove any leading
a
. - Treat non-alphabetic characters as
0
. - Split on first two characters.
- Sort on normalised form.
Spanish
edit- Remove all space and punctuation.
- Treat any remaining non-alphabetic as
0
. - Split on first two
(ñ|ll|ch|[[:alpha:]0])
- Use
es_ES.utf-8
to sort original string.
Formatting
editCurrently all languages are treated about the same:
- Strikethrough links that were added as "missing" from the inde<xes
- Add an
{{audio-list}}
for an audio file, if one was found. - Abbreviate PoS and add that in italics.
- Add an * linked to any entries which contained the word.
- Add #<language name> to links that were not the first on the page.
- Put the lists (#-lists) into a
<div class="index"></div>
seperated by ===-headings and a table of contents.- This means that the lists run horizontally, this means that they can change width to fill the maximum amount of space available to them, and that users can continue scrolling downwards without having to go up to find the next column.
Old Stuff
editWiktionary: cross-namespace redirects
editWill replace all links to these except for:
Links to redirects to Help Pages
edit- Wiktionary:Help > Help:Contents
- Wiktionary:FAQ > Help:FAQ
- Wiktionary:How to edit a page > Help:How to edit a page
- Wiktionary:How to start a page > Help:Starting a new page
- Wiktionary:How to check translations > Help:How to check translations
as they are widely known and linked to.
Hungarian Indexes
editNeed to move these first.
Links to redirects to Special pages
edit- Wiktionary:Beer parlour archive/July 06 : Wiktionary:Low water mark > Special:Recentchanges/hidepatrolled
- Wiktionary:Request pages/Recent changes/2007-01-08 : Wiktionary:Low water mark > Special:Recentchanges/hidepatrolled
- Wiktionary:Requests for deletion/Archives/2008/01 : Wiktionary:Low water mark > Special:Recentchanges/hidepatrolled
- Wiktionary talk:Main Page/Redesign 2006 : Wiktionary:Main Page/Redesign > Special:Allpages/Wiktionary:Main Page/Redesign
- Wiktionary:Beer parlour archive/February 06 : Wiktionary:Main Page/Redesign > Special:Allpages/Wiktionary:Main Page/Redesign
- Wiktionary:Request pages/Recent changes/2007-01-08 : Wiktionary:Main Page/Redesign > Special:Allpages/Wiktionary:Main Page/Redesign
- Wiktionary:MediaWiki custom messages : Wiktionary:MediaWiki namespace > Special:Allmessages
- Wiktionary:Tutorial (Namespaces) : Wiktionary:MediaWiki namespace > Special:Allmessages
Wont touch
editTotals
editBefore pruning: 15839
After pruning: 7748 links to fix (on 7343 pages).