Phase 0 (October 27th)
The IMAP server was upgraded to 2006a to allow testing of the new file format on a test account. New Internet standards for IMAP (support for UID List) required that some old mail messages with invalid UIDs be renumbered. Users using the POP protocol, particularly those who had not deleted messages on the server, saw duplicated emails and old messages re-appearing. (The mail program keeps a big list of messages UIDs it has already seen). Phase 1 (the night of October 3rd/4th)
"mail.triumf.ca" alias created as an outgoing server, points to "trmail"
For each user:
Hopefully, most users will be able to access their mail on "trmail" in the evening, and access it on "trmail" the next morning, with no disruption. (some hope!)
Phase 2 (again, probably at night)
For each user:
Phase 3
Phase 4
Phase 5
Undoubtedly there will be some more, though we will endeavour to provide (nearly) uninterrupted service to all users.
Wednesday May 26 2010
Around 17:00 on Tuesday May 25 trmail logged errors on the sunstore connection,
and the device went offline shortly afterwards. This affected some 70 users.
In order to reset the connection, trmail was rebooted and subsequently
had a boot failure where the system volume became unreadable. An attempt to
recover this volume failed and the system was re-installed from
distribution and backup media.
Around 08:00 on Wednesday the system appeared to be functioning normally, and deferred mail from other sites was being delivered to users' inboxes. Shortly thereafter some mail folders on the sunstore appeared to be corrupted - were unreadable, and the directory listing was similar to that seen on a failed disk drive. On the assumption that all data on the sunstore was suspect, new mail inboxes were created for affected users on another storage device - the Coraid. This device already contained a copy of user data prior to May 19th. At 12:00 on Wednesday, each affected user had 3 inboxes - a new one on the Coraid collecting incoming mail since 9am, an old one (INBOX-prior) on the Coraid with mail from before May 19, and a suspect one on the sunstore containing mail up till 9am. At this time, it appeared to users that most of their mail had disappeared, but they could send and receive new mail normally.
Once all live email was removed from the sunstore, the connection was reset and it was apparent that the files were not in fact corrupted on disk - the file corruption was an artefact of the network connection. Accordingly a procedure was implemented to merge the 3 inboxes into a single one on the Coraid. At the completion of this, users would have seen a complete inbox, but other folders would have reverted to a state from 10 days previous. An error in setting file permissions also caused folders to be unavailable for a while on Wednesday afternoon, and some incoming mail was delivered to a mail spool file.
After the merging of inboxes, other folders from the sunstore were checked for integrity and, if they more recent than those already on the Coraid, restored there. The permission problem was fixed and mail from the spool file merged with the normal inboxes. This was completed about 5pm on Wednesday.
Mon May 31 2010 The above procedure left a number of mailboxes, principally variations of "Sent" and "Drafts", which had been in use on the Coraid since May 26, along with a mailbox on the Sunstore containing messages saved between May 19-25. It was not possible to simply discard one copy, since messages would have been lost, and it was not practical to simply append one mailbox to the other since that would have left thousands of duplicate messages. Mailboxes for some users who had urgent needs were restored manually on Thu-Fri 27-28, while the majority awaited the development of an automated solution, since there were a total of some 500 affected mailboxes - too many for manual restoration.
Wed, Thu Jun 2-3 2010 A script was run to merge "Drafts" and "postponed-messages" folders. Messages with a "unique ID" seen on the Sunstore but not on the Coraid were copied across to the active folder on the Coraid, preserving their date. Depending on the mail client settings, they might appear in date order, or appended at the end of the folder.
Fri Jun 4 2010 The script was extended to all other folders, other than Inbox and Drafts which had already been processed. Affected folders were backed up to a folder suffixed ".bk4jun", and merged as described. A problem handling folders with spaces in the folder name was quickly found and fixed, and the process run for all affected users.
Mon Jun 7 2010
A few folders that had failed the script on Friday, and some that could not be merged because they did not exist on the Coraid, were
copied.
At this point, all mailboxes should have been restored.
Sun Aug 16 10:50 PDT 2009
The Sun storage device (sunstore), which was being used to hold mail for high-volume
users, failed. Email has been restored, and mail folders recovered from backup.
Some mail will have been delayed. Mail received between the time of last backup and time of
failure will be unavailable until sunstore is repaired.
21 May 2007 - remaining MBX and Unix mailboxes converted to MIX format. All mailfolders may now contain subfolders, not just MIX ones.
4 May 2007 - imap updated to version imap-h production release
27 March 2007 - - imap updated to version imap-2006g snapshot to address some issues with Outlook clients
27 Feb 2007 - imap updated to version imap-2006e
Nov 7 - squirrelmail personal addressbooks imported/merged from old server. Some
addresses recovered from IMP addressbooks, marked "from IMP"
Backup files (.mbx) removed for most users.
Oct 29 00:00 - impad upgraded to version 2006c1
More mailboxes converted; most large boxes. Some .mbx backups kept for now; will be deleted.
Phase 1 completed October 4th; some problems...
| Service | Plaintext | TLS | SSL | |
|---|---|---|---|---|
| Webserver | 80: OK | (443: OK) | 443: OK | |
| SMTP | 25: OK | 25: OK | 465: OK | |
| SMTP | 587: OK | 587: OK | ||
| IMAP | 143: OK | 143: OK | 993: OK | |
| POP3 | 110: OK | 110: OK | 995: OK | |
| POP2 | 109: disabled | N/A | N/A | |
| LDAP | 389: OK | (636: OK) | 636: OK |
| Service | Status |
|---|---|
| SpamAssassin | Running |
| AVP antivirus | Running |
| Whitelist | OK |
| Blacklist | Running |
| Mailman | Running |
Outgoing mail: See Setting an outgoing SMTP server
Reading mail: See Setting an IMAP server
The following addressbook (LDAP) settings may be used:
POP3 is now enabled.
The new recommended outgoing SMTP server "mail.triumf.ca" generates a certificate
warning on some clients if TLS or SSL is used. This can be ignored. In the future, this
may be a separate machine with its own certificate and the problem will go away.
Some users reported problems due to Web cache - their Web browser was
showing the old page. Solution: reload/refresh page.
Some users reported problems due to ARP cache - their
computer was still looking at the old hardware based on its Ethernet address.
Solution: flush ARP cache, or reboot computer.
SpamAssassin was not enabled on the morning of Oct 4th. Users
receivedsome untagged spam in their inbox instead of tagged or in "junk".
Some users were running a spam filter but had no "junk" folder
and the mailer was reporting "unsafe" delivery;
this folder has been recreated.
Many users were found to be requesting authentication "if possible" when sending mail. As the old mailserver did not support authentication this worked. Since authentication is enabled on the new server, users saw a password challenge and in some cases a Symantec antivirus alert. Initially the authentication mechanism did not work properly. Solution: turn off authentication for outgoing mail from within TRIUMF.
Initially POP3 service was configured to require encryption - the default on new servers for security reasons. Only TLS was working, which is not supported in older clients such as Netscape Communicator. Later, POP3 with SSL was enabled, and on Oct 14th, POP3 with plain login was enabled. Note: Encryption is still recommended, especially on laptops using wireless networking. POP3 passwords are trivially visible on wireless.
Due to a change in Internet standards, all email messages stored on the server now need a unique ID number;
this forced renumbering of messages in some mailboxes. For clients using POP3 to read mail,
this
caused a mismatch with status stored in the client program (e.g. Thunderbird, Outlook).
The client then downloaded "new" messages which the user thought they had deleted or already read.
This may happen again as we migrate to the new mailbox format.
LDAP (address lookup and website authentication) was not working. Temporary solution: use the LDAP service on the old machine (trmail-old). Note: A new CNAME "ldap.triumf.ca" has been created; please use that instead of "trmail" in addressbook and authentication configuration.
Mailman (mailing lists archives) were not working. Temporary solution: use mailing lists on the old server. The mailing list addresses (listxxx@trmail) are redirected and should work.
Some people's incoming mail was going in a mailbox called "broken" due to a missing XXX filter. "broken" mail was merged into normal inboxes. Some users may have seen old lost mail and spam reappear. Some mail may have been duplicated.
Between October 4th - 11th, the spam blacklists were not rejecting mail - the whitelist was misconfigured and was accepting all mail. Users may have seen more spam than usual, though most would have been tagged or filtered by SpamAssassin.
Between October 4th - 10th the virtual host Web pages - ca.triumf.ca, cbl.triumf.ca, rbl2.triumf.ca - were not working.
On October 4th, the Secure Web (https/SSL) was not working, blocking access to Webmail (Squirrelmail). As a temporary workaround users were directed to use the unencrypted service.
Between October 4th - 6th: the SpamAssassin central whitelists (CERN, JLAB etc.) were not configured - some "spammy" looking email from "trusted" institutions may have been flagged as spam, though this is fairly unlikely.
On Oct 16th sending mail with authentication (such as from laptops) was broken for a while following an LDAP rebuild.
Oct 19th: LDAP directories enabled for Squirrelmail
Oct 19th AM: trshare was rebooted when it became unresponsive; users could not
sent mail through smtp.triumf.ca. Unrelated to the email upgrade.
Oct 19th: Domain now set correctly for Squirrelmail outgoing mail
Oct 20-22: Some mailboxes (junk, Trash...) converted to Mix format,
starting with larger files where most benefit will be seen.
Oct 25: Most inboxes converted to Mix format. Old format preserved temporarily as INBOX.mbx - you must subscribe to this if you want to check. This will be deleted later.
Oct 24: Mailman switched to new server, now called lists.triumf.ca
Nov 6 - Attn. Pine users - see this page for notes on Mix issues
Nov 2 15:20 - LDAP server "ldap.triumf.ca" moved to point to trmail (new). All authentication and addressbooks should use this CNAME (ldap.triumf.ca) as it may move again. Stop using trmail-old.triumf.ca
Either TLS or SSL should be used on wireless devices (e.g. laptops) whenever a password is sent over the network. This includes reading email with IMAP or POP, using the Squirrelmail Webmail client, logging on to TRIUMF internal pages from offsite, and sending mail via smtp.triumf.ca from offsite. In some cases (e.g. smtp.triumf.ca) this is enforced - the system will not accept a password unless encryption is used.