commit a0e7a494e49c3ed41df2beac84b370132455100f Author: Steve Cliff Date: Thu Jan 29 17:37:11 2026 +0000 Initial commit: IMAP email downloader Single-file Python script to download emails from IMAP servers: - Downloads emails as .eml files preserving folder structure - Extracts attachments to zip files - Supports SSL and STARTTLS connections - Incremental updates using UID tracking (default behavior) - Multi-account support with separate folders per email - Safety checks to prevent duplicate downloads Co-Authored-By: Claude Opus 4.5 diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..67c3aff --- /dev/null +++ b/.gitignore @@ -0,0 +1,8 @@ +# Python +__pycache__/ +*.py[cod] +.venv/ +venv/ + +# Downloads +download/ diff --git a/README.md b/README.md new file mode 100644 index 0000000..977c167 --- /dev/null +++ b/README.md @@ -0,0 +1,133 @@ +# IMAP Downloader + +A simple Python script to download all emails from an IMAP server into individual EML files, preserving the folder structure. + +## Features + +- Downloads emails as standard `.eml` files +- Preserves IMAP folder hierarchy locally +- Extracts attachments into zip files alongside each email +- Supports SSL and STARTTLS connections +- Incremental updates using UID tracking (only download new emails) +- Multi-account support (separate folders per email address) +- Configurable download limit for testing/debugging + +## Requirements + +- Python 3.6+ +- No external dependencies (uses only standard library) + +## Installation + +```bash +# Clone or download the script +git clone +cd imapdown + +# Create virtual environment (optional but recommended) +python3 -m venv .venv +source .venv/bin/activate +``` + +## Usage + +### Basic Usage + +By default, the script only downloads new emails since the last run (incremental mode). On first run, it downloads everything. + +```bash +# Download emails using SSL (most common) +./imapdown.py --server imap.example.com --email me@example.com --user me@example.com --password "secret" --ssl + +# Using STARTTLS +./imapdown.py --server imap.example.com --email me@example.com --user me@example.com --password "secret" --starttls + +# Custom port +./imapdown.py --server imap.example.com --email me@example.com --user me@example.com --password "secret" --ssl --port 12993 +``` + +### Full Download + +To force a complete download of all emails (ignoring previous state): + +```bash +./imapdown.py --server imap.example.com --email me@example.com --user me@example.com --password "secret" --ssl --full +``` + +**Note:** As a safety measure, `--full` will refuse to run if the download folder already contains emails. This prevents accidental duplicates. To re-download everything, first delete the folder: + +```bash +rm -rf download/me@example.com/ +./imapdown.py --server imap.example.com --email me@example.com --user me@example.com --password "secret" --ssl --full +``` + +### Debugging/Testing + +Limit the number of emails downloaded: + +```bash +./imapdown.py --server imap.example.com --email me@example.com --user me@example.com --password "secret" --ssl --limit 10 +``` + +## Command Line Arguments + +| Argument | Required | Description | +|----------|----------|-------------| +| `--server` | Yes | IMAP server hostname | +| `--email` | Yes | Email address (used for folder organization) | +| `--user` | Yes | Username for authentication | +| `--password` | Yes | Password for authentication | +| `--ssl` | No | Use implicit SSL/TLS (default port 993) | +| `--starttls` | No | Use STARTTLS (default port 143) | +| `--port` | No | Custom port (overrides defaults) | +| `--limit` | No | Maximum number of emails to download | +| `--full` | No | Download all emails (default: only new since last run) | + +Note: `--ssl` and `--starttls` are mutually exclusive. + +## Output Structure + +``` +./download/ +├── user@example.com/ +│ ├── .imapdown_state.json # Tracks last downloaded UID per folder +│ ├── INBOX/ +│ │ ├── 123_20240115_Meeting_notes.eml +│ │ ├── 124_20240116_Report.eml +│ │ └── 124_20240116_Report.zip # Attachments (if any) +│ ├── Sent/ +│ │ └── 456_20240114_RE_Question.eml +│ └── Archive/ +│ └── 789_20240101_Old_email.eml +└── another@example.com/ + └── ... +``` + +### File Naming + +Email files are named: `{UID}_{date}_{subject}.eml` + +- **UID**: Unique identifier from the IMAP server +- **date**: Message date in `YYYYMMDD_HHMMSS` format +- **subject**: Sanitized email subject (truncated to 50 characters) + +### Attachments + +When an email contains attachments, they are extracted and saved in a zip file with the same base name as the `.eml` file but with a `.zip` extension. + +## State Tracking + +The script maintains a `.imapdown_state.json` file in each email account's folder. This file tracks the highest downloaded UID for each IMAP folder, enabling efficient incremental updates with `--update`. + +Example state file: +```json +{ + "INBOX": 19334, + "INBOX.Archive": 1770, + "Sent": 892 +} +``` + +## License + +MIT diff --git a/imapdown.py b/imapdown.py new file mode 100755 index 0000000..c7a33fd --- /dev/null +++ b/imapdown.py @@ -0,0 +1,395 @@ +#!/usr/bin/env python3 +"""Simple IMAP email downloader - downloads all emails to EML files.""" + +import argparse +import email +import email.utils +import imaplib +import io +import json +import os +import re +import sys +import zipfile +from datetime import datetime + + +def parse_args(): + """Parse command line arguments.""" + parser = argparse.ArgumentParser( + description="Download all emails from an IMAP server to EML files" + ) + + parser.add_argument("--server", required=True, help="IMAP server hostname") + parser.add_argument("--email", required=True, help="Email address") + parser.add_argument("--user", required=True, help="Username for authentication") + parser.add_argument("--password", required=True, help="Password for authentication") + + security = parser.add_mutually_exclusive_group() + security.add_argument("--ssl", action="store_true", help="Use implicit SSL/TLS (default port 993)") + security.add_argument("--starttls", action="store_true", help="Use STARTTLS (default port 143)") + + parser.add_argument("--port", type=int, help="Custom port (default: 993 for SSL, 143 otherwise)") + parser.add_argument("--limit", type=int, help="Limit number of emails to download (for debugging)") + parser.add_argument("--full", action="store_true", help="Download all emails (default: only new emails since last run)") + + return parser.parse_args() + + +def decode_modified_utf7(s): + """Decode IMAP modified UTF-7 folder names.""" + result = [] + i = 0 + while i < len(s): + if s[i] == '&': + if i + 1 < len(s) and s[i + 1] == '-': + result.append('&') + i += 2 + else: + end = s.find('-', i + 1) + if end == -1: + result.append(s[i:]) + break + encoded = s[i + 1:end] + if encoded: + encoded = encoded.replace(',', '/') + padding = (4 - len(encoded) % 4) % 4 + encoded += '=' * padding + try: + import base64 + decoded = base64.b64decode(encoded).decode('utf-16-be') + result.append(decoded) + except Exception: + result.append(s[i:end + 1]) + i = end + 1 + else: + result.append(s[i]) + i += 1 + return ''.join(result) + + +def parse_folder_list(response): + """Parse IMAP LIST response to extract folder names.""" + folders = [] + pattern = re.compile(r'\((?P.*?)\) "(?P.*)" (?P.*)') + + for item in response: + if isinstance(item, bytes): + item = item.decode('utf-8', errors='replace') + + match = pattern.match(item) + if match: + name = match.group('name') + if name.startswith('"') and name.endswith('"'): + name = name[1:-1] + name = decode_modified_utf7(name) + folders.append(name) + + return folders + + +def sanitize_filename(name, max_length=50): + """Sanitize a string for use as a filename.""" + if not name: + return "untitled" + name = re.sub(r'[<>:"/\\|?*\x00-\x1f]', '_', name) + name = name.strip('. ') + name = name[:max_length] + name = name.strip('. ') + return name or "untitled" + + +def sanitize_folder_path(folder_name): + """Sanitize folder path for filesystem use.""" + parts = folder_name.replace('/', os.sep).replace('.', os.sep).split(os.sep) + sanitized = [sanitize_filename(p, max_length=100) for p in parts if p] + return os.path.join(*sanitized) if sanitized else "INBOX" + + +def get_message_date(msg): + """Extract date from email message.""" + date_str = msg.get('Date') + if date_str: + try: + parsed = email.utils.parsedate_to_datetime(date_str) + return parsed.strftime('%Y%m%d_%H%M%S') + except Exception: + pass + return datetime.now().strftime('%Y%m%d_%H%M%S') + + +def get_message_subject(msg): + """Extract and decode subject from email message.""" + subject = msg.get('Subject', '') + if not subject: + return 'no_subject' + + try: + decoded_parts = email.header.decode_header(subject) + decoded = [] + for part, charset in decoded_parts: + if isinstance(part, bytes): + charset = charset or 'utf-8' + try: + decoded.append(part.decode(charset, errors='replace')) + except Exception: + decoded.append(part.decode('utf-8', errors='replace')) + else: + decoded.append(part) + return ''.join(decoded) + except Exception: + return str(subject) + + +def extract_attachments(msg, eml_filepath): + """Extract attachments from email and save as zip file.""" + attachments = [] + + for part in msg.walk(): + content_disposition = part.get('Content-Disposition', '') + if 'attachment' in content_disposition or 'inline' in content_disposition: + filename = part.get_filename() + if filename: + try: + decoded_parts = email.header.decode_header(filename) + decoded_filename = [] + for data, charset in decoded_parts: + if isinstance(data, bytes): + charset = charset or 'utf-8' + decoded_filename.append(data.decode(charset, errors='replace')) + else: + decoded_filename.append(data) + filename = ''.join(decoded_filename) + except Exception: + pass + + payload = part.get_payload(decode=True) + if payload: + attachments.append((sanitize_filename(filename, max_length=100), payload)) + + if attachments: + zip_path = os.path.splitext(eml_filepath)[0] + '.zip' + with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf: + seen_names = {} + for filename, data in attachments: + if filename in seen_names: + seen_names[filename] += 1 + name, ext = os.path.splitext(filename) + filename = f"{name}_{seen_names[filename]}{ext}" + else: + seen_names[filename] = 0 + zf.writestr(filename, data) + return len(attachments) + return 0 + + +STATE_FILE = '.imapdown_state.json' + + +def load_state(base_dir): + """Load the state file tracking last downloaded emails.""" + state_path = os.path.join(base_dir, STATE_FILE) + if os.path.exists(state_path): + try: + with open(state_path, 'r') as f: + return json.load(f) + except Exception: + pass + return {} + + +def save_state(base_dir, state): + """Save the state file.""" + state_path = os.path.join(base_dir, STATE_FILE) + with open(state_path, 'w') as f: + json.dump(state, f, indent=2) + + +def connect_imap(server, port, use_ssl, use_starttls): + """Connect to IMAP server with appropriate security.""" + if use_ssl: + port = port or 993 + print(f"Connecting to {server}:{port} with SSL...") + return imaplib.IMAP4_SSL(server, port) + else: + port = port or 143 + print(f"Connecting to {server}:{port}...") + conn = imaplib.IMAP4(server, port) + if use_starttls: + print("Upgrading to TLS with STARTTLS...") + conn.starttls() + return conn + + +def download_folder(conn, folder_name, base_dir, limit=None, total_so_far=0, update_mode=False, last_uid=None): + """Download all emails from a folder. Returns (downloaded_count, highest_uid).""" + local_path = os.path.join(base_dir, sanitize_folder_path(folder_name)) + os.makedirs(local_path, exist_ok=True) + + try: + status, _ = conn.select(f'"{folder_name}"', readonly=True) + if status != 'OK': + print(f" Could not select folder: {folder_name}") + return 0, last_uid + except Exception as e: + print(f" Error selecting folder {folder_name}: {e}") + return 0, last_uid + + if update_mode and last_uid is not None: + status, data = conn.uid('SEARCH', None, f'UID {last_uid + 1}:*') + else: + status, data = conn.uid('SEARCH', None, 'ALL') + + if status != 'OK': + print(f" Could not search folder: {folder_name}") + return 0, last_uid + + uid_list = data[0].split() + + # Filter out UIDs <= last_uid (some servers return highest UID even when searching for higher) + if update_mode and last_uid is not None: + uid_list = [uid for uid in uid_list if int(uid) > last_uid] + + if not uid_list: + print(f" {folder_name}: no new messages") + return 0, last_uid + + if limit is not None: + remaining = limit - total_so_far + if remaining <= 0: + return 0, last_uid + uid_list = uid_list[:remaining] + + print(f" {folder_name}: {len(uid_list)} messages to download") + downloaded = 0 + highest_uid = last_uid + + for uid in uid_list: + try: + uid_int = int(uid) + status, data = conn.uid('FETCH', uid, '(RFC822)') + if status != 'OK': + continue + + raw_email = None + for part in data: + if isinstance(part, tuple): + raw_email = part[1] + break + + if raw_email is None: + continue + + msg = email.message_from_bytes(raw_email) + date_str = get_message_date(msg) + subject = sanitize_filename(get_message_subject(msg)) + + filename = f"{uid_int}_{date_str}_{subject}.eml" + filepath = os.path.join(local_path, filename) + + counter = 1 + base_filepath = filepath + while os.path.exists(filepath): + name, ext = os.path.splitext(base_filepath) + filepath = f"{name}_{counter}{ext}" + counter += 1 + + with open(filepath, 'wb') as f: + f.write(raw_email) + + extract_attachments(msg, filepath) + downloaded += 1 + + if highest_uid is None or uid_int > highest_uid: + highest_uid = uid_int + + except Exception as e: + print(f" Error downloading UID {uid}: {e}") + + return downloaded, highest_uid + + +def main(): + args = parse_args() + + email_folder = sanitize_filename(args.email, max_length=100) + base_dir = os.path.join(os.getcwd(), 'download', email_folder) + os.makedirs(base_dir, exist_ok=True) + + if args.full: + has_emails = False + for root, dirs, files in os.walk(base_dir): + if any(f.endswith('.eml') for f in files): + has_emails = True + break + if has_emails: + print(f"Error: --full specified but {base_dir} already contains emails.", file=sys.stderr) + print("Delete the folder first to do a full re-download, or run without --full for incremental update.", file=sys.stderr) + sys.exit(1) + + try: + conn = connect_imap(args.server, args.port, args.ssl, args.starttls) + except Exception as e: + print(f"Connection failed: {e}", file=sys.stderr) + sys.exit(1) + + try: + status, _ = conn.login(args.user, args.password) + if status != 'OK': + print("Authentication failed", file=sys.stderr) + sys.exit(1) + print("Logged in successfully") + except Exception as e: + print(f"Authentication failed: {e}", file=sys.stderr) + sys.exit(1) + + try: + status, folder_data = conn.list() + if status != 'OK': + print("Could not list folders", file=sys.stderr) + sys.exit(1) + + folders = parse_folder_list(folder_data) + print(f"Found {len(folders)} folders") + + update_mode = not args.full + state = load_state(base_dir) if update_mode else {} + if args.full: + print("Full download mode: downloading all emails") + else: + print("Incremental mode: only downloading new emails (use --full to download all)") + + total_downloaded = 0 + for folder in folders: + last_uid = None + if update_mode and folder in state: + try: + last_uid = int(state[folder]) + except (ValueError, TypeError): + pass + + downloaded, highest_uid = download_folder( + conn, folder, base_dir, args.limit, total_downloaded, + update_mode=update_mode, last_uid=last_uid + ) + total_downloaded += downloaded + + if highest_uid is not None: + state[folder] = highest_uid + + if args.limit and total_downloaded >= args.limit: + print(f" Reached limit of {args.limit} emails") + break + + save_state(base_dir, state) + print(f"\nDownloaded {total_downloaded} emails to {base_dir}") + + finally: + try: + conn.logout() + except Exception: + pass + + +if __name__ == '__main__': + main() diff --git a/plan.md b/plan.md new file mode 100644 index 0000000..d82f465 --- /dev/null +++ b/plan.md @@ -0,0 +1,97 @@ +# Implementation Plan: IMAP Downloader + +## Overview + +Create a single-file Python script (`imapdown.py`) that downloads all emails from an IMAP server and saves them as individual EML files in a local folder structure mirroring the IMAP mailbox hierarchy. + +## Implementation Steps + +### 1. Argument Parsing + +Use `argparse` to handle command line arguments: + +**Mandatory arguments:** +- `--server` - IMAP server hostname +- `--email` - Email address +- `--user` - Username for authentication +- `--password` - Password for authentication + +**Optional arguments:** +- `--ssl` - Use implicit SSL/TLS (typically port 993) +- `--starttls` - Use STARTTLS upgrade (typically port 143) +- `--port` - Custom port (defaults: 993 for SSL, 143 for STARTTLS/plain) + +Add mutual exclusion for `--ssl` and `--starttls`. + +### 2. IMAP Connection + +- Use Python's built-in `imaplib` module +- Connection logic: + - If `--ssl`: Use `IMAP4_SSL` (default port 993) + - If `--starttls`: Use `IMAP4`, then call `starttls()` (default port 143) + - If neither: Use plain `IMAP4` (default port 143) +- Authenticate with provided credentials + +### 3. Folder Discovery + +- Use `list()` method to get all mailbox folders +- Parse folder names and hierarchy delimiter +- Handle folder name encoding (IMAP uses modified UTF-7) + +### 4. Email Download + +For each folder: +1. Create corresponding local directory structure +2. Select the folder with `select()` +3. Search for all messages with `search(None, 'ALL')` +4. For each message: + - Fetch the complete RFC822 message + - Generate a unique filename (using UID or message ID + date) + - Save as `.eml` file + +### 5. File Naming Strategy + +Use a naming scheme that ensures uniqueness and provides useful info: +- Format: `{UID}_{date}_{subject_snippet}.eml` +- Sanitize subject for filesystem safety +- Handle duplicates by appending counter if needed + +### 6. Error Handling + +- Connection failures +- Authentication errors +- Folder access issues +- Invalid/corrupt messages +- Filesystem errors (permissions, disk space) + +## Dependencies + +Only Python standard library: +- `imaplib` - IMAP protocol +- `argparse` - Command line parsing +- `email` - Email message parsing +- `os` / `pathlib` - Filesystem operations +- `re` - Regex for sanitization +- `datetime` - Date handling + +## Output Structure + +``` +./download/ +├── INBOX/ +│ ├── 1_20240115_Meeting_notes.eml +│ └── 2_20240116_Project_update.eml +├── Sent/ +│ └── 1_20240114_RE_Question.eml +└── Archive/ + └── 2023/ + └── 1_20230501_Old_email.eml +``` + +## Testing Approach + +1. Test argument parsing with various combinations +2. Test connection with SSL, STARTTLS, and plain +3. Test with folders containing special characters +4. Test with empty folders +5. Verify EML files are valid and openable diff --git a/project.md b/project.md new file mode 100644 index 0000000..bb8e3ef --- /dev/null +++ b/project.md @@ -0,0 +1,27 @@ +# Simple IMAP downloader + +A single file Python script to download all emails from an IMAP inbox into single EML files, one per email, into a folder structure representing the same folder structure in the IMAP inbox + +## Arguments + +Mandatory: + +--server +--email +--user +--password + +Optional (if not supplied, use sensible defaults) + +--ssl or --starttls (either allowed but not both) +--port + +## Environment + +There is a virtual Python environment set up in .venv - use it + +## Additional requirements + +- limit the number of returned emails with '--limit xxx' - this is mainly to be used for debugging purposes +- ensure that file attachments (if available) are downloaded as well - zip these up into a single zip file and name it after the downloaded .eml file but with .zip instead +- keep track of the latest email downloaded - if `--update` is specified then just pull back emails newer than the last email downloaded \ No newline at end of file