When and Why to Change Your Database Collation

Changing Database Collation Without Data LossChanging a database collation can be necessary for many reasons: you might be standardizing systems after a merger, fixing sorting or comparison bugs, or moving to a Unicode-capable collation to support multilingual data. However, changing collation improperly can corrupt text, break indexes, or produce unexpected sorting/comparison behavior. This guide explains what collations are, why changes are needed, the risks involved, and provides step‑by‑step procedures and best practices to change a database collation without losing or mangling data. Examples focus on MySQL/MariaDB and Microsoft SQL Server; many principles apply to other systems (PostgreSQL, Oracle) as well.


What is collation, and why does it matter?

Collation is the set of rules that determine how strings are compared and sorted. It includes:

  • Character encoding (which code points represent characters).
  • Sorting order (which character comes before another).
  • Comparison rules (case sensitivity and accent sensitivity).

Common issues caused by incorrect collations:

  • Incorrect ORDER BY results.
  • WHERE comparisons failing to match expected rows.
  • JOINs not matching because columns use different collations.
  • Corrupted or incorrectly interpreted characters when moving between encodings (e.g., latin1 to utf8/utf8mb4).

Key risks when changing collation

  • Character data corruption if character set conversion is mishandled (for example, changing from latin1 to utf8 without converting the stored bytes properly).
  • Index rebuilds can be expensive and may lock tables.
  • Application-level assumptions (case-sensitivity, accent handling) may break.
  • Mismatched collations across columns, databases, or servers can lead to errors in queries (especially in strict SQL Server settings).

Always assume the change may be destructive unless you verify data and take precautions.


High-level strategy

  1. Inventory current state (character sets, collations, column-level overrides).
  2. Back up everything (logical + physical where possible).
  3. Test the change on a copy of the database.
  4. Convert character set first if moving to Unicode (e.g., latin1 → utf8mb4).
  5. Change collations at the database, table, and column levels in controlled steps.
  6. Rebuild indexes and update application queries if needed.
  7. Validate thoroughly (data integrity, sorting, searching, performance).
  8. Roll out to production during a maintenance window with rollback plan.

Pre-change checklist

  • Full logical backup (mysqldump, BACPAC, or equivalent).
  • Physical snapshot if supported (VM snapshot, storage snapshot).
  • List of databases, tables, and columns with current character set/collation.
  • Identify text columns: CHAR, VARCHAR, TEXT, NVARCHAR, NCHAR.
  • Identify stored procedures, views, triggers, computed columns relying on string comparisons.
  • Estimate downtime required for index rebuilds.
  • Test environment mirroring production data and workload.

MySQL / MariaDB: Step‑by‑step

Assumptions: migrating to utf8mb4 and a utf8mb4 collation such as utf8mb4_unicode_520_ci or utf8mb4_0900_ai_ci (MySQL 8.0). Replace collation names per your needs.

  1. Inventory current collations:

    SELECT table_schema, table_name, column_name, character_set_name, collation_name FROM information_schema.columns WHERE table_schema NOT IN ('mysql','information_schema','performance_schema','sys') AND data_type IN ('char','varchar','text','tinytext','mediumtext','longtext'); 
  2. Backup:

  • Logical: mysqldump –routines –triggers –events –single-transaction –set-gtid-purged=OFF -u user -p dbname > dump.sql
  • Physical: file-system snapshot or LVM snapshot if possible.
  1. Test conversion on a copy.

  2. Convert database default character set and collation:

    ALTER DATABASE dbname CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci; 

    This changes defaults for new tables/columns only.

  3. Convert each table and column. Two methods:

  • Table-level conversion (simpler, converts all text columns):

    ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; 

    This converts data and rebuilds the table. It can be expensive and may lock the table.

  • Column-level conversion (more granular):

    ALTER TABLE tbl_name MODIFY column_name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci, MODIFY another_col TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; 

Notes:

  • CONVERT TO CHARACTER SET attempts to convert bytes from old charset to new; verify results.
  • For big tables, consider pt-online-schema-change (Percona Toolkit) or gh-ost to avoid long locks.
  • Recreate FULLTEXT indexes if necessary; some engines handle them differently under utf8mb4.
  1. Rebuild indexes where necessary:
  • ALTER TABLE … ENGINE=InnoDB; or explicit DROP/CREATE INDEX.
  1. Verify:
  • SELECT COUNT(*) FROM table WHERE column LIKE ‘%…%’ for sample queries.
  • Compare checksums between original and converted tables (mysqldump or checksum tools).

SQL Server: Step‑by‑step

SQL Server collations include server-level, database-level, and column-level collations. SQL Server distinguishes between code page–based collations (non-Unicode) and Unicode NVARCHAR (collation affects sorting but not encoding because NVARCHAR stores UCS-2/UTF-16).

  1. Inventory: “`sql SELECT name, collation_name FROM sys.databases;

SELECT t.name AS table_name, c.name AS column_name, c.collation_name FROM sys.columns c JOIN sys.tables t ON c.object_id = t.object_id WHERE c.collation_name IS NOT NULL;


2) Backup (full backup). 3) To change database default collation: ```sql ALTER DATABASE YourDB COLLATE Latin1_General_100_CI_AS_SC; 

This changes default for new objects only.

  1. To change column collations:
    
    ALTER TABLE dbo.YourTable ALTER COLUMN YourColumn NVARCHAR(200) COLLATE Latin1_General_100_CI_AS_SC NOT NULL; 

    Notes:

  • For large tables, ALTER COLUMN will rebuild the table and lock it.
  • If columns participate in indexes or constraints, you must drop or rebuild those indexes/constraints first.
  1. Changing server collation requires rebuilding the master database and restarting — rarely needed.

  2. Special care converting non-Unicode (VARCHAR) data to different code pages: you may need to migrate to NVARCHAR to avoid lossy conversions.

  3. Verify sorting and comparisons, especially for case/accent sensitivity changes.


Handling mixed‑collation JOINs and comparisons

  • MySQL: when comparing columns with different collations, MySQL applies coercibility rules and may convert to the “stronger” collation; explicit COLLATE in queries can resolve issues:
    
    SELECT * FROM a JOIN b ON a.name COLLATE utf8mb4_0900_ai_ci = b.name COLLATE utf8mb4_0900_ai_ci; 
  • SQL Server: use COLLATE clause in queries:
    
    SELECT * FROM a JOIN b ON a.name COLLATE Latin1_General_100_CI_AS = b.name COLLATE Latin1_General_100_CI_AS; 

Testing and validation checklist

  • Row counts match before/after.
  • Checksums match (where applicable).
  • Sample text values preserved (especially accents, emojis).
  • ORDER BY results match expected language rules.
  • Application-level searches, LIKE queries, and equality checks behave as expected.
  • Performance benchmarks (index sizes, query times).
  • Backup restoration tested.

Rollback strategies

  • Restore from the logical backup/dump if conversion causes data corruption.
  • Keep original physical snapshots until conversion verified.
  • For large systems, perform conversion on a shadow copy and then switch application pointers (DNS, connection strings) to the converted instance.

Common pitfalls and how to avoid them

  • Assuming ALTER DATABASE will convert existing columns — it doesn’t. Use ALTER TABLE/ALTER COLUMN.
  • Not converting client/connection character set — set proper client character set (MySQL: SET NAMES utf8mb4).
  • Forgetting to update stored procedures/views that embed string literals with different collations.
  • Failing to account for index size increase when moving to utf8mb4 (may exceed index key length limits).

Example migration plan (concise)

  1. Inventory and backup.
  2. Create test copy.
  3. On test copy: convert database default, then table-by-table convert using pt-online-schema-change for large tables.
  4. Run verification scripts.
  5. Schedule maintenance window.
  6. Repeat on production; monitor closely.
  7. Run post-migration validation and performance tests.

Tools that help

  • mysqldump, mysqlpump (MySQL).
  • pt-online-schema-change, gh-ost (online schema changes).
  • SQL Server Management Studio (SSMS) / sqlcmd.
  • checksum tools, data-diff utilities.
  • Backups and snapshots.

Conclusion

Changing database collation is straightforward in concept but sensitive in practice. The safe path is: inventory, backup, test thoroughly, convert character sets carefully (especially when moving to Unicode), convert collations at the correct levels, and validate extensively. With careful planning and the right tools, you can change collation without data loss or user-visible regressions.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *