Aurora Mystery: "Found a row not matching the given partition set"
A few weeks ago we started seeing a strange error in one of our Aurora MySQL testing environments:
Found a row not matching the given partition set
At first glance it looked like a classic partitioning problem.
Maybe a partition was missing.
Maybe partition maintenance failed.
The interesting part was that the issue appeared on the 1st day of the month and then disappeared.
Naturally, partitioning became the prime suspect.
First checks
The history tables were partitioned by creation timestamp:
PARTITION BY RANGE (
FLOOR(UNIX_TIMESTAMP(creation_date))
)
The usual checks looked healthy:
- partitions existed
- future partitions existed
- pmax existed
- partition manager completed successfully
- no errors were found in maintenance logs
Nothing looked suspicious.
Looking for broken data
The next theory was that perhaps the application was trying to insert invalid values.
The problem was that we did not have the exact SQL statement that triggered the exception.
Without the failing SQL, troubleshooting becomes much more difficult.
We reviewed triggers, table definitions, partition definitions and application logs, but could not find an obvious explanation.
Manual testing
At this point I started testing the partitioned tables directly.
Manual inserts worked.
Boundary tests around month transitions worked.
The partitioned tables accepted new rows without any issues.
This was confusing because Aurora was clearly returning:
Found a row not matching the given partition set
yet I could not reproduce it manually.
An interesting observation
The architecture contained three parent tables and three corresponding history tables.
The application wrote into the parent tables.
AFTER INSERT and AFTER UPDATE triggers copied data into history tables.
The history tables were partitioned.
The flow looked like this:
Application → Parent Table → Trigger → Partitioned History Table
At the same time, there were other partitioned tables in the environment.
Those tables were receiving direct inserts and were working perfectly fine.
This narrowed the investigation considerably.
Partitioning itself did not appear to be generally broken.
The problem seemed to exist only when triggers and partitioned destination tables were involved.
The workaround
To restore stability, we temporarily removed partitioning from the three history tables.
The application remained unchanged.
The triggers remained unchanged.
Only partitioning was removed.
Immediately after that, the problem disappeared.
That was the strongest clue so far.
The AWS investigation
At this point we opened a support case with AWS.
After reviewing the architecture, trigger definitions and table DDLs, AWS confirmed something very interesting.
The issue is a known problem affecting Aurora MySQL versions compatible with MySQL 8.0.42 and later.
Our cluster had recently been upgraded:
Aurora 3.09 -> Aurora 3.12.0
MySQL 8.0.40 -> MySQL 8.0.44
The timing matched perfectly.
According to AWS, INSERT statements involving partitioned tables with CURRENT_TIMESTAMP-based partition keys may incorrectly return:
Found a row not matching the given partition set
even when:
- partitions exist
- partition definitions are valid
- maintenance succeeds
- manual inserts work
In other words, all the things we checked were healthy because the issue was not caused by broken partition definitions at all.
The recommended workaround
AWS suggested explicitly passing the timestamp value instead of relying on the column default.
For example:
INSERT INTO history_table (
...,
creation_date
)
VALUES (
...,
CURRENT_TIMESTAMP(6)
);
instead of relying on:
creation_date TIMESTAMP(6)
DEFAULT CURRENT_TIMESTAMP(6)
Another valid workaround is exactly what we implemented: removing partitioning.
Lessons learned
The most interesting part of this incident was that every symptom pointed toward partition corruption or missing partitions.
Yet none of those things were actually wrong.
Sometimes the database is behaving exactly as designed.
Sometimes the schema is correct.
Sometimes the data is correct.
And sometimes the bug lives inside the engine itself.
One More Lesson
There was another lesson hidden in this incident.
The first occurrence happened in a testing environment month ago
At the time, we investigated the issue, reviewed partitions, triggers and application behavior, but could not identify a definitive root cause. Since the problem disappeared and the environment recovered, it was easy to classify it as an isolated incident.
A month later, the same error appeared again - this time affecting production.
Looking back, the technical problem itself was interesting, but the operational lesson was equally important.
When an unusual database error appears after an engine upgrade, especially one that cannot be fully explained, it is worth treating it as a potential production issue until proven otherwise.
Sometimes the most valuable outcome of an investigation is not finding the answer immediately.
Sometimes it is making sure the unanswered question remains visible until it is resolved.
This incident was a good reminder that unexplained behavior in lower environments deserves the same level of curiosity and persistence as an issue affecting production.