Aurora Mystery: "Found a row not matching the given partition set"

A few weeks ago we started seeing a strange error in one of our Aurora MySQL testing environments:

Found a row not matching the given partition set

At first glance it looked like a classic partitioning problem.

Maybe a partition was missing.

Maybe partition maintenance failed.

The interesting part was that the issue appeared on the 1st day of the month and then disappeared.

Naturally, partitioning became the prime suspect.

First checks

The history tables were partitioned by creation timestamp:

PARTITION BY RANGE (
    FLOOR(UNIX_TIMESTAMP(creation_date))
)

The usual checks looked healthy:

partitions existed
future partitions existed
pmax existed
partition manager completed successfully
no errors were found in maintenance logs

Nothing looked suspicious.

Looking for broken data

The next theory was that perhaps the application was trying to insert invalid values.

The problem was that we did not have the exact SQL statement that triggered the exception.

Without the failing SQL, troubleshooting becomes much more difficult.

We reviewed triggers, table definitions, partition definitions and application logs, but could not find an obvious explanation.

Manual testing

At this point I started testing the partitioned tables directly.

Manual inserts worked.

Boundary tests around month transitions worked.

The partitioned tables accepted new rows without any issues.

This was confusing because Aurora was clearly returning:

Found a row not matching the given partition set

yet I could not reproduce it manually.

An interesting observation

The architecture contained three parent tables and three corresponding history tables.

The application wrote into the parent tables.

AFTER INSERT and AFTER UPDATE triggers copied data into history tables.

The history tables were partitioned.

The flow looked like this:

Application → Parent Table → Trigger → Partitioned History Table

At the same time, there were other partitioned tables in the environment.

Those tables were receiving direct inserts and were working perfectly fine.

This narrowed the investigation considerably.

Partitioning itself did not appear to be generally broken.

The problem seemed to exist only when triggers and partitioned destination tables were involved.

The workaround

To restore stability, we temporarily removed partitioning from the three history tables.

The application remained unchanged.

The triggers remained unchanged.

Only partitioning was removed.

Immediately after that, the problem disappeared.

That was the strongest clue so far.

The AWS investigation

At this point we opened a support case with AWS.

After reviewing the architecture, trigger definitions and table DDLs, AWS confirmed something very interesting.

The issue is a known problem affecting Aurora MySQL versions compatible with MySQL 8.0.42 and later.

Our cluster had recently been upgraded:

Aurora 3.09 -> Aurora 3.12.0
MySQL 8.0.40 -> MySQL 8.0.44

The timing matched perfectly.

According to AWS, INSERT statements involving partitioned tables with CURRENT_TIMESTAMP-based partition keys may incorrectly return:

Found a row not matching the given partition set

even when:

partitions exist
partition definitions are valid
maintenance succeeds
manual inserts work

In other words, all the things we checked were healthy because the issue was not caused by broken partition definitions at all.

The recommended workaround

AWS suggested explicitly passing the timestamp value instead of relying on the column default.

For example:

INSERT INTO history_table (
    ...,
    creation_date
)
VALUES (
    ...,
    CURRENT_TIMESTAMP(6)
);

instead of relying on:

creation_date TIMESTAMP(6)
DEFAULT CURRENT_TIMESTAMP(6)

Another valid workaround is exactly what we implemented: removing partitioning.

Lessons learned

The most interesting part of this incident was that every symptom pointed toward partition corruption or missing partitions.

Yet none of those things were actually wrong.

Sometimes the database is behaving exactly as designed.

Sometimes the schema is correct.

Sometimes the data is correct.

And sometimes the bug lives inside the engine itself.

One More Lesson

There was another lesson hidden in this incident.

The first occurrence happened in a testing environment month ago

At the time, we investigated the issue, reviewed partitions, triggers and application behavior, but could not identify a definitive root cause. Since the problem disappeared and the environment recovered, it was easy to classify it as an isolated incident.

A month later, the same error appeared again - this time affecting production.

Looking back, the technical problem itself was interesting, but the operational lesson was equally important.

When an unusual database error appears after an engine upgrade, especially one that cannot be fully explained, it is worth treating it as a potential production issue until proven otherwise.

Sometimes the most valuable outcome of an investigation is not finding the answer immediately.

Sometimes it is making sure the unanswered question remains visible until it is resolved.

This incident was a good reminder that unexplained behavior in lower environments deserves the same level of curiosity and persistence as an issue affecting production.

Aurora Mystery: "Found a row not matching the given partition set"

First checks

Looking for broken data

Manual testing

An interesting observation

The workaround

The AWS investigation

The recommended workaround

Lessons learned

One More Lesson

Read more

MySQL Replication: "Can't connect to MySQL server" (110) - Troubleshooting Case

Monitoring Failed Amazon RDS Snapshots: An Unexpected AWS Limitation

Troubleshooting Replication Lag on a Percona XtraDB Cluster (PXC) Async Replica

Why SST Completed But MySQL Still Failed to Start