Find duplicate rows in SQL
Tuesday, November 16, 2010 2:11:57 PM
There are many ways to search for duplicate rows in MySQL. Few great ones are in Xaprb's How to find duplicate rows with SQL blog post and following comments.I couldn't really use INSERT INTO / CREATE TABLE AS SELECT solutions because when GROUP-ing a table with 50 million rows, that takes (lot of) time...
Let's say, I need to find duplicates in column containing MD5 hashes in some table. Here's a solution using tools available in shell:
mysql -e "SELECT md5 FROM some_table;" > md5dump
sort md5dump --output=md5sorted
uniq md5sorted md5unique
diff --left-column --speed-large-files md5sorted md5unique > md5diff
awk '{ if ($2) print $2 }' md5diff > md5cleandiff
Now we have a list of MD5 hashes, that are present more than once in the table. In my case, this was a matter of minutes even with the md5dump file being 1.3GB. I wrote a SELECT/DELETE script (in python) which looked for those duplicate rows and removed all but first. That one ran for cca an hour.
I killed my initial approach "INSERT INTO ... HAVING COUNT(*) > 1" query after about 2 hours, tired of waiting.
Command line tools like mysql, sort, uniq, diff, awk and many others are just amazing.
It's a very unusal solution to this problem, but it's definitely a fun one to use






