How do I see stuff around me...

Subscribe to RSS feed

Find duplicate rows in SQL

, ,

There are many ways to search for duplicate rows in MySQL. Few great ones are in Xaprb's How to find duplicate rows with SQL blog post and following comments.

I couldn't really use INSERT INTO / CREATE TABLE AS SELECT solutions because when GROUP-ing a table with 50 million rows, that takes (lot of) time...

Let's say, I need to find duplicates in column containing MD5 hashes in some table. Here's a solution using tools available in shell:

mysql -e "SELECT md5 FROM some_table;" > md5dump
sort md5dump --output=md5sorted
uniq md5sorted md5unique
diff --left-column --speed-large-files md5sorted md5unique > md5diff
awk '{ if ($2) print $2 }' md5diff > md5cleandiff

Now we have a list of MD5 hashes, that are present more than once in the table. In my case, this was a matter of minutes even with the md5dump file being 1.3GB. I wrote a SELECT/DELETE script (in python) which looked for those duplicate rows and removed all but first. That one ran for cca an hour.

I killed my initial approach "INSERT INTO ... HAVING COUNT(*) > 1" query after about 2 hours, tired of waiting.

Command line tools like mysql, sort, uniq, diff, awk and many others are just amazing.

It's a very unusal solution to this problem, but it's definitely a fun one to use smile

JSLint: The JavaScript Code Quality Tool

, , , ...

It all started watching Javascript: Good Parts presentation by Douglas Crockford. There he describes what JavaScript has wrong about it, and how JSLint can help to fix that (while hurting your feelings).
Here goes everything I could find on JS lint and has been usefull so far...

Web tool:
Command line (using Java):
Other:
Command line (using SpiderMonkey):
Command line (wrapped in Python):
IDE integration:

I had few ideas while looking at all this:
  • Integrate JSLint output to Django's test case runner (I use JsTestDriver, so before actually testing what it does there might be a check for how it's written).
  • Create a SVN pre commit hook that runs JSlint for commited JS files.
  • Use .jslint settings file for a Python wrapper. (Python > PERL) smile


Happy linting...