sql - Delete duplicates based on string length criteria -

- February 15, 2015

background

remove duplicate city names temporary table, based on length of name.

problem

the following query returns 350,000 rows:

select   tc.id,   tc.name_lowercase,   tc.population,   tc.latitude_decimal,   tc.longitude_decimal   climate.temp_city tc inner join (   select     tc2.latitude_decimal,     tc2.longitude_decimal       climate.temp_city tc2   group     tc2.latitude_decimal,     tc2.longitude_decimal   having     count(*) > 3 ) s on    tc.latitude_decimal = s.latitude_decimal ,   tc.longitude_decimal = s.longitude_decimal

sample data:

940308;"sara"            ;;-53.4333333;-68.1833333 935665;"estancia la sara";;-53.4333333;-68.1833333 935697;"estancia sara"   ;;-53.4333333;-68.1833333 937204;"la sara"         ;;-53.4333333;-68.1833333 940350;"seccion gap"     ;;-52.1666667;-68.5666667 941448;"zanja pique"     ;;-52.1666667;-68.5666667 935941;"gap"             ;;-52.1666667;-68.5666667 935648;"estancia gap"    ;;-52.1666667;-68.5666667 939635;"ritchie"         ;;-51.9833333;-70.4 934948;"d.e. ritchie"    ;;-51.9833333;-70.4 934992;"diego richtie"   ;;-51.9833333;-70.4 934993;"diego ritchie"   ;;-51.9833333;-70.4 934990;"diego e. ritchie";;-51.9833333;-70.4

i remove duplicates, retaining rows where:

population not null; and
the name longest of duplicates (max(tc.name_lowercase)); and
if neither of these conditions met, retain max(tc.id).

from given set of data, remaining rows be:

935665;"estancia la sara";;-53.4333333;-68.1833333 935648;"estancia gap"    ;;-52.1666667;-68.5666667 934990;"diego e. ritchie";;-51.9833333;-70.4

question

how select rows duplicate lat/long values meet problem criteria?

thank you!

i think you're looking this:

select t.id, t.name_lowercase, t.latitude_decimal, t.longitude_decimal (select max(length(name_lowercase)) len, latitude_decimal, longitude_decimal temp_city group latitude_decimal, lng) max_length,      temp_city t max_length.latitude_decimal  = t.latitude_decimal   , max_length.longitude_decimal = t.longitude_decimal   , max_length.len = length(t.name_lowercase);

where temp_city table contains sample results.

the above run problems if temp_city contains row:

1 | xxxancia la sara | -53.4333333 | -68.1833333

you didn't offer way choose row amongst name has maximum length both of these returned:

      1 | xxxancia la sara | -53.4333333 | -68.1833333  935665 | estancia la sara | -53.4333333 | -68.1833333

update: if max(tc.id) criteria trimming down above duplicates, can wrap layer on:

select t.id, t.name_lowercase, t.latitude_decimal, t.longitude_decimal     (     select max(t.id) id           (         select max(length(name_lowercase)) len, latitude_decimal, longitude_decimal         temp_city         group latitude_decimal, longitude_decimal       ) max_length,       temp_city t     max_length.latitude_decimal  = t.latitude_decimal       , max_length.longitude_decimal = t.longitude_decimal       , max_length.len               = length(t.name_lowercase)     group t.latitude_decimal, t.longitude_decimal, length(t.name_lowercase)   ) tt,    temp_city t t.id = tt.id

Search This Blog

Support