sql - Delete duplicates based on string length criteria -
background
remove duplicate city names temporary table, based on length of name.
problem
the following query returns 350,000 rows:
select tc.id, tc.name_lowercase, tc.population, tc.latitude_decimal, tc.longitude_decimal climate.temp_city tc inner join ( select tc2.latitude_decimal, tc2.longitude_decimal climate.temp_city tc2 group tc2.latitude_decimal, tc2.longitude_decimal having count(*) > 3 ) s on tc.latitude_decimal = s.latitude_decimal , tc.longitude_decimal = s.longitude_decimal
sample data:
940308;"sara" ;;-53.4333333;-68.1833333 935665;"estancia la sara";;-53.4333333;-68.1833333 935697;"estancia sara" ;;-53.4333333;-68.1833333 937204;"la sara" ;;-53.4333333;-68.1833333 940350;"seccion gap" ;;-52.1666667;-68.5666667 941448;"zanja pique" ;;-52.1666667;-68.5666667 935941;"gap" ;;-52.1666667;-68.5666667 935648;"estancia gap" ;;-52.1666667;-68.5666667 939635;"ritchie" ;;-51.9833333;-70.4 934948;"d.e. ritchie" ;;-51.9833333;-70.4 934992;"diego richtie" ;;-51.9833333;-70.4 934993;"diego ritchie" ;;-51.9833333;-70.4 934990;"diego e. ritchie";;-51.9833333;-70.4
i remove duplicates, retaining rows where:
- population not null; and
- the name longest of duplicates (
max(tc.name_lowercase)
); and - if neither of these conditions met, retain
max(tc.id)
.
from given set of data, remaining rows be:
935665;"estancia la sara";;-53.4333333;-68.1833333 935648;"estancia gap" ;;-52.1666667;-68.5666667 934990;"diego e. ritchie";;-51.9833333;-70.4
question
how select rows duplicate lat/long values meet problem criteria?
thank you!
i think you're looking this:
select t.id, t.name_lowercase, t.latitude_decimal, t.longitude_decimal (select max(length(name_lowercase)) len, latitude_decimal, longitude_decimal temp_city group latitude_decimal, lng) max_length, temp_city t max_length.latitude_decimal = t.latitude_decimal , max_length.longitude_decimal = t.longitude_decimal , max_length.len = length(t.name_lowercase);
where temp_city
table contains sample results.
the above run problems if temp_city
contains row:
1 | xxxancia la sara | -53.4333333 | -68.1833333
you didn't offer way choose row amongst name
has maximum length both of these returned:
1 | xxxancia la sara | -53.4333333 | -68.1833333 935665 | estancia la sara | -53.4333333 | -68.1833333
update: if max(tc.id)
criteria trimming down above duplicates, can wrap layer on:
select t.id, t.name_lowercase, t.latitude_decimal, t.longitude_decimal ( select max(t.id) id ( select max(length(name_lowercase)) len, latitude_decimal, longitude_decimal temp_city group latitude_decimal, longitude_decimal ) max_length, temp_city t max_length.latitude_decimal = t.latitude_decimal , max_length.longitude_decimal = t.longitude_decimal , max_length.len = length(t.name_lowercase) group t.latitude_decimal, t.longitude_decimal, length(t.name_lowercase) ) tt, temp_city t t.id = tt.id
Comments
Post a Comment