SQL Server Duplicate

Question

I need to find duplicates in a table named lead, which contains around 100k records. The duplicates have similar values in the company column, such as:

enter image description here

The goal is to retain only the latest leadid (95803 in this case). However, there's an issue with leadid 95803, which has some extra characters after a space.

I've attempted to use the following script, but it's not providing the desired results:

select t1.*
FROM [dbo].[LEAD] t1
LEFT JOIN (
    SELECT
        company,
        city,
        MAX(leadid) AS keep_leadid
    FROM [dbo].[LEAD]
    GROUP BY company, city
) t2 ON t1.company = t2.company AND t1.city = t2.city
WHERE t1.leadid <> t2.keep_leadid 
  AND t1.company LIKE '%Uvalde Country%'

Any assistance in refining the script to achieve the intended outcome would be greatly appreciated.

I want to all delete except of this:

enter image description here

There are many companies, with different strings, I want to apply same script for all.

There are rather involved software solutions on the market of guessing if some similar entries are in fact duplicates or not. You are not going to get away with just "refining the script". — GSerg, Commented Nov 13, 2023 at 13:50
Please do not post images of code or data, only paste in text. — Charlieface, Commented Nov 13, 2023 at 13:53
Are you really still using SQL Server 2008 as well? That version ran completely out of support way back in 2019; using such a version now is a severe concern for you and your clients. — Thom A, Commented Nov 13, 2023 at 13:54
This is possible assuming the only issue is ever a string that's longer than the others (but the prefix is the same). — Charlieface, Commented Nov 13, 2023 at 13:54
Perhaps you could give more than one example, which demonstrate, presumbly, the different scenarios you have? I'm sure @Charlieface has a solution in mind for this one problem, and I do too, but it relies on their caveat. if you have a value of "Uvalde CC" or "Country Club, Uvalde" it's not going to work. — Thom A, Commented Nov 13, 2023 at 14:17

Alex · Accepted Answer · 2023-11-13 15:13:44Z

Here's an attempt with the caveat that you are interested in the shortest company names and will be matched with companies that START with the shortest company name, also NOT considering city:

declare @t table([sid] int not null identity(1,1), leadid int, company varchar(80));

insert into @t values(1,'company A');
insert into @t values(30,'company A INC');
insert into @t values(5,'company B');
insert into @t values(9,'company C');
insert into @t values(48,'company C INC');

--query to see join on companies that start with the same string
select *
from 
    @t a
    LEFT join @t b on a.company = left(b.company, len(a.company))

;WITH CTE AS
(
    --get max id per company
    select MAX(case when a.leadid > b.leadid then a.leadid else b.leadid end) max_id, 
    case when len(a.company) < len(b.company) then a.company else b.company end company,
    case when len(a.company) < len(b.company) then len(a.company) else len(b.company) end len_company
    from 
        @t a
        inner join @t b on a.company = left(b.company, len(a.company))
    group by --group by shortest company?
        case when len(a.company) < len(b.company) then a.company else b.company end,
        case when len(a.company) < len(b.company) then len(a.company) else len(b.company) end
), CTE2 AS
(
    --ROW NUMBER ON SHORTEST COMPANIES TO DISCARD ONES WITH MORE CHARACTERS AT THE END
    select max_id, MAX(company) company, ROW_NUMBER() OVER(PARTITION BY max_id ORDER BY MIN(len_company)) rn
    from CTE
    group by max_id, len_company
)
SELECT max_id, company
FROM 
    CTE2 
where
    rn = 1;

Collectives™ on Stack Overflow

SQL Server Duplicate

1 Answer 1

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related