Skip to content

Duplicates values in identity column when mixing redshift.copy and redshift.to_sql #2996

Open
@laderjk

Description

@laderjk

Hello everyone! We are having an issue with duplicated values in an identity column when executing redshift.to_sql() after executing a redshift.copy(). When trying to insert more than 1000 records we use the COPY method and when trying to insert less than 1000 records we use the redshift.to_sql() method as is recommended in the docs.

This is the structure of the table:

create table if not exists dim_testing
(
    id BIGINT IDENTITY NOT NULL PRIMARY KEY,
    name varchar
);

This is the code in which we are able to reproduce the issue:

import random
import string

import awswrangler as wr
import pandas as pd

def generate_df(length):
    return pd.DataFrame([''.join(random.choices(string.ascii_letters + string.digits, k=30)) for _ in range(length)], columns=['name'])

df = generate_df(1500)
wr.redshift.copy(
    df=df,
    con=redshift.conn,
    path=f's3://s3_bucket_path/dim_testing/',
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

df = generate_df(500)
wr.redshift.to_sql(
    df,
    redshift.conn,
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

After executing that code if we look for duplicates by the identity column id, we get duplicated values:

select id, count(*)
from dim_testing
group by id
having count(*) > 1;

If we only use to_sql() or copy(), the issue does not happens. It happens when executing to_sql() after a copy().
Is this normal? Are we doing something wrong? Should we only chose one write method and not use the other?

Thanks!

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions