0

I've got a json file with 30-ish, blocks of "dicts" where every block has and ID, like this:

{
      "ID": "23926695",
      "webpage_url": "https://.com",
      "logo_url": null,
      "headline": "aewafs",
      "application_deadline": "2020-03-31T23:59:59",
}

Since my script pulls information in the same way from an API more than once, I would like to append new "blocks" to the json file only if the ID doesn't already exist in the JSON file.

I've got something like this so far:

import os

check_empty = os.stat('pbdb.json').st_size
if check_empty == 0:
    with open('pbdb.json', 'w') as f:
        f.write('[\n]')    # Writes '[' then linebreaks with '\n' and writes ']'
output = json.load(open("pbdb.json"))

for i in jobs:
    output.append({
        'ID': job_id, 
        'Title': jobtitle, 
        'Employer' : company, 
        'Employment type' : emptype, 
        'Fulltime' : tid, 
        'Deadline' : deadline, 
        'Link' : webpage
    })

with open('pbdb.json', 'w') as job_data_file:
    json.dump(output, job_data_file)

but I would like to only do the "output.append" part if the ID doesn't exist in the Json file.

7
  • This isn't valid JSON. It's JSON Lines and you'll have to scan the file each time to look to see whether the ID exists; that's gonna be O(N), so gets progressively slower as the dataset grows. Or you can try keep something in memory to keep track of seen IDs? Is there a reason that you're not using a database?
    – roganjosh
    Commented Mar 22, 2020 at 20:09
  • Actually, it's not even JSON Lines because you're putting the carriage return in a list. I don't know what you're trying to do
    – roganjosh
    Commented Mar 22, 2020 at 20:14
  • it writes the data in output.append to a json file and it looks pretty much like this image: cloud.google.com/bigquery/images/create-schema-array.png I don't know if that's valid or not but it's not throwing any errors atm, it's just a fun project for me to learn, and since the dataset would never be that big, I thought a JSON or CSV file would work well enough, I'm pretty new to this, would a database make this much easier?
    – Derpa
    Commented Mar 22, 2020 at 20:51
  • That is valid JSON, but you haven't shown the newlines. Yes, a database would make this easier
    – roganjosh
    Commented Mar 22, 2020 at 20:53
  • any recommendations for a good, lightweight database that works great with python?
    – Derpa
    Commented Mar 22, 2020 at 21:31

2 Answers 2

0

I am not able to complete the code you provided but I added an example to show how you can achieve the none duplicate list of jobs(hopefully it helps):

# suppose `data` is you input data with duplicate ids included
data = [{'id': 1, 'name': 'john'}, {'id': 1, 'name': 'mary'}, {'id': 2, 'name': 'george'}]

# using dictionary comprehension you can eliminate the duplicates and finally get the results by calling the `values` method on dict.
noduplicate = list({itm['id']:itm for itm in data}.values())

with open('pbdb.json', 'w') as job_data_file:
    json.dump(noduplicate, job_data_file)

6
  • Well that doesn't make sense because it just overwrites duplicate data
    – roganjosh
    Commented Mar 22, 2020 at 21:54
  • You are right, however having an id means data are duplicates and from the question, it seems that having one copy of each is sufficient and overwriting should not matter.
    – Bradia
    Commented Mar 22, 2020 at 22:30
  • That's not true. You have no guarantee on the order of records, generally, with JSON. So you have no idea which value of a duplicate key would be saved
    – roganjosh
    Commented Mar 22, 2020 at 22:33
  • Here is the quote from the question Since my script pulls information in the same way from an API more than once, I would like to append new "blocks" to the json file only if the ID doesn't already exist in the JSON file. which means duplicate data are coming in.
    – Bradia
    Commented Mar 22, 2020 at 22:36
  • Great. And your answer doesn't deal with that. The very act of opening the file in write mode just wipes its contents
    – roganjosh
    Commented Mar 22, 2020 at 22:41
0

I'll just go with a database guys, thank you for your time, we can close this thread now

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.