0

Reports out of Amazon's SP-API are generally in UTF-8 except for the ones out of Japan, which are in CP932. I cannot seem to figure out how to decode these into usable data.

Running Ruby 3.1.2 and using the amz_sp_api gem for connecting with Amazon

For CSV reports we are doing:

data = AmzSpApi.inflate_document(content, report_document)
csv_string = CSV.generate do |csv|
  data.gsub("\r", "").split("\n").each do |line|
    csv << line.split("\t")
  end
end
csv_string.force_encoding 'ASCII-8BIT'
csv = CSV.parse(csv_string, headers: true)

Which doesn't complain about anything, but the resulting data looks something like:

...
"ship-state"=>"\xE7\xA6\x8F\xE5\xB2\xA1\xE7\x9C\x8C",

If I force the encoding to be 'CP932' then when I try to parse the csv I get:

3.1.2/lib/ruby/3.1.0/csv/parser.rb:786:in `build_scanner': Invalid byte sequence in Windows-31J in line 2. (CSV::MalformedCSVError)

For the XML reports we are using Nokogiri and doing something like this:

data = AmzSpApi.inflate_document(content, report_document)
parsed_xml = Nokogiri::XML(data)

The resulting xml is actually only part of the first node because it seems to silently fail.

In the above example data has:

data.encoding
=> #<Encoding:ASCII-8BIT>

You get the idea.

I obviously need to do SOMETHING to get all this to parse out properly but I am unclear what that something is.

I believe that perhaps the data is being converted to a string from a byte string, but that must be happening automatically behind the scenes

2
  • If you know this file is in CP932, likely Shift-JIS, set your encoding to that. Forcing to ASCII seems counter-productive. You'll want to convert any input to UTF-8 as soon as possible to avoid encoding issues internally.
    – tadman
    Commented Apr 11, 2023 at 14:44
  • Thanks @tadman. The ASCII 8-BIT is what rails is giving me. I will try again to force to CP932 as soon as the data stream is read and see what happens and update the question.
    – phil
    Commented Apr 12, 2023 at 0:14

1 Answer 1

0

What doesn't work (but works for all Amazon reports in other regions that come down as UTF-8):

report_result, _status_code, headers = importer.api.get_report_with_http_info(report_id)
report_document_id = report_result.report_document_id
report_document = importer.api.get_report_document(report_document_id)
url = report_document.url
content = Faraday.get(url).body
p "Content is #{content.encoding}"
data = AmzSpApi.inflate_document(content, report_document)
p "Data is #{data.encoding}"
xml = Nokogiri::XML(data)
p "We found #{xml.xpath("//Order").count} orders"

Output:

"Content is ASCII-8BIT"
"Data is ASCII-8BIT"
"We found 1 orders"

In the above, the xml will be malformed and not work (Hence the 1 order)

What works:

report_result, _status_code, headers = importer.api.get_report_with_http_info(report_id)
report_document_id = report_result.report_document_id
report_document = importer.api.get_report_document(report_document_id)
url = report_document.url
content = Faraday.get(url).body
p "Content is #{content.encoding}"
data = AmzSpApi.inflate_document(content, report_document).gsub("CP932", "UTF-8")
p "Data is #{data.encoding}"
xml = Nokogiri::XML(data)
p "We found #{xml.xpath("//Order").count} orders"

Output:

=> "Content is ASCII-8BIT"
=> "Data is ASCII-8BIT"
=> "We found 151 orders"

The issue seems to be Nokogiri (and other online parsers I found) cannot handle that xml tag that says the encoding is CP932.

<?xml version="1.0" encoding="CP932"?>

The above code with gsub also works for UTF-8 files (because it does nothing)

NOTE: If you use HTTParty instead of Faraday the content encoding is UTF-8 instead of ASCII-8BIT but the issue (and solution) remains the same.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.