Edit - Stack Overflow

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Decoding Amazon Reports in CP932 with Ruby

Reports out of Amazon's SP-API are generally in UTF-8 except for the ones out of Japan, which are in CP932. I cannot seem to figure out how to decode these into usable data.

Running Ruby 3.1.2 and using the amz_sp_api gem for connecting with Amazon

For CSV reports we are doing:

data = AmzSpApi.inflate_document(content, report_document)
csv_string = CSV.generate do |csv|
  data.gsub("\r", "").split("\n").each do |line|
    csv << line.split("\t")
  end
end
csv_string.force_encoding 'ASCII-8BIT'
csv = CSV.parse(csv_string, headers: true)

Which doesn't complain about anything, but the resulting data looks something like:

...
"ship-state"=>"\xE7\xA6\x8F\xE5\xB2\xA1\xE7\x9C\x8C",

If I force the encoding to be 'CP932' then when I try to parse the csv I get:

3.1.2/lib/ruby/3.1.0/csv/parser.rb:786:in `build_scanner': Invalid byte sequence in Windows-31J in line 2. (CSV::MalformedCSVError)

For the XML reports we are using Nokogiri and doing something like this:

data = AmzSpApi.inflate_document(content, report_document)
parsed_xml = Nokogiri::XML(data)

The resulting xml is actually only part of the first node because it seems to silently fail.

In the above example data has:

data.encoding
=> #<Encoding:ASCII-8BIT>

You get the idea.

I obviously need to do SOMETHING to get all this to parse out properly but I am unclear what that something is.

I believe that perhaps the data is being converted to a string from a byte string, but that must be happening automatically behind the scenes

Answer*

What doesn't work (but works for all Amazon reports in other regions that come down as UTF-8):

```
report_result, _status_code, headers = importer.api.get_report_with_http_info(report_id)
report_document_id = report_result.report_document_id
report_document = importer.api.get_report_document(report_document_id)
url = report_document.url
content = Faraday.get(url).body
p "Content is #{content.encoding}"
data = AmzSpApi.inflate_document(content, report_document)
p "Data is #{data.encoding}"
xml = Nokogiri::XML(data)
p "We found #{xml.xpath("//Order").count} orders"
```
Output:
```
"Content is ASCII-8BIT"
"Data is ASCII-8BIT"
"We found 1 orders"
```

In the above, the xml will be malformed and not work (Hence the 1 order)

What works:
```
report_result, _status_code, headers = importer.api.get_report_with_http_info(report_id)
report_document_id = report_result.report_document_id
report_document = importer.api.get_report_document(report_document_id)
url = report_document.url
content = Faraday.get(url).body
p "Content is #{content.encoding}"
data = AmzSpApi.inflate_document(content, report_document).gsub("CP932", "UTF-8")
p "Data is #{data.encoding}"
xml = Nokogiri::XML(data)
p "We found #{xml.xpath("//Order").count} orders"
```

Output:
```
=> "Content is ASCII-8BIT"
=> "Data is ASCII-8BIT"
=> "We found 151 orders"
```

The issue seems to be Nokogiri (and other online parsers I found) cannot handle that xml tag that says the encoding is CP932.

`<?xml version="1.0" encoding="CP932"?>`

The above code with gsub also works for UTF-8 files (because it does nothing)

NOTE: If you use `HTTParty` instead of `Faraday` the content encoding is `UTF-8` instead of `ASCII-8BIT` but the issue (and solution) remains the same.

Edit Summary*

Cancel

Add a comment |

Correct minor typos or mistakes
Clarify meaning without changing it
Add related resources or links
Always respect the author’s intent
Don’t use edits to reply to the author

create code fences with backticks ` or tildes ~
```
like so
```
add language identifier to highlight code
```python
def function(foo):
print(foo)
```
put returns between paragraphs
for linebreak add 2 spaces at end
_italic_ or **bold**
indent code by 4 spaces
backtick escapes `like _so_`
quote by placing > at start of line
to make links (use https whenever possible)

<https://example.com>

[example](https://example.com)

<a href="https://example.com">example</a>

formatting help »
answering help »

Collectives™ on Stack Overflow

Decoding Amazon Reports in CP932 with Ruby

Answer*