Ding's Programming Blog: JSON parsing in Java, Python and Hive

This blog post summarizes techniques I have tried for JSON parsing on large dataset.

In Hive:
When I was studying how to run Hive on Amazon Elastic MapReduce, I encountered one AWS online tutorial which introduced a custom Serde (Serializer-deserializer) to parse JSON data. According to the tutorial, the jar file for the serializer-deserializer could be included at a .q Hive scripts by placing:
add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
at the top of the scripts.
Code snippet below illustrates a simple example:

The four words assigned to ‘paths’ are the four corresponding field names used by the JSON source data for the four fields in the Hive table declared above. Notice that the last field ‘user.id’ demonstrates how to retrieve nested format data.

In Java:
Online tutorial posted on http://www.java2blog.com/ contains detailed examples for the two most widely used Java libraries for JSON processing: JSON.simple and GSON. I followed these instructions during development.

In Python:
I used json, the JSON encoder and decoder in Python.
The following code snippet does the exactly same job as compared to the Hive script example above.
The json.loads() method was used, details could be found at here.

Ding's Programming Blog

Monday, October 5, 2015

JSON parsing in Java, Python and Hive

No comments:

Post a Comment

Blog Archive