Spark SQL – Inferring the Schema using Reflection

This method uses reflection to generate the schema of an RDD that contains specific types of objects. The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and they become the names of the columns.

Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly be converted to a DataFrame and then registered as a table. Tables can be used in subsequent SQL statements.

Example

Let us consider an example of employee records in a text file named employee.txt. Create an RDD by reading the data from text file and convert it into DataFrame using Default SQL functions.

Given Data āˆ’ Take a look into the following data of a file named employee.txt placed it in the current respective directory where the spark shell point is running.

1201, subrat, 25
1202, zafurl, 28
1203, ajit, 39
1204, jay, 23
1205, devi, 23

The following examples explain how to generate a schema using Reflections.

Start the Spark Shell

Start the Spark Shell using following command.

$ spark-shell

Create SQLContext

Generate SQLContext using the following command. Here, sc means SparkContext object.

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Import SQL Functions

Use the following command to import all the SQL functions used to implicitly convert an RDD to a DataFrame.

scala> import sqlContext.implicts._

Create Case Class

Next, we have to define a schema for employee record data using a case class. The following command is used to declare the case class based on the given data (id, name, age).

scala> case class Employee(id: Int, name: String, age: Int)
defined class Employee

Create RDD and Apply Transformations

Use the following command to generate an RDD named empl by reading the data from employee.txt and converting it into DataFrame, using the Map functions.

Here, two map functions are defined. One is for splitting the text record into fields (.map(_.split(ā€œ,ā€))) and the second map function for converting individual fields (id, name, age) into one case class object (.map(e(0).trim.toInt, e(1), e(2).trim.toInt)).

At last, toDF() method is used for converting the case class object with schema into a DataFrame.

scala> val empl=sc.textFile("employee.txt")
.map(_.split(","))
.map(eā‡’ employee(e(0).trim.toInt,e(1), e(2).trim.toInt))
.toDF()

Output

empl: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int]

Store the DataFrame Data in a Table

Use the following command to store the DataFrame data into a table named employee. After this command, we can apply all types of SQL statements into it.

scala> empl.registerTempTable("employee")

The employee table is ready. Let us now pass some sql queries on the table using SQLContext.sql() method.

Select Query on DataFrame

Use the following command to select all the records from the employee table. Here, we use the variable allrecords for capturing all records data. To display those records, call show() method on it.

scala> val allrecords = sqlContext.sql("SELeCT * FROM employee")

To see the result data of allrecords DataFrame, use the following command.

scala> allrecords.show()

Output

+------+---------+----+
|  id  |  name   |age |
+------+---------+----+
| 1201 | subrat  | 25 |
| 1202 | zafurl | 28 |
| 1203 | ajit | 39 |
| 1204 | jay  | 23 |
| 1205 | devi  | 23 |
+------+---------+----+

Where Clause SQL Query on DataFrame

Use the following command for applying where statement in a table. Here, the variable agefilter stores the records of employees whose age are between 20 and 35.

scala> val agefilter = sqlContext.sql("SELeCT * FROM employee WHERE ageC>=20 AND age <= 35")

To see the result data of agefilter DataFrame, use the following command.

scala> agefilter.show()

Output

<console>:25, took 0.112757 s
+------+---------+----+
|  id  |  name   |age |
+------+---------+----+
| 1201 | subrat  | 25 |
| 1202 | zafurl | 28 |
| 1204 | jay  | 23 |
| 1205 | devi  | 23 |
+------+---------+----+

The previous two queries were passed against the whole table DataFrame. Now let us try to fetch data from the result DataFrame by applying Transformations on it.

Fetch ID values from agefilter DataFrame using column index

The following statement is used for fetching the ID values from agefilter RDD result, using field index.

scala> agefilter.map(t=>"ID: "+t(0)).collect().foreach(println)

Output

<console>:25, took 0.093844 s
ID: 1201
ID: 1202
ID: 1204
ID: 1205

This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

Leave a Reply