Translate Page To German Tranlate Page To Spanish Translate Page To French Translate Page To Italian Translate Page To Japanese Translate Page To Korean Translate Page To Portuguese Translate Page To Chinese
  Number Times Read : 98    Word Count: 578  

Arts & Entertainment
Cars and Trucks
Culture and Society
Disease & Illness
Food & Beverage
Health & Fitness
Home & Family
Internet Business
Online Shopping
Pets & Animals
Product Reviews
Recreation & Sports
Reference & Education
Self Improvement
Travel & Leisure
Womens Issues
Writing & Speaking


What is Spark SQL?

[Valid RSS feed]  Category Rss Feed -
By : narayana reddy    29 or more times read
Submitted 2019-03-08 07:24:42
Spark is no doubt one of the most successful projects which the Apache Software Foundation could have ever conceived. They have incepted Spark SQL which integrates relational processing with the functional programming API of Spark.

Querying data through SQL or the Hive query language is possible through Spark SQL. Those familiar with RDBMS can easily relate to the syntax of Spark SQL. Locating tables and metadata couldn’t be easier to Spark SQL. Spark SQL is known for working with structured and semi-structured data. Structured data is something which has a schema which has a known set of fields. When the schema and the data have no separation then the data is known as semi-structured.

Spark SQL definition – Putting it simply for structured and semi-structured data processing Spark SQL is used which is nothing but a module of Spark.
If you are interested to learn more information for SQl server training

Hive limitations

Apache Hive was originally designed to run on top of Apache Spark. But it had considerable limitations like:
1) For running the ad-hoc queries Hive internally launches Map Reduce jobs. In the processing of medium-sized data, sets Map Reduce lags in the performance
2) If during the execution of a workflow the processing suddenly fails then hive can’t resume from the point where it failed when the system is returned to normal.
3) When trash is enabled it leads to an execution error when encrypted databases are dropped in cascade.
Spark SQL was incepted to trump over these

The architecture of Spark SQL
Spark SQL consists of three main layers such as
Language API – Spark is compatible and even supported by these languages like Python, HiveQL, Scala, Java.
SchemaRDD – RDD (resilient distributed dataset) is a special data structure which the Spark core is designed with. As Spark SQL works on schemas, tables, and records we can use Schema RDD or data frame as a temporary table.
Data sources – For Spark-core the data source is usually a text file, Avro file etc. the data sources for Spark SQL are different like JSON document, Parquet file, HIVE tables, and Cassandra database.

Components of Spark SQL

Spark SQL Data frames – There were some shortcomings on part of RDDs which the Spark Data Frame overcame in version 1.3 of Spark. First off there was no provision to handle structured data and there was no optimization engine when working with structured data. On the basis of attributes, the developer had to optimize each RDD. Spark Data Frame is a distributed collection of data ordered into named columns. You might remember a table in the relational database. Spark Data Frame is similar to that.
Spark SQL datasets – In version 1.6 of Spark, Spark dataset was the interface that was added. The catch with this interface is that it provides the benefits of RDDs along with the benefits of an optimized execution engine of Apache Spark SQL. To achieve conversion between JVM objects and tabular representation the concept of the encoder is used. Using JVM objects a dataset can be incepted and functional transformations like a map, filter etc. have to be used to modify them. The Dataset API is available both in Scala and Java but is not supported in Python.

Spark Catalyst Optimizer – Catalyst optimizer is the optimizer used in Spark SQL and all the queries written by Spark SQL and Data Frame DSL is optimized by this tool. This optimizer is better than the RDD and hence the performance of the system is increased.
Author Resource:- onlineitguru
Article From Articles Promoter Article Directory

HTML Ready Article. Click on the "Copy" button to copy into your clipboard.

Firefox users please select/copy/paste as usual
New Members
Sign up
learn more
Affiliate Sign in
Affiliate Sign In
Nav Menu
Submit Articles
Submission Guidelines
Top Articles
Link Directory
About Us
Contact Us
Privacy Policy
RSS Feeds

Print This Article
Add To Favorites


Free Article Submission

Website Security Test