Co-written by Terry McCann & Simon Whiteley.
A few weeks ago we delivered a condensed version of our Azure Databricks course to a sold out crowd at the UK's largest data platform conference, SQLBits. The course was a condensed version of our 3-day Azure Databricks Applied Azure Databricks programme. During the course we were ask a lot of incredible questions. This blog all of those questions and a set of detailed answers. If you are looking for Accelerating your journey to Databricks, then take a look at our Databricks services.
There were over 60 questions. Some are a little duplicated, some require a lot more detail than others. 60 is too many to tackle in one blog. This will be the first of 6 blogs going in to detail on the questions. They are posted in the order they were asked. I have altered the questions to give them more context. Thank you to all those who asked questions.
Part one. Questions 1 to 10
Part two. Questions 11 to 20
Part three. Questions 21 to 30
Part four. Questions 31 to 40
Part five. Questions 41 to 50
Part six. Questions 51 to 63
Q1: Do compressed data sources (.csv.gz) also get properly distributed in Apache Spark?
A: The depends, when you read a compressed serialised dataset such as a zipped CSV, it will be SINGLE-THREADED
But, once it's read it off disk, it'll be held in memory as a distributed dataset, so it's only the initial read that is not distributed. This is because compressed files are hard to break in to extents. A file which is readable and chunk-able can be divided in to multiple extents in Azure Data Lake or another Hadoop based file system. If you chunk the file up into lots of compressed files, then you get a thread per file which depending on the amount of files you have, may become a bottleneck.
Q2: How do PySpark DataFrames relate to other Python DataFrames, e.g. Pandas - are they are totally different implementation, or are there similarities between them?
A: Spark DataFrames are based on Pandas, they emulate pandas they are different. Many developers who know Python well can sometime overly rely on Pandas. Databricks are working on making Pandas work better, but for now you should use DataFrames in Spark over Pandas. If you use Pandas and Spark DataFrames, then you should look at using Apache Arrow to make the process of moving from one to another more performant. Remember that Spark DataFrames will end up being converted in to RDD expressions by the Catalyst engine. Pandas will not.
Q3: Does Databricks works with Azure Notebooks?
A: They're a similar implementation, but sending data to the cluster is something you have to code in manually. There is now Databricks connect. Which should allow this. They will not operate in the same way. Databricks implements many optimisations on top of Jupyter which are only included in Databricks. https://docs.azuredatabricks.net/user-guide/dev-tools/db-connect.html
Q4: Why is the' from pyspark.sql.types import * ' bad for performance if Spark is an optimised Lazy language and it only executes procedures when and if needed. ?
A: The lazy bits are more about the transformations on the DataFrames. Importing libraries brings them into context on the servers as you run them - it's a fairly small memory overhead but if you're managing a production workload, better to be exact about the load you're putting on them. That library is quite large. Limit the scope to ensure it runs as fast as possible.
Q5: can you force the caching of data without having to force it with a count? i.e is there some parameter you can use?
A: Using count() is just an example of an action. Spark is lazy and will not execute your command until you trigger an action. You could use any action you wish. There is little reason to put something in memory unless you need it. Forcing data in to memory should be avoided as you can quickly run out of memory.
Q6: Do you always store the results of an action into another variable?
A: Nope, depends what you want to do with it. Actions can be things like "write it to disk", which wouldn't be a variable
But the list of transformations you're applying, are held as that "DataFrame" variable.
Q7: should we ever clean up and delete DataFrames we're no longer using?
A: You can choose to apply it over the same variable, ie: "DF = spark.select", then DF = DF.select()" and it'll change the df definition as you go. Or you can name the different stages as different names to keep track of the different stages of transformation. DataFrames you don't need to clean up, unless you use cache(), which is going to take up a chunk of memory on the cluster. If you're caching a large dataset, you'll probably want to clean it up if it is not being used.
Q8: When using spark.read with a schema, does it require us to select all columns in i.e. a csv-file? An example, what happens behind the scene if we were to calculate an average of a integer column across regions in a string column in a file with 100.000.000 records and 200 columns wide and we only need info from 2 of the columns?
A: So this depends on the storage type. CSVs you have to read the full row, as there's no way of identifying a vertical slice of data from that file. If you use a columnar file such as Parquet then each column and be excluded as required.
Q9: Could I, or I guess should I connect to database services such as ADWH, SQLDB to get data from? Or files (blob storage, ADLS) is the preferred (only possible) approach?
A: You can absolutely grab data from sqldb, sqldw, cosmos etc. There are connections for most databases.
Q10: to what degree do classic t-sql coding practices apply (eg avoiding functions on the left side of the operator)?
A: I know what you're asking, I wrote a non-sargable function on my data. Which in SQL is a big no -no.
But we are not working against a SQL engine. We are working on non-indexed files. Non-sargable relates to indexes. We do not have any.