Working with big data can initially feel overwhelming — with rows and columns stretching into millions, traditional tools often slow to a crawl. That’s where PySpark shines. It combines Python’s simplicity with Spark’s distributed power, letting you process massive datasets with ease. However, learning PySpark can feel like wandering through a giant toolbox without knowing which tools matter. You don’t need every single function to get real work done. What you need are the essentials — the ones you’ll use daily to clean, transform, and analyze data. This guide walks you through those key PySpark functions, with simple examples.
The select()
function is your go-to when you only need certain columns from a DataFrame. Instead of hauling around the whole table, you can keep just what matters.
Example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 34), ("Bob", 45), ("Charlie", 29)]
df = spark.createDataFrame(data, ["name", "age"])
df.select("name").show()
Output:
+-------+
| name|
+-------+
| Alice|
| Bob|
|Charlie|
+-------+
You can also use selectExpr()
to write SQL-like expressions when selecting columns.
When you need to create a new column or modify an existing one, use withColumn()
. You pass it the name of the column and the expression to compute.
Example:
from pyspark.sql.functions import col
df.withColumn("age_plus_10", col("age") + 10).show()
Output:
+-------+---+-----------+
| name|age|age_plus_10|
+-------+---+-----------+
| Alice| 34| 44 |
| Bob| 45| 55 |
|Charlie| 29| 39 |
+-------+---+-----------+
You’ll often need to work with a subset of your data. filter()
or where()
helps you keep only rows that match a condition.
Example:
df.filter(col("age") > 30).show()
Output:
+-----+---+
| name|age|
+-----+---+
|Alice| 34|
| Bob| 45|
+-----+---+
Both filter()
and where()
are interchangeable. Use whichever feels more readable to you.
To summarize data, you’ll use groupBy()
combined with aggregation functions. You can compute counts, averages, sums, etc.
Example:
from pyspark.sql.functions import avg
data = [("Math", "Alice", 85), ("Math", "Bob", 78),
("English", "Alice", 90), ("English", "Bob", 80)]
df2 = spark.createDataFrame(data, ["subject", "student", "score"])
df2.groupBy("subject").agg(avg("score").alias("avg_score")).show()
Output:
+-------+---------+
|subject|avg_score|
+-------+---------+
| Math| 81.5|
|English| 85.0|
+-------+---------+
If you want your results in a specific order, use orderBy()
or sort()
. Both do the same thing.
Example:
df.orderBy(col("age").desc()).show()
Output:
+-------+---+
| name|age|
+-------+---+
| Bob| 45|
| Alice| 34|
|Charlie| 29|
+-------+---+
You can sort by multiple columns if needed.
Sometimes you want to remove a column you don’t need anymore.
Example:
df.drop("age").show()
Output:
+-------+
| name|
+-------+
| Alice|
| Bob|
|Charlie|
+-------+
To get unique rows from your DataFrame, use distinct()
.
Example:
data = [("Alice", 34), ("Alice", 34), ("Bob", 45)]
df3 = spark.createDataFrame(data, ["name", "age"])
df3.distinct().show()
Output:
+-----+---+
| name|age|
+-----+---+
| Bob| 45|
|Alice| 34|
+-----+---+
This is like distinct()
, but you can specify which columns to consider when checking for duplicates.
Example:
df3.dropDuplicates(["name"]).show()
Output:
+-----+---+
| name|age|
+-----+---+
| Bob| 45|
|Alice| 34|
+-----+---+
Combining two DataFrames is a common need. Use join()
to merge on a common column.
Example:
data1 = [("Alice", "Math"), ("Bob", "English")]
df4 = spark.createDataFrame(data1, ["name", "subject"])
data2 = [("Alice", 85), ("Bob", 78)]
df5 = spark.createDataFrame(data2, ["name", "score"])
df4.join(df5, on="name").show()
Output:
+-----+-------+-----+
| name|subject|score|
+-----+-------+-----+
|Alice| Math| 85|
| Bob|English| 78|
+-----+-------+-----+
When working with large datasets, reusing the same DataFrame can get slow. cache()
keeps it in memory for faster access.
Example:
df.cache()
df.count() # This action triggers caching
There’s no visible output here, but future operations on df
will run faster.
To get your results back to Python as a list of rows, use collect()
. Be careful — if your data is huge, this can crash your driver.
Example:
rows = df.collect()
print(rows)
Output:
[Row(name='Alice', age=34), Row(name='Bob', age=45), Row(name='Charlie', age=29)]
This one you’ve already seen throughout the examples. show()
prints your DataFrame in a readable tabular format.
Example:
df.show()
To quickly find out how many rows you have.
Example:
df.count()
Output:
3
To replace specific values in a DataFrame.
Example:
df.replace("Alice", "Alicia", "name").show()
Output:
+-------+---+
| name|age|
+-------+---+
| Alicia| 34|
| Bob| 45|
|Charlie| 29|
+-------+---+
To fill in missing values with a default.
Example:
data = [("Alice", None), ("Bob", 45)]
df6 = spark.createDataFrame(data, ["name", "age"])
df6.fillna(0).show()
Output:
+-----+---+
| name|age|
+-----+---+
|Alice| 0|
| Bob| 45|
+-----+---+
When working with columns that contain arrays or lists, you often need to turn each element of the array into its row. That’s what explode()
does — it flattens an array column.
Example:
from pyspark.sql.functions import explode
data = [("Alice", ["Math", "English"]), ("Bob", ["History", "Science"])]
df7 = spark.createDataFrame(data, ["name", "subjects"])
df7.select("name", explode("subjects").alias("subject")).show()
Output:
+-----+--------+
| name| subject|
+-----+--------+
|Alice| Math|
|Alice| English|
| Bob| History|
| Bob| Science|
+-----+--------+
Knowing which PySpark functions to focus on saves you time and makes your code much cleaner. You don’t need to memorize every single function in the library — the ones we’ve covered here are more than enough to handle most real-world tasks. They cover selecting and transforming columns, filtering and grouping data, joining DataFrames, dealing with duplicates and missing values, and working efficiently with cached data. As you get more practice, you’ll start using these almost without thinking. PySpark is powerful because of how much you can do with just a few well-chosen functions. Start with these, experiment with your datasets, and the rest will come naturally.
For more information on PySpark, you can visit the Apache Spark Documentation.
Explore what loss functions are, their importance in machine learning, and how they help models make better predictions. A beginner-friendly explanation with examples and insights.
Discover the most requested ChatGPT features for 2025, based on real user feedback. From smarter memory to real-time web access, see what users want most in the next round of new ChatGPT updates.
Discover how superalignment ensures future AI systems stay aligned with human values, ethics, and safety standards.
From AI fatigue to gimmicky features, these 7 signs show the AI boom may have already peaked. Here's what you need to know.
From AI fatigue to gimmicky features, these 7 signs show the AI boom may have already peaked. Here's what you need to know.
Explore how Perplexity Assistant offers powerful AI tools at low cost, making it a great alternative to ChatGPT Operator.
Meet the top AI influencers of 2025 that you can follow on social media to stay informed about cutting-edge AI advancements
Learn what digital twins are, explore their types, and discover how they improve performance across various industries.
Switching to AI-enabled cloud ERP helps reduce costs, automate tasks, and make faster business decisions in real-time.
Blended Labs transforms K-12 education with AI solutions, creating personalized learning environments for all students.
Meet the top AI influencers of 2025 that you can follow on social media to stay informed about cutting-edge AI advancements
By ensuring integration with current technologies, Micro-personalized GenAI improves speed, quality, teamwork, and processes.
Hyundai creates new brand to focus on the future of software-defined vehicles, transforming how cars adapt, connect, and evolve through intelligent software innovation.
Discover how Deloitte's Zora AI is reshaping enterprise automation and intelligent decision-making at Nvidia GTC 2025.
Discover how Nvidia, Google, and Disney's partnership at GTC aims to revolutionize robot AI infrastructure, enhancing machine learning and movement in real-world scenarios.
What is Nvidia's new AI Factory Platform, and how is it redefining AI reasoning? Here's how GTC 2025 set a new direction for intelligent computing.
Can talking cars become the new normal? A self-driving taxi prototype is testing a conversational AI agent that goes beyond basic commands—here's how it works and why it matters.
Hyundai is investing $21 billion in the U.S. to enhance electric vehicle production, modernize facilities, and drive innovation, creating thousands of skilled jobs and supporting sustainable mobility.
An AI startup hosted a hackathon to test smart city tools in simulated urban conditions, uncovering insights, creative ideas, and practical improvements for more inclusive cities.
Researchers fine-tune billion-parameter AI models to adapt them for specific, real-world tasks. Learn how fine-tuning techniques make these massive systems efficient, reliable, and practical for healthcare, law, and beyond.
How AI is shaping the 2025 Masters Tournament with IBM’s enhanced features and how Meta’s Llama 4 models are redefining open-source innovation.
Discover how next-generation technology is redefining NFL stadiums with AI-powered systems that enhance crowd flow, fan experience, and operational efficiency.
Gartner forecasts task-specific AI will outperform general AI by 2027, driven by its precision and practicality. Discover the reasons behind this shift and its impact on the future of artificial intelligence.
Hugging Face has entered the humanoid robots market following its acquisition of a robotics firm, blending advanced AI with lifelike machines for homes, education, and healthcare.