Explained: Exploratory Data Analysis

Someone drops a CSV in your lap. Maybe it's a database export, a log dump, or a spreadsheet that's been passed around for two years with no documentation. Your job is to do something useful with it. Where do you even start?

Most engineers' instinct is to jump straight in: write a query, train a model, build a chart. That instinct is wrong, and it will cost you. You'll spend hours chasing a correlation that turns out to be an artifact of missing data. You'll build a model on a column that's 40% nulls. You'll present a chart to stakeholders that has an outlier so extreme it compresses everything else into a flat line.

Exploratory Data Analysis, or EDA, is the discipline of understanding a dataset before you do anything with it. It's the investigative phase, the forensic pass, the moment where you ask the data what it actually contains rather than assuming it matches what the schema says. Think of it as the difference between a doctor who orders tests before diagnosing, and one who prescribes on gut feel.

EDA has no single definition and no fixed checklist, but it has a clear purpose: eliminate surprises. By the end of an EDA pass, you should know the shape of every column, where the data is missing or corrupt, which variables are related, and where the anomalies live.

The Problem EDA Solves

Data in the real world is almost never clean. It arrives with encoding errors, duplicated rows, columns that were supposed to be numeric but contain strings like "N/A" or "—", timestamps in three different formats, and categorical fields that changed meaning halfway through the dataset's history.

None of that is visible in a schema. A schema tells you what the data is supposed to look like. EDA tells you what it actually looks like.

The cost of skipping EDA compounds. A null value that propagates through a pipeline silently becomes a zero, which biases an average, which drives a wrong business decision. An outlier that nobody noticed inflates a standard deviation, making a model think variance is normal when it isn't. These are not edge cases. They are the default state of real data.

EDA surfaces these problems at the cheapest possible moment: before you've built anything that depends on assumptions you haven't verified.

The EDA Process

EDA is iterative, not linear. You make an observation, it raises a question, you investigate, and that investigation raises another question. But the process has four broad phases that most practitioners move through in roughly this order.

Phase 1: First look. Load the data and get oriented. How many rows and columns? What are the data types? Which columns have null values and how many? Are there duplicate rows? This phase takes minutes and rules out entire classes of downstream errors.

Phase 2: Distributions. For each column, understand what the values actually look like. For numeric columns: what is the min, max, mean, median, and standard deviation? Is the distribution roughly normal, heavily skewed, or bimodal? For categorical columns: how many distinct values are there? Is one value dominant? Are there values that look like data entry errors?

Phase 3: Relationships. Look at how variables relate to each other. Are two columns correlated? Does one column predict another? Are there groupings that separate the data meaningfully? This phase often reveals the most analytically interesting findings and the most dangerous data traps.

Phase 4: Document. Write down what you found. This sounds obvious but it almost never happens. The findings from EDA should feed directly into data cleaning decisions, modelling choices, and stakeholder conversations.

The First Look

The first look is about orientation. You're not looking for insights yet. You're building a map of the territory.

In pandas, the first look takes about five lines:

import pandas as pd

df = pd.read_csv("data.csv")

print(df.shape)        # (rows, columns)
print(df.dtypes)       # data type per column
print(df.isnull().sum()) # null count per column
print(df.duplicated().sum()) # duplicate row count
print(df.head())       # first five rows

These five outputs tell you a huge amount. df.dtypes will reveal columns that should be numeric but were parsed as object, usually because they contain a stray string somewhere. isnull().sum() gives you the null map: columns with high null rates may need to be dropped or imputed before any analysis. duplicated().sum() catches the surprisingly common case of rows that appear twice because of a join gone wrong upstream.

df.describe() extends this for numeric columns, returning count, mean, standard deviation, min, quartiles, and max in one call. It's worth running on every numeric column before going further.

Distributions and Data Quality

Once you know the shape of the data, you want to understand the shape of each column's values. Distributions are where most data quality problems reveal themselves visually.

A right-skewed distribution in a numeric column often means you're looking at something like income, response time, or transaction value: most values cluster low, with a long tail of high values. Summary statistics like the mean will be pulled toward the tail and misrepresent the typical case. The median is a better central tendency measure for skewed data.

A bimodal distribution (two distinct humps) usually means your dataset contains two populations that should probably be analyzed separately. You might be looking at a column that measures different things for different user types, or a sensor reading that behaves differently under two operating conditions.

An outlier in a numeric column can have several causes: genuine extreme events, data entry errors, unit mismatches (someone recorded a value in metres when everything else is in centimetres), or upstream pipeline bugs. The key step is to investigate the outlier, not simply remove it. Removing a legitimate extreme event loses real information.

In pandas, histograms and box plots are the fastest way to see this:

import matplotlib.pyplot as plt

# Histogram for a numeric column
df["response_time_ms"].hist(bins=50)
plt.title("Response Time Distribution")
plt.show()

# Box plot to highlight outliers
df[["response_time_ms", "payload_size_kb"]].boxplot()
plt.show()

For categorical columns, the equivalent is a value count plot. A column with 200 distinct values that was supposed to be a status field with five possible values tells you immediately that something went wrong upstream.

# Top value counts for a categorical column
print(df["status"].value_counts())

# Spot-check: how many unique values?
print(df["status"].nunique())

Relationships Between Variables

Individual column distributions tell you about each variable in isolation. The more analytically interesting findings come from looking at how variables relate to each other.

The standard starting point for numeric relationships is a correlation matrix. Correlation measures how much two variables move together, on a scale from -1 (perfectly inverse) to +1 (perfectly aligned), with 0 meaning no linear relationship.

import seaborn as sns

corr = df.select_dtypes(include="number").corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

A high correlation between two independent variables in a model is a problem called multicollinearity. It doesn't make the model wrong, but it makes the individual coefficients unreliable and the model harder to interpret. EDA is where you catch this before it bites you.

Correlation has a famous limitation: it only measures linear relationships. Two variables can have a strong non-linear relationship and a correlation close to zero. Scatter plots fill the gap. A scatter plot of two variables will reveal a U-shaped relationship, a clustering pattern, or a relationship that only holds above a certain threshold — none of which a correlation coefficient would catch.

For relationships between a categorical variable and a numeric one, grouped summary statistics and box plots by category are the standard tools:

# Mean response time by region
print(df.groupby("region")["response_time_ms"].mean())

# Box plot by category
df.boxplot(column="response_time_ms", by="region")
plt.show()

This kind of grouped analysis often reveals the most actionable findings: latency is 3x higher in one region, conversion rates differ dramatically by user segment, error rates spike on one day of the week.

Handling Missing Data

Missing data is not a single problem. It comes in three distinct forms, and each one requires a different response.

Missing Completely at Random (MCAR) means the probability of a value being missing has nothing to do with any other variable in the dataset. This is the benign case. Dropping those rows or filling with the column mean introduces no systematic bias.

Missing at Random (MAR) is slightly misleading as a name: the data is not missing randomly, but the reason it's missing can be explained by other columns you have. If age is missing for users under 18 because your app doesn't ask, and you have a separate under-18 flag, the missingness is explained. You can impute using the related columns.

Missing Not at Random (MNAR) is the dangerous case. The value is missing because of what the value would have been. High earners skip the income question. Users with poor performance scores leave the platform before a second measurement. If you impute or ignore MNAR data without acknowledging it, your analysis will be biased in ways that are hard to detect. The correct response is usually to model the missingness itself as a variable, or to be explicit about the limitation in your conclusions.

Where You'll Encounter EDA

EDA is most commonly associated with data science and machine learning pipelines, but the underlying discipline appears in many engineering contexts under different names.

In database performance investigation, looking at the distribution of query execution times, identifying the outlier queries that account for most of the load, and checking for null or malformed values in indexed columns is EDA. It's just not called that.

In log analysis, an engineer who loads a week of access logs and immediately checks: how many unique IPs, what's the distribution of response codes, are there any status 500s that cluster at a particular time — that's EDA.

In A/B testing, checking that the two groups have similar distributions before the test runs (a check called pre-experiment balance verification) is a formal application of EDA techniques. Skipping it means you might attribute a difference to your treatment when it actually existed before the experiment started.

In data engineering, profiling a new data source before ingesting it into a warehouse is EDA. Tools like Great Expectations and dbt's schema tests are essentially automated EDA checks codified as assertions.

Summary

EDA is the practice of interrogating a dataset before trusting it with anything consequential. It is not a fixed procedure; it is a mindset applied through a loose sequence of steps: understand the shape of the data, understand the distribution of each variable, understand how variables relate, and document what you find.

The core insight is that data quality problems are the default, not the exception. Raw data almost always contains nulls, outliers, encoding inconsistencies, and structural surprises. EDA finds them at the cheapest moment: before you've built a pipeline, trained a model, or made a recommendation that depends on assumptions you haven't verified.

The Python tools (pandas, matplotlib, seaborn) lower the cost of EDA to the point where there is no good excuse to skip it. Five lines of code for the first look, a histogram per numeric column, a correlation heatmap, and a grouped summary by your key categorical variable will surface the majority of problems in most real-world datasets.

It is also worth noting that EDA is not a one-time gate you pass through before the "real" work. It recurs whenever new data arrives, whenever a model starts behaving unexpectedly, or whenever a metric shifts in a way that doesn't match intuition. The discipline is the habit of looking carefully before concluding.

Part of the Explained series — concepts in tech, clearly.