Allan Butler — Articles

Smarter Grocery Search with Knowledge Graph RAG and DSPy

Mon, 20 Oct 2025 12:00:00 GMT

Problem

In modern grocery retail, customers expect search experiences that are fast, relevant, and personalized. If you search for "nut-free granola under $5", a typical keyword search fails because it doesn't understand "nut-free" as an attribute and it might pull any "granola" regardless of price.

This highlights three core challenges:

Multi-attribute complexity – Each product spans multiple structured fields: brand, category, nutrition, ingredients, dietary tags, and price. A single query can touch all of them.
Free-form natural language – Shoppers don't speak in schemas. They mix attributes ("nut-free"), numeric filters ("under $5"), and categories ("granola") in ways that don't align neatly to database fields.
Explainability and trust – Customers want to know why a product is recommended, and merchandisers need to validate how items surface in search. Without transparency, trust erodes.

Traditional keyword or embedding search struggles to consistently deliver relevance in this context. Traditional vector retrieval methods capture semantic similarity but struggle with constraints like price thresholds or categorical attributes. A Knowledge Graph offers a formal representation (G=(V,E)), where products, brands, and attributes are entities (V), and relationships such as HAS_ATTRIBUTE or IN_CATEGORY are edges (E). Queries like "nut-free granola under $5" can then be interpreted as subgraph patterns with attribute constraints and a numeric inequality, which classical vector spaces cannot enforce. This motivates a hybrid retriever that fuses:

Knowledge Graph → Precision, constraints, explainability.
Vector Embeddings → Semantic recall.

Solution: Knowledge Graph RAG w/ DSPy

This architecture generalizes the conventional Retrieval-Augmented Generation (RAG) paradigm. Rather than treating retrieval as a flat vector similarity operation, we augment it with structured graph-based reasoning. The result is a hybrid retriever that balances semantic flexibility with constraint enforcement.

To tackle these challenges, we combine three complementary pieces:

Vector embeddings – capture semantic similarity, so queries like "granola" and "cereal" don't miss relevant matches.
Knowledge Graphs (KGs) – enforce structured reasoning, letting us filter by attributes (e.g., HAS_ATTRIBUTE = nut-free) and constraints (e.g., PRICE < 5).
DSPy – a framework for declaratively building LLM pipelines, so we can design hybrid retrieval systems that are modular, explainable, and easy to extend.

This approach extends the familiar Retrieval-Augmented Generation (RAG) pattern. Instead of treating retrieval as a flat vector lookup, we enrich it with structured knowledge.

Why Knowledge Graph RAG?

A grocery product is not just a row in a table, it's better understood as a node in a network of relationships. Take something as simple as granola. It isn't defined only by its name, it's linked to a brand like H-E-B or Central Market, placed within a category such as Pantry → Granola, associated with ingredients like oats or almonds, described by attributes like nut-free or gluten-free, and tied to price metadata that could reflect everyday low price, promotions, or coupon eligibility.

This web of connections is what a Knowledge Graph (KG) captures. In a KG, edges describe meaning: a product HAS_ATTRIBUTE Nut-Free, IN_CATEGORY Granola, or MADE_BY H-E-B. That structure gives us more than just labels, it encodes the logic of how grocery items relate to one another.

Compare that to a Classic RAG pipeline:

Search → Embedding → Vector DB → Retrieved Docs → LLM answers.

This flow works well when the goal is retrieving unstructured text — FAQs, policy documents, articles. But it breaks down in retail search. Embeddings can tell us that "granola" is semantically similar to "cereal." What they can't do reliably is enforce constraints like "must be nut-free," "price under $5," or "belongs in the Pantry category." And those are exactly the rules shoppers care about.

Imagine a customer in Texas searching H-E-B Digital for "organic salsa under $4." That query carries intent across multiple structured dimensions at once: a dietary attribute, a category, and a numeric filter. A vector-only search may capture the gist of "salsa," but it often drops the fine-grained conditions that make the result meaningful.

This is why Knowledge Graph RAG matters. It blends the semantic flexibility of embeddings with the structured precision of graph reasoning. In practice, that means a product like H-E-B Nut-Free Crunch Granola ($4.79) is represented not just by text embeddings but by explicit graph links to its attributes, category, brand, and price. When retrieved, the system can explain itself:

"Recommended because it's granola, tagged nut-free, and priced under $5."

The results create a system tuned for how people actually shop for groceries—combining natural language flexibility with structured, constraint-aware precision.

Enter DSPy

DSPy helps us build this LLM pipeline declaratively. Designing hybrid retrieval pipelines with LLMs often turns into a mess of brittle prompt chains and glue code. That's where DSPy comes in. Instead of hand-crafting prompts, DSPy lets you declare what the pipeline should do, and it handles the rest.

The building blocks are simple:

Signatures – define inputs/outputs (e.g., ProductSearchSignature).
Modules – compose retrieval + answer steps.
Programs – orchestrate hybrid retrieval + answer generation.

For example, a product search task can be expressed in just a few lines:

import dspy

class ProductSearchSignature(dspy.Signature):
"""Return product suggestions and key facts based on a grocery search query."""
query: str
hybrid_context: list[str]
suggestions: str

With this declaration, DSPy automatically generates the right prompts behind the scenes. That means pipelines stay modular, explainable, and easier to maintain. You focus on what needs to happen (semantic + graph retrieval, ranking, explanation), not on how to hack together prompts.

In practice, this makes DSPy a natural fit for Knowledge Graph RAG in grocery search, where transparency and structured reasoning are just as important as semantic recall.

Architecture

The architecture integrates structured reasoning from a Knowledge Graph (KG) with semantic recall from vector retrieval, orchestrated through a declarative DSPy pipeline. Product data from the grocery catalog (brand, category, nutrition, labels, and price) is ingested into the KG, where it is linked to attributes and relationships such as HAS_ATTRIBUTE or IN_CATEGORY. At query time, a customer request is decomposed into both free-text (e.g., product names or descriptions) and structured constraints (e.g., nut-free, price < $5). The KG enforces attribute and numeric filters, while embeddings capture broader semantic matches. Candidate products retrieved from both channels are passed to the LLM layer, where DSPy coordinates hybrid reasoning and explanation. This final stage produces not only ranked recommendations but also explicit justifications (e.g., "recommended because it is granola, tagged nut-free, and priced under $5"), ensuring transparency and trust in the system.

Sample Grocery Dataset

product_id,name,brand,category,sub_category,price,ingredients,attributes
1,HEB Oats & Honey Granola,H-E-B,Pantry,Cereal & Granola,4.49,"Whole grain oats,honey,almonds","contains_nuts;vegetarian"
2,Central Market Organic Granola Low Sugar,Central Market,Pantry,Cereal & Granola,5.99,"Oats,coconut,chia,monk fruit","organic;low_sugar;vegan"
...

Vector Store (FAISS + Embeddings)

Dense vector representations form the backbone of modern search. They map products and queries into a shared continuous space where similarity is measured by cosine distance or inner product. Historically, models like Word2Vec and GloVe used 200–300 dimensions; transformers like BERT/SBERT expanded this to ~768; and today's API embeddings often run 1,536–4,096 dimensions. Benchmarks like MTEB show higher dimensions improve recall and coverage, but at the cost of speed, memory, and storage.

For grocery search, embeddings help generalize semantically ("granola" ≈ "cereal") and capture brand or description similarity. But dimensionality alone cannot enforce structured rules like HAS_ATTRIBUTE = nut_free or PRICE < 5. This is why we need a hybrid approach: embeddings for semantic recall, knowledge graphs for constraints and explainability.

We use SentenceTransformers + FAISS to encode product text (name, brand, category, attributes, nutrition).

from sentence_transformers import SentenceTransformer
import faiss

model = SentenceTransformer("all-MiniLM-L6-v2")
embs = model.encode(product_texts, normalize_embeddings=True)
index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs.astype("float32"))

Knowledge Graph Representation

We also ingest the dataset into a KG for explicit reasoning:

import networkx as nx

G = nx.Graph()
for _, r in df.iterrows():
pid = f"product:{r.product_id}"
G.add_node(pid, label="Product", name=r.name, brand=r.brand)
# Link to category + attributes
G.add_node(f"attr:{r.attributes}", label="Attribute")
G.add_edge(pid, f"attr:{r.attributes}", type="HAS_ATTRIBUTE")

This allows us to query structured relationships.

Hybrid Retrieval

To combine both sources of relevance:

vec_results = vector_search(query, k=6)
kg_results = kg_search(query, k=6)

context_texts = [r["text"] for r in vec_results + kg_results]

The LLM now sees semantic hits + structured facts.

DSPy Pipeline

We define DSPy for search & answering:

class ProductSearchSignature(dspy.Signature):
query: str
hybrid_context: list[str]
suggestions: str

class HybridSearchProgram(dspy.Module):
def __init__(self):
self.search_llm = dspy.Predict(ProductSearchSignature)

def forward(self, query: str):
vec = vector_search(query)
kg = kg_search(query)
context = [r["text"] for r in vec + kg]
pred = self.search_llm(query=query, hybrid_context=context)
return pred.suggestions

HybridSearchProgram merges vector + KG retrieval.

DSPy generates prompts under the hood, ensuring modularity and transparency. DSPy uses the description you defined in your Signature to generate examples into the prompt.

Walkthrough

Example: "nut-free granola under $5"

Vector Search finds granola products.
KG filters for attribute = nut-free and price < 5.
Result: H-E-B Nut-Free Crunch Granola ($4.79).

Explainability: Why Did This Product Rank?

One of the biggest pain points in grocery search is that results often feel like a black box. Shoppers see a product surface, but they don't know why. Did it match a keyword? Was it the cheapest? Or was it just similar text in the description? That's not good enough when customers are filtering by dietary needs and health attributes. Grocery catalogs are packed with metadata—organic, nut-free, gluten-free, low sodium, high protein—and customers expect search to honor those signals. If a parent is shopping for a child with a nut allergy, they don't just want "granola." They want to know it's nut-free and still within budget.

This is where Knowledge Graph RAG changes the game. Because products are represented as nodes connected to explicit attributes, the system can explain itself:

"Recommended because it's granola, tagged nut-free, and priced under $5."

That simple explanation builds trust with shoppers who can see their intent and why certain items surfaced.

Conclusion

Integrating vector embeddings, knowledge graphs, and DSPy yields a retrieval architecture that aligns with the complexity of modern grocery search. Embeddings provide semantic recall, knowledge graphs enforce attribute and numeric constraints, and DSPy ensures that the pipeline remains modular and declarative. The result is a system that is:

Constraint-aware – results respect attributes and thresholds rather than relying solely on lexical matches.
Explainable – recommendations are transparent and auditable, enabling both shopper trust and merchandiser validation.
Maintainable – the declarative design simplifies extension and long-term support.

For grocery retail, where discovery often hinges on nuanced attributes like organic, nut-free, or low sodium, this hybrid approach unlocks better discovery. It means customers can find exactly what they need, with confidence, while H-E-B can deliver on the promise of "Here Everything's Better" in the digital space as well. And with DSPy, the pipeline stays clean, modular, and transparent.

How to Try It

Clone the repo (Github link here).
poetry install
poetry shell
poetry run python -m sgs prepare-data → builds KG + FAISS index.
poetry run python -m sgs run-server → starts API.
curl 'http://127.0.0.1:8000/search?q=nut-free%granola'

Sources

What's In A Name?

Sun, 15 May 2022 12:00:00 GMT

What Is In A Baby Name?

Becoming a first time parent is a daunting task in an individuals life. From the many baby books to all the gadgets (hot take: you don't need all the gadgets) you need to purchase for the individual that will soon become your new roomy. With all the chaos that will come soon in those 9 short months, one of the most challenging can be coming up with a name. Using the Social Security Card Application Baby Names from 2010 - 2020 I used a data approach to try and solve this problem.

We want to pick a name that is not the most popular and/or a passing trend, unique enough for our family tree, and true to our families culture.

Approach is as follow:

Complete simple counts to examine overall most/least popular
Year-over-year differences of popularity values.
Find names that have sudden spikes & then drop off, proxy for trendy names.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

df_m = pd.read_csv("data_b.csv", sep='\t')

Once the data is imported & filtered for male only names we take a quick look at our four columns of interest.

year
name
gender
count

For a quick look at the top 5 names we run a simple aggregate by name and count using pandas.

# Grab top 5 names
df_m_sum = df_m.groupby('name')['count'].agg(['sum', 'max'], as_index=False)

df_m_sum.nlargest(5, ['sum'])

name — sum — max

Noah — 201245 — 19650

Liam — 193376 — 20555

William — 172238 — 17347

Jacob — 172154 — 22139

Mason — 167681 — 19518

Next lets examine fastest growing names from 2010 - 2020. We do this by creating two separate dataframes and then use the merge function in pandas to join and calculate the growth column. (latest_year - first_year)/(latest_year) * 100

# Fastest Growing Names (2010 - 2020)

df_2010 = df_m[df_m["year"] == 2010]
df_2020 = df_m[df_m["year"] == 2020]

df_yoy_all = pd.merge(df_2010, df_2020, on="name")
# x is 2010, y is 2020

# Filter names with counts over 100 in 2010
df_yoy = df_yoy_all[df_yoy_all["count_x"] > 5000]

# Create yoy metric
# (2020-2010)/(2010)*100
df_yoy["growth"] = (df_yoy["count_y"] - df_yoy["count_x"])/(df_yoy["count_x"])

df_yoy.nlargest(10, ['growth'])

year_x — name — gender_x — count_x — year_y — gender_y — count_y — growth

2010 — Liam — M — 10928 — 2020 — M — 19659 — 0.798957

2010 — Henry — M — 6399 — 2020 — M — 10705 — 0.672918

2010 — Levi — M — 6016 — 2020 — M — 9005 — 0.496842

2010 — Sebasti — M — 6361 — 2020 — M — 8927 — 0.403396

2010 — Josiah — M — 5206 — 2020 — M — 6077 — 0.167307

2010 — Noah — M — 16460 — 2020 — M — 18252 — 0.108870

2010 — Wyatt — M — 7374 — 2020 — M — 8135 — 0.103200

2010 — Lucas — M — 10379 — 2020 — M — 11281 — 0.086906

2010 — Owen — M — 8176 — 2020 — M — 8623 — 0.054672

2010 — Jack — M — 8519 — 2020 — M — 8876 — 0.041906

df_yoy.nsmallest(10, ['growth'])

year_x — name — gender_x — count_x — year_y — gender_y — count_y — growth

2010 — Tyler — M — 10450 — 2020 — M — 2771 — -0.734833

2010 — Gavin — M — 9619 — 2020 — M — 2570 — -0.732820

2010 — Brandon — M — 8547 — 2020 — M — 2287 — -0.732421

2010 — Justin — M — 7848 — 2020 — M — 2277 — -0.709862

2010 — Kevin — M — 7324 — 2020 — M — 2359 — -0.677908

2010 — Evan — M — 9730 — 2020 — M — 3389 — -0.651696

2010 — Brayden — M — 9113 — 2020 — M — 3253 — -0.643037

2010 — Zachary — M — 7180 — 2020 — M — 2698 — -0.624234

2010 — Joshua — M — 15448 — 2020 — M — 5924 — -0.616520

2010 — Jayden — M — 17189 — 2020 — M — 7102 — -0.586829

A quick look at the top 10 largest & smallest growing names over the 10 year span tells us that Liam is the fastest growing and Tyler is the name that is shrinking the most. I've filtered the dataset to include only names with over 5000 counts beginning in the year 2010.

# Filter specific names of initial interest
df_int = df_yoy_all[df_yoy_all["count_x"] > 1]

df_int["growth"] = (df_int["count_y"] - df_int["count_x"])/(df_int["count_x"])

Creating a function to explore any name of interest will be a valuable reusable asset.

# Create function to look up any name of interest
name_list = ['Paxton', 'Parker', 'Ethan', 'Hayden']

def find_name(search: str):
return (df_int[df_int['name'].str.contains(search)])

def find_list(search: list):
return df_int[df_int['name'].isin(search)].sort_values("growth", ascending=False)

search = ['Allan', 'Paxton', 'Parker', 'Ethan', 'George', 'Dee', 'Hayden', 'Enzo']

find_list(search)

year_x — name — gender_x — count_x — year_y — gender_y — count_y — growth

2010 — Enzo — M — 602 — 2020 — M — 2201 — 2.656146

2010 — Dee — M — 5 — 2020 — M — 6 — 0.200000

2010 — Paxton — M — 1110 — 2020 — M — 1286 — 0.158559

2010 — George — M — 2373 — 2020 — M — 2746 — 0.157185

2010 — Parker — M — 4732 — 2020 — M — 3797 — -0.197591

2010 — Allan — M — 403 — 2020 — M — 277 — -0.312655

2010 — Ethan — M — 18006 — 2020 — M — 9464 — -0.474397

2010 — Hayden — M — 4191 — 2020 — M — 2146 — -0.487950

find_name("Hayden")

year_x — name — gender_x — count_x — year_y — gender_y — count_y — growth

2010 — Hayden — M — 4191 — 2020 — M — 2146 — -0.48795

Plot Most Trendy Names

Plotting the overall growth gives us some insights but lets break that calculation out by each year to get a better sense of the growth trend.

Lets observe how all-time most popular names have grown over the years instead of just observing the 10 year growth. We can accomplish this by first creating a pivot df.

pivot_df = df_m.pivot_table(index="name", columns="year", values="count", aggfunc=np.sum).fillna(0)

# Now we calucalte the percentage of each name by year.

perc_df = pivot_df / pivot_df.sum() * 100

# Then add a new column with the cumulative percentages sum.
perc_df["total"] = perc_df.sum(axis=1)

sort_df = perc_df.sort_values(by="total", ascending=False).drop("total", axis=1)[0:10]

transpose_df = sort_df.transpose()
transpose_df.head(5)

We sort the dataframe to check which are the top values and slice the data appropriately. Lastly, we drop the total column and flip the axes to make plotting the data easier.

name — Noah — Liam — William — Jacob — Mason — Ethan — Michael — James — Alexander — Elijah

2010 — 0.858554 — 0.570005 — 0.889746 — 1.154771 — 0.774524 — 0.939193 — 0.905550 — 0.724346 — 0.874046 — 0.725285

2011 — 0.889028 — 0.708355 — 0.914274 — 1.074023 — 1.028696 — 0.879594 — 0.885813 — 0.698658 — 0.827679 — 0.736711

2012 — 0.916126 — 0.886996 — 0.891535 — 1.007317 — 1.001406 — 0.933172 — 0.854119 — 0.709259 — 0.804302 — 0.732743

2013 — 0.966854 — 0.960396 — 0.881369 — 0.962090 — 0.937371 — 0.860090 — 0.821397 — 0.718762 — 0.789320 — 0.730778

2014 — 1.007213 — 0.962950 — 0.877551 — 0.880575 — 0.897154 — 0.820202 — 0.806594 — 0.752946 — 0.804352 — 0.722134

import plotly.express as px

plot = px.line(transpose_df, x=transpose_df.index, y=transpose_df.columns, title="Top 10 Trendy Names")
plot.show()

Figure 1.1 Trendy Baby Names Over Time.

Liam is still the most 'trendy' & popular name, according to growth, over the last 10 years.

I'm going to create another function to grab the year where the name of interest is the highest.

def when_most_births(name):

if name in set(df_m["name"]):

highest = df_m[df_m["name"] == name].groupby("year")["count"].sum().sort_values(ascending = False)[:1]
in_2020 = df_m[(df_m["name"] == name) & (df_m["year"] == 2020)]["count"].sum()

print("Name {} was most popular in {} with {} kids given this name.\n".format(name, int(highest.index[0]), highest.iloc[0]))

print('In 2020 there were {} babies in total who were given the name {}.\n'.format(in_2020, name))

px.line(df_m[df_m["name"] == name], x="year", y="count", color = "name", title=f"Baby Name {name} Over Time").show()

else:
print(f"Name {name} is not in the database.")

when_most_births("Enzo")

Name Enzo was most popular in 2020 with 2201 kids given this name.

In 2020 there were 2201 babies in total who were given the name Enzo.

Figure 1.2 Most Popular Over Time.

Using a function from a kaggle notebooke we will

Create a metric that measure spikes & then has a drop off.

Divide a names maximum count by its total count.

Most Sudden Names

df = df_m.groupby(['name', 'gender'])['count'].agg(['sum', 'max'])

df_ = df.reset_index()

df_['spike_fall'] = df_['max']/df_['sum']

popular = df_.sort_values(by='spike_fall',ascending=False)

popular_df = popular[popular["sum"] > 5000]
popular_df.head(5)

Lets use our function when_most_births to plot what names we want to examine for name spikes/falls.

when_most_births("Jase")

Figure 1.3 Spike-Fall Over Time.

Name Jase was most popular in 2013 with 4552 kids given this name.

In 2020 there were 624 babies in total who were given the name Jase.

Examining The Spike-Fall Names

Jase is a great example of the spike/fall being able to capture an example of a name that peaked in 2013 and has dropped in popularity.
For some high ranked spike/fall names we do not see the fade part because their peak year is the last one in the dataset.

As you might imagine, this is not the end of finding a baby name. Some open questions are:

How do I actually use this data to choose a name and not just use the analysis for avoiding names?
What if a trendy name is something we want?

Further analysis can look into both gender names to create a metric that finds the optimal gender neutral name.

We solved the initial problem of avoiding specific names but the question of interest is still left open-ended. Luckily we have 6 months remaining to decide on a name.

Forecasting Super Bowl Sales

Thu, 24 Jan 2019 12:00:00 GMT

Time Series Forecasting

EDA & Data Preperation

Time series analysis is a very useful tool businesses can use to assist in their deicsion making process. We all know that "No model will be 100% accurate but some models are useful." There are numerous time series methods and techniques that can be used but for this example we will be utilizing Business Science collection of open software packages. Although after recently attending the RStudio Conference the tsibble and fable package could be used for this analysis as well. The concepts I use when beginning any type of data analysis come heavily from Hadley Wickham and Garrett Grolemund's R4DS. The analysis pipeline that I follow always begins with what is the business task at hand, what data science tools can help tackle, and what question do we want to have answered? The process is straight forward and usually leads to more questions, insights, and steps to take towards achieving an actionable outcome.

The business problem is to estimate future super bowl ticket sales. Using past sales, the data can help improve forecasts and generate models that describe the main factors of influence. We can then use the analysis to develop actionable outcomes based on what we have learned. The first step is loading our packages and reading in the data. Usually I would be reading data from a database but for clarity and simplicity we will read in a csv file.

library(tidyverse)
library(lubridate)
library(timetk)
library(tidyquant)
library(broom)
library(modelr)
library(caret)
library(gridExtra)

SB <- read_csv("SB.csv") %>%
mutate(Event_Date = mdy(Event_Date), Sale_Date = mdy(Sale_Date), days_to_event = (Event_Date - Sale_Date))

The best way to get an understanding of your data is to create different visualizations, lets start with yearly sales.

Sales over time

To begin our exploratory analysis we will take a look at sales over time.

# Create a sales by year data frame
salesByYear <- SB %>%
group_by(Year) %>%
summarize(total_sales = sum(Sale_Price))

# Use ggplot to plot sales by year
ggplot(salesByYear, aes(Year, total_sales)) +
geom_bar(stat = "identity") +
geom_smooth(method = "lm", se = FALSE) +
labs(title="Super Bowl Sales Over Time", x="Year", y="Sales") +
scale_y_continuous(labels = scales::dollar) +
geom_text(aes(y=total_sales, label=scales::dollar(total_sales)),
vjust=1.5,
color="white",
size=4) +
theme_bw()

Figure 1.1 Revenue Over Time.

Secondary market Super Bowl sales has a linear growth trend with 2018 being the highest gross sales. Note that these numbers do not take into account inflation but still provide insight into market trends throughout the years.

Next we can examine quantity sold and total sales over the last 3 weeks until the Super Bowl.

SB %>%
mutate(days_to_event = as.numeric(days_to_event)) %>%
group_by(days_to_event, Year) %>%
summarise(Qty = sum(Qty)) %>%
filter(days_to_event <= 21) %>%
ggplot(aes(x = days_to_event, y = Qty, color = Year)) +
geom_line(aes(y = Qty), color = palette_light()[[1]]) +
facet_grid(Year ~ ., scales = "free") +
theme_tq() +
guides(color = FALSE) +
labs(title = "Quantity Sold Over Last 3 Weeks",
x = "",
y = "Quantity Sold")

Figure 2.1 Quantity Sold.

SB %>%
mutate(days_to_event = as.numeric(days_to_event)) %>%
group_by(days_to_event, Year) %>%
summarise(Sale_Price = sum(Sale_Price)) %>%
filter(days_to_event <= 21) %>%
ggplot(aes(x = days_to_event, y = Sale_Price, color = Year)) +
geom_line(aes(y = Sale_Price), color = palette_light()[[1]]) +
facet_grid(Year ~ ., scales = "free") +
theme_tq() +
guides(color = FALSE) +
labs(title = "Total Sales Over Last 3 Weeks",
x = "",
y = "Sale Price")

Figure 2.2 Revenue Sold

There is a strong uptick trend at the two weeks out from the game mark for both metrics which intuitively makes sense that has more tickets are sold revenue increases. This is usually when both team are officially decided. Lets further examine a heat map comparing month and day of the month of transactions.

SB %>%
mutate(day = day(Sale_Date), month = month(Sale_Date)) %>%
group_by(month, day) %>%
summarise(total_sales = sum(Sale_Price)) %>%
ggplot(aes(x = month, y = day, fill = total_sales)) +
geom_tile(alpha = 0.8, color = "white") +
scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
theme_tq() +
theme(legend.position = "right") +
labs(title = "Sales per Month and Day",
y = "Day of the Month",
fill = "Total Sales")

Figure 3 Heat Map of Sales by month and day

The heap map tells us that sales happen less during Oct - Dec and heat up late January and early February closer to the event. There are no sales in March - August. Now we can examine sales by specific sections and zones.

Top 10 Zones

Let's explore some stadium zones to get an idea of top selling zones.

# Plot top 10 products

# Create top 10 products data frame
zoneSales <- SB %>%
group_by(Zone = Section) %>%
summarize(total_sales = sum(Sale_Price),
qty_total = sum(Qty)) %>%
mutate(pct_total = total_sales / sum(total_sales)) %>%
arrange(desc(total_sales))
top10.ordered <- head(zoneSales, 10)
top10.ordered$Zone <- factor(top10.ordered$Zone, levels = arrange(top10.ordered, total_sales)$Zone)

# Use ggplot to plot the top products
ggplot(top10.ordered, aes(Zone, total_sales)) +
geom_bar(stat="identity") +
geom_text(aes(ymax=pct_total, label=scales::percent(pct_total)),
hjust= -0.25,
vjust= 0.5,
color="black",
size=4) +
geom_text(aes(ymax=qty_total, label=paste("Qty:", qty_total)),
hjust= 1.25,
vjust= 0.5,
color="white",
size=4) +
coord_flip() +
labs(title="Top 10 Zones",
x="",
y="Sales")+
scale_y_continuous(labels = scales::dollar, limits = c(0,2500000)) +
theme_bw()

Figure 4 Bar chart for sales by top 5 zones

Unexpectedly Upper Corner is the top selling zone wtih $1,463,821 and 2.17% of total ticket sales. There could be biases in the data due to different stadium layouts. Further analysis would need to group each individual section into specific titled zones.

Geographic Trends

Lets map the sales using leaflet to try and expose sales trends by city.

# Plot sales by geographic location

# Create sales by location from orders extedend, joining latitude and longitude
# data by customer name
salesByLocation <- SB %>%
group_by(Stadium, LNG, LAT) %>%
summarise(total_sales = sum(Sale_Price)) %>%
mutate(popup = paste0(Stadium, ": ", scales::dollar(total_sales)))

# Use Leaflet package to create map visualizing sales by customer location
library(leaflet)
leaflet(salesByLocation) %>%
addProviderTiles("CartoDB.Positron") %>%
addMarkers(lng = ~LNG,
lat = ~LAT,
popup = ~popup) %>%
addCircles(lng = ~LNG,
lat = ~LAT,
weight = 2,
radius = ~(total_sales)^0.775)

Figure 5 Leaflet Map of past superbowl sales.

Larger circles relate to higher sales, and smaller circles relate to lower sales. leaflet provides interactivety by being able to click on the markers. The geographic trends are consistent with the sales over time charts.

Now that we have done our exploratory data analysis we can attempt a time series forecast.

Based upon our EDA we have features relevant to forecasting demand or future revenue. We can split the data into a training and test set and begin forecasting future revenue. We will use all data before 2018 Super Bowl as the training data and all data after as the test samples.

SB_forecast <- SB %>%
group_by(Sale_Date) %>%
summarise(Qty = sum(Qty), Sales = sum(Sale_Price)) %>%
mutate(model = ifelse(Sale_Date < "2017-09-10", "train", "test"))

SB_qty <- SB_forecast %>%
ggplot(aes(Sale_Date, Sales, color = model)) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()

SB_days_until <- SB %>%
group_by(Sale_Date, days_to_event) %>%
summarise(Sales = sum(Sale_Price)) %>%
mutate(model = ifelse(Sale_Date < "2017-09-10", "train", "test")) %>%
ggplot(aes(days_to_event, Sales, color = model)) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()

grid.arrange(SB_qty, SB_days_until)

Figure 6 Time Series of Quantity & Sales

Notice the issue with the missing time series values when there are not any sales data. We will have to account for the missing dates when creating our future index.

Using timekt we can add time series signature to our corresponsing repsonse variable.

SB_forecast_aug <- SB_forecast %>%
select(model, Sale_Date, Sales) %>%
tk_augment_timeseries_signature()

SB_forecast_aug <- SB_forecast_aug[complete.cases(SB_forecast_aug), ]

After adding the features based on the properties of our tk_augment_timeseries_signature() function we them remove missing values from the data frame. Since we have to account for the missing sales dates we need to ask ourselves whether replacing those values with the mean or setting the values to 0. Since there are large gaps in purchases between super bowls for this situation we should set the results to 0 and also remove values with a variance of 0.

library(matrixStats)

(var <- data.frame(colnames = colnames(SB_forecast_aug[, sapply(SB_forecast_aug, is.numeric)]),
colvars = colVars(as.matrix(SB_forecast_aug[, sapply(SB_forecast_aug, is.numeric)]))) %>%
filter(colvars == 0))

SB_forecast_aug <- select(SB_forecast_aug, -one_of(as.character(var$colnames)))

The sales data is aggregated by day so the hour, minute, second, am/pm features are removed. Next we will remove the highly correlated values in the data set.

library(ggcorrplot)

cor <- cor(SB_forecast_aug[, sapply(SB_forecast_aug, is.numeric)])
p.cor <- cor_pmat(SB_forecast_aug[, sapply(SB_forecast_aug, is.numeric)])

ggcorrplot(cor, type = "upper", outline.col = "white", hc.order = TRUE, p.mat = p.cor,
colors = c(palette_light()[1], "white", palette_light()[2]))

Figure 7 Correlation plot

Examining the correlation plot and data frame I am going to choose to remove features of 0.95 as a cutoff.

cor_cut <- findCorrelation(cor, cutoff = 0.95)
SB_forecast_aug <- select(SB_forecast_aug, -one_of(colnames(cor)[cor_cut]))

After removing the highly correlated values we can split data into our training and test set.

train <- filter(SB_forecast_aug, model == "train") %>%
select(-model)
test <- filter(SB_forecast_aug, model == "test")

Modeling

The response variable Sales will be modeled using a generalized linear model. We could test numerous statistical learning models to deviate the best model choice but for this situation Occam probably was right.

fit_lm <- glm(Sales ~ ., data = train)

Visualize the model features using broom and ggplot2

tidy(fit_lm) %>%
gather(x, y, estimate:p.value) %>%
ggplot(aes(x = term, y = y, color = x, fill = x)) +
facet_wrap(~ x, scales = "free", ncol = 4) +
geom_bar(stat = "identity", alpha = 0.8) +
scale_color_manual(values = palette_light()) +
scale_fill_manual(values = palette_light()) +
theme_tq() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Figure 8 Model features

augment(fit_lm) %>%
ggplot(aes(x = Sale_Date, y = .resid)) +
geom_hline(yintercept = 0, color = "red") +
geom_point(alpha = 0.5, color = palette_light()[[1]]) +
geom_smooth() +
theme_tq()

Figure 9

After plotting we can now add predictions and residuals for the test data and visualize the residuals.

pred_test <- test %>%
add_predictions(fit_lm, "pred_lm") %>%
add_residuals(fit_lm, "resid_lm")

pred_test %>%
ggplot(aes(x = Sale_Date, y = resid_lm)) +
geom_hline(yintercept = 0, color = "red") +
geom_point(alpha = 0.5, color = palette_light()[[1]]) +
geom_smooth() +
theme_tq()

Figure 10

After examining the residuals we would probably want to do some form of model transformation on the response variable using interaction or adding polynomial terms to the independent variables but we can leave that explanation for another time.

Now we compare the predicted against the actual data in the test set.

pred_test %>%
gather(x, y, Sales, pred_lm) %>%
ggplot(aes(x = Sale_Date, y = y, color = x)) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()

Figure 11

Our model apears to miss the uptick in sales in late January but appears consistent none the less.

Forecasting

Now that our feature selection is out of the way we can forecast next years total Super Bowl tickets sales. First we extract and index using the tk_index function.

# Extract index
idx <- SB_forecast %>%
tk_index()

idx_future <- idx %>%
tk_get_timeseries_summary()
idx_future

## # A tibble: 1 x 12
## n.obs start end units scale tzone diff.minimum diff.q1
##
## 1 524 2012-09-30 2018-02-04 days day UTC 86400 86400
## # ... with 4 more variables: diff.median , diff.mean ,
## # diff.q3 , diff.maximum

We need to account for the irregular data because we are missing dates due to no past sales and the mean difference does not equal 86400 or 1 day.

We need to beware of that we never have data for days where there are no sales and we have a few random missing values in between, as can be seen in the diff column of SB_forecast_aug (1 day difference is 86400 seconds).

SB_forecast_aug %>%
ggplot(aes(x = Sale_Date, y = diff)) +
geom_point(alpha = 0.5, aes(color = as.factor(diff))) +
geom_line(alpha = 0.5) +
theme_tq()

Figure 12

Create future index and rename index to Sale_Date to match original data. We account for the missing days on a monthly, quarterly, or yearly schedule using the inspect_months function.

idx_future <- idx %>%
tk_make_future_timeseries(n_future = 365, inspect_months = TRUE)

data_future <- idx_future %>%
tk_get_timeseries_signature() %>%
rename(Sale_Date = index)

Predict the future values and build the future data frame.

pred_future <- predict(fit_lm, newdata = data_future)

sales_future <- data_future %>%
select(Sale_Date) %>%
add_column(Sales = pred_future)

SB_forecast %>%
ggplot(aes(x = Sale_Date, y = Sales)) +
geom_rect(xmin = as.numeric(ymd("2017-09-10")),
xmax = as.numeric(ymd("2018-02-04")),
ymin = 0, ymax = 2000000,
fill = palette_light()[[4]], alpha = 0.01) +
geom_rect(xmin = as.numeric(ymd("2018-02-05")),
xmax = as.numeric(ymd("2019-02-04")),
ymin = 0, ymax = 2000000,
fill = palette_light()[[3]], alpha = 0.01) +
annotate("text", x = ymd("2013-11-03"), y = 1500000,
color = palette_light()[[1]], label = "Train Region") +
annotate("text", x = ymd("2017-08-01"), y = 550000,
color = palette_light()[[1]], label = "Test Region") +
annotate("text", x = ymd("2018-10-01"), y = 550000,
color = palette_light()[[1]], label = "Forecast Region") +
geom_point(alpha = 0.5, color = palette_light()[[1]]) +
geom_point(aes(x = Sale_Date, y = Sales), data = sales_future,
alpha = 0.5, color = palette_light()[[2]]) +
geom_smooth(aes(x = Sale_Date, y = Sales), data = sales_future,
method = 'loess') +
labs(title = "Seconday Market Super Bowl Ticket Sales: 2019 Forecast", x = "") +
theme_tq()

Figure 13

Notice the negative values. This is not only impossible but might tell us something about the error rate in our model. We can visualize this by plotting the standard deviation of the test residuals.

test_residuals <- pred_test$resid_lm
test_resid_sd <- sd(test_residuals, na.rm = TRUE)

sales_future <- sales_future %>%
mutate(
lo.95 = Sales - 1.96 * test_resid_sd,
lo.80 = Sales - 1.28 * test_resid_sd,
hi.80 = Sales + 1.28 * test_resid_sd,
hi.95 = Sales + 1.96 * test_resid_sd
)

SB_forecast %>%
ggplot(aes(x = Sale_Date, y = Sales)) +
geom_point(alpha = 0.5, color = palette_light()[[1]]) +
geom_ribbon(aes(ymin = lo.95, ymax = hi.95), data = sales_future,
fill = "#D5DBFF", color = NA, size = 0) +
geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), data = sales_future,
fill = "#596DD5", color = NA, size = 0, alpha = 0.8) +
geom_point(aes(x = Sale_Date, y = Sales), data = sales_future,
alpha = 0.5, color = palette_light()[[2]]) +
geom_smooth(aes(x = Sale_Date, y = Sales), data = sales_future,
method = 'loess', color = "white") +
labs(title = "Seconday Market Super Bowl Ticket Sales: 2019 Forecast with Prediction Intervals", x = "") +
theme_tq()

Figure 14

Our model predicts that 2019 Super Bowl Sales will not be as prosperous as 2018. The secondary ticket market is notable for high variance and can have a highly uncertain future. Although the revenue forecast follows a similar curve compared to past years summarising a total will provide a better view.

combine1 <- SB %>%
select(Year, Sale_Price) %>%
group_by(Year) %>%
summarise(total_sales = sum(Sale_Price))

combine2 <- sales_future %>%
mutate(Year = 2019) %>%
na.omit() %>%
group_by(Year) %>%
summarise(total_sales = sum(Sales))
All <- bind_rows(combine1, combine2)

ggplot(All, aes(Year, total_sales)) +
geom_bar(stat = "identity") +
geom_smooth(method = "lm", se = FALSE) +
labs(title="Super Bowl Sales Over Time", x="Year", y="Sales") +
scale_y_continuous(labels = scales::dollar) +
geom_text(aes(y=total_sales, label=scales::dollar(total_sales)),
vjust=1.5,
color="white",
size=3.5) +
theme_tq()

Figure 15 Total Revenue Including Forecast

A little data manipulation helps us plot the total forecasted revenue alongside the previoius years for a clear comparison snapshot and visualizing the linear trend. The upward linear trend in sales is a testimate to the growing secondary market along with Super Bowl prices outpacing inflation growth. A further interesting analysis would be comparing wage growth and overall inflation rates amongst Super Bowl prices. Forecasting using the timekt approach is a great machine learning application based upon our data set. However, a prediction is only as good as the data used and a major omitted variable in our analysis is the teams playing and the location. These features can be added to the regression but our example tried to simplify as much as possible to get the results for an outcome. In a real business case example different features could be tested to achieve the most optimal model and result.

Optimizing Wedding Reception Seating Charts

Wed, 21 Nov 2018 12:00:00 GMT

Recently my wife and I were married. We were so fortunate that many of our close friends and family members attended our wedding in California (we live in Texas). My beautiful wife was the ultimate planer and tackled almost every task of wedding planning with her mom. I definitely lucked out with my responsibilities being minimal. However, when she asked for my help with the seating chart of the 90 guests I knew this was a problem that data science could help solve. Luckily after reading Alogrithms to Live By, written by Christian and Griffiths, I came across Meghan Bellows story of planning her wedding while also doing her PhD research in chemical engineering. Using specific scores for each guest relationship and specifying a few constraints I was able to replicate a similar 'travelling salesman problem'. Along the way I also found this github repo here by Megan Stiles. She tackled the optimization problem of seating her guests, so big shoutout to her for the R code help.

Figure 1. Final tables at the reception.

Building the Guest Relational Matrix

Based on the assumption that people want to sit at a table with the people they are most closely related we made our guest relational matrix of 90 guests for the Wedding Reception, 9 Tables of 10.

Key: 2000 = Spouse/Date, 900 = Sibling, 700 = Parent/Child, 600 = Grandparent, 500 = Cousin, 300 = Aunt/Niece, 100 = Friend, 0 = Strangers, 5000 = Bride/Groom

Unfortunately there were no other ways to tackle this problem then to manually enter the matrix data into excel, feel free to reach out if you can think of any better suggestions.

The Genetic Algorithm Solution in R

library(tidyverse)
library(genalg)

wedding_matrix <- read_csv("wedding_seating_chart.csv")

# 1s indicate the guest is at the current table and 0s indicate they are not. The model will seat one table at a time and iterate until all the tables are filled

### Define Fitness Function

evalFunc <- function(x) {
# Total Table Closeness, initialize to 0
closeness = 0

# Number of people at the table
current_table_1 = sum(x == 1)

# Calculate Index of each person at the tablen (This corresponds to the closeness matrix)
i = 0
Table_1_POS<- vector()

for (i in 1:(length(x - 1))) {
if (x[i] == 1) {
Table_1_POS<-append(Table_1_POS,i)
}
}
i = 0

#Calculates the closeness for the table

Table_1 = 0
i=0
for (i in 1: length(x)) {
if (x[i] == 1) {
j =0
for (j in 1: length(Table_1_POS - 1)) {
Table_1 = Table_1 + wedding_matrix[i, Table_1_POS[[j]] + 1]
}
}
}
#Total Closeness
closeness = Table_1

#Restrict Number of guests at each table
if (current_table_1 > 10)
return(0) else return(-closeness)

}

### Iteratively Seat Tables###

#Initialze interations to 300
iters = 300
i = 0

#initialize chromosome size to 60
size = 90

#Initialze seating vector to store seating vector
Seating_Order <- vector()
for (i in 1:8) {

#Increase Generations for final two tables
if ( i > 8) {
iters = 1000
}

#Run GA
ga.model <- rbga.bin(size = size, popSize = 200, evalFunc = evalFunc, iters = iters, elitism = TRUE)

#Best Solution
solution <- ga.model$population[which.min(ga.model$evaluations),]

# Print Which Table we are on, The closeness, and how many people are at each table to keep track
print(i)
print(sum(solution == 1))
closeness <- min(ga.model$evaluations)
print(closeness)

#Append Seated Guests to Seating_Order Vector
seated <- wedding_matrix[solution == 1,]
Seating_Order <- append(Seating_Order, as.character(seated$X))

#Remove seated guests from the df before rerunning the model for the next table
seated.index = vector()

for (j in 1:(length(solution))) {
if (solution[j] == 1) {
seated.index<- append(seated.index, j)
}
}
wedding_matrix = wedding_matrix[-c(seated.index[[1]],seated.index[[2]], seated.index[[3]], seated.index[[4]], seated.index[[5]], seated.index[[6]], seated.index[[7]], seated.index[[8]], seated.index[[9]], seated.index[[10]]),
-c((seated.index[[1]]+1),(seated.index[[2]]+1), (seated.index[[3]]+1), (seated.index[[4]]+1), (seated.index[[5]]+1), (seated.index[[6]]+1), (seated.index[[7]]+1), (seated.index[[8]]+1), (seated.index[[9]]+1), (seated.index[[10]]+1))]

#Reduce size of chromosome by 10 for next run
size = size -10

}

#Separate Tables
One = Seating_Order[1:10]
Two = Seating_Order[11:20]
Three = Seating_Order[21:30]
Four = Seating_Order[31:40]
Five = Seating_Order[41:50]
Six = Seating_Order[51:60]
Seven = Seating_Order[61:70]
Eight = Seating_Order[71:80]
Nine = as.character(weddingd_matrix$X)

Combining Tables into the Final Seating Chart

seating_chart <- as.data.frame(bind_rows(One, Two, Three, Four, Five, Six, Seven, Eight, Nine))

#Save Completed Seating Chart in csv
write_csv(seating_chart, "Wedding_Seating_Chart.csv")

The Results

The final seating chart solution had only a few minor tweaks made by my bride but saved me from the strenuous process of deciding where each individual should sit and I also found a way to include R. Also, the wedding was a blast!

Figure 2. My Beautiful Wife and I

Puppy Training with Machine Learning

Sat, 28 Apr 2018 12:00:00 GMT

A Data Driven Approach to Housebreaking My Puppy

Figure 1.1 Don't let the cuteness fool you.

Housetraining a puppy is work. Don't let the cuteness of your pup fool you into thinking housetraining will be a breeze, although the right training up front will save you agony down the road. After reading Rover's post on house breaking your dog I decided to take a data approach to housetraining by documenting eating and bathroom breaks. After a month of recording data I was not only extremely grateful for automation of data warehouses but also able to determine if my pup was on the right track with her potty and eating behaviors. For this post I will only use her bathroom dataset.

First we will load the data into a data frame for exploratory analysis along with the correct R packages. Exploratory analysis is about asking a series of data questions and trying to gain useful insights to influence our decision making.

library(tidyverse)
library(lubridate)
library(ggthemes)
library(modelr)
library(broom)
library(caret)
library(tidytext)
library(lime)
library(ggridges)
library(viridis)

potty_records <- read_csv("Aimee/potty_records.csv") %>%
mutate(Date = mdy(Date), day_of_week = wday(Date, label = TRUE))
potty_records$hour <- as.POSIXlt(potty_records$Time, format="%H:%M")$hour

Visual Exploration

Now that we have the data loaded with the appropriate packages we can start the EDA process by drawing some plots. Lets start with some plots to get to know the data and visualize whether there are any trends that would help understand the relationship between Potty break or in-house accident? variable and other variables. But first we need to clarify where the missing values exist and if it will cause a problem with the EDA phase.

# List of NAs
potty_records %>%
purrr::map_df(~sum(is.na(.)))

## # A tibble: 1 x 10
## `Trial No.` Date Time `Potty break or in-ho~ `U(rination), D(efecatio~
##
## 1 0 0 0 2 0
## # ... with 5 more variables: `What was the dog doing pre-elimination?
## # (nap, meal, walk, play, sniffing, pacing, etc.)` , `Consequences
## # for the dog (play, treat, walk, scolding, clean up/no response?)`
## # , Notes , day_of_week , hour

We see that there are 359 NA values in the Notes, 2 in the Potty break, and 2 in the Pre-elimination column. Since this is manually logged I know that the Pre-elimination NAs were because of only finding the accident and not seeing any behaviors beforehand or from taking the dog out and no action occurred. It is important to know your data and troubleshoot any data integrity issues that you find.

Lets now visualize by column Potty break or in-house accident? over time to get a trend. We can plot the Success average over time to gain a better visualization of the Success rate and see if results have been constantly happening or they just started happening all of a sudden.

potty_records %>%
rename(type = `Potty break or in-house accident?`) %>%
group_by(Date, type) %>%
summarise(n = n()) %>%
mutate(freq = n/sum(n)) %>%
ggplot(aes(Date, freq, color = type)) +
geom_point(size = 1) +
geom_smooth(method = "lm") +
scale_color_fivethirtyeight("type") +
labs(title = "Time Series of Bathroom Type",
subtitle = "by % of Success or Accident") +
theme_fivethirtyeight()

Figure 1.2 Time Series of Success or Accident by Percent.

Great, it appears Success has a linear trend upward over time despite some minor setbacks. She appears to be a quick learner and Accidents have definitely decreased.

The first granular look we can do is look at bathroom trips across the different days of the week by hour.

potty_records %>%
ggplot(aes(x = hour, y = day_of_week, fill = ..x..)) +
geom_density_ridges_gradient(scale = 3) +
scale_x_continuous(expand = c(0.01, 0)) +
scale_y_discrete(expand = c(0.01, 0)) +
scale_fill_viridis(name = "Hour", option = "C") +
labs(title = "Number of Potty Breaks By Day of the Week & Hour",
subtitle = "Source: Aimee's housebreaking",
x = "Hour") +
theme_ridges(font_size = 13, grid = TRUE) + theme(axis.title.y = element_blank())

Figure 1.2 Joy Plot of Potty Breaks by Day & Hour.

Here we can see that Aimee definitely goes to the bathroom more often later in the day. I would assume this is because I am home from work and she is out more. Also, the variance in Thursday is also a little unusual.

Next thing to do is examine further into hours and types of Accidents vs Success and search for patterns.

success <- potty_records %>%
filter(`Potty break or in-house accident?` == 'Success')

success_hour <- ggplot(aes(x = hour), data = success) + geom_histogram(bins = 24, color = 'black', fill = 'blue') +
ggtitle('Histogram of Success Potty Times by Type') +
facet_wrap(~ `U(rination), D(efecation), N(either), B(oth)`) +
theme_minimal()

accident <- potty_records %>%
filter(`Potty break or in-house accident?` == 'Accident')

accident_hour <- ggplot(aes(x = hour), data = accident) + geom_histogram(bins = 24, color = 'black', fill = '#CE1141') +
ggtitle('Histogram of Accident Times by Type') +
facet_wrap(~ `U(rination), D(efecation), N(either), B(oth)`) +
theme_minimal()

accident_hour

Figure 1.3 Histogram of Accident Times by Type.

success_hour

Figure 1.4 Histogram of Success Times by Type.

Again, the afternoon seems to be her most active restroom activity as well as when the most accidents occur. This is probably due to Aimee being out of her crate and having more free range.

Lets also examine actions before potty times and compare successful and in house accidents.

a <- potty_records %>%
filter(`Potty break or in-house accident?` == 'Success') %>%
group_by(`What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`) %>%
summarise(n = n()) %>%
mutate(freq = n/sum(n))

action_success <- ggplot(aes(x = `What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`, y = freq), data = a) +
geom_bar(stat = "identity", fill = "blue") +
geom_text(aes(label = paste0(round(freq*100, 0), "%")), position = position_stack(vjust = 0.5), size = 3.5) +
theme_fivethirtyeight() +
labs(x = "",
y = "Fequency",
title = 'Action Before Successful Potty Times')

b <- potty_records %>%
filter(`Potty break or in-house accident?` == 'Accident') %>%
group_by(`What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`) %>%
summarise(n = n()) %>%
mutate(freq = n/sum(n))

action_accident <- ggplot(aes(x = `What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`, y = freq), data = b) +
geom_bar(stat = "identity", fill = "#E31837") +
geom_text(aes(label = paste0(round(freq*100, 0), "%")), position = position_stack(vjust = 0.5), size = 3.5) +
theme_fivethirtyeight() +
labs(x = "",
y = "Fequency",
title = 'Action Before Accident Potty Times')

action_success

Figure 1.5 Bar Chart of Success by Before Action.

action_accident

Figure 1.5 Bar Chart of Accident by Before Action.

Examing the action before accident bar chart shows a clear trend of sniffing before the accident happens. This is a common and intuitive tell from any dog that they are searching for relief spot but it is nice to have the data to support the claim.

Lastly let plot the consequences for Success and Accident by Consequences for the dog (play, treat, walk, scolding, clean up/no response?)

c <- potty_records %>%
filter(`Potty break or in-house accident?` == 'Success') %>%
group_by(`Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`) %>%
summarise(n = n()) %>%
mutate(freq = n/sum(n))

ggplot(aes(x = `Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`, y = freq), data = c) +
geom_bar(stat = "identity", fill = "blue") +
geom_text(aes(label = paste0(round(freq*100, 0), "%")), position = position_stack(vjust = 0.5), size = 3.5) +
theme_fivethirtyeight() +
labs(x = "",
y = "Fequency",
title = 'Consequences after Successful Relief')

Figure 1.6 Bar Chart of Success by Consequence.

d <- potty_records %>%
filter(`Potty break or in-house accident?` == 'Accident') %>%
group_by(`Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`) %>%
summarise(n = n()) %>%
mutate(freq = n/sum(n))

ggplot(aes(x = `Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`, y = freq), data = d) +
geom_bar(stat = "identity", fill = "#E31837") +
geom_text(aes(label = paste0(round(freq*100, 0), "%")), position = position_stack(vjust = 0.5), size = 3.5) +
theme_fivethirtyeight() +
labs(x = "",
y = "Fequency",
title = 'Consequences for Accident in House')

Figure 1.6 Bar Chart of Accident by Consequence.

When training Aimee we are going by Karen Pryor's positive reinforcement method and it definitely appears in the data but 33% my partner and I could not hold back the scolding. After all, we are only human.

Formulate hypothesis around EDA

The available data is limited to the bathroom data. Using the potty_records we know whether she has a Success or an Accident. Based upon the data my hypothesis' are:

Based upon what she was doing pre-elimination we can try to determine whether or not we will have a Success or an Accident. This may or may not be enough to build a sufficient prediction model but we can gain some insights from building a machine learning model for variable importance. A better question may be "What might make the Accident column tally less and more?" For instance, is there any difference between action before pre-elimination or between consequences. Or, if time of meals has anything to do with whether the pup will have a Success or Accident.
Consequences for the dog seem to be making a big difference for Success rate improving.
Based upon hour and type of potty there doesn't seem to be a difference between whether an elimination will be Success or Accident.

Now lets evaluate these hypotheses by building some models and a few more plots.

potty_records %>%
group_by(Date, `Potty break or in-house accident?`) %>%
summarise(n = n()) %>%
na.omit() %>%
ggplot(aes(`Potty break or in-house accident?`, n)) +
geom_boxplot(color = "black", aes(fill = factor(`Potty break or in-house accident?`))) +
theme_bw() +
scale_fill_brewer(palette = "Blues") +
labs(title = "Potty break or in-house accident?",
x = "",
y = "") +
guides(fill = guide_legend(title = "Type"))

Figure 1.7 Box Plot.

Examining the box plot we see that Accident by day appears to have a wider variance while Success occurs more often but has one outlier. Since this is group_by day I can remember the unsuccessful day of housebreaking. Lets dig deeper and build some models.

Correlation is different from causation.

Through building a classification model we can understand the relationship between the variables better. We can also understand and perhaps explain changes in Success and Accident. But the relationship is correlation, meaning that changes in Success rate are influenced by certain metrics and not caused by them.

Model Building

Since our predictor is a binary outcome we will use a machine learning model to predict Success or Accident. I will also use some plotting and variable importance to get insights about how to extract information from the variables using the caret and lime packages.

Lets build and evaluate a model to help us determine important variables for Success and/or Accident by removing time stamps and dates from the data. We will also remove the Trial No and day_of_week because they are not driving whether or not Aimee will have a Success or not and we do not want to overfit the model.

potty_records_model <- potty_records %>%
select(-Notes, -`Time`, -Date, -`Trial No.`, -day_of_week) %>%
mutate(`Potty break or in-house accident?` = as.factor(`Potty break or in-house accident?`),
`U(rination), D(efecation), N(either), B(oth)` = as.factor(`U(rination), D(efecation), N(either), B(oth)`),
`What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)` = as.factor(`What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`), `Consequences for the dog (play, treat, walk, scolding, clean up/no response?)` = as.factor(`Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`)) %>%
na.omit()

potty_records_model <- potty_records_model %>%
rename(type = `U(rination), D(efecation), N(either), B(oth)`, action_before = `What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`, Consequences = `Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`)

# Replace NAs w/ 0s
potty_records_model <- potty_records_model %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 0)))

Now we split the data into training and test set. In this situation, we are looking at Success potty trips. Now we can fit some models using a random forest.

# training and test set
set.seed(42)
index <- createDataPartition(potty_records_model$`Potty break or in-house accident?`, p = 0.6, list = FALSE)
train_data <- potty_records_model[index, ]
test_data <- potty_records_model[-index, ]

# modeling
model_rf <- caret::train(`Potty break or in-house accident?` ~ .,
data = train_data,
method = "rf", # random forest
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
verboseIter = FALSE))

model_rf

## Random Forest
##
## 219 samples
## 4 predictor
## 2 classes: 'Accident', 'Success'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 197, 197, 197, 198, 197, 197, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9663919 0.9235532
## 7 0.9826802 0.9631164
## 12 0.9782138 0.9531246
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.

Our accuracy of the model is 98.27%. Our goal is not to perfect a prediction of whether she will have an accident or a successful bathroom trip but it is good to know our dependent variable is measured effectively by the independent variables in our dataset. Since we have a good prediction accuracy we can now extract insights.

pred <- data.frame(sample_id = 1:nrow(test_data), predict(model_rf, test_data, type = "prob"), actual = test_data$`Potty break or in-house accident?`) %>%
mutate(prediction = colnames(.)[2:3][apply(.[, 2:3], 1, which.max)], correct = ifelse(actual == prediction, "correct", "wrong"))

confusionMatrix(pred$actual, pred$prediction, positive = "Success")

## Confusion Matrix and Statistics
##
## Reference
## Prediction Accident Success
## Accident 51 0
## Success 2 91
##
## Accuracy : 0.9861
## 95% CI : (0.9507, 0.9983)
## No Information Rate : 0.6319
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9699
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.9623
## Pos Pred Value : 0.9785
## Neg Pred Value : 1.0000
## Prevalence : 0.6319
## Detection Rate : 0.6319
## Detection Prevalence : 0.6458
## Balanced Accuracy : 0.9811
##
## 'Positive' Class : Success
##

LIME needs data without response variable

train_x <- dplyr::select(train_data, -`Potty break or in-house accident?`)
test_x <- dplyr::select(test_data, -`Potty break or in-house accident?`)

train_y <- dplyr::select(train_data, `Potty break or in-house accident?`)
test_y <- dplyr::select(test_data, `Potty break or in-house accident?`)

Build explainer, the key function in lime that explains the model's predictions.

explainer <- lime(train_x, model_rf, n_bins = 5, quantile_bins = TRUE)

Run explain() function. We are setting the n_featuers = 8. This helps breakdown the complexity of trying to understand all the features in the dataset, which can lead to more confusion. Next we set the feature_select function to "forward_selection", which is the auto default in the lime package.

explanation_df <- lime::explain(test_x, explainer, n_labels = 2, n_features = 8, n_permutations = 1000, feature_select = "forward_selection")

The feature importance plot is the reason LIME is so useful. This allows us to visualize each of the first 3 cases (observations) from the test data. The top four features for each case are shown. Note that they are not the same for each case. The green bars mean that the feature supports the model conclusion, and the red bars contradict.

plot_features(explanation_df[1:24, ], ncol = 2) +
labs(title = "LIME Feature Importance Visualization")

Figure 1.9 Lime Feature Importantance.

Lime is able to provide with an easy to view plot but what does the data tell us? Lets examine case 1:

pred %>%
filter(sample_id == 1)

## sample_id Accident Success actual prediction correct
## 1 1 0.008 0.992 Success Success correct

Case 1 was correctly predicted to come from the Success group because it

Has play as a consequence for action after potty break
The hour the action occurred was <= 8
The action before was sniffing
The type was labeled U

The explanatory plot tells us for each feature the range of values the data point would fall. If it does, this gets counted as support for this prediction, if it does not, it gets scored as contradictory. For instance, examining case 3 on the plot, scolding contradicts the support for a Success.

plot_explanations() is another great visualization that can be utilized with LIME. The function produces a faceted heatmap of all feature combinations.

df <- explanation_df %>%
mutate(case = as.numeric(case)) %>%
filter(case < 31)

plot_explanations(df) +
labs(title = "LIME Feature Importance Heatmap",
subtitle = "Hold Out (Test) Set, First 30 Cases Shown")

Figure 1.10 Lime Feature Importantance Heatmap.

Power Test and Difference in Means

Since we do not have a randomized control experiment we will control for type and see where we are achieving Success in the house breaking. First examine overall Success rate.

test <- potty_records %>%
mutate(Success = case_when(`Potty break or in-house accident?` == 'Success' ~ 1,
`Potty break or in-house accident?` == 'Accident' ~ 0))

test_mean <- test %>%
summarise(n = n(),
mean_success = mean(Success, na.rm = TRUE),
std_error = sd(Success, na.rm = TRUE) / sqrt(n),
sd = sd(Success, na.rm = TRUE),
lower.ci = mean_success - qt(1 - (0.05/2), n - 1) * std_error,
upper.ci = mean_success + qt(1 - (0.05/2), n - 1) * std_error)
test_mean

## # A tibble: 1 x 6
## n mean_success std_error sd lower.ci upper.ci
##
## 1 365 0.645 0.0251 0.479 0.595 0.694

We have an overall Success rate of 64%. Lets now examine where we are achieving the most Success.

We can control for U(rination), D(efecation), N(either), B(oth) to see if results would be causal.

test_type <- test %>%
group_by(`U(rination), D(efecation), N(either), B(oth)`) %>%
summarise(n = n(),
mean_success = mean(Success, na.rm = TRUE),
std_error = sd(Success, na.rm = TRUE) / sqrt(n),
sd = sd(Success, na.rm = TRUE),
lower.ci = mean_success - qt(1 - (0.05/2), n - 1) * std_error,
upper.ci = mean_success + qt(1 - (0.05/2), n - 1) * std_error) %>%
filter(n > 2) %>%
arrange(desc(mean_success))
test_type

## # A tibble: 3 x 7
## `U(rination), D(ef~ n mean_success std_error sd lower.ci upper.ci
##
## 1 B 53 0.755 0.0597 0.434 0.635 0.874
## 2 D 41 0.659 0.0750 0.480 0.507 0.810
## 3 U 269 0.621 0.0296 0.486 0.562 0.679

Even though it can feel like I have been achieving progress, the least amount of progress is with U. This could be because of the amount of times she goes U and if a larger accident is taking place Aimee is immediately taken outside.

Lets now visualize the statistics.

test_type %>%
rename(Type = `U(rination), D(efecation), N(either), B(oth)`) %>%
ggplot(aes(mean_success, n, color = Type)) +
geom_point() +
geom_errorbarh(aes(xmin = lower.ci, xmax = upper.ci)) +
labs(x = "Success Rate",
y = "n",
title = 'Success Rate by Type') +
theme_bw()

Figure 2 Success Rate by Type.

The snapshot of the data tells us that D has a higher rate of Success than the U but the confidence intervals are extreme in comparison.

Lets also control for What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.) and see if our results change.

test_elimination <- test %>%
group_by(`What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`) %>%
summarise(n = n(),
mean_success = mean(Success, na.rm = TRUE),
std_error = sd(Success, na.rm = TRUE) / sqrt(n),
sd = sd(Success, na.rm = TRUE),
lower.ci = mean_success - qt(1 - (0.05/2), n - 1) * std_error,
upper.ci = mean_success + qt(1 - (0.05/2), n - 1) * std_error) %>%
filter(n > 2) %>%
arrange(desc(mean_success))
test_elimination

## # A tibble: 6 x 7
## `What was the dog ~ n mean_success std_error sd lower.ci upper.ci
##
## 1 crate 71 0.972 0.0198 0.167 0.932 1.01
## 2 nap 27 0.889 0.0616 0.320 0.762 1.02
## 3 signal 15 0.600 0.131 0.507 0.319 0.881
## 4 sniffing 215 0.553 0.0340 0.498 0.487 0.620
## 5 pacing 14 0.429 0.137 0.514 0.132 0.725
## 6 play 21 0.333 0.105 0.483 0.113 0.553

When Aimee is in her crate before going out she has the highest success rate.

Now we run a t.test for statistical significance between Success and Accident by date but before the test we will remove missing values (when Aimee had no action but was taken outside).

test <- test[c(-56, -15), ]

hypothesis <- with(test, t.test(Success == 1, Success == 0))
hypothesis

##
## Welch Two Sample t-test
##
## data: Success == 1 and Success == 0
## t = 8.1307, df = 724, p-value = 1.85e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95% percent confidence interval:
## 0.2194118 0.3591006
## sample estimates:
## mean of x mean of y
## 0.6446281 0.3553719

obs_diff <- hypothesis[["estimate"]][["mean of x"]] - hypothesis[["estimate"]][["mean of y"]]
obs_diff

## [1] 0.2892562

Successful housebreaking trips are achieving at 0.6446281 while accidents are occurring 0.3553719. That's a 0.2892562 drop, which is great if it were true. The most likely reason for weird difference in means results are that we didn't collect enough data.

Lets plot the p-value by date.

test_by_day <- test %>%
group_by(Date) %>%
summarise(p_value = t.test(Success == 1, Success == 0)$p.value,
Success = t.test(Success == 1, Success == 0)$estimate[1])

test_by_day %>%
ggplot(aes(Date, p_value)) +
geom_line(size = 1) +
geom_hline(yintercept = 0.05, linetype="dashed", color = "red") +
labs(title = "P-Value of Success by Day",
subtitle = "With 0.05 Threshold") +
theme_fivethirtyeight()

Figure 2.1 P-Value of Success by Day.

The difference in means is statistically significant at the conventional levels of confidence. As the p-value is larger than our 0.05 significance level, we can reject the null hypothesis that there is no statistical difference in Success vs Accident for housebreaking Aimee. This type of statistical test is useful for me to determine whether housebreaking Aimee resulted in a statistical difference of Succcess.

Lastly we can calculate the effect of success over time and the total effect of success.

test_by_acc <- test %>%
group_by(Date) %>%
summarise(Accident = t.test(Success == 1, Success == 0)$estimate[2])

effect <- inner_join(test_by_day, test_by_acc, by = "Date") %>%
mutate(effect = (Success - Accident))

effect %>%
summarise(mean_effect = mean(effect), total_effect = sum(effect))

## # A tibble: 1 x 2
## mean_effect total_effect
##
## 1 0.315 10.1

Lets plot the effect overtime for visual ease.

effect %>%
ggplot(aes(Date, effect)) +
geom_line(size = 1, color = "blue") +
labs(title = "Percent Change of Success by Day") +
theme_fivethirtyeight()

Figure 2.2 Percent Change of Success by Day.

Final hypothesis

My final hypothesis is that Aimee is more accident prone later in the day.

ggplot(data = test, aes(`Potty break or in-house accident?`, hour)) +
geom_boxplot(color = "#007DC5", alpha = 0.8) +
geom_jitter(size = 0.5) +
theme_bw() +
labs(x = "",
y = "",
title = "",
subtitle = "Box Plot of Potty break or in-house accident? by Hour") +
coord_flip()

Figure 2.3 Box Plot of Potty break or in-house accident? by Hour

qplot(fill = `Potty break or in-house accident?`, x = hour, data = test, geom = "density",
alpha = I(0.5),
adjust = 1,
xlim = c(-5, 30)) +
theme_bw()

Figure 2.4 Density Plot by Hour

hour_t.test <- with(test, t.test(hour ~ `Potty break or in-house accident?`))
hour_t.test

##
## Welch Two Sample t-test
##
## data: hour by Potty break or in-house accident?
## t = 2.1031, df = 296.87, p-value = 0.0363
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.07694935 2.31919455
## sample estimates:
## mean in group Accident mean in group Success
## 14.86047 13.66239

As the p-value is smaller than our 0.05 significance level, we reject the null hypothesis that there is no statistical difference in the hour for Potty break or in-house accident?. This type of statistical test is useful to determine if the hour of the day resulted in a statistical difference in success. This means that if the data is continued to be collected using the same techniques, 95% of the intervals constructed this way would contain the true proportion and will fall within the interval estimates 95% of the time. Examining the box plot above gives a easy visualization of our confidence interval for the true proportion of the sample.

hour_diff <- round(hour_t.test$estimate[1] - hour_t.test$estimate[2], 1)

Our study finds that hour of day, on average is 1.2 hours later in the Accident group compared to the Success group (t-statistic 2.1, p=0.036, CI [0.1, 2.3] hours)

Conclusion

To clarify, I am not a professional trainer but thought using data to measure whether or not my pup was progressing in the right direction seemed amicable. Also, I used no form of punishment and strongly suggest the reinforcement method of using a clicker. Learning that punishment does not work because they don't remember the act of going to the bathroom in the house is key to only using positive reinforcement. If you scare your animal while catching them in the act it will only cause them to be afraid of you when they have to potty and will lead to finding hidden accidents.

Now for the data conclusions, using a schedule and rewarding good behavior was key to the quick learning results while housebreaking.

Remember that correlation is not causation. The later it is in the day is not causing Aimee to have more or less success with housebreaking. It is more likely due to both my partner and I being home and present while being able to pay more or less attention to her behavior.

In the future we could also use the food and water data I collected to help with determining variables in housebreaking. Animals that eat/drink on a set schedule tend to use the bathroom on a schedule. Another useful variable may have been to group by Date and calculate the average time between potty trips to gather a general pattern. A good data analysis always generates insights but also helps generate more questions.

Figure 2.5 Aimee

Sentiment Analysis of Red Hot Chili Peppers

Thu, 07 Sep 2017 12:00:00 GMT

Last week, I finished been reading 'Scar Tissue', Anthony Kiedis' autobiography. The book details his life and the many years he has been involved with the RHCP. Keidis has lived a life worth telling in the memoir. Constant recollection of his life journeys are spilled into a 400+ page book that does not dissapoint. After completing the tell all story I decided to take a data perspective on the RHCP. After their succesful album, 'Blood Sugar Sex Magik', lead guitarist John Frusciante left the band due to the overwhelming popularity and among other issues. Replacing Frusciante with Dave Navarro in 1992 the RHCP created 'One Hot Minute'. Although the album went platinum it was not as successful as the earlier title. Frusciante re-joined the RHCP in 1998 and they released 'Californication'. The RHCP style in 'One Hot Minute' vs 'Blood Sugar Sex Magik' is stated to contain darker subject matter, which is credited to the addition of Navarro. Creating a sentiment analysis, we will compare the albums lyrics.

Getting the Data by scraping RHCP lyrics

To gather the lyrical data we will need to scrape the lyrics using rvest.

library(knitr)
library(rvest)
library(tidyr)
library(tidytext)
library(wordcloud)
library(XML)
library(tidyverse)

poe <- ('https://genius.com/Red-hot-chili-peppers-the-power-of-equality-lyrics')
poe_html <- read_html(poe)
poe_lyrics <- poe_html %>%
html_nodes("p") %>%
html_text()
poe_lyric_df <- data.frame(line = 1:1, text = poe_lyrics)
poe_lyric_df$text <- as.character(poe_lyric_df$text)

poe <- poe_lyric_df %>%
unnest_tokens(word, text)

blood_sugar <- blood_sugar %>%
anti_join(stop_words) %>%
filter(!grepl('[0-9]', word), word != 'verse', word != 'hook', word != 'song', word != 'album', word != 'anthony')

To extract the lyrics we can use the format above for each url lyric or use purrr for writing a function by album using map_chr function to transform the input into a list or data frame (This is by far the most efficient route).

Now we have the lyrics for 'Blood Sugar Sex Magik' and can transfer them using the tidy text format.

Now we can do the same for 'One Hot Minute'

Once we have the lyrics for 'One Hot Minute' we transfer them using the same tidy text format.

one_minute <- one_minute %>%
anti_join(stop_words) %>%
filter(!grepl('[0-9]', word), word != 'verse', word != 'chorus', word != 'song', word != 'album', word != 'red', word != 'hot',
word != 'peppers', word != 'chili', word != 'https', word != 'lyrics', word != 'genius.com')

one_minute <- bind_rows(mutate(warped, album = "One Hot Minute", song = "Warped"),
mutate(aeroplane, album = "One Hot Minute", song = "Aeroplane"),
mutate(deep_kick, album = "One Hot Minute", song = "Deep Kick"),
mutate(my_friends, album = "One Hot Minute", song = "My Friends"),
mutate(coffee_shop, album = "One Hot Minute", song = "Coffee Shop"),
mutate(pea, album = "One Hot Minute", song = "Pea"),
mutate(one_big_mob, album = "One Hot Minute", song = "One Big Mob"),
mutate(walkabout, album = "One Hot Minute", song = "Walkabout"),
mutate(tearjerker, album = "One Hot Minute", song = "Tearjerker"),
mutate(one_hot_m, album = "One Hot Minute", song = "One Hot Minute"),
mutate(falling_into_grace, album = "One Hot Minute", song = "Falling Into Grace"),
mutate(shallow, album = "One Hot Minute", song = "Shallow"),
mutate(transcending, album = "One Hot Minute", song = "Transcending")) %>%
unnest_tokens(word, text)

Frequency of Lyrics between albums

library(stringr)

frequency <- bind_rows(mutate(one_minute, album = "One Hot Minute"),
mutate(blood_sugar, album = "Blood Sugar Sex Magik")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(album, word) %>%
group_by(album) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(album, proportion) %>%
gather(album, proportion, `One Hot Minute`) %>%
na.omit()

library(scales)

ggplot(frequency, aes(x = proportion, y = `Blood Sugar Sex Magik`, color = abs(`Blood Sugar Sex Magik` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~album, ncol = 2) +
theme(legend.position="none") +
labs(y = "Blood Sugar Sex Magik", x = NULL)

Figure 1.1 Comparing word frequences between RHCP albums 'One Hot Minute' & 'Blood Sugar Sex Magik'.

Words that are close to the line have similar frequencies in both albums. Some words landed here unintentionally. For instance, rick and kiedis are most likely not lyrics but appear from scraping the web page (Rick Rubin was the producer on both albums while Anthony Kiedis is the lead singer). It is interesting to see 'funky' appearing near the middle of the line while 'love' appearing at the high end of the frequency.

We can now quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between 'One Hot Minute' & 'Blood Sugar Sex Magik'?

cor.test(data = frequency[frequency$album == "One Hot Minute",],
~ proportion + `Blood Sugar Sex Magik`)
##
## Pearson's product-moment correlation
##
## data: proportion and Blood Sugar Sex Magik
## t = 5.2814, df = 246, p-value = 2.82e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2026112 0.4267278
## sample estimates:
## cor
## 0.3191241

The correlation between words in 'One Hot Minute' & 'Blood Sugar Sex Magik' is .31, not a strong indication of similar lyrics. This could be due to the addition of Navarro or the RHCP trying a different lyrical tone.

Combine both albums and add sentiment analysis

Using an inner_join statement we can get a good grasp of the sentiment by grabbing positive and negative words. Lets find the net sentiment between the two albums.

tidy <- bind_rows(blood_sugar, one_minute)

afinn <- tidy %>%
inner_join(get_sentiments("afinn")) %>%
group_by(album) %>%
summarise(sentiment = sum(score)) %>%
mutate(method = "AFINN")

ggplot(afinn, aes(album, sentiment, fill = album)) +
geom_col(show.legend = FALSE) +
facet_wrap(~album, ncol = 2, scales = "free_x")

Figure 1.2 Displays 'One Hot Minute' has a higher negative sentiment between the two album lyrics.

We can now examine the top words used between each album

tidy %>%
group_by(album) %>%
count(word, sort = TRUE) %>%
filter(n > 13) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = album)) +
facet_wrap(~ album, scales = "free_y") +
geom_col(show.legend = FALSE) +
labs(y = "Most Common Used Words") +
coord_flip()

Figure 1.3 Examines the most common words used between the albums

We can now reshape this chart into a wordcloud

Most Common Word Clouds

tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))

Lets distinguish between positive and negative words.

library(reshape2)

tidy %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#F8766D", "#00BFC4"),
max.words = 100)

Summary

The data appear to support the notion that when John Frusciante left the RHCP momentarily their lyrics were darker and 'negative' according the word lexicons. The analysis was motivated by the tidytext and rvest packages. Further analysis could look into the RHCP later albums with Frusciante and possibly other Navarro band lyrics. The Red Hot Chili Peppers revolutionized funk rock in America and anyone interested in their journey should read 'Scar Tissue'.

Altuve or Biggio? Using Bayesian A/B Testing

Tue, 23 May 2017 12:00:00 GMT

Using Bayesian A/B Testing

Altuve vs Biggio with Bayesian A/B Testing.

Who is a better batter?: Craig Biggio or Jose Altuve?

Inspiration for this post comes after reading David Robinson's post comparing Mike Piazza vs Hank Aaron using Bayesian A/B testing here.

At the end of 2014 Jose Altuve has a higher career batting average (630 hits/ 2083 at-bats=.302) than Craig Biggio (3060 hits/ 10876 at-bats=.281).

Can we say that Altuve's batting skill is actually better than Biggio's or could it be that Altuve has not played long enough to regress towards the mean?

In this post we will compare two batters using an empirical Bayesian approach to batting statistics to determine who is the better batter and by how much?

Understanding the difference between the two proportions is important in A/B testing. One of the most common examples of A/B testing is comparing clickthrough rates ("out of X impressions, there have been Y clicks")- which on the surface is similar to our batting average estimation problem ("out of X at-bats, there have been Y hits").

Lets define the problem in terms of the difference between each players posterior distribution, and look at three mathematical and computational strategies we can use to solve the issue related to baseball statistics although many A/B tests can apply the same principles.

Setup

library(dplyr)
library(tidyr)
library(Lahman)
library(knitr)
library(ggplot2)
theme_set(theme_bw())

pitchers <- Pitching %>%
group_by(playerID) %>%
summarize(gamesPitched = sum(G)) %>%
filter(gamesPitched > 3)

career <- Batting %>%
filter(AB > 0) %>%
anti_join(pitchers, by = "playerID") %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB)) %>%
mutate(average = H / AB)

career <- Master %>%
tbl_df() %>%
select(playerID, nameFirst, nameLast) %>%
unite(name, nameFirst, nameLast, sep = " ") %>%
inner_join(career, by = "playerID")

career_filtered <- career %>% filter(AB >= 500)
m <- MASS::fitdistr(career_filtered$average, dbeta,
start = list(shape1 = 1, shape2 = 10))

alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]

career_eb <- career %>%
mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0)) %>%
mutate(alpha1 = H + alpha0,
beta1 = AB - H + beta0) %>%
arrange(desc(eb_estimate))

So let's take a look at the two batters in question, Craig Biggio and Jose Altuve

# Save them as separate objects too for later:
biggio <- career_eb %>% filter(name == "Craig Biggio")
altuve <- career_eb %>% filter(name == "Jose Altuve")
bagwell <- career_eb %>% filter(name == "Jeff Bagwell")
two_players <- bind_rows(biggio, altuve)

kable(head(two_players))

playerID — name — H — AB — average — eb_estimate — alpha1 — beta1

biggicr01 — Craig Biggio — 3060 — 7816 — 0.281 — 0.281 — 3137 — 8035

altluvjo01 — Jose Altuve — 1046 — 2315 — 0.311 — 0.307 — 1123 — 2534

We see that Altuve has slightly higher batting average, and a higher shrunken empirical bayes estimate. But is Altuve's true probability of getting a hit higher than Biggios? Or is the difference due to chance?

The answer lies in considering the range of plausible values for their "true" batting averages after we have taken their batting average (record) into account, or the "actual posterior distributions".

These posterior distributions are modeled as beta distributions with the parameters Beta(α0 + H, α0 + β0 + H + AB)

library(broom)
library(ggplot2)
theme_set(theme_bw())

two_players %>%
inflate(x = seq(.26, .33, .00025)) %>%
mutate(density = dbeta(x, alpha1, beta1)) %>%
ggplot(aes(x, density, color = name)) +
geom_line() +
labs(x = "Batting average", color = "")

This posterior is a probalistic representations of our uncertainty in each estimate. When we ask what is the probability Altuve is better, we are asking "if I drew a random draw from Altuve's batting record and a random draw from Biggio's, what is the probability Altuve is higher"?

Notice how Biggio's and Atluve's distribution overlap near the .290 range. Although by examing the distribution there is NOT enough uncertainty in each of the estimates to determine that Biggio could be a better hitter than Altuve at the current year statistics in 2014. If we took a random draw from Biggio's distribution from Altuve's, its very unlikely Biggio would be higher.

career_eb %>%
filter(name %in% c("Craig Biggio", "Jose Altuve", "Jeff Bagwell")) %>%
inflate(x = seq(.26, .33, .00025)) %>%
mutate(density = dbeta(x, alpha1, beta1)) %>%
ggplot(aes(x, density, color = name)) +
geom_line() +
labs(x = "Batting average", color = "")

Jeff Bagwell won a Silver Slugger Award in 1994 and had an excellent batting record. Notice the vast amount of overlap in Bagwell and Altuve's distributions. This means there is enough uncertainty in the estimates that Bagwell could easily be a better batter than Altuve.

Posterior Probability

We may be interested in the probability that Altuve is a stronger hitter than Biggio within our model. From the graph we can already tell that its greater than 50%, how can we quantify this?

We need to kow the probability one beta ditribution is greater than another.

I'm going to illustrate three common routes in solving a Bayesian problem: 1) Simulation of posterior draws 2) Numerical integration 3) Closed-form approximation

Simulation of posterior draws

Simulation is the quickest way around not having to do any math. Using each player's α1 and β1 parameters, draw a million items from each of them using rbeta, and compare results:

altuve_simulation <- rbeta(1e6, altuve$alpha1, altuve$beta1)
biggio_simulation <- rbeta(1e6, biggio$alpha1, biggio$beta1)
bagwell_simulation <- rbeta(1e6, bagwell$alpha1, bagwell$beta1)
sim <- mean(altuve_simulation > biggio_simulation)
head(sim)

## [1] 0.999

A 99% probability that Altuve is a better batter than Biggio.

For fun lets compare Altuve to Bagwell.

sim2 <- mean(bagwell_simulation > altuve_simulation )
sim2

## [1] 0.103

A much lower probability of 10% that Bagwell is a better batter than Altuve.

You could turn up or down the number of draws depending on how much you value speed vs precision. We didn't have to do any mathematical derivation or proofs. Even if we had a more complicated model, the process for simulating from it would still straightforward. This is one of the reasons Bayesian simulation approaches have become popular: computational power has gotten cheap, while doing math is as expensive.

Integration

These two posteriors have their own independent distribution, and together they form a joing distribution - a density over particular pairs of x and y. The joint distribution could be imagined as a density cloud:

library(tidyr)

x <- seq(.270, .312, .0002)
crossing(altuve_x = x, biggio_x = x) %>%
mutate(altuve_density = dbeta(altuve_x, altuve$alpha1, altuve$beta1),
biggio_density = dbeta(biggio_x, biggio$alpha1, biggio$beta1),
joint = altuve_density * biggio_density) %>%
ggplot(aes(altuve_x, biggio_x, fill = joint)) +
geom_tile() +
geom_abline() +
scale_fill_gradient2(low = "white", high = "red") +
labs(x = "Altuve batting average",
y = "Biggio batting average",
fill = "Joint density") +
theme(legend.position = "none")

Here we are asking what fraction of the joint probability density lies below the black line, where altuve's average is greater than Biggio's. Clearly more lies below than above, confirming the posterior probability that Altuve is a better hitter by 99%.

Using numerical integration to calculate this quantitatively would look like this in R:

d <- .00002
limits <- seq(.26, .33, d)
sum(outer(limits, limits, function(x, y) {
(x > y) *
dbeta(x, altuve$alpha1, altuve$beta1) *
dbeta(y, biggio$alpha1, biggio$beta1) *
d ^ 2
}))

## [1] 0.997

The approach becomes harder to control in problems that have many dimensions.

Closed-form approximation

Closed-form approximation is a much faster approximation approach. When α and β are both fairly large, the beta starts looking similar to a normal distribution, so much so that it can be closely approximated.

If you draw the normal approximation to the Altuve and Biggio, they are visually indistinguishable:

two_players %>%
mutate(mu = alpha1 / (alpha1 + beta1),
var = alpha1 * beta1 / ((alpha1 + beta1) ^ 2 * (alpha1 + beta1 + 1))) %>%
inflate(x = seq(.26, .33, .00025)) %>%
mutate(density = dbeta(x, alpha1, beta1),
normal = dnorm(x, mu, sqrt(var))) %>%
ggplot(aes(x, density, group = name)) +
geom_line(aes(color = name)) +
geom_line(lty = 2)

The probability one normal is greater than another is very easy to calculate mathematically:

h_approx <- function(alpha_a, beta_a,
alpha_b, beta_b) {
u1 <- alpha_a / (alpha_a + beta_a)
u2 <- alpha_b / (alpha_b + beta_b)
var1 <- alpha_a * beta_a / ((alpha_a + beta_a) ^ 2 * (alpha_a + beta_a + 1))
var2 <- alpha_b * beta_b / ((alpha_b + beta_b) ^ 2 * (alpha_b + beta_b + 1))
pnorm(0, u2 - u1, sqrt(var1 + var2))
}

h_approx(altuve$alpha1, altuve$beta1, biggio$alpha1, biggio$beta1)

## [1] 0.999

The calculation is vecorizable in R. The downside being that for low α or low β, the normal approximation to the beta is going to fit rather poorly. The closed-form approximation is systematically biased. In certain problems it will give too high of an answer and some cases too low. When we have prior alpha and beta we are safe using the closed-form approximation.

Confidence and credible intervals

In frequentist statistics is a contigency table comparing two proporations. Such as:

Player — Hits — Misses

Craig Biggio — 3060 — 7816

Jose Altuve — 1046 — 2315

A common classical way to approach contingency table problems in with Pearson's chi-squared test, implemented in R as prop.test:

prop.test(two_players$H, two_players$AB)

##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: two_players$H out of two_players$AB
## X-squared = 10, df = 1, p-value = 9e-04
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.0478 -0.0119
## sample estimates:
## prop 1 prop 2
## 0.281 0.311

We see a significant value less than .05. Therefore confirming our posterior distribution.

Prop test also gives you a confidence interval for the difference between the two players.

Now we will use empirical Bayes to compute the credible interval about the difference in Altuve and Biggio. We can do this simulation or integration but we will use our normal approximation approach:

credible_interval_approx <- function(a, b, c, d) {
u1 <- a / (a + b)
u2 <- c / (c + d)
var1 <- a * b / ((a + b) ^ 2 * (a + b + 1))
var2 <- c * d / ((c + d) ^ 2 * (c + d + 1))

mu_diff <- u2 - u1
sd_diff <- sqrt(var1 + var2)

data_frame(posterior = pnorm(0, mu_diff, sd_diff),
estimate = mu_diff,
conf.low = qnorm(.025, mu_diff, sd_diff),
conf.high = qnorm(.975, mu_diff, sd_diff))
}
credible_interval_approx(altuve$alpha1, altuve$beta1, biggio$alpha1, biggio$beta1)

## # A tibble: 1 x 4
## posterior estimate conf.low conf.high
##
## 1 0.999 -0.0262 -0.0433 -0.00911

set.seed(188)

intervals <- career_eb %>%
filter(AB > 10) %>%
sample_n(20) %>%
group_by(name, H, AB) %>%
do(credible_interval_approx(altuve$alpha1, altuve$beta1, .$alpha1, .$beta1)) %>%
ungroup() %>%
mutate(name = reorder(paste0(name, " (", H, " / ", AB, ")"), -estimate))

f <- function(H, AB) broom::tidy(prop.test(c(H, altuve$H), c(AB, altuve$AB)))
prop_tests <- purrr::map2_df(intervals$H, intervals$AB, f) %>%
mutate(estimate = estimate1 - estimate2,
name = intervals$name)

all_intervals <- bind_rows(
mutate(intervals, type = "Credible"),
mutate(prop_tests, type = "Confidence")
)

ggplot(all_intervals, aes(x = estimate, y = name, color = type)) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
xlab("Altuve average - Player average") +
ylab("Player")

Because there is not a lot of information on certain players their credible intervals end up smaller than their confidence intervals. This is because we are able to use the prior to adjust the expectations (Esix Snead may have ended up with a higher batting average than Altuve but we are sure it was not .25 higher). When provided with a lot of information, the confidence and credible intervals approach almost perfectly. Therefore, empirical Bayes A/B credible intervals are a way to "shrink" frequentist confidence intervals, by sharing power across players.

Conclusion:

We are acting as if baseball players make up one homogeneous pool, this is mathematically convenient but its ignoring a lot of information about players. Pitchers faced, stadiums played in, length of career. For instance, ignoring how long Altuve's career compared to Biggio's 20 year career. This leads to bias where empirical Bayes tends to overestimate players with very few at-bats.

Also, this post is ONLY comparing Altuve's BATTING AVERAGE to Biggio's and not taking into account how valuable Biggio was to the Astros over the years. Starting at catcher then moving to second base and even dabbling in center field.

For a moving piece on Biggio read Bill James analysis of Craig Biggio here. Despite a little negativity there is one thing James hit spot on, "Biggio was the guy who would do whatever needed to be done."

Headshots of Allan Butler

Thu, 01 Jan 2015 04:35:00 GMT

These images may be used as headshots of Allan Butler for speaking and media appearances.

Click each image for high-quality, print-ready file.