<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Allan Butler — Articles</title>
    <link>https://allanbutler.com/</link>
    <description>I design and deploy AI systems inside large organizations.</description>
    <atom:link href="https://allanbutler.com/rss.xml" rel="self" type="application/rss+xml" />
    <language>en-US</language>
    <lastBuildDate>Thu, 04 Jun 2026 16:32:12 GMT</lastBuildDate>
    <item>
      <title>Smarter Grocery Search with Knowledge Graph RAG and DSPy</title>
      <link>https://allanbutler.com/smarter-grocery-search-knowledge-graph-rag-dspy/</link>
      <guid isPermaLink="true">https://allanbutler.com/smarter-grocery-search-knowledge-graph-rag-dspy/</guid>
      <pubDate>Mon, 20 Oct 2025 12:00:00 GMT</pubDate>
      <description>Problem In modern grocery retail, customers expect search experiences that are fast, relevant, and personalized. If you search for &quot;nut-free granola…</description>
      <content:encoded><![CDATA[<h2>Problem</h2>
<p>In modern grocery retail, customers expect search experiences that are fast, relevant, and personalized. If you search for &quot;nut-free granola under $5&quot;, a typical keyword search fails because it doesn&#39;t understand &quot;nut-free&quot; as an attribute and it might pull any &quot;granola&quot; regardless of price.</p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/56_1.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/56_1.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/56_1.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/56_1.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="56_1" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>This highlights three core challenges:</p>
<ol><li><strong>Multi-attribute complexity</strong> – Each product spans multiple structured fields: brand, category, nutrition, ingredients, dietary tags, and price. A single query can touch all of them.</li><li><strong>Free-form natural language</strong> – Shoppers don&#39;t speak in schemas. They mix attributes (&quot;nut-free&quot;), numeric filters (&quot;under $5&quot;), and categories (&quot;granola&quot;) in ways that don&#39;t align neatly to database fields.</li><li><strong>Explainability and trust</strong> – Customers want to know why a product is recommended, and merchandisers need to validate how items surface in search. Without transparency, trust erodes.</li></ol>
<p>Traditional keyword or embedding search struggles to consistently deliver relevance in this context. Traditional vector retrieval methods capture semantic similarity but struggle with constraints like price thresholds or categorical attributes. A Knowledge Graph offers a formal representation (G=(V,E)), where products, brands, and attributes are entities (V), and relationships such as <strong>HAS_ATTRIBUTE</strong> or <strong>IN_CATEGORY</strong> are edges (E). Queries like &quot;nut-free granola under $5&quot; can then be interpreted as subgraph patterns with attribute constraints and a numeric inequality, which classical vector spaces cannot enforce. This motivates a hybrid retriever that fuses:</p>
<ul><li><strong>Knowledge Graph</strong> → Precision, constraints, explainability.</li><li><strong>Vector Embeddings</strong> → Semantic recall.</li></ul>
<h2>Solution: Knowledge Graph RAG w/ DSPy</h2>
<p>This architecture generalizes the conventional Retrieval-Augmented Generation (RAG) paradigm. Rather than treating retrieval as a flat vector similarity operation, we augment it with structured graph-based reasoning. The result is a hybrid retriever that balances semantic flexibility with constraint enforcement.</p>
<p>To tackle these challenges, we combine three complementary pieces:</p>
<ul><li><strong>Vector embeddings</strong> – capture semantic similarity, so queries like &quot;granola&quot; and &quot;cereal&quot; don&#39;t miss relevant matches.</li><li><strong>Knowledge Graphs (KGs)</strong> – enforce structured reasoning, letting us filter by attributes (e.g., <code>HAS_ATTRIBUTE = nut-free</code>) and constraints (e.g., <code>PRICE &lt; 5</code>).</li><li><strong>DSPy</strong> – a framework for declaratively building LLM pipelines, so we can design hybrid retrieval systems that are modular, explainable, and easy to extend.</li></ul>
<p>This approach extends the familiar Retrieval-Augmented Generation (RAG) pattern. Instead of treating retrieval as a flat vector lookup, we enrich it with structured knowledge.</p>
<h2>Why Knowledge Graph RAG?</h2>
<p>A grocery product is not just a row in a table, it&#39;s better understood as a node in a network of relationships. Take something as simple as granola. It isn&#39;t defined only by its name, it&#39;s linked to a brand like H-E-B or Central Market, placed within a category such as Pantry → Granola, associated with ingredients like oats or almonds, described by attributes like nut-free or gluten-free, and tied to price metadata that could reflect everyday low price, promotions, or coupon eligibility.</p>
<p>This web of connections is what a Knowledge Graph (KG) captures. In a KG, edges describe meaning: a product <strong>HAS_ATTRIBUTE</strong> Nut-Free, <strong>IN_CATEGORY</strong> Granola, or <strong>MADE_BY</strong> H-E-B. That structure gives us more than just labels, it encodes the logic of how grocery items relate to one another.</p>
<p>Compare that to a Classic RAG pipeline:</p>
<blockquote>Search → Embedding → Vector DB → Retrieved Docs → LLM answers.</blockquote>
<p>This flow works well when the goal is retrieving unstructured text — FAQs, policy documents, articles. But it breaks down in retail search. Embeddings can tell us that &quot;granola&quot; is semantically similar to &quot;cereal.&quot; What they can&#39;t do reliably is enforce constraints like &quot;must be nut-free,&quot; &quot;price under $5,&quot; or &quot;belongs in the Pantry category.&quot; And those are exactly the rules shoppers care about.</p>
<p>Imagine a customer in Texas searching H-E-B Digital for &quot;organic salsa under $4.&quot; That query carries intent across multiple structured dimensions at once: a dietary attribute, a category, and a numeric filter. A vector-only search may capture the gist of &quot;salsa,&quot; but it often drops the fine-grained conditions that make the result meaningful.</p>
<p>This is why Knowledge Graph RAG matters. It blends the semantic flexibility of embeddings with the structured precision of graph reasoning. In practice, that means a product like H-E-B Nut-Free Crunch Granola ($4.79) is represented not just by text embeddings but by explicit graph links to its attributes, category, brand, and price. When retrieved, the system can explain itself:</p>
<blockquote>&quot;Recommended because it&#39;s granola, tagged nut-free, and priced under $5.&quot;</blockquote>
<p>The results create a system tuned for how people actually shop for groceries—combining natural language flexibility with structured, constraint-aware precision.</p>
<h2>Enter DSPy</h2>
<p>DSPy helps us build this LLM pipeline declaratively. Designing hybrid retrieval pipelines with LLMs often turns into a mess of brittle prompt chains and glue code. That&#39;s where DSPy comes in. Instead of hand-crafting prompts, DSPy lets you declare what the pipeline should do, and it handles the rest.</p>
<p>The building blocks are simple:</p>
<ul><li><strong>Signatures</strong> – define inputs/outputs (e.g., <code>ProductSearchSignature</code>).</li><li><strong>Modules</strong> – compose retrieval + answer steps.</li><li><strong>Programs</strong> – orchestrate hybrid retrieval + answer generation.</li></ul>
<p>For example, a product search task can be expressed in just a few lines:</p>
<p><code>import dspy</code><br /><br /><code>class ProductSearchSignature(dspy.Signature):</code><br /><code>    &quot;&quot;&quot;Return product suggestions and key facts based on a grocery search query.&quot;&quot;&quot;</code><br /><code>    query: str</code><br /><code>    hybrid_context: list[str]</code><br /><code>    suggestions: str</code></p>
<p>With this declaration, DSPy automatically generates the right prompts behind the scenes. That means pipelines stay modular, explainable, and easier to maintain. You focus on what needs to happen (semantic + graph retrieval, ranking, explanation), not on how to hack together prompts.</p>
<p>In practice, this makes DSPy a natural fit for Knowledge Graph RAG in grocery search, where transparency and structured reasoning are just as important as semantic recall.</p>
<h2>Architecture</h2>
<p>The architecture integrates structured reasoning from a Knowledge Graph (KG) with semantic recall from vector retrieval, orchestrated through a declarative DSPy pipeline. Product data from the grocery catalog (brand, category, nutrition, labels, and price) is ingested into the KG, where it is linked to attributes and relationships such as <strong>HAS_ATTRIBUTE</strong> or <strong>IN_CATEGORY</strong>. At query time, a customer request is decomposed into both free-text (e.g., product names or descriptions) and structured constraints (e.g., nut-free, price &lt; $5). The KG enforces attribute and numeric filters, while embeddings capture broader semantic matches. Candidate products retrieved from both channels are passed to the LLM layer, where DSPy coordinates hybrid reasoning and explanation. This final stage produces not only ranked recommendations but also explicit justifications (e.g., &quot;recommended because it is granola, tagged nut-free, and priced under $5&quot;), ensuring transparency and trust in the system.</p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/56_2.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/56_2.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/56_2.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/56_2.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="56_2" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<h2>Sample Grocery Dataset</h2>
<p><code>product_id,name,brand,category,sub_category,price,ingredients,attributes</code><br /><code>1,HEB Oats &amp; Honey Granola,H-E-B,Pantry,Cereal &amp; Granola,4.49,&quot;Whole grain oats,honey,almonds&quot;,&quot;contains_nuts;vegetarian&quot;</code><br /><code>2,Central Market Organic Granola Low Sugar,Central Market,Pantry,Cereal &amp; Granola,5.99,&quot;Oats,coconut,chia,monk fruit&quot;,&quot;organic;low_sugar;vegan&quot;</code><br /><code>...</code></p>
<h2>Vector Store (FAISS + Embeddings)</h2>
<p>Dense vector representations form the backbone of modern search. They map products and queries into a shared continuous space where similarity is measured by cosine distance or inner product. Historically, models like Word2Vec and GloVe used 200–300 dimensions; transformers like BERT/SBERT expanded this to ~768; and today&#39;s API embeddings often run 1,536–4,096 dimensions. Benchmarks like MTEB show higher dimensions improve recall and coverage, but at the cost of speed, memory, and storage.</p>
<p>For grocery search, embeddings help generalize semantically (&quot;granola&quot; ≈ &quot;cereal&quot;) and capture brand or description similarity. But dimensionality alone cannot enforce structured rules like <code>HAS_ATTRIBUTE = nut_free</code> or <code>PRICE &lt; 5</code>. This is why we need a hybrid approach: embeddings for semantic recall, knowledge graphs for constraints and explainability.</p>
<p>We use SentenceTransformers + FAISS to encode product text (name, brand, category, attributes, nutrition).</p>
<p><code>from sentence_transformers import SentenceTransformer</code><br /><code>import faiss</code><br /><br /><code>model = SentenceTransformer(&quot;all-MiniLM-L6-v2&quot;)</code><br /><code>embs = model.encode(product_texts, normalize_embeddings=True)</code><br /><code>index = faiss.IndexFlatIP(embs.shape[1])</code><br /><code>index.add(embs.astype(&quot;float32&quot;))</code></p>
<h2>Knowledge Graph Representation</h2>
<p>We also ingest the dataset into a KG for explicit reasoning:</p>
<p><code>import networkx as nx</code><br /><br /><code>G = nx.Graph()</code><br /><code>for _, r in df.iterrows():</code><br /><code>    pid = f&quot;product:{r.product_id}&quot;</code><br /><code>    G.add_node(pid, label=&quot;Product&quot;, name=r.name, brand=r.brand)</code><br /><code>    # Link to category + attributes</code><br /><code>    G.add_node(f&quot;attr:{r.attributes}&quot;, label=&quot;Attribute&quot;)</code><br /><code>    G.add_edge(pid, f&quot;attr:{r.attributes}&quot;, type=&quot;HAS_ATTRIBUTE&quot;)</code></p>
<p>This allows us to query structured relationships.</p>
<h2>Hybrid Retrieval</h2>
<p>To combine both sources of relevance:</p>
<p><code>vec_results = vector_search(query, k=6)</code><br /><code>kg_results = kg_search(query, k=6)</code><br /><br /><code>context_texts = [r[&quot;text&quot;] for r in vec_results + kg_results]</code></p>
<p>The LLM now sees semantic hits + structured facts.</p>
<h2>DSPy Pipeline</h2>
<p>We define DSPy for search &amp; answering:</p>
<p><code>class ProductSearchSignature(dspy.Signature):</code><br /><code>    query: str</code><br /><code>    hybrid_context: list[str]</code><br /><code>    suggestions: str</code><br /><br /><code>class HybridSearchProgram(dspy.Module):</code><br /><code>    def __init__(self):</code><br /><code>        self.search_llm = dspy.Predict(ProductSearchSignature)</code><br /><br /><code>    def forward(self, query: str):</code><br /><code>        vec = vector_search(query)</code><br /><code>        kg = kg_search(query)</code><br /><code>        context = [r[&quot;text&quot;] for r in vec + kg]</code><br /><code>        pred = self.search_llm(query=query, hybrid_context=context)</code><br /><code>        return pred.suggestions</code></p>
<p><code>HybridSearchProgram</code> merges vector + KG retrieval.</p>
<p>DSPy generates prompts under the hood, ensuring modularity and transparency. DSPy uses the description you defined in your <code>Signature</code> to generate examples into the prompt.</p>
<h2>Walkthrough</h2>
<p><strong>Example:</strong> &quot;nut-free granola under $5&quot;</p>
<ol><li>Vector Search finds granola products.</li><li>KG filters for attribute = nut-free and price &lt; 5.</li><li>Result: H-E-B Nut-Free Crunch Granola ($4.79).</li></ol>
<h2>Explainability: Why Did This Product Rank?</h2>
<p>One of the biggest pain points in grocery search is that results often feel like a black box. Shoppers see a product surface, but they don&#39;t know why. Did it match a keyword? Was it the cheapest? Or was it just similar text in the description? That&#39;s not good enough when customers are filtering by dietary needs and health attributes. Grocery catalogs are packed with metadata—organic, nut-free, gluten-free, low sodium, high protein—and customers expect search to honor those signals. If a parent is shopping for a child with a nut allergy, they don&#39;t just want &quot;granola.&quot; They want to know it&#39;s nut-free and still within budget.</p>
<p>This is where Knowledge Graph RAG changes the game. Because products are represented as nodes connected to explicit attributes, the system can explain itself:</p>
<blockquote>&quot;Recommended because it&#39;s granola, tagged nut-free, and priced under $5.&quot;</blockquote>
<p>That simple explanation builds trust with shoppers who can see their intent and why certain items surfaced.</p>
<h2>Conclusion</h2>
<p>Integrating vector embeddings, knowledge graphs, and DSPy yields a retrieval architecture that aligns with the complexity of modern grocery search. Embeddings provide semantic recall, knowledge graphs enforce attribute and numeric constraints, and DSPy ensures that the pipeline remains modular and declarative. The result is a system that is:</p>
<ul><li><strong>Constraint-aware</strong> – results respect attributes and thresholds rather than relying solely on lexical matches.</li><li><strong>Explainable</strong> – recommendations are transparent and auditable, enabling both shopper trust and merchandiser validation.</li><li><strong>Maintainable</strong> – the declarative design simplifies extension and long-term support.</li></ul>
<p>For grocery retail, where discovery often hinges on nuanced attributes like organic, nut-free, or low sodium, this hybrid approach unlocks better discovery. It means customers can find exactly what they need, with confidence, while H-E-B can deliver on the promise of &quot;Here Everything&#39;s Better&quot; in the digital space as well. And with DSPy, the pipeline stays clean, modular, and transparent.</p>
<h2>How to Try It</h2>
<ol><li>Clone the repo (<a href="https://github.com/allanbutler/kg-rag-grocery">Github link here</a>).</li><li><code>poetry install</code></li><li><code>poetry shell</code></li><li><code>poetry run python -m sgs prepare-data</code> → builds KG + FAISS index.</li><li><code>poetry run python -m sgs run-server</code> → starts API.</li><li><code>curl &#39;http://127.0.0.1:8000/search?q=nut-free%granola&#39;</code></li></ol>
<h2>Sources</h2>
<ul><li><a href="https://neo4j.com/blog/developer/rag-tutorial/">RAG</a></li><li><a href="https://github.com/stanfordnlp/dspy">DSPy</a></li><li><a href="https://pedramnavid.com/blog/dspy-part-one/">DSPy Case Study</a></li><li><a href="https://www.youtube.com/watch?v=JEMYuzrKLUw">DSPy Lecture</a></li><li><a href="https://gitlab.com/butler.allan-heb/smart-grocery-search">Repo</a></li></ul>]]></content:encoded>
    </item>
    <item>
      <title>What&apos;s In A Name?</title>
      <link>https://allanbutler.com/whats-in-a-name/</link>
      <guid isPermaLink="true">https://allanbutler.com/whats-in-a-name/</guid>
      <pubDate>Sun, 15 May 2022 12:00:00 GMT</pubDate>
      <description>What Is In A Baby Name? Becoming a first time parent is a daunting task in an individuals life. From the many baby books to all the gadgets (hot take:…</description>
      <content:encoded><![CDATA[<h4>What Is In A Baby Name?</h4>
<p>Becoming a first time parent is a daunting task in an individuals life. From the many <a href="https://www.amazon.com/Baby-Book/s?k=Baby+Book">baby books</a> to all the gadgets (hot take: you don&#39;t need all the gadgets) you need to purchase for the individual that will soon become your new roomy. With all the chaos that will come soon in those 9 short months, one of the most challenging can be coming up with a name. Using the Social Security Card Application Baby Names from 2010 - 2020 I used a data approach to try and solve this problem.</p>
<p>We want to pick a name that is not the most popular and/or a passing trend, unique enough for our family tree, and true to our families culture.</p>
<h4>Approach is as follow:</h4>
<ul><li>Complete simple counts to examine overall most/least popular</li><li>Year-over-year differences of popularity values.</li><li>Find names that have sudden spikes &amp; then drop off, proxy for trendy names.</li></ul>
<p><code>import matplotlib.pyplot as plt</code><br /><code>import numpy as np</code><br /><code>import pandas as pd</code><br /><code>import seaborn as sns</code><br /><br /><code>df_m = pd.read_csv(&quot;data_b.csv&quot;, sep=&#39;\t&#39;)</code></p>
<p>Once the data is imported &amp; filtered for male only names we take a quick look at our four columns of interest.</p>
<ul><li>year</li><li>name</li><li>gender</li><li>count</li></ul>
<p>For a quick look at the top 5 names we run a simple aggregate by name and count using pandas.</p>
<p><code># Grab top 5 names</code><br /><code>df_m_sum = df_m.groupby(&#39;name&#39;)[&#39;count&#39;].agg([&#39;sum&#39;, &#39;max&#39;], as_index=False)</code><br /><br /><code>df_m_sum.nlargest(5, [&#39;sum&#39;])</code></p>
<p><strong>name  —  sum  —  max</strong></p>
<p>Noah  —  201245  —  19650</p>
<p>Liam  —  193376  —  20555</p>
<p>William  —  172238  —  17347</p>
<p>Jacob  —  172154  —  22139</p>
<p>Mason  —  167681  —  19518</p>
<p>Next lets examine fastest growing names from 2010 - 2020. We do this by creating two separate dataframes and then use the <code>merge</code> function in pandas to join and calculate the growth column. <code>(latest_year - first_year)/(latest_year) * 100</code></p>
<p><code># Fastest Growing Names (2010 - 2020)</code><br /><br /><code>df_2010 = df_m[df_m[&quot;year&quot;] == 2010]</code><br /><code>df_2020 = df_m[df_m[&quot;year&quot;] == 2020]</code><br /><br /><code>df_yoy_all = pd.merge(df_2010, df_2020, on=&quot;name&quot;)</code><br /><code># x is 2010, y is 2020</code><br /><br /><code># Filter names with counts over 100 in 2010</code><br /><code>df_yoy = df_yoy_all[df_yoy_all[&quot;count_x&quot;] &gt; 5000]</code><br /><br /><code># Create yoy metric</code><br /><code># (2020-2010)/(2010)*100</code><br /><code>df_yoy[&quot;growth&quot;] = (df_yoy[&quot;count_y&quot;] - df_yoy[&quot;count_x&quot;])/(df_yoy[&quot;count_x&quot;])</code></p>
<p><code>df_yoy.nlargest(10, [&#39;growth&#39;])</code></p>
<p><strong>year_x  —  name  —  gender_x  —  count_x  —  year_y  —  gender_y  —  count_y  —  growth</strong></p>
<p>2010  —  Liam  —  M  —  10928  —  2020  —  M  —  19659  —  0.798957</p>
<p>2010  —  Henry  —  M  —  6399  —  2020  —  M  —  10705  —  0.672918</p>
<p>2010  —  Levi  —  M  —  6016  —  2020  —  M  —  9005  —  0.496842</p>
<p>2010  —  Sebasti  —  M  —  6361  —  2020  —  M  —  8927  —  0.403396</p>
<p>2010  —  Josiah  —  M  —  5206  —  2020  —  M  —  6077  —  0.167307</p>
<p>2010  —  Noah  —  M  —  16460  —  2020  —  M  —  18252  —  0.108870</p>
<p>2010  —  Wyatt  —  M  —  7374  —  2020  —  M  —  8135  —  0.103200</p>
<p>2010  —  Lucas  —  M  —  10379  —  2020  —  M  —  11281  —  0.086906</p>
<p>2010  —  Owen  —  M  —  8176  —  2020  —  M  —  8623  —  0.054672</p>
<p>2010  —  Jack  —  M  —  8519  —  2020  —  M  —  8876  —  0.041906</p>
<p><code>df_yoy.nsmallest(10, [&#39;growth&#39;])</code></p>
<p><strong>year_x  —  name  —  gender_x  —  count_x  —  year_y  —  gender_y  —  count_y  —  growth</strong></p>
<p>2010  —  Tyler  —  M  —  10450  —  2020  —  M  —  2771  —  -0.734833</p>
<p>2010  —  Gavin  —  M  —  9619  —  2020  —  M  —  2570  —  -0.732820</p>
<p>2010  —  Brandon  —  M  —  8547  —  2020  —  M  —  2287  —  -0.732421</p>
<p>2010  —  Justin  —  M  —  7848  —  2020  —  M  —  2277  —  -0.709862</p>
<p>2010  —  Kevin  —  M  —  7324  —  2020  —  M  —  2359  —  -0.677908</p>
<p>2010  —  Evan  —  M  —  9730  —  2020  —  M  —  3389  —  -0.651696</p>
<p>2010  —  Brayden  —  M  —  9113  —  2020  —  M  —  3253  —  -0.643037</p>
<p>2010  —  Zachary  —  M  —  7180  —  2020  —  M  —  2698  —  -0.624234</p>
<p>2010  —  Joshua  —  M  —  15448  —  2020  —  M  —  5924  —  -0.616520</p>
<p>2010  —  Jayden  —  M  —  17189  —  2020  —  M  —  7102  —  -0.586829</p>
<p>A quick look at the top 10 largest &amp; smallest growing names over the 10 year span tells us that Liam is the fastest growing and Tyler is the name that is shrinking the most. I&#39;ve filtered the dataset to include only names with over 5000 counts beginning in the year 2010.</p>
<p><code># Filter specific names of initial interest</code><br /><code>df_int = df_yoy_all[df_yoy_all[&quot;count_x&quot;] &gt; 1]</code><br /><br /><code>df_int[&quot;growth&quot;] = (df_int[&quot;count_y&quot;] - df_int[&quot;count_x&quot;])/(df_int[&quot;count_x&quot;])</code></p>
<p>Creating a function to explore any name of interest will be a valuable reusable asset.</p>
<p><code># Create function to look up any name of interest</code><br /><code>name_list = [&#39;Paxton&#39;, &#39;Parker&#39;, &#39;Ethan&#39;, &#39;Hayden&#39;]</code><br /><br /><code>def find_name(search: str):</code><br /><code>    return (df_int[df_int[&#39;name&#39;].str.contains(search)])</code><br /><br /><code>def find_list(search: list):</code><br /><code>    return df_int[df_int[&#39;name&#39;].isin(search)].sort_values(&quot;growth&quot;, ascending=False)</code></p>
<p><code>search = [&#39;Allan&#39;, &#39;Paxton&#39;, &#39;Parker&#39;, &#39;Ethan&#39;, &#39;George&#39;, &#39;Dee&#39;, &#39;Hayden&#39;, &#39;Enzo&#39;]</code><br /><br /><code>find_list(search)</code></p>
<p><strong>year_x  —  name  —  gender_x  —  count_x  —  year_y  —  gender_y  —  count_y  —  growth</strong></p>
<p>2010  —  Enzo  —  M  —  602  —  2020  —  M  —  2201  —  2.656146</p>
<p>2010  —  Dee  —  M  —  5  —  2020  —  M  —  6  —  0.200000</p>
<p>2010  —  Paxton  —  M  —  1110  —  2020  —  M  —  1286  —  0.158559</p>
<p>2010  —  George  —  M  —  2373  —  2020  —  M  —  2746  —  0.157185</p>
<p>2010  —  Parker  —  M  —  4732  —  2020  —  M  —  3797  —  -0.197591</p>
<p>2010  —  Allan  —  M  —  403  —  2020  —  M  —  277  —  -0.312655</p>
<p>2010  —  Ethan  —  M  —  18006  —  2020  —  M  —  9464  —  -0.474397</p>
<p>2010  —  Hayden  —  M  —  4191  —  2020  —  M  —  2146  —  -0.487950</p>
<p><code>find_name(&quot;Hayden&quot;)</code></p>
<p><strong>year_x  —  name  —  gender_x  —  count_x  —  year_y  —  gender_y  —  count_y  —  growth</strong></p>
<p>2010  —  Hayden  —  M  —  4191  —  2020  —  M  —  2146  —  -0.48795</p>
<h4>Plot Most Trendy Names</h4>
<p>Plotting the overall growth gives us some insights but lets break that calculation out by each year to get a better sense of the growth trend.</p>
<p>Lets observe how all-time most popular names have grown over the years instead of just observing the 10 year growth. We can accomplish this by first creating a pivot df.</p>
<p><code>pivot_df = df_m.pivot_table(index=&quot;name&quot;, columns=&quot;year&quot;, values=&quot;count&quot;, aggfunc=np.sum).fillna(0)</code><br /><br /><code># Now we calucalte the percentage of each name by year.</code><br /><br /><code>perc_df = pivot_df / pivot_df.sum() * 100</code><br /><br /><code># Then add a new column with the cumulative percentages sum.</code><br /><code>perc_df[&quot;total&quot;] = perc_df.sum(axis=1)</code><br /><br /><code>sort_df = perc_df.sort_values(by=&quot;total&quot;, ascending=False).drop(&quot;total&quot;, axis=1)[0:10]</code><br /><br /><code>transpose_df = sort_df.transpose()</code><br /><code>transpose_df.head(5)</code></p>
<p>We sort the dataframe to check which are the top values and slice the data appropriately. Lastly, we drop the <code>total</code> column and flip the axes to make plotting the data easier.</p>
<p><strong>name  —  Noah  —  Liam  —  William  —  Jacob  —  Mason  —  Ethan  —  Michael  —  James  —  Alexander  —  Elijah</strong></p>
<p>2010  —  0.858554  —  0.570005  —  0.889746  —  1.154771  —  0.774524  —  0.939193  —  0.905550  —  0.724346  —  0.874046  —  0.725285</p>
<p>2011  —  0.889028  —  0.708355  —  0.914274  —  1.074023  —  1.028696  —  0.879594  —  0.885813  —  0.698658  —  0.827679  —  0.736711</p>
<p>2012  —  0.916126  —  0.886996  —  0.891535  —  1.007317  —  1.001406  —  0.933172  —  0.854119  —  0.709259  —  0.804302  —  0.732743</p>
<p>2013  —  0.966854  —  0.960396  —  0.881369  —  0.962090  —  0.937371  —  0.860090  —  0.821397  —  0.718762  —  0.789320  —  0.730778</p>
<p>2014  —  1.007213  —  0.962950  —  0.877551  —  0.880575  —  0.897154  —  0.820202  —  0.806594  —  0.752946  —  0.804352  —  0.722134</p>
<p><code>import plotly.express as px</code><br /><br /><code>plot = px.line(transpose_df, x=transpose_df.index, y=transpose_df.columns, title=&quot;Top 10 Trendy Names&quot;)</code><br /><code>plot.show()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/54_1.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/54_1.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/54_1.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/54_1.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="54_1" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.1 Trendy Baby Names Over Time.</p>
<p>Liam is still the most &#39;trendy&#39; &amp; popular name, according to growth, over the last 10 years.</p>
<p>I&#39;m going to create another function to grab the year where the name of interest is the highest.</p>
<p><code>def when_most_births(name):</code><br /><br /><code>    if name in set(df_m[&quot;name&quot;]):</code><br /><br /><code>        highest = df_m[df_m[&quot;name&quot;] == name].groupby(&quot;year&quot;)[&quot;count&quot;].sum().sort_values(ascending = False)[:1]</code><br /><code>        in_2020 = df_m[(df_m[&quot;name&quot;] == name) &amp; (df_m[&quot;year&quot;] == 2020)][&quot;count&quot;].sum()</code><br /><br /><code>        print(&quot;Name {} was most popular in {} with {} kids given this name.\n&quot;.format(name, int(highest.index[0]), highest.iloc[0]))</code><br /><br /><code>        print(&#39;In 2020 there were {} babies in total who were given the name {}.\n&#39;.format(in_2020, name))</code><br /><br /><code>        px.line(df_m[df_m[&quot;name&quot;] == name], x=&quot;year&quot;, y=&quot;count&quot;, color = &quot;name&quot;, title=f&quot;Baby Name {name} Over Time&quot;).show()</code><br /><br /><code>    else:</code><br /><code>        print(f&quot;Name {name} is not in the database.&quot;)</code></p>
<p><code>when_most_births(&quot;Enzo&quot;)</code></p>
<p>Name Enzo was most popular in 2020 with 2201 kids given this name.</p>
<p>In 2020 there were 2201 babies in total who were given the name Enzo.</p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/54_2.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/54_2.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/54_2.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/54_2.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="54_2" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.2 Most Popular Over Time.</p>
<p>Using a function from a kaggle notebooke we will</p>
<h4>Create a metric that measure spikes &amp; then has a drop off.</h4>
<ul><li>Divide a names maximum count by its total count.</li></ul>
<h4>Most Sudden Names</h4>
<p><code>df = df_m.groupby([&#39;name&#39;, &#39;gender&#39;])[&#39;count&#39;].agg([&#39;sum&#39;, &#39;max&#39;])</code><br /><br /><code>df_ = df.reset_index()</code><br /><br /><code>df_[&#39;spike_fall&#39;] = df_[&#39;max&#39;]/df_[&#39;sum&#39;]</code><br /><br /><code>popular = df_.sort_values(by=&#39;spike_fall&#39;,ascending=False)</code><br /><br /><code>popular_df = popular[popular[&quot;sum&quot;] &gt; 5000]</code><br /><code>popular_df.head(5)</code></p>
<p>Lets use our function <code>when_most_births</code> to plot what names we want to examine for name spikes/falls.</p>
<p><code>when_most_births(&quot;Jase&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/54_3.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/54_3.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/54_3.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/54_3.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="54_3" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.3 Spike-Fall Over Time.</p>
<p>Name Jase was most popular in 2013 with 4552 kids given this name.</p>
<p>In 2020 there were 624 babies in total who were given the name Jase.</p>
<h4>Examining The Spike-Fall Names</h4>
<ul><li>Jase is a great example of the spike/fall being able to capture an example of a name that peaked in 2013 and has dropped in popularity.</li><li>For some high ranked spike/fall names we do not see the fade part because their peak year is the last one in the dataset.</li></ul>
<p>As you might imagine, this is not the end of finding a baby name. Some open questions are:</p>
<ul><li>How do I actually use this data to choose a name and not just use the analysis for avoiding names?</li><li>What if a trendy name is something we want?</li></ul>
<p>Further analysis can look into both gender names to create a metric that finds the optimal gender neutral name.</p>
<p>We solved the initial problem of avoiding specific names but the question of interest is still left open-ended. Luckily we have 6 months remaining to decide on a name.</p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/54_4.jpg" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/54_4.jpg 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/54_4.jpg 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/54_4.jpg 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="54_4" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>]]></content:encoded>
    </item>
    <item>
      <title>Forecasting Super Bowl Sales</title>
      <link>https://allanbutler.com/forecasting-super-bowl-sales/</link>
      <guid isPermaLink="true">https://allanbutler.com/forecasting-super-bowl-sales/</guid>
      <pubDate>Thu, 24 Jan 2019 12:00:00 GMT</pubDate>
      <description>Time Series Forecasting EDA &amp; Data Preperation Time series analysis is a very useful tool businesses can use to assist in their deicsion making process.…</description>
      <content:encoded><![CDATA[<h3>Time Series Forecasting</h3>
<h4>EDA &amp; Data Preperation</h4>
<p>Time series analysis is a very useful tool businesses can use to assist in their deicsion making process. We all know that &quot;No model will be 100% accurate but some models are useful.&quot; There are numerous time series methods and techniques that can be used but for this example we will be utilizing <a href="http://www.business-science.io/r-packages.html">Business Science</a> collection of open software packages. Although after recently attending the <a href="https://www.rstudio.com/conference/">RStudio Conference</a> the <a href="https://github.com/cran/tsibble">tsibble</a> and <a href="https://github.com/tidyverts/fable">fable</a> package could be used for this analysis as well. The concepts I use when beginning any type of data analysis come heavily from Hadley Wickham and Garrett Grolemund&#39;s <a href="http://r4ds.had.co.nz/">R4DS</a>. The analysis pipeline that I follow always begins with what is the business task at hand, what data science tools can help tackle, and what question do we want to have answered? The process is straight forward and usually leads to more questions, insights, and steps to take towards achieving an actionable outcome.</p>
<p>The business problem is to estimate future super bowl ticket sales. Using past sales, the data can help improve forecasts and generate models that describe the main factors of influence. We can then use the analysis to develop actionable outcomes based on what we have learned. The first step is loading our packages and reading in the data. Usually I would be reading data from a database but for clarity and simplicity we will read in a csv file.</p>
<p><code>library(tidyverse)</code><br /><code>library(lubridate)</code><br /><code>library(timetk)</code><br /><code>library(tidyquant)</code><br /><code>library(broom)</code><br /><code>library(modelr)</code><br /><code>library(caret)</code><br /><code>library(gridExtra)</code><br /><br /><code>SB &lt;- read_csv(&quot;SB.csv&quot;) %&gt;%</code><br /><code>  mutate(Event_Date = mdy(Event_Date), Sale_Date = mdy(Sale_Date), days_to_event = (Event_Date - Sale_Date))</code></p>
<p>The best way to get an understanding of your data is to create different visualizations, lets start with yearly sales.</p>
<h4>Sales over time</h4>
<p>To begin our exploratory analysis we will take a look at sales over time.</p>
<p><code># Create a sales by year data frame</code><br /><code>salesByYear &lt;- SB %&gt;%</code><br /><code>  group_by(Year) %&gt;%</code><br /><code>  summarize(total_sales = sum(Sale_Price))</code><br /><br /><code># Use ggplot to plot sales by year</code><br /><code>ggplot(salesByYear, aes(Year, total_sales)) +</code><br /><code>  geom_bar(stat = &quot;identity&quot;) +</code><br /><code>  geom_smooth(method = &quot;lm&quot;, se = FALSE) +</code><br /><code>  labs(title=&quot;Super Bowl Sales Over Time&quot;, x=&quot;Year&quot;, y=&quot;Sales&quot;) +</code><br /><code>  scale_y_continuous(labels = scales::dollar) +</code><br /><code>  geom_text(aes(y=total_sales, label=scales::dollar(total_sales)),</code><br /><code>                        vjust=1.5,</code><br /><code>                        color=&quot;white&quot;,</code><br /><code>                        size=4) +</code><br /><code>  theme_bw()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_01.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_01.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_01.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_01.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_01" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.1 Revenue Over Time.</p>
<p>Secondary market Super Bowl sales has a linear growth trend with 2018 being the highest gross sales. Note that these numbers do not take into account inflation but still provide insight into market trends throughout the years.</p>
<p>Next we can examine quantity sold and total sales over the last 3 weeks until the Super Bowl.</p>
<p><code>SB %&gt;%</code><br /><code>  mutate(days_to_event  = as.numeric(days_to_event)) %&gt;%</code><br /><code>  group_by(days_to_event, Year) %&gt;%</code><br /><code>  summarise(Qty = sum(Qty)) %&gt;%</code><br /><code>  filter(days_to_event &lt;= 21) %&gt;%</code><br /><code>  ggplot(aes(x = days_to_event, y = Qty, color = Year)) +</code><br /><code>  geom_line(aes(y = Qty), color = palette_light()[[1]]) +</code><br /><code>  facet_grid(Year ~ ., scales = &quot;free&quot;) +</code><br /><code>  theme_tq() +</code><br /><code>  guides(color = FALSE) +</code><br /><code>  labs(title = &quot;Quantity Sold Over Last 3 Weeks&quot;,</code><br /><code>       x = &quot;&quot;,</code><br /><code>       y = &quot;Quantity Sold&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_02.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_02.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_02.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_02.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_02" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 2.1 Quantity Sold.</p>
<p><code>SB %&gt;%</code><br /><code>  mutate(days_to_event  = as.numeric(days_to_event)) %&gt;%</code><br /><code>  group_by(days_to_event, Year) %&gt;%</code><br /><code>  summarise(Sale_Price = sum(Sale_Price)) %&gt;%</code><br /><code>  filter(days_to_event &lt;= 21) %&gt;%</code><br /><code>  ggplot(aes(x = days_to_event, y = Sale_Price, color = Year)) +</code><br /><code>  geom_line(aes(y = Sale_Price), color = palette_light()[[1]]) +</code><br /><code>  facet_grid(Year ~ ., scales = &quot;free&quot;) +</code><br /><code>  theme_tq() +</code><br /><code>  guides(color = FALSE) +</code><br /><code>  labs(title = &quot;Total Sales Over Last 3 Weeks&quot;,</code><br /><code>       x = &quot;&quot;,</code><br /><code>       y = &quot;Sale Price&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_03.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_03.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_03.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_03.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_03" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 2.2 Revenue Sold</p>
<p>There is a strong uptick trend at the two weeks out from the game mark for both metrics which intuitively makes sense that has more tickets are sold revenue increases. This is usually when both team are officially decided. Lets further examine a heat map comparing month and day of the month of transactions.</p>
<p><code>SB %&gt;%</code><br /><code>  mutate(day = day(Sale_Date), month = month(Sale_Date)) %&gt;%</code><br /><code>  group_by(month, day) %&gt;%</code><br /><code>  summarise(total_sales = sum(Sale_Price)) %&gt;%</code><br /><code>  ggplot(aes(x = month, y = day, fill = total_sales)) +</code><br /><code>    geom_tile(alpha = 0.8, color = &quot;white&quot;) +</code><br /><code>    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +</code><br /><code>    theme_tq() +</code><br /><code>    theme(legend.position = &quot;right&quot;) +</code><br /><code>    labs(title = &quot;Sales per Month and Day&quot;,</code><br /><code>         y = &quot;Day of the Month&quot;,</code><br /><code>         fill = &quot;Total Sales&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_04.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_04.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_04.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_04.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_04" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 3 Heat Map of Sales by month and day</p>
<p>The heap map tells us that sales happen less during Oct - Dec and heat up late January and early February closer to the event. There are no sales in March - August. Now we can examine sales by specific sections and zones.</p>
<h4>Top 10 Zones</h4>
<p>Let&#39;s explore some stadium zones to get an idea of top selling zones.</p>
<p><code># Plot top 10 products</code><br /><br /><code># Create top 10 products data frame</code><br /><code>zoneSales &lt;- SB %&gt;%</code><br /><code>  group_by(Zone = Section) %&gt;%</code><br /><code>  summarize(total_sales = sum(Sale_Price),</code><br /><code>            qty_total = sum(Qty)) %&gt;%</code><br /><code>  mutate(pct_total = total_sales / sum(total_sales)) %&gt;%</code><br /><code>  arrange(desc(total_sales))</code><br /><code>top10.ordered &lt;- head(zoneSales, 10)</code><br /><code>top10.ordered$Zone &lt;- factor(top10.ordered$Zone, levels = arrange(top10.ordered, total_sales)$Zone)</code><br /><br /><code># Use ggplot to plot the top products</code><br /><code>ggplot(top10.ordered, aes(Zone, total_sales)) +</code><br /><code>  geom_bar(stat=&quot;identity&quot;) +</code><br /><code>  geom_text(aes(ymax=pct_total, label=scales::percent(pct_total)),</code><br /><code>      hjust= -0.25,</code><br /><code>      vjust= 0.5,</code><br /><code>      color=&quot;black&quot;,</code><br /><code>      size=4) +</code><br /><code>  geom_text(aes(ymax=qty_total, label=paste(&quot;Qty:&quot;, qty_total)),</code><br /><code>      hjust= 1.25,</code><br /><code>      vjust= 0.5,</code><br /><code>      color=&quot;white&quot;,</code><br /><code>      size=4) +</code><br /><code>  coord_flip() +</code><br /><code>  labs(title=&quot;Top 10 Zones&quot;,</code><br /><code>       x=&quot;&quot;,</code><br /><code>       y=&quot;Sales&quot;)+</code><br /><code>  scale_y_continuous(labels = scales::dollar, limits = c(0,2500000)) +</code><br /><code>  theme_bw()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_05.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_05.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_05.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_05.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_05" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 4 Bar chart for sales by top 5 zones</p>
<p>Unexpectedly Upper Corner is the top selling zone wtih $1,463,821 and 2.17% of total ticket sales. There could be biases in the data due to different stadium layouts. Further analysis would need to group each individual section into specific titled zones.</p>
<h4>Geographic Trends</h4>
<p>Lets map the sales using <code>leaflet</code> to try and expose sales trends by city.</p>
<p><code># Plot sales by geographic location</code><br /><br /><code># Create sales by location from orders extedend, joining latitude and longitude</code><br /><code># data by customer name</code><br /><code>salesByLocation &lt;- SB %&gt;%</code><br /><code>  group_by(Stadium, LNG, LAT) %&gt;%</code><br /><code>  summarise(total_sales = sum(Sale_Price)) %&gt;%</code><br /><code>  mutate(popup = paste0(Stadium, &quot;: &quot;, scales::dollar(total_sales)))</code><br /><br /><code># Use Leaflet package to create map visualizing sales by customer location</code><br /><code>library(leaflet)</code><br /><code>leaflet(salesByLocation) %&gt;%</code><br /><code>  addProviderTiles(&quot;CartoDB.Positron&quot;) %&gt;%</code><br /><code>  addMarkers(lng = ~LNG,</code><br /><code>             lat = ~LAT,</code><br /><code>             popup = ~popup) %&gt;%</code><br /><code>  addCircles(lng = ~LNG,</code><br /><code>             lat = ~LAT,</code><br /><code>             weight = 2,</code><br /><code>             radius = ~(total_sales)^0.775)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_06.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_06.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_06.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_06.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_06" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 5 Leaflet Map of past superbowl sales.</p>
<p>Larger circles relate to higher sales, and smaller circles relate to lower sales. <code>leaflet</code> provides interactivety by being able to click on the markers. The geographic trends are consistent with the sales over time charts.</p>
<p>Now that we have done our exploratory data analysis we can attempt a time series forecast.</p>
<p>Based upon our EDA we have features relevant to forecasting demand or future revenue. We can split the data into a training and test set and begin forecasting future revenue. We will use all data before 2018 Super Bowl as the training data and all data after as the test samples.</p>
<p><code>SB_forecast &lt;- SB %&gt;%</code><br /><code>  group_by(Sale_Date) %&gt;%</code><br /><code>  summarise(Qty = sum(Qty), Sales = sum(Sale_Price)) %&gt;%</code><br /><code>  mutate(model = ifelse(Sale_Date &lt; &quot;2017-09-10&quot;, &quot;train&quot;, &quot;test&quot;))</code><br /><br /><code>SB_qty &lt;- SB_forecast %&gt;%</code><br /><code>  ggplot(aes(Sale_Date, Sales, color = model)) +</code><br /><code>  geom_point(alpha = 0.5) +</code><br /><code>    geom_line(alpha = 0.5) +</code><br /><code>    scale_color_manual(values = palette_light()) +</code><br /><code>    theme_tq()</code><br /><br /><code>SB_days_until &lt;- SB %&gt;%</code><br /><code>  group_by(Sale_Date, days_to_event) %&gt;%</code><br /><code>  summarise(Sales = sum(Sale_Price)) %&gt;%</code><br /><code>  mutate(model = ifelse(Sale_Date &lt; &quot;2017-09-10&quot;, &quot;train&quot;, &quot;test&quot;)) %&gt;%</code><br /><code>  ggplot(aes(days_to_event, Sales, color = model)) +</code><br /><code>  geom_point(alpha = 0.5) +</code><br /><code>    geom_line(alpha = 0.5) +</code><br /><code>    scale_color_manual(values = palette_light()) +</code><br /><code>    theme_tq()</code><br /><br /><code>grid.arrange(SB_qty, SB_days_until)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_07.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_07.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_07.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_07.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_07" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 6 Time Series of Quantity &amp; Sales</p>
<p>Notice the issue with the missing time series values when there are not any sales data. We will have to account for the missing dates when creating our future index.</p>
<p>Using <code>timekt</code> we can add time series signature to our corresponsing repsonse variable.</p>
<p><code>SB_forecast_aug &lt;- SB_forecast %&gt;%</code><br /><code>  select(model, Sale_Date, Sales) %&gt;%</code><br /><code>  tk_augment_timeseries_signature()</code><br /><br /><code>SB_forecast_aug &lt;- SB_forecast_aug[complete.cases(SB_forecast_aug), ]</code></p>
<p>After adding the features based on the properties of our <code>tk_augment_timeseries_signature()</code> function we them remove missing values from the data frame. Since we have to account for the missing sales dates we need to ask ourselves whether replacing those values with the mean or setting the values to 0. Since there are large gaps in purchases between super bowls for this situation we should set the results to 0 and also remove values with a variance of 0.</p>
<p><code>library(matrixStats)</code><br /><br /><code>(var &lt;- data.frame(colnames = colnames(SB_forecast_aug[, sapply(SB_forecast_aug, is.numeric)]),</code><br /><code>           colvars = colVars(as.matrix(SB_forecast_aug[, sapply(SB_forecast_aug, is.numeric)]))) %&gt;%</code><br /><code>  filter(colvars == 0))</code><br /><br /><code>SB_forecast_aug &lt;- select(SB_forecast_aug, -one_of(as.character(var$colnames)))</code></p>
<p>The sales data is aggregated by day so the hour, minute, second, am/pm features are removed. Next we will remove the highly correlated values in the data set.</p>
<p><code>library(ggcorrplot)</code><br /><br /><code>cor &lt;- cor(SB_forecast_aug[, sapply(SB_forecast_aug, is.numeric)])</code><br /><code>p.cor &lt;- cor_pmat(SB_forecast_aug[, sapply(SB_forecast_aug, is.numeric)])</code><br /><br /><code>ggcorrplot(cor,  type = &quot;upper&quot;, outline.col = &quot;white&quot;, hc.order = TRUE, p.mat = p.cor,</code><br /><code>           colors = c(palette_light()[1], &quot;white&quot;, palette_light()[2]))</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_08.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_08.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_08.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_08.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_08" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 7 Correlation plot</p>
<p>Examining the correlation plot and data frame I am going to choose to remove features of 0.95 as a cutoff.</p>
<p><code>cor_cut &lt;- findCorrelation(cor, cutoff = 0.95)</code><br /><code>SB_forecast_aug &lt;- select(SB_forecast_aug, -one_of(colnames(cor)[cor_cut]))</code></p>
<p>After removing the highly correlated values we can split data into our training and test set.</p>
<p><code>train &lt;- filter(SB_forecast_aug, model == &quot;train&quot;) %&gt;%</code><br /><code>  select(-model)</code><br /><code>test &lt;- filter(SB_forecast_aug, model == &quot;test&quot;)</code></p>
<h4>Modeling</h4>
<p>The response variable <code>Sales</code> will be modeled using a generalized linear model. We could test numerous statistical learning models to deviate the best model choice but for this situation <a href="https://en.wikipedia.org/wiki/Occam%27s_razor">Occam</a> probably was right.</p>
<p><code>fit_lm &lt;- glm(Sales ~ ., data = train)</code></p>
<p>Visualize the model features using <code>broom</code> and <code>ggplot2</code></p>
<p><code>tidy(fit_lm) %&gt;%</code><br /><code>  gather(x, y, estimate:p.value) %&gt;%</code><br /><code>  ggplot(aes(x = term, y = y, color = x, fill = x)) +</code><br /><code>    facet_wrap(~ x, scales = &quot;free&quot;, ncol = 4) +</code><br /><code>    geom_bar(stat = &quot;identity&quot;, alpha = 0.8) +</code><br /><code>    scale_color_manual(values = palette_light()) +</code><br /><code>    scale_fill_manual(values = palette_light()) +</code><br /><code>    theme_tq() +</code><br /><code>    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_09.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_09.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_09.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_09.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_09" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 8 Model features</p>
<p><code>augment(fit_lm) %&gt;%</code><br /><code>  ggplot(aes(x = Sale_Date, y = .resid)) +</code><br /><code>    geom_hline(yintercept = 0, color = &quot;red&quot;) +</code><br /><code>    geom_point(alpha = 0.5, color = palette_light()[[1]]) +</code><br /><code>    geom_smooth() +</code><br /><code>    theme_tq()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_10.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_10.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_10.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_10.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_10" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 9</p>
<p>After plotting we can now add predictions and residuals for the test data and visualize the residuals.</p>
<p><code>pred_test &lt;- test %&gt;%</code><br /><code>  add_predictions(fit_lm, &quot;pred_lm&quot;) %&gt;%</code><br /><code>  add_residuals(fit_lm, &quot;resid_lm&quot;)</code><br /><br /><code>pred_test %&gt;%</code><br /><code>    ggplot(aes(x = Sale_Date, y = resid_lm)) +</code><br /><code>    geom_hline(yintercept = 0, color = &quot;red&quot;) +</code><br /><code>    geom_point(alpha = 0.5, color = palette_light()[[1]]) +</code><br /><code>    geom_smooth() +</code><br /><code>    theme_tq()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_11.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_11.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_11.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_11.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_11" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 10</p>
<p>After examining the residuals we would probably want to do some form of model transformation on the response variable using interaction or adding polynomial terms to the independent variables but we can leave that explanation for another time.</p>
<p>Now we compare the predicted against the actual data in the test set.</p>
<p><code>pred_test %&gt;%</code><br /><code>  gather(x, y, Sales, pred_lm) %&gt;%</code><br /><code>  ggplot(aes(x = Sale_Date, y = y, color = x)) +</code><br /><code>    geom_point(alpha = 0.5) +</code><br /><code>    geom_line(alpha = 0.5) +</code><br /><code>    scale_color_manual(values = palette_light()) +</code><br /><code>    theme_tq()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_12.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_12.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_12.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_12.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_12" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 11</p>
<p>Our model apears to miss the uptick in sales in late January but appears consistent none the less.</p>
<h4>Forecasting</h4>
<p>Now that our feature selection is out of the way we can forecast next years total Super Bowl tickets sales. First we extract and index using the <code>tk_index</code> function.</p>
<p><code># Extract index</code><br /><code>idx &lt;- SB_forecast %&gt;%</code><br /><code>    tk_index()</code><br /><br /><code>idx_future &lt;- idx %&gt;%</code><br /><code>  tk_get_timeseries_summary()</code><br /><code>idx_future</code></p>
<p><code>    ## # A tibble: 1 x 12</code><br /><code>    ##   n.obs start      end        units scale tzone diff.minimum diff.q1</code><br /><code>    ##                           </code><br /><code>    ## 1   524 2012-09-30 2018-02-04 days  day   UTC          86400   86400</code><br /><code>    ## # ... with 4 more variables: diff.median , diff.mean ,</code><br /><code>    ## #   diff.q3 , diff.maximum </code></p>
<p>We need to account for the irregular data because we are missing dates due to no past sales and the mean difference does not equal 86400 or 1 day.</p>
<p>We need to beware of that we never have data for days where there are no sales and we have a few random missing values in between, as can be seen in the diff column of SB_forecast_aug (1 day difference is 86400 seconds).</p>
<p><code>SB_forecast_aug %&gt;%</code><br /><code>  ggplot(aes(x = Sale_Date, y = diff)) +</code><br /><code>    geom_point(alpha = 0.5, aes(color = as.factor(diff))) +</code><br /><code>    geom_line(alpha = 0.5) +</code><br /><code>    theme_tq()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_13.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_13.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_13.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_13.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_13" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 12</p>
<p>Create future index and rename index to <code>Sale_Date</code> to match original data. We account for the missing days on a monthly, quarterly, or yearly schedule using the <code>inspect_months</code> function.</p>
<p><code>idx_future &lt;- idx %&gt;%</code><br /><code>  tk_make_future_timeseries(n_future = 365, inspect_months = TRUE)</code><br /><br /><code>data_future &lt;- idx_future %&gt;%</code><br /><code>    tk_get_timeseries_signature() %&gt;%</code><br /><code>    rename(Sale_Date = index)</code></p>
<p>Predict the future values and build the future data frame.</p>
<p><code>pred_future &lt;- predict(fit_lm, newdata = data_future)</code><br /><br /><code>sales_future &lt;- data_future %&gt;%</code><br /><code>    select(Sale_Date) %&gt;%</code><br /><code>    add_column(Sales = pred_future)</code><br /><br /><code>SB_forecast %&gt;%</code><br /><code>    ggplot(aes(x = Sale_Date, y = Sales)) +</code><br /><code>    geom_rect(xmin = as.numeric(ymd(&quot;2017-09-10&quot;)),</code><br /><code>              xmax = as.numeric(ymd(&quot;2018-02-04&quot;)),</code><br /><code>              ymin = 0, ymax = 2000000,</code><br /><code>              fill = palette_light()[[4]], alpha = 0.01) +</code><br /><code>    geom_rect(xmin = as.numeric(ymd(&quot;2018-02-05&quot;)),</code><br /><code>              xmax = as.numeric(ymd(&quot;2019-02-04&quot;)),</code><br /><code>              ymin = 0, ymax = 2000000,</code><br /><code>              fill = palette_light()[[3]], alpha = 0.01) +</code><br /><code>    annotate(&quot;text&quot;, x = ymd(&quot;2013-11-03&quot;), y = 1500000,</code><br /><code>             color = palette_light()[[1]], label = &quot;Train Region&quot;) +</code><br /><code>    annotate(&quot;text&quot;, x = ymd(&quot;2017-08-01&quot;), y = 550000,</code><br /><code>             color = palette_light()[[1]], label = &quot;Test Region&quot;) +</code><br /><code>    annotate(&quot;text&quot;, x = ymd(&quot;2018-10-01&quot;), y = 550000,</code><br /><code>             color = palette_light()[[1]], label = &quot;Forecast Region&quot;) +</code><br /><code>    geom_point(alpha = 0.5, color = palette_light()[[1]]) +</code><br /><code>    geom_point(aes(x = Sale_Date, y = Sales), data = sales_future,</code><br /><code>               alpha = 0.5, color = palette_light()[[2]]) +</code><br /><code>    geom_smooth(aes(x = Sale_Date, y = Sales), data = sales_future,</code><br /><code>                method = &#39;loess&#39;) +</code><br /><code>    labs(title = &quot;Seconday Market Super Bowl Ticket Sales: 2019 Forecast&quot;, x = &quot;&quot;) +</code><br /><code>    theme_tq()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_14.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_14.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_14.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_14.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_14" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 13</p>
<p>Notice the negative values. This is not only impossible but might tell us something about the error rate in our model. We can visualize this by plotting the standard deviation of the test residuals.</p>
<p><code>test_residuals &lt;- pred_test$resid_lm</code><br /><code>test_resid_sd &lt;- sd(test_residuals, na.rm = TRUE)</code><br /><br /><code>sales_future &lt;- sales_future %&gt;%</code><br /><code>    mutate(</code><br /><code>        lo.95 = Sales - 1.96 * test_resid_sd,</code><br /><code>        lo.80 = Sales - 1.28 * test_resid_sd,</code><br /><code>        hi.80 = Sales + 1.28 * test_resid_sd,</code><br /><code>        hi.95 = Sales + 1.96 * test_resid_sd</code><br /><code>        )</code><br /><br /><code>SB_forecast %&gt;%</code><br /><code>    ggplot(aes(x = Sale_Date, y = Sales)) +</code><br /><code>    geom_point(alpha = 0.5, color = palette_light()[[1]]) +</code><br /><code>    geom_ribbon(aes(ymin = lo.95, ymax = hi.95), data = sales_future,</code><br /><code>                fill = &quot;#D5DBFF&quot;, color = NA, size = 0) +</code><br /><code>    geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), data = sales_future,</code><br /><code>                fill = &quot;#596DD5&quot;, color = NA, size = 0, alpha = 0.8) +</code><br /><code>    geom_point(aes(x = Sale_Date, y = Sales), data = sales_future,</code><br /><code>               alpha = 0.5, color = palette_light()[[2]]) +</code><br /><code>    geom_smooth(aes(x = Sale_Date, y = Sales), data = sales_future,</code><br /><code>                method = &#39;loess&#39;, color = &quot;white&quot;) +</code><br /><code>    labs(title = &quot;Seconday Market Super Bowl Ticket Sales: 2019 Forecast with Prediction Intervals&quot;, x = &quot;&quot;) +</code><br /><code>    theme_tq()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_15.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_15.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_15.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_15.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_15" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 14</p>
<p>Our model predicts that 2019 Super Bowl Sales will not be as prosperous as 2018. The secondary ticket market is notable for high variance and can have a highly uncertain future. Although the revenue forecast follows a similar curve compared to past years summarising a total will provide a better view.</p>
<p><code>combine1 &lt;- SB %&gt;%</code><br /><code>  select(Year, Sale_Price) %&gt;%</code><br /><code>  group_by(Year) %&gt;%</code><br /><code>  summarise(total_sales = sum(Sale_Price))</code><br /><br /><code>combine2 &lt;- sales_future %&gt;%</code><br /><code>  mutate(Year = 2019) %&gt;%</code><br /><code>  na.omit() %&gt;%</code><br /><code>  group_by(Year) %&gt;%</code><br /><code>  summarise(total_sales = sum(Sales))</code><br /><code>All &lt;- bind_rows(combine1, combine2)</code><br /><br /><code>ggplot(All, aes(Year, total_sales)) +</code><br /><code>  geom_bar(stat = &quot;identity&quot;) +</code><br /><code>  geom_smooth(method = &quot;lm&quot;, se = FALSE) +</code><br /><code>  labs(title=&quot;Super Bowl Sales Over Time&quot;, x=&quot;Year&quot;, y=&quot;Sales&quot;) +</code><br /><code>  scale_y_continuous(labels = scales::dollar) +</code><br /><code>  geom_text(aes(y=total_sales, label=scales::dollar(total_sales)),</code><br /><code>                        vjust=1.5,</code><br /><code>                        color=&quot;white&quot;,</code><br /><code>                        size=3.5) +</code><br /><code>  theme_tq()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_16.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/52_16.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/52_16.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/52_16.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="52_16" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 15 Total Revenue Including Forecast</p>
<p>A little data manipulation helps us plot the total forecasted revenue alongside the previoius years for a clear comparison snapshot and visualizing the linear trend. The upward linear trend in sales is a testimate to the growing secondary market along with Super Bowl prices outpacing inflation growth. A further interesting analysis would be comparing wage growth and overall inflation rates amongst Super Bowl prices. Forecasting using the <code>timekt</code> approach is a great machine learning application based upon our data set. However, a prediction is only as good as the data used and a major omitted variable in our analysis is the teams playing and the location. These features can be added to the regression but our example tried to simplify as much as possible to get the results for an outcome. In a real business case example different features could be tested to achieve the most optimal model and result.</p>]]></content:encoded>
    </item>
    <item>
      <title>Optimizing Wedding Reception Seating Charts</title>
      <link>https://allanbutler.com/optimizing-wedding-reception-seating-charts/</link>
      <guid isPermaLink="true">https://allanbutler.com/optimizing-wedding-reception-seating-charts/</guid>
      <pubDate>Wed, 21 Nov 2018 12:00:00 GMT</pubDate>
      <description>Recently my wife and I were married. We were so fortunate that many of our close friends and family members attended our wedding in California (we live…</description>
      <content:encoded><![CDATA[<p>Recently my wife and I were married. We were so fortunate that many of our close friends and family members attended our wedding in California (we live in Texas). My beautiful wife was the ultimate planer and tackled almost every task of wedding planning with her mom. I definitely lucked out with my responsibilities being minimal. However, when she asked for my help with the seating chart of the 90 guests I knew this was a problem that data science could help solve. Luckily after reading Alogrithms to Live By, written by Christian and Griffiths, I came across Meghan Bellows story of planning her wedding while also doing her PhD research in chemical engineering. Using specific scores for each guest relationship and specifying a few constraints I was able to replicate a similar &#39;travelling salesman problem&#39;. Along the way I also found this github repo here by Megan Stiles. She tackled the optimization problem of seating her guests, so big shoutout to her for the R code help.</p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/50_1-scaled.jpg" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/50_1-scaled.jpg 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/50_1-scaled.jpg 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/50_1-scaled.jpg 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="50_1" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p><em>Figure 1. Final tables at the reception.</em></p>
<h2>Building the Guest Relational Matrix</h2>
<p>Based on the assumption that people want to sit at a table with the people they are most closely related we made our guest relational matrix of 90 guests for the Wedding Reception, 9 Tables of 10.</p>
<p>Key: 2000 = Spouse/Date, 900 = Sibling, 700 = Parent/Child, 600 = Grandparent, 500 = Cousin, 300 = Aunt/Niece, 100 = Friend, 0 = Strangers, 5000 = Bride/Groom</p>
<p>Unfortunately there were no other ways to tackle this problem then to manually enter the matrix data into excel, feel free to reach out if you can think of any better suggestions.</p>
<h2>The Genetic Algorithm Solution in R</h2>
<p><code>library(tidyverse)</code><br /><code>library(genalg)</code><br /><br /><code>wedding_matrix &lt;- read_csv(&quot;wedding_seating_chart.csv&quot;)</code><br /><br /><br /><code># 1s indicate the guest is at the current table and 0s indicate they are not. The model will seat one table at a time and iterate until all the tables are filled</code><br /><br /><code>### Define Fitness Function</code><br /><br /><code>evalFunc &lt;- function(x) {</code><br /><code>  # Total Table Closeness, initialize to 0</code><br /><code>  closeness = 0</code><br /><br /><code>  # Number of people at the table</code><br /><code>  current_table_1 = sum(x == 1)</code><br /><br /><code>  # Calculate Index of each person at the tablen (This corresponds to the closeness matrix)</code><br /><code>  i = 0</code><br /><code>  Table_1_POS&lt;- vector()</code><br /><br /><code>  for (i in 1:(length(x - 1))) {</code><br /><code>    if (x[i] == 1) {</code><br /><code>      Table_1_POS&lt;-append(Table_1_POS,i)</code><br /><code>    }</code><br /><code>  }</code><br /><code>  i = 0</code><br /><br /><code>  #Calculates the closeness for the table</code><br /><br /><code>  Table_1 = 0</code><br /><code>  i=0</code><br /><code>  for (i in 1: length(x)) {</code><br /><code>    if (x[i] == 1) {</code><br /><code>      j =0</code><br /><code>      for (j in 1: length(Table_1_POS - 1)) {</code><br /><code>        Table_1 = Table_1 + wedding_matrix[i, Table_1_POS[[j]] + 1]</code><br /><code>      }</code><br /><code>    }</code><br /><code>  }</code><br /><code>  #Total Closeness</code><br /><code>  closeness = Table_1</code><br /><br /><code>  #Restrict Number of guests at each table</code><br /><code>  if (current_table_1 &gt; 10)</code><br /><code>    return(0) else return(-closeness)</code><br /><br /><code>}</code><br /><br /><code>### Iteratively Seat Tables###</code><br /><br /><code>#Initialze interations to 300</code><br /><code>iters = 300</code><br /><code>i = 0</code><br /><br /><code>#initialize chromosome size to 60</code><br /><code>size = 90</code><br /><br /><code>#Initialze seating vector to store seating vector</code><br /><code>Seating_Order &lt;- vector()</code><br /><code>for (i in 1:8) {</code><br /><br /><code>  #Increase Generations for final two tables</code><br /><code>  if ( i &gt; 8) {</code><br /><code>    iters = 1000</code><br /><code>  }</code><br /><br /><code>  #Run GA</code><br /><code>  ga.model &lt;- rbga.bin(size = size, popSize = 200, evalFunc = evalFunc, iters = iters, elitism = TRUE)</code><br /><br /><code>  #Best Solution</code><br /><code>  solution &lt;- ga.model$population[which.min(ga.model$evaluations),]</code><br /><br /><code>  # Print Which Table we are on, The closeness, and how many people are at each table to keep track</code><br /><code>  print(i)</code><br /><code>  print(sum(solution == 1))</code><br /><code>  closeness &lt;- min(ga.model$evaluations)</code><br /><code>  print(closeness)</code><br /><br /><code>  #Append Seated Guests to Seating_Order Vector</code><br /><code>  seated &lt;- wedding_matrix[solution == 1,]</code><br /><code>  Seating_Order &lt;- append(Seating_Order, as.character(seated$X))</code><br /><br /><br /><code>  #Remove seated guests from the df before rerunning the model for the next table</code><br /><code>  seated.index = vector()</code><br /><br /><code>  for (j in 1:(length(solution))) {</code><br /><code>    if (solution[j] == 1) {</code><br /><code>      seated.index&lt;- append(seated.index, j)</code><br /><code>    }</code><br /><code>  }</code><br /><code>  wedding_matrix = wedding_matrix[-c(seated.index[[1]],seated.index[[2]], seated.index[[3]], seated.index[[4]], seated.index[[5]], seated.index[[6]], seated.index[[7]], seated.index[[8]], seated.index[[9]], seated.index[[10]]),</code><br /><code>                        -c((seated.index[[1]]+1),(seated.index[[2]]+1), (seated.index[[3]]+1), (seated.index[[4]]+1), (seated.index[[5]]+1), (seated.index[[6]]+1), (seated.index[[7]]+1), (seated.index[[8]]+1), (seated.index[[9]]+1), (seated.index[[10]]+1))]</code><br /><br /><code>  #Reduce size of chromosome by 10 for next run</code><br /><code>  size = size -10</code><br /><br /><code>}</code><br /><br /><code>#Separate Tables</code><br /><code>One = Seating_Order[1:10]</code><br /><code>Two = Seating_Order[11:20]</code><br /><code>Three = Seating_Order[21:30]</code><br /><code>Four = Seating_Order[31:40]</code><br /><code>Five = Seating_Order[41:50]</code><br /><code>Six = Seating_Order[51:60]</code><br /><code>Seven = Seating_Order[61:70]</code><br /><code>Eight = Seating_Order[71:80]</code><br /><code>Nine = as.character(weddingd_matrix$X)</code></p>
<h2>Combining Tables into the Final Seating Chart</h2>
<p><code>seating_chart &lt;- as.data.frame(bind_rows(One, Two, Three, Four, Five, Six, Seven, Eight, Nine))</code><br /><br /><code>#Save Completed Seating Chart in csv</code><br /><code>write_csv(seating_chart, &quot;Wedding_Seating_Chart.csv&quot;)</code></p>
<h2>The Results</h2>
<p>The final seating chart solution had only a few minor tweaks made by my bride but saved me from the strenuous process of deciding where each individual should sit and I also found a way to include R. Also, the wedding was a blast!</p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/50_2-scaled.jpg" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/50_2-scaled.jpg 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/50_2-scaled.jpg 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/50_2-scaled.jpg 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="50_2" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p><em>Figure 2. My Beautiful Wife and I</em></p>]]></content:encoded>
    </item>
    <item>
      <title>Puppy Training with Machine Learning</title>
      <link>https://allanbutler.com/puppy-training-machine-learning/</link>
      <guid isPermaLink="true">https://allanbutler.com/puppy-training-machine-learning/</guid>
      <pubDate>Sat, 28 Apr 2018 12:00:00 GMT</pubDate>
      <description>A Data Driven Approach to Housebreaking My Puppy Figure 1.1 Don&apos;t let the cuteness fool you. Housetraining a puppy is work. Don&apos;t let the cuteness of…</description>
      <content:encoded><![CDATA[<h3>A Data Driven Approach to Housebreaking My Puppy</h3>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_01-scaled.jpg" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_01-scaled.jpg 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_01-scaled.jpg 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_01-scaled.jpg 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_01" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.1 Don&#39;t let the cuteness fool you.</p>
<p>Housetraining a puppy is work. Don&#39;t let the cuteness of your pup fool you into thinking housetraining will be a breeze, although the right training up front will save you agony down the road. After reading <a href="https://www.rover.com/blog/complete-guide-puppy-potty-training/">Rover&#39;s</a> post on house breaking your dog I decided to take a data approach to housetraining by documenting eating and bathroom breaks. After a month of recording data I was not only extremely grateful for automation of data warehouses but also able to determine if my pup was on the right track with her potty and eating behaviors. For this post I will only use her bathroom dataset.</p>
<p>First we will load the data into a data frame for exploratory analysis along with the correct R packages. Exploratory analysis is about asking a series of data questions and trying to gain useful insights to influence our decision making.</p>
<p><code>library(tidyverse)</code><br /><code>library(lubridate)</code><br /><code>library(ggthemes)</code><br /><code>library(modelr)</code><br /><code>library(broom)</code><br /><code>library(caret)</code><br /><code>library(tidytext)</code><br /><code>library(lime)</code><br /><code>library(ggridges)</code><br /><code>library(viridis)</code><br /><br /><code>potty_records &lt;- read_csv(&quot;Aimee/potty_records.csv&quot;) %&gt;%</code><br /><code>  mutate(Date = mdy(Date), day_of_week = wday(Date, label = TRUE))</code><br /><code>potty_records$hour &lt;- as.POSIXlt(potty_records$Time, format=&quot;%H:%M&quot;)$hour</code></p>
<h4>Visual Exploration</h4>
<p>Now that we have the data loaded with the appropriate packages we can start the EDA process by drawing some plots. Lets start with some plots to get to know the data and visualize whether there are any trends that would help understand the relationship between <code>Potty break or in-house accident?</code> variable and other variables. But first we need to clarify where the missing values exist and if it will cause a problem with the EDA phase.</p>
<p><code># List of NAs</code><br /><code>potty_records %&gt;%</code><br /><code>  purrr::map_df(~sum(is.na(.)))</code></p>
<p><code>## # A tibble: 1 x 10</code><br /><code>##   `Trial No.`  Date  Time `Potty break or in-ho~ `U(rination), D(efecatio~</code><br /><code>##                                                  </code><br /><code>## 1           0     0     0                      2                         0</code><br /><code>## # ... with 5 more variables: `What was the dog doing pre-elimination?</code><br /><code>## #   (nap, meal, walk, play, sniffing, pacing, etc.)` , `Consequences</code><br /><code>## #   for the dog (play, treat, walk, scolding, clean up/no response?)`</code><br /><code>## #   , Notes , day_of_week , hour </code></p>
<p>We see that there are 359 <code>NA</code> values in the Notes, 2 in the <code>Potty break</code>, and 2 in the <code>Pre-elimination</code> column. Since this is manually logged I know that the <code>Pre-elimination</code> NAs were because of only finding the accident and not seeing any behaviors beforehand or from taking the dog out and no action occurred. It is important to know your data and troubleshoot any data integrity issues that you find.</p>
<p>Lets now visualize by column <code>Potty break or in-house accident?</code> over time to get a trend. We can plot the <code>Success</code> average over time to gain a better visualization of the <code>Success</code> rate and see if results have been constantly happening or they just started happening all of a sudden.</p>
<p><code>potty_records %&gt;%</code><br /><code>  rename(type = `Potty break or in-house accident?`) %&gt;%</code><br /><code>  group_by(Date, type) %&gt;%</code><br /><code>  summarise(n = n()) %&gt;%</code><br /><code>  mutate(freq = n/sum(n)) %&gt;%</code><br /><code>  ggplot(aes(Date, freq, color = type)) +</code><br /><code>  geom_point(size = 1) +</code><br /><code>  geom_smooth(method = &quot;lm&quot;) +</code><br /><code>  scale_color_fivethirtyeight(&quot;type&quot;) +</code><br /><code>  labs(title = &quot;Time Series of Bathroom Type&quot;,</code><br /><code>          subtitle = &quot;by % of Success or Accident&quot;) +</code><br /><code>  theme_fivethirtyeight() </code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_02.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_02.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_02.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_02.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_02" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.2 Time Series of Success or Accident by Percent.</p>
<p>Great, it appears <code>Success</code> has a linear trend upward over time despite some minor setbacks. She appears to be a quick learner and <code>Accidents</code> have definitely decreased.</p>
<p>The first granular look we can do is look at bathroom trips across the different days of the week by hour.</p>
<p><code>potty_records %&gt;%</code><br /><code>  ggplot(aes(x = hour, y = day_of_week, fill = ..x..)) +</code><br /><code>  geom_density_ridges_gradient(scale = 3) +</code><br /><code>  scale_x_continuous(expand = c(0.01, 0)) +</code><br /><code>  scale_y_discrete(expand = c(0.01, 0)) +</code><br /><code>  scale_fill_viridis(name = &quot;Hour&quot;, option = &quot;C&quot;) +</code><br /><code>  labs(title = &quot;Number of Potty Breaks By Day of the Week &amp; Hour&quot;,</code><br /><code>       subtitle = &quot;Source: Aimee&#39;s housebreaking&quot;,</code><br /><code>       x = &quot;Hour&quot;) +</code><br /><code>  theme_ridges(font_size = 13, grid = TRUE) + theme(axis.title.y = element_blank())</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_03.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_03.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_03.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_03.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_03" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.2 Joy Plot of Potty Breaks by Day &amp; Hour.</p>
<p>Here we can see that Aimee definitely goes to the bathroom more often later in the day. I would assume this is because I am home from work and she is out more. Also, the variance in Thursday is also a little unusual.</p>
<p>Next thing to do is examine further into hours and types of <code>Accidents</code> vs <code>Success</code> and search for patterns.</p>
<p><code>success &lt;- potty_records %&gt;%</code><br /><code>  filter(`Potty break or in-house accident?` == &#39;Success&#39;)</code><br /><br /><code>success_hour &lt;- ggplot(aes(x = hour), data = success) + geom_histogram(bins = 24, color = &#39;black&#39;, fill = &#39;blue&#39;) +</code><br /><code>  ggtitle(&#39;Histogram of Success Potty Times by Type&#39;) +</code><br /><code>  facet_wrap(~ `U(rination), D(efecation), N(either), B(oth)`) +</code><br /><code>  theme_minimal()</code><br /><br /><code>accident &lt;- potty_records %&gt;%</code><br /><code>  filter(`Potty break or in-house accident?` == &#39;Accident&#39;)</code><br /><br /><code>accident_hour &lt;- ggplot(aes(x = hour), data = accident) + geom_histogram(bins = 24, color = &#39;black&#39;, fill = &#39;#CE1141&#39;) +</code><br /><code>  ggtitle(&#39;Histogram of Accident Times by Type&#39;) +</code><br /><code>  facet_wrap(~ `U(rination), D(efecation), N(either), B(oth)`) +</code><br /><code>  theme_minimal()</code><br /><br /><code>accident_hour</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_04.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_04.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_04.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_04.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_04" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.3 Histogram of Accident Times by Type.</p>
<p><code>success_hour</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_05.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_05.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_05.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_05.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_05" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.4 Histogram of Success Times by Type.</p>
<p>Again, the afternoon seems to be her most active restroom activity as well as when the most accidents occur. This is probably due to Aimee being out of her crate and having more free range.</p>
<p>Lets also examine actions before <code>potty times</code> and compare successful and in house accidents.</p>
<p><code>a &lt;- potty_records %&gt;%</code><br /><code>  filter(`Potty break or in-house accident?` == &#39;Success&#39;) %&gt;%</code><br /><code>  group_by(`What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`) %&gt;%</code><br /><code>  summarise(n = n()) %&gt;%</code><br /><code>  mutate(freq = n/sum(n))</code><br /><br /><code>action_success &lt;- ggplot(aes(x = `What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`, y = freq), data = a) +</code><br /><code>  geom_bar(stat = &quot;identity&quot;, fill = &quot;blue&quot;) +</code><br /><code>   geom_text(aes(label = paste0(round(freq*100, 0), &quot;%&quot;)), position = position_stack(vjust = 0.5), size = 3.5) +</code><br /><code>  theme_fivethirtyeight() +</code><br /><code>      labs(x = &quot;&quot;,</code><br /><code>       y = &quot;Fequency&quot;,</code><br /><code>       title = &#39;Action Before Successful Potty Times&#39;)</code><br /><br /><code>b &lt;- potty_records %&gt;%</code><br /><code>  filter(`Potty break or in-house accident?` == &#39;Accident&#39;) %&gt;%</code><br /><code>  group_by(`What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`) %&gt;%</code><br /><code>  summarise(n = n()) %&gt;%</code><br /><code>  mutate(freq = n/sum(n))</code><br /><br /><code>action_accident &lt;- ggplot(aes(x = `What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`, y = freq), data = b) +</code><br /><code>  geom_bar(stat = &quot;identity&quot;, fill = &quot;#E31837&quot;) +</code><br /><code>   geom_text(aes(label = paste0(round(freq*100, 0), &quot;%&quot;)), position = position_stack(vjust = 0.5), size = 3.5) +</code><br /><code>  theme_fivethirtyeight() +</code><br /><code>      labs(x = &quot;&quot;,</code><br /><code>       y = &quot;Fequency&quot;,</code><br /><code>       title = &#39;Action Before Accident Potty Times&#39;)</code><br /><br /><code>action_success</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_06.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_06.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_06.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_06.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_06" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.5 Bar Chart of Success by Before Action.</p>
<p><code>action_accident</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_07.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_07.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_07.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_07.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_07" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.5 Bar Chart of Accident by Before Action.</p>
<p>Examing the action before accident bar chart shows a clear trend of sniffing before the accident happens. This is a common and intuitive tell from any dog that they are searching for relief spot but it is nice to have the data to support the claim.</p>
<p>Lastly let plot the consequences for <code>Success</code> and <code>Accident</code> by <code>Consequences for the dog (play, treat, walk, scolding, clean up/no response?)</code></p>
<p><code>c &lt;- potty_records %&gt;%</code><br /><code>  filter(`Potty break or in-house accident?` == &#39;Success&#39;) %&gt;%</code><br /><code>  group_by(`Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`) %&gt;%</code><br /><code>  summarise(n = n()) %&gt;%</code><br /><code>  mutate(freq = n/sum(n))</code><br /><br /><code>ggplot(aes(x = `Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`, y = freq), data = c) +</code><br /><code>  geom_bar(stat = &quot;identity&quot;, fill = &quot;blue&quot;) +</code><br /><code>   geom_text(aes(label = paste0(round(freq*100, 0), &quot;%&quot;)), position = position_stack(vjust = 0.5), size = 3.5) +</code><br /><code>  theme_fivethirtyeight() +</code><br /><code>      labs(x = &quot;&quot;,</code><br /><code>       y = &quot;Fequency&quot;,</code><br /><code>       title = &#39;Consequences after Successful Relief&#39;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_08.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_08.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_08.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_08.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_08" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.6 Bar Chart of Success by Consequence.</p>
<p><code>d &lt;- potty_records %&gt;%</code><br /><code>  filter(`Potty break or in-house accident?` == &#39;Accident&#39;) %&gt;%</code><br /><code>  group_by(`Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`) %&gt;%</code><br /><code>  summarise(n = n()) %&gt;%</code><br /><code>  mutate(freq = n/sum(n))</code><br /><br /><code>ggplot(aes(x = `Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`, y = freq), data = d) +</code><br /><code>  geom_bar(stat = &quot;identity&quot;, fill = &quot;#E31837&quot;) +</code><br /><code>   geom_text(aes(label = paste0(round(freq*100, 0), &quot;%&quot;)), position = position_stack(vjust = 0.5), size = 3.5) +</code><br /><code>  theme_fivethirtyeight() +</code><br /><code>      labs(x = &quot;&quot;,</code><br /><code>       y = &quot;Fequency&quot;,</code><br /><code>       title = &#39;Consequences for Accident in House&#39;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_09.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_09.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_09.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_09.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_09" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.6 Bar Chart of Accident by Consequence.</p>
<p>When training Aimee we are going by Karen Pryor&#39;s positive reinforcement method and it definitely appears in the data but 33% my partner and I could not hold back the scolding. After all, we are only human.</p>
<h3>Formulate hypothesis around EDA</h3>
<p>The available data is limited to the bathroom data. Using the <code>potty_records</code> we know whether she has a <code>Success</code> or an <code>Accident</code>. Based upon the data my hypothesis&#39; are:</p>
<ul><li>Based upon what she was doing pre-elimination we can try to determine whether or not we will have a <code>Success</code> or an <code>Accident</code>. This may or may not be enough to build a sufficient prediction model but we can gain some insights from building a machine learning model for variable importance. A better question may be &quot;What might make the <code>Accident</code> column tally less and more?&quot; For instance, is there any difference between action before pre-elimination or between consequences. Or, if time of meals has anything to do with whether the pup will have a <code>Success</code> or <code>Accident</code>.</li><li>Consequences for the dog seem to be making a big difference for <code>Success</code> rate improving.</li><li>Based upon hour and type of potty there doesn&#39;t seem to be a difference between whether an elimination will be <code>Success</code> or <code>Accident</code>.</li></ul>
<p>Now lets evaluate these hypotheses by building some models and a few more plots.</p>
<p><code>potty_records %&gt;%</code><br /><code>  group_by(Date, `Potty break or in-house accident?`) %&gt;%</code><br /><code>  summarise(n = n()) %&gt;%</code><br /><code>  na.omit() %&gt;%</code><br /><code>  ggplot(aes(`Potty break or in-house accident?`, n)) +</code><br /><code>  geom_boxplot(color = &quot;black&quot;, aes(fill = factor(`Potty break or in-house accident?`))) +</code><br /><code>  theme_bw() +</code><br /><code>  scale_fill_brewer(palette = &quot;Blues&quot;) +</code><br /><code>  labs(title = &quot;Potty break or in-house accident?&quot;,</code><br /><code>       x = &quot;&quot;,</code><br /><code>       y = &quot;&quot;) +</code><br /><code>  guides(fill = guide_legend(title = &quot;Type&quot;))</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_10.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_10.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_10.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_10.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_10" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.7 Box Plot.</p>
<p>Examining the box plot we see that <code>Accident</code> by day appears to have a wider variance while Success occurs more often but has one outlier. Since this is <code>group_by</code> day I can remember the unsuccessful day of housebreaking. Lets dig deeper and build some models.</p>
<h3>Correlation is different from causation.</h3>
<p>Through building a classification model we can understand the relationship between the variables better. We can also understand and perhaps explain changes in <code>Success</code> and <code>Accident</code>. But the relationship is correlation, meaning that changes in <code>Success</code> rate are influenced by certain metrics and not caused by them.</p>
<h3>Model Building</h3>
<p>Since our predictor is a binary outcome we will use a machine learning model to predict Success or Accident. I will also use some plotting and variable importance to get insights about how to extract information from the variables using the <code>caret</code> and <code>lime</code> packages.</p>
<p>Lets build and evaluate a model to help us determine important variables for <code>Success</code> and/or <code>Accident</code> by removing time stamps and dates from the data. We will also remove the <code>Trial No</code> and <code>day_of_week</code> because they are not driving whether or not Aimee will have a <code>Success</code> or not and we do not want to overfit the model.</p>
<p><code>potty_records_model &lt;- potty_records %&gt;%</code><br /><code>  select(-Notes, -`Time`, -Date, -`Trial No.`, -day_of_week) %&gt;%</code><br /><code>  mutate(`Potty break or in-house accident?` = as.factor(`Potty break or in-house accident?`),</code><br /><code>         `U(rination), D(efecation), N(either), B(oth)` = as.factor(`U(rination), D(efecation), N(either), B(oth)`),</code><br /><code>         `What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)` = as.factor(`What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`), `Consequences for the dog (play, treat, walk, scolding, clean up/no response?)` = as.factor(`Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`)) %&gt;%</code><br /><code>  na.omit()</code><br /><br /><code>potty_records_model &lt;- potty_records_model %&gt;%</code><br /><code>  rename(type = `U(rination), D(efecation), N(either), B(oth)`, action_before = `What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`, Consequences = `Consequences for the dog (play, treat, walk, scolding, clean up/no response?)`)</code><br /><br /><code># Replace NAs w/ 0s</code><br /><code>potty_records_model &lt;- potty_records_model %&gt;%</code><br /><code>  mutate_if(is.numeric, funs(replace(., is.na(.), 0)))</code></p>
<p>Now we split the data into training and test set. In this situation, we are looking at <code>Success</code> potty trips. Now we can fit some models using a random forest.</p>
<p><code># training and test set</code><br /><code>set.seed(42)</code><br /><code>index &lt;- createDataPartition(potty_records_model$`Potty break or in-house accident?`, p = 0.6, list = FALSE)</code><br /><code>train_data &lt;- potty_records_model[index, ]</code><br /><code>test_data  &lt;- potty_records_model[-index, ]</code><br /><br /><code># modeling</code><br /><code>model_rf &lt;- caret::train(`Potty break or in-house accident?` ~ .,</code><br /><code>  data = train_data,</code><br /><code>  method = &quot;rf&quot;, # random forest</code><br /><code>  trControl = trainControl(method = &quot;repeatedcv&quot;,</code><br /><code>       number = 10,</code><br /><code>       repeats = 5,</code><br /><code>       verboseIter = FALSE))</code><br /><br /><code>model_rf</code></p>
<p><code>## Random Forest</code><br /><code>##</code><br /><code>## 219 samples</code><br /><code>##   4 predictor</code><br /><code>##   2 classes: &#39;Accident&#39;, &#39;Success&#39;</code><br /><code>##</code><br /><code>## No pre-processing</code><br /><code>## Resampling: Cross-Validated (10 fold, repeated 5 times)</code><br /><code>## Summary of sample sizes: 197, 197, 197, 198, 197, 197, ...</code><br /><code>## Resampling results across tuning parameters:</code><br /><code>##</code><br /><code>##   mtry  Accuracy   Kappa</code><br /><code>##    2    0.9663919  0.9235532</code><br /><code>##    7    0.9826802  0.9631164</code><br /><code>##   12    0.9782138  0.9531246</code><br /><code>##</code><br /><code>## Accuracy was used to select the optimal model using the largest value.</code><br /><code>## The final value used for the model was mtry = 7.</code></p>
<p>Our accuracy of the model is 98.27%. Our goal is not to perfect a prediction of whether she will have an accident or a successful bathroom trip but it is good to know our dependent variable is measured effectively by the independent variables in our dataset. Since we have a good prediction accuracy we can now extract insights.</p>
<p><code>pred &lt;- data.frame(sample_id = 1:nrow(test_data), predict(model_rf, test_data, type = &quot;prob&quot;), actual = test_data$`Potty break or in-house accident?`) %&gt;%</code><br /><code>  mutate(prediction = colnames(.)[2:3][apply(.[, 2:3], 1, which.max)], correct = ifelse(actual == prediction, &quot;correct&quot;, &quot;wrong&quot;))</code><br /><br /><code>confusionMatrix(pred$actual, pred$prediction, positive = &quot;Success&quot;)</code></p>
<p><code>## Confusion Matrix and Statistics</code><br /><code>##</code><br /><code>##           Reference</code><br /><code>## Prediction Accident Success</code><br /><code>##   Accident       51       0</code><br /><code>##   Success         2      91</code><br /><code>##</code><br /><code>##                Accuracy : 0.9861</code><br /><code>##                  95% CI : (0.9507, 0.9983)</code><br /><code>##     No Information Rate : 0.6319</code><br /><code>##     P-Value [Acc &gt; NIR] : &lt;2e-16</code><br /><code>##</code><br /><code>##                   Kappa : 0.9699</code><br /><code>##  Mcnemar&#39;s Test P-Value : 0.4795</code><br /><code>##</code><br /><code>##             Sensitivity : 1.0000</code><br /><code>##             Specificity : 0.9623</code><br /><code>##          Pos Pred Value : 0.9785</code><br /><code>##          Neg Pred Value : 1.0000</code><br /><code>##              Prevalence : 0.6319</code><br /><code>##          Detection Rate : 0.6319</code><br /><code>##    Detection Prevalence : 0.6458</code><br /><code>##       Balanced Accuracy : 0.9811</code><br /><code>##</code><br /><code>##        &#39;Positive&#39; Class : Success</code><br /><code>## </code></p>
<p>LIME needs data without response variable</p>
<p><code>train_x &lt;- dplyr::select(train_data, -`Potty break or in-house accident?`)</code><br /><code>test_x &lt;- dplyr::select(test_data, -`Potty break or in-house accident?`)</code><br /><br /><code>train_y &lt;- dplyr::select(train_data, `Potty break or in-house accident?`)</code><br /><code>test_y &lt;- dplyr::select(test_data, `Potty break or in-house accident?`)</code></p>
<p>Build explainer, the key function in <code>lime</code> that explains the model&#39;s predictions.</p>
<p><code>explainer &lt;- lime(train_x, model_rf, n_bins = 5, quantile_bins = TRUE)</code></p>
<p>Run explain() function. We are setting the <code>n_featuers</code> = 8. This helps breakdown the complexity of trying to understand all the features in the dataset, which can lead to more confusion. Next we set the <code>feature_select</code> function to &quot;forward_selection&quot;, which is the auto default in the lime package.</p>
<p><code>explanation_df &lt;- lime::explain(test_x, explainer, n_labels = 2, n_features = 8, n_permutations = 1000, feature_select = &quot;forward_selection&quot;)</code></p>
<p>The feature importance plot is the reason LIME is so useful. This allows us to visualize each of the first 3 cases (observations) from the test data. The top four features for each case are shown. Note that they are not the same for each case. The green bars mean that the feature supports the model conclusion, and the red bars contradict.</p>
<p><code>plot_features(explanation_df[1:24, ], ncol = 2) +</code><br /><code>  labs(title = &quot;LIME Feature Importance Visualization&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_11.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_11.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_11.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_11.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_11" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.9 Lime Feature Importantance.</p>
<p>Lime is able to provide with an easy to view plot but what does the data tell us? Lets examine case 1:</p>
<p><code>pred %&gt;%</code><br /><code>  filter(sample_id == 1)</code></p>
<p><code>##   sample_id Accident Success  actual prediction correct</code><br /><code>## 1         1    0.008   0.992 Success    Success correct</code></p>
<p>Case 1 was correctly predicted to come from the <code>Success</code> group because it</p>
<ul><li>Has play as a consequence for action after potty break</li><li>The hour the action occurred was &lt;= 8</li><li>The action before was sniffing</li><li>The type was labeled U</li></ul>
<p>The explanatory plot tells us for each feature the range of values the data point would fall. If it does, this gets counted as support for this prediction, if it does not, it gets scored as contradictory. For instance, examining case 3 on the plot, scolding contradicts the support for a <code>Success</code>.</p>
<p><code>plot_explanations()</code> is another great visualization that can be utilized with LIME. The function produces a faceted heatmap of all feature combinations.</p>
<p><code>df &lt;- explanation_df %&gt;%</code><br /><code>  mutate(case = as.numeric(case)) %&gt;%</code><br /><code>  filter(case &lt; 31)</code><br /><br /><code>plot_explanations(df) +</code><br /><code>  labs(title = &quot;LIME Feature Importance Heatmap&quot;,</code><br /><code>   subtitle = &quot;Hold Out (Test) Set, First 30 Cases Shown&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_12.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_12.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_12.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_12.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_12" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.10 Lime Feature Importantance Heatmap.</p>
<h3>Power Test and Difference in Means</h3>
<p>Since we do not have a randomized control experiment we will control for type and see where we are achieving <code>Success</code> in the house breaking. First examine overall <code>Success</code> rate.</p>
<p><code>test &lt;- potty_records %&gt;%</code><br /><code>  mutate(Success = case_when(`Potty break or in-house accident?` == &#39;Success&#39; ~ 1,</code><br /><code>                             `Potty break or in-house accident?` == &#39;Accident&#39; ~ 0))</code><br /><br /><code>test_mean &lt;- test %&gt;%</code><br /><code>  summarise(n = n(),</code><br /><code>            mean_success = mean(Success, na.rm = TRUE),</code><br /><code>            std_error = sd(Success, na.rm = TRUE) / sqrt(n),</code><br /><code>            sd = sd(Success, na.rm = TRUE),</code><br /><code>            lower.ci = mean_success - qt(1 - (0.05/2), n - 1) * std_error,</code><br /><code>            upper.ci = mean_success + qt(1 - (0.05/2), n - 1) * std_error)</code><br /><code>test_mean</code></p>
<p><code>## # A tibble: 1 x 6</code><br /><code>##       n mean_success std_error    sd lower.ci upper.ci</code><br /><code>##                         </code><br /><code>## 1   365        0.645    0.0251 0.479    0.595    0.694</code></p>
<p>We have an overall <code>Success</code> rate of 64%. Lets now examine where we are achieving the most <code>Success</code>.</p>
<p>We can control for <code>U(rination), D(efecation), N(either), B(oth)</code> to see if results would be causal.</p>
<p><code>test_type &lt;- test %&gt;%</code><br /><code>  group_by(`U(rination), D(efecation), N(either), B(oth)`) %&gt;%</code><br /><code>  summarise(n = n(),</code><br /><code>            mean_success = mean(Success, na.rm = TRUE),</code><br /><code>            std_error = sd(Success, na.rm = TRUE) / sqrt(n),</code><br /><code>            sd = sd(Success, na.rm = TRUE),</code><br /><code>            lower.ci = mean_success - qt(1 - (0.05/2), n - 1) * std_error,</code><br /><code>            upper.ci = mean_success + qt(1 - (0.05/2), n - 1) * std_error) %&gt;%</code><br /><code>  filter(n &gt; 2) %&gt;%</code><br /><code>  arrange(desc(mean_success))</code><br /><code>test_type</code></p>
<p><code>## # A tibble: 3 x 7</code><br /><code>##   `U(rination), D(ef~     n mean_success std_error    sd lower.ci upper.ci</code><br /><code>##                                        </code><br /><code>## 1 B                      53        0.755    0.0597 0.434    0.635    0.874</code><br /><code>## 2 D                      41        0.659    0.0750 0.480    0.507    0.810</code><br /><code>## 3 U                     269        0.621    0.0296 0.486    0.562    0.679</code></p>
<p>Even though it can feel like I have been achieving progress, the least amount of progress is with <code>U</code>. This could be because of the amount of times she goes <code>U</code> and if a larger accident is taking place Aimee is immediately taken outside.</p>
<p>Lets now visualize the statistics.</p>
<p><code>test_type %&gt;%</code><br /><code>  rename(Type = `U(rination), D(efecation), N(either), B(oth)`) %&gt;%</code><br /><code>ggplot(aes(mean_success, n, color = Type)) +</code><br /><code>    geom_point() +</code><br /><code>    geom_errorbarh(aes(xmin = lower.ci, xmax = upper.ci)) +</code><br /><code>  labs(x = &quot;Success Rate&quot;,</code><br /><code>       y = &quot;n&quot;,</code><br /><code>       title = &#39;Success Rate by Type&#39;) +</code><br /><code>  theme_bw()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_13.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_13.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_13.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_13.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_13" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 2 Success Rate by Type.</p>
<p>The snapshot of the data tells us that <code>D</code> has a higher rate of <code>Success</code> than the <code>U</code> but the confidence intervals are extreme in comparison.</p>
<p>Lets also control for <code>What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)</code> and see if our results change.</p>
<p><code>test_elimination &lt;- test %&gt;%</code><br /><code>  group_by(`What was the dog doing pre-elimination? (nap, meal, walk, play, sniffing, pacing, etc.)`) %&gt;%</code><br /><code>  summarise(n = n(),</code><br /><code>            mean_success = mean(Success, na.rm = TRUE),</code><br /><code>            std_error = sd(Success, na.rm = TRUE) / sqrt(n),</code><br /><code>            sd = sd(Success, na.rm = TRUE),</code><br /><code>            lower.ci = mean_success - qt(1 - (0.05/2), n - 1) * std_error,</code><br /><code>            upper.ci = mean_success + qt(1 - (0.05/2), n - 1) * std_error) %&gt;%</code><br /><code>  filter(n &gt; 2) %&gt;%</code><br /><code>  arrange(desc(mean_success))</code><br /><code>test_elimination</code></p>
<p><code>## # A tibble: 6 x 7</code><br /><code>##   `What was the dog ~     n mean_success std_error    sd lower.ci upper.ci</code><br /><code>##                                        </code><br /><code>## 1 crate                  71        0.972    0.0198 0.167    0.932    1.01</code><br /><code>## 2 nap                    27        0.889    0.0616 0.320    0.762    1.02</code><br /><code>## 3 signal                 15        0.600    0.131  0.507    0.319    0.881</code><br /><code>## 4 sniffing              215        0.553    0.0340 0.498    0.487    0.620</code><br /><code>## 5 pacing                 14        0.429    0.137  0.514    0.132    0.725</code><br /><code>## 6 play                   21        0.333    0.105  0.483    0.113    0.553</code></p>
<p>When Aimee is in her crate before going out she has the highest success rate.</p>
<p>Now we run a t.test for statistical significance between <code>Success</code> and <code>Accident</code> by date but before the test we will remove missing values (when Aimee had no action but was taken outside).</p>
<p><code>test &lt;- test[c(-56, -15), ]</code><br /><br /><code>hypothesis &lt;- with(test, t.test(Success == 1, Success == 0))</code><br /><code>hypothesis</code></p>
<p><code>##</code><br /><code>##  Welch Two Sample t-test</code><br /><code>##</code><br /><code>## data:  Success == 1 and Success == 0</code><br /><code>## t = 8.1307, df = 724, p-value = 1.85e-15</code><br /><code>## alternative hypothesis: true difference in means is not equal to 0</code><br /><code>## 95% percent confidence interval:</code><br /><code>##  0.2194118 0.3591006</code><br /><code>## sample estimates:</code><br /><code>## mean of x mean of y</code><br /><code>## 0.6446281 0.3553719</code></p>
<p><code>obs_diff &lt;- hypothesis[[&quot;estimate&quot;]][[&quot;mean of x&quot;]] - hypothesis[[&quot;estimate&quot;]][[&quot;mean of y&quot;]]</code><br /><code>obs_diff</code></p>
<p><code>## [1] 0.2892562</code></p>
<p>Successful housebreaking trips are achieving at 0.6446281 while accidents are occurring 0.3553719. That&#39;s a 0.2892562 drop, which is great if it were true. The most likely reason for weird difference in means results are that we didn&#39;t collect enough data.</p>
<p>Lets plot the p-value by date.</p>
<p><code>test_by_day &lt;- test %&gt;%</code><br /><code> group_by(Date) %&gt;%</code><br /><code> summarise(p_value = t.test(Success == 1, Success == 0)$p.value,</code><br /><code>    Success = t.test(Success == 1, Success == 0)$estimate[1])</code><br /><br /><code>test_by_day %&gt;%</code><br /><code>  ggplot(aes(Date, p_value)) +</code><br /><code>  geom_line(size = 1) +</code><br /><code>  geom_hline(yintercept = 0.05, linetype=&quot;dashed&quot;, color = &quot;red&quot;) +</code><br /><code>  labs(title = &quot;P-Value of Success by Day&quot;,</code><br /><code>          subtitle = &quot;With 0.05 Threshold&quot;) +</code><br /><code>  theme_fivethirtyeight() </code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_14.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_14.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_14.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_14.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_14" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 2.1 P-Value of Success by Day.</p>
<p>The difference in means is statistically significant at the conventional levels of confidence. As the p-value is larger than our 0.05 significance level, we can reject the null hypothesis that there is no statistical difference in <code>Success</code> vs <code>Accident</code> for housebreaking Aimee. This type of statistical test is useful for me to determine whether housebreaking Aimee resulted in a statistical difference of <code>Succcess</code>.</p>
<p>Lastly we can calculate the effect of success over time and the total effect of success.</p>
<p><code>test_by_acc &lt;- test %&gt;%</code><br /><code> group_by(Date) %&gt;%</code><br /><code> summarise(Accident = t.test(Success == 1, Success == 0)$estimate[2])</code><br /><br /><code>effect &lt;- inner_join(test_by_day, test_by_acc, by = &quot;Date&quot;) %&gt;%</code><br /><code>  mutate(effect = (Success - Accident))</code><br /><br /><code>effect %&gt;%</code><br /><code>  summarise(mean_effect = mean(effect), total_effect = sum(effect))</code></p>
<p><code>## # A tibble: 1 x 2</code><br /><code>##   mean_effect total_effect</code><br /><code>##                 </code><br /><code>## 1       0.315         10.1</code></p>
<p>Lets plot the effect overtime for visual ease.</p>
<p><code>effect %&gt;%</code><br /><code>  ggplot(aes(Date, effect)) +</code><br /><code>  geom_line(size = 1, color = &quot;blue&quot;) +</code><br /><code>  labs(title = &quot;Percent Change of Success by Day&quot;) +</code><br /><code>  theme_fivethirtyeight() </code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_15.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_15.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_15.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_15.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_15" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 2.2 Percent Change of Success by Day.</p>
<h3>Final hypothesis</h3>
<p>My final hypothesis is that Aimee is more accident prone later in the day.</p>
<p><code>ggplot(data = test, aes(`Potty break or in-house accident?`, hour)) +</code><br /><code>  geom_boxplot(color = &quot;#007DC5&quot;, alpha = 0.8) +</code><br /><code>  geom_jitter(size = 0.5) +</code><br /><code>  theme_bw() +</code><br /><code>  labs(x = &quot;&quot;,</code><br /><code>    y = &quot;&quot;,</code><br /><code>    title = &quot;&quot;,</code><br /><code>    subtitle = &quot;Box Plot of Potty break or in-house accident? by Hour&quot;) +</code><br /><code>  coord_flip()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_16.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_16.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_16.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_16.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_16" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 2.3 Box Plot of Potty break or in-house accident? by Hour</p>
<p><code>qplot(fill = `Potty break or in-house accident?`, x = hour, data = test, geom = &quot;density&quot;,</code><br /><code>      alpha = I(0.5),</code><br /><code>      adjust = 1,</code><br /><code>      xlim = c(-5, 30)) +</code><br /><code>  theme_bw()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_17.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_17.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_17.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_17.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_17" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 2.4 Density Plot by Hour</p>
<p><code>hour_t.test &lt;- with(test, t.test(hour ~ `Potty break or in-house accident?`))</code><br /><code>hour_t.test</code></p>
<p><code>##</code><br /><code>##  Welch Two Sample t-test</code><br /><code>##</code><br /><code>## data:  hour by Potty break or in-house accident?</code><br /><code>## t = 2.1031, df = 296.87, p-value = 0.0363</code><br /><code>## alternative hypothesis: true difference in means is not equal to 0</code><br /><code>## 95 percent confidence interval:</code><br /><code>##  0.07694935 2.31919455</code><br /><code>## sample estimates:</code><br /><code>## mean in group Accident  mean in group Success</code><br /><code>##               14.86047               13.66239</code></p>
<p>As the p-value is smaller than our 0.05 significance level, we reject the null hypothesis that there is no statistical difference in the hour for <code>Potty break or in-house accident?</code>. This type of statistical test is useful to determine if the hour of the day resulted in a statistical difference in success. This means that if the data is continued to be collected using the same techniques, 95% of the intervals constructed this way would contain the true proportion and will fall within the interval estimates 95% of the time. Examining the box plot above gives a easy visualization of our confidence interval for the true proportion of the sample.</p>
<p><code>hour_diff &lt;- round(hour_t.test$estimate[1] - hour_t.test$estimate[2], 1)</code></p>
<p>Our study finds that hour of day, on average is 1.2 hours later in the Accident group compared to the Success group (t-statistic 2.1, p=0.036, CI [0.1, 2.3] hours)</p>
<h3>Conclusion</h3>
<p>To clarify, I am not a professional trainer but thought using data to measure whether or not my pup was progressing in the right direction seemed amicable. Also, I used no form of punishment and strongly suggest the reinforcement method of using a clicker. Learning that punishment does not work because they don&#39;t remember the act of going to the bathroom in the house is key to only using positive reinforcement. If you scare your animal while catching them in the act it will only cause them to be afraid of you when they have to potty and will lead to finding hidden accidents.</p>
<p>Now for the data conclusions, using a schedule and rewarding good behavior was key to the quick learning results while housebreaking.</p>
<p>Remember that correlation is not causation. The later it is in the day is not causing Aimee to have more or less success with housebreaking. It is more likely due to both my partner and I being home and present while being able to pay more or less attention to her behavior.</p>
<p>In the future we could also use the food and water data I collected to help with determining variables in housebreaking. Animals that eat/drink on a set schedule tend to use the bathroom on a schedule. Another useful variable may have been to group by Date and calculate the average time between potty trips to gather a general pattern. A good data analysis always generates insights but also helps generate more questions.</p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_18-scaled.jpg" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/48_18-scaled.jpg 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/48_18-scaled.jpg 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/48_18-scaled.jpg 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="48_18" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 2.5 Aimee</p>]]></content:encoded>
    </item>
    <item>
      <title>Sentiment Analysis of Red Hot Chili Peppers</title>
      <link>https://allanbutler.com/sentiment-analysis-red-hot-chili-peppers/</link>
      <guid isPermaLink="true">https://allanbutler.com/sentiment-analysis-red-hot-chili-peppers/</guid>
      <pubDate>Thu, 07 Sep 2017 12:00:00 GMT</pubDate>
      <description>Last week, I finished been reading &apos;Scar Tissue&apos;, Anthony Kiedis&apos; autobiography. The book details his life and the many years he has been involved with…</description>
      <content:encoded><![CDATA[<p>Last week, I finished been reading &#39;Scar Tissue&#39;, Anthony Kiedis&#39; autobiography. The book details his life and the many years he has been involved with the RHCP. Keidis has lived a life worth telling in the memoir. Constant recollection of his life journeys are spilled into a 400+ page book that does not dissapoint. After completing the tell all story I decided to take a data perspective on the RHCP. After their succesful album, &#39;Blood Sugar Sex Magik&#39;, lead guitarist John Frusciante left the band due to the overwhelming popularity and among other issues. Replacing Frusciante with Dave Navarro in 1992 the RHCP created &#39;One Hot Minute&#39;. Although the album went platinum it was not as successful as the earlier title. Frusciante re-joined the RHCP in 1998 and they released &#39;Californication&#39;. The RHCP style in &#39;One Hot Minute&#39; vs &#39;Blood Sugar Sex Magik&#39; is stated to contain darker subject matter, which is credited to the addition of Navarro. Creating a sentiment analysis, we will compare the albums lyrics.</p>
<h2>Getting the Data by scraping RHCP lyrics</h2>
<p>To gather the lyrical data we will need to scrape the lyrics using <code>rvest</code>.</p>
<p><code>library(knitr)</code><br /><code>library(rvest)</code><br /><code>library(tidyr)</code><br /><code>library(tidytext)</code><br /><code>library(wordcloud)</code><br /><code>library(XML)</code><br /><code>library(tidyverse)</code><br /><br /><code>poe &lt;- (&#39;https://genius.com/Red-hot-chili-peppers-the-power-of-equality-lyrics&#39;)</code><br /><code>poe_html &lt;- read_html(poe)</code><br /><code>  poe_lyrics &lt;- poe_html %&gt;%</code><br /><code>    html_nodes(&quot;p&quot;) %&gt;%</code><br /><code>    html_text()</code><br /><code>poe_lyric_df &lt;- data.frame(line = 1:1, text = poe_lyrics)</code><br /><code>poe_lyric_df$text &lt;- as.character(poe_lyric_df$text)</code><br /><br /><code>poe &lt;- poe_lyric_df %&gt;%</code><br /><code>  unnest_tokens(word, text)</code><br /><br /><code>  blood_sugar &lt;- blood_sugar %&gt;%</code><br /><code>  anti_join(stop_words) %&gt;%</code><br /><code>  filter(!grepl(&#39;[0-9]&#39;, word), word != &#39;verse&#39;, word != &#39;hook&#39;, word != &#39;song&#39;, word != &#39;album&#39;, word != &#39;anthony&#39;)</code></p>
<p>To extract the lyrics we can use the format above for each url lyric or use <code>purrr</code> for writing a function by album using <code>map_chr</code> function to transform the input into a list or data frame (This is by far the most efficient route).</p>
<p>Now we have the lyrics for &#39;Blood Sugar Sex Magik&#39; and can transfer them using the tidy text format.</p>
<p>Now we can do the same for &#39;One Hot Minute&#39;</p>
<p>Once we have the lyrics for &#39;One Hot Minute&#39; we transfer them using the same tidy text format.</p>
<p><code>one_minute &lt;- one_minute %&gt;%</code><br /><code>  anti_join(stop_words) %&gt;%</code><br /><code>  filter(!grepl(&#39;[0-9]&#39;, word), word != &#39;verse&#39;, word != &#39;chorus&#39;, word != &#39;song&#39;, word != &#39;album&#39;, word != &#39;red&#39;, word != &#39;hot&#39;,</code><br /><code>  word != &#39;peppers&#39;, word != &#39;chili&#39;, word != &#39;https&#39;, word != &#39;lyrics&#39;, word != &#39;genius.com&#39;)</code><br /><br /><code>one_minute &lt;- bind_rows(mutate(warped, album = &quot;One Hot Minute&quot;, song = &quot;Warped&quot;),</code><br /><code>                       mutate(aeroplane, album = &quot;One Hot Minute&quot;, song = &quot;Aeroplane&quot;),</code><br /><code>                       mutate(deep_kick, album = &quot;One Hot Minute&quot;, song = &quot;Deep Kick&quot;),</code><br /><code>                       mutate(my_friends, album = &quot;One Hot Minute&quot;, song = &quot;My Friends&quot;),</code><br /><code>                       mutate(coffee_shop, album = &quot;One Hot Minute&quot;, song = &quot;Coffee Shop&quot;),</code><br /><code>                       mutate(pea, album = &quot;One Hot Minute&quot;, song = &quot;Pea&quot;),</code><br /><code>                       mutate(one_big_mob, album = &quot;One Hot Minute&quot;, song = &quot;One Big Mob&quot;),</code><br /><code>                       mutate(walkabout, album = &quot;One Hot Minute&quot;, song = &quot;Walkabout&quot;),</code><br /><code>                       mutate(tearjerker, album = &quot;One Hot Minute&quot;, song = &quot;Tearjerker&quot;),</code><br /><code>                       mutate(one_hot_m, album = &quot;One Hot Minute&quot;, song = &quot;One Hot Minute&quot;),</code><br /><code>                       mutate(falling_into_grace, album = &quot;One Hot Minute&quot;, song = &quot;Falling Into Grace&quot;),</code><br /><code>                       mutate(shallow, album = &quot;One Hot Minute&quot;, song = &quot;Shallow&quot;),</code><br /><code>                       mutate(transcending, album = &quot;One Hot Minute&quot;, song = &quot;Transcending&quot;)) %&gt;%</code><br /><code>                       unnest_tokens(word, text)</code></p>
<h2>Frequency of Lyrics between albums</h2>
<p><code>library(stringr)</code><br /><br /><code>frequency &lt;- bind_rows(mutate(one_minute, album = &quot;One Hot Minute&quot;),</code><br /><code>                       mutate(blood_sugar, album = &quot;Blood Sugar Sex Magik&quot;)) %&gt;%</code><br /><code>  mutate(word = str_extract(word, &quot;[a-z&#39;]+&quot;)) %&gt;%</code><br /><code>  count(album, word) %&gt;%</code><br /><code>  group_by(album) %&gt;%</code><br /><code>  mutate(proportion = n / sum(n)) %&gt;%</code><br /><code>  select(-n) %&gt;%</code><br /><code>  spread(album, proportion) %&gt;%</code><br /><code>  gather(album, proportion, `One Hot Minute`) %&gt;%</code><br /><code>  na.omit()</code><br /><br /><code>library(scales)</code><br /><br /><code>ggplot(frequency, aes(x = proportion, y = `Blood Sugar Sex Magik`, color = abs(`Blood Sugar Sex Magik` - proportion))) +</code><br /><code>  geom_abline(color = &quot;gray40&quot;, lty = 2) +</code><br /><code>  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +</code><br /><code>  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +</code><br /><code>  scale_x_log10(labels = percent_format()) +</code><br /><code>  scale_y_log10(labels = percent_format()) +</code><br /><code>  scale_color_gradient(limits = c(0, 0.001), low = &quot;darkslategray4&quot;, high = &quot;gray75&quot;) +</code><br /><code>  facet_wrap(~album, ncol = 2) +</code><br /><code>  theme(legend.position=&quot;none&quot;) +</code><br /><code>  labs(y = &quot;Blood Sugar Sex Magik&quot;, x = NULL)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_1.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/46_1.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_1.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/46_1.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="46_1" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.1 Comparing word frequences between RHCP albums &#39;One Hot Minute&#39; &amp; &#39;Blood Sugar Sex Magik&#39;.</p>
<p>Words that are close to the line have similar frequencies in both albums. Some words landed here unintentionally. For instance, rick and kiedis are most likely not lyrics but appear from scraping the web page (Rick Rubin was the producer on both albums while Anthony Kiedis is the lead singer). It is interesting to see &#39;funky&#39; appearing near the middle of the line while &#39;love&#39; appearing at the high end of the frequency.</p>
<p>We can now quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between &#39;One Hot Minute&#39; &amp; &#39;Blood Sugar Sex Magik&#39;?</p>
<p><code>cor.test(data = frequency[frequency$album == &quot;One Hot Minute&quot;,],</code><br /><code>         ~ proportion + `Blood Sugar Sex Magik`)</code><br /><code>##</code><br /><code>##  Pearson&#39;s product-moment correlation</code><br /><code>##</code><br /><code>## data:  proportion and Blood Sugar Sex Magik</code><br /><code>## t = 5.2814, df = 246, p-value = 2.82e-07</code><br /><code>## alternative hypothesis: true correlation is not equal to 0</code><br /><code>## 95 percent confidence interval:</code><br /><code>##  0.2026112 0.4267278</code><br /><code>## sample estimates:</code><br /><code>##       cor</code><br /><code>## 0.3191241</code></p>
<p>The correlation between words in &#39;One Hot Minute&#39; &amp; &#39;Blood Sugar Sex Magik&#39; is .31, not a strong indication of similar lyrics. This could be due to the addition of Navarro or the RHCP trying a different lyrical tone.</p>
<h2>Combine both albums and add sentiment analysis</h2>
<p>Using an <code>inner_join</code> statement we can get a good grasp of the sentiment by grabbing positive and negative words. Lets find the net sentiment between the two albums.</p>
<p><code>tidy &lt;- bind_rows(blood_sugar, one_minute)</code><br /><br /><code>afinn &lt;- tidy %&gt;%</code><br /><code>  inner_join(get_sentiments(&quot;afinn&quot;)) %&gt;%</code><br /><code>  group_by(album) %&gt;%</code><br /><code>  summarise(sentiment = sum(score)) %&gt;%</code><br /><code>  mutate(method = &quot;AFINN&quot;)</code><br /><br /><code>ggplot(afinn, aes(album, sentiment, fill = album)) +</code><br /><code>  geom_col(show.legend = FALSE) +</code><br /><code>  facet_wrap(~album, ncol = 2, scales = &quot;free_x&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_2.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/46_2.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_2.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/46_2.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="46_2" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.2 Displays &#39;One Hot Minute&#39; has a higher negative sentiment between the two album lyrics.</p>
<p>We can now examine the top words used between each album</p>
<p><code>tidy %&gt;%</code><br /><code>  group_by(album) %&gt;%</code><br /><code>  count(word, sort = TRUE) %&gt;%</code><br /><code>  filter(n &gt; 13) %&gt;%</code><br /><code>  mutate(word = reorder(word, n)) %&gt;%</code><br /><code>  ggplot(aes(word, n, fill = album)) +</code><br /><code>  facet_wrap(~ album, scales = &quot;free_y&quot;) +</code><br /><code>  geom_col(show.legend = FALSE) +</code><br /><code>  labs(y = &quot;Most Common Used Words&quot;) +</code><br /><code>  coord_flip()</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_3.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/46_3.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_3.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/46_3.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="46_3" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Figure 1.3 Examines the most common words used between the albums</p>
<p>We can now reshape this chart into a wordcloud</p>
<h2>Most Common Word Clouds</h2>
<p><code>tidy %&gt;%</code><br /><code>  anti_join(stop_words) %&gt;%</code><br /><code>  count(word) %&gt;%</code><br /><code>  with(wordcloud(word, n, max.words = 100))</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_4.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/46_4.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_4.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/46_4.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="46_4" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Lets distinguish between positive and negative words.</p>
<p><code>library(reshape2)</code><br /><br /><code>tidy %&gt;%</code><br /><code>  inner_join(get_sentiments(&quot;bing&quot;)) %&gt;%</code><br /><code>  count(word, sentiment, sort = TRUE) %&gt;%</code><br /><code>  acast(word ~ sentiment, value.var = &quot;n&quot;, fill = 0) %&gt;%</code><br /><code>  comparison.cloud(colors = c(&quot;#F8766D&quot;, &quot;#00BFC4&quot;),</code><br /><code>                   max.words = 100)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_5.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/46_5.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/46_5.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/46_5.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="46_5" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<h2>Summary</h2>
<p>The data appear to support the notion that when John Frusciante left the RHCP momentarily their lyrics were darker and &#39;negative&#39; according the word lexicons. The analysis was motivated by the tidytext and rvest packages. Further analysis could look into the RHCP later albums with Frusciante and possibly other Navarro band lyrics. The Red Hot Chili Peppers revolutionized funk rock in America and anyone interested in their journey should read &#39;Scar Tissue&#39;.</p>]]></content:encoded>
    </item>
    <item>
      <title>Altuve or Biggio? Using Bayesian A/B Testing</title>
      <link>https://allanbutler.com/altuve-or-biggio-bayesian-ab-testing/</link>
      <guid isPermaLink="true">https://allanbutler.com/altuve-or-biggio-bayesian-ab-testing/</guid>
      <pubDate>Tue, 23 May 2017 12:00:00 GMT</pubDate>
      <description>Using Bayesian A/B Testing Altuve vs Biggio with Bayesian A/B Testing. Who is a better batter?: Craig Biggio or Jose Altuve? Inspiration for this post…</description>
      <content:encoded><![CDATA[<h2>Using Bayesian A/B Testing</h2>
<h3>Altuve vs Biggio with Bayesian A/B Testing.</h3>
<p>Who is a better batter?: Craig Biggio or Jose Altuve?</p>
<p>Inspiration for this post comes after reading David Robinson&#39;s post comparing Mike Piazza vs Hank Aaron using Bayesian A/B testing <a href="http://varianceexplained.org/r/bayesian_ab_baseball/">here</a>.</p>
<p>At the end of 2014 Jose Altuve has a higher career batting average (630 hits/ 2083 at-bats=.302) than Craig Biggio (3060 hits/ 10876 at-bats=.281).</p>
<p>Can we say that Altuve&#39;s batting skill is actually better than Biggio&#39;s or could it be that Altuve has not played long enough to regress towards the mean?</p>
<p>In this post we will compare two batters using an empirical Bayesian approach to batting statistics to determine who is the better batter and by how much?</p>
<p>Understanding the difference between the two proportions is important in A/B testing. One of the most common examples of A/B testing is comparing clickthrough rates (&quot;out of X impressions, there have been Y clicks&quot;)- which on the surface is similar to our batting average estimation problem (&quot;out of X at-bats, there have been Y hits&quot;).</p>
<p>Lets define the problem in terms of the difference between each players posterior distribution, and look at three mathematical and computational strategies we can use to solve the issue related to baseball statistics although many A/B tests can apply the same principles.</p>
<h2>Setup</h2>
<p><code>library(dplyr)</code><br /><code>library(tidyr)</code><br /><code>library(Lahman)</code><br /><code>library(knitr)</code><br /><code>library(ggplot2)</code><br /><code>theme_set(theme_bw())</code><br /><br /><code>pitchers &lt;- Pitching %&gt;%</code><br /><code>  group_by(playerID) %&gt;%</code><br /><code>  summarize(gamesPitched = sum(G)) %&gt;%</code><br /><code>  filter(gamesPitched &gt; 3)</code><br /><br /><code>career &lt;- Batting %&gt;%</code><br /><code>  filter(AB &gt; 0) %&gt;%</code><br /><code>  anti_join(pitchers, by = &quot;playerID&quot;) %&gt;%</code><br /><code>  group_by(playerID) %&gt;%</code><br /><code>  summarize(H = sum(H), AB = sum(AB)) %&gt;%</code><br /><code>  mutate(average = H / AB)</code></p>
<p><code>career &lt;- Master %&gt;%</code><br /><code>  tbl_df() %&gt;%</code><br /><code>  select(playerID, nameFirst, nameLast) %&gt;%</code><br /><code>  unite(name, nameFirst, nameLast, sep = &quot; &quot;) %&gt;%</code><br /><code>  inner_join(career, by = &quot;playerID&quot;)</code></p>
<p><code>career_filtered &lt;- career %&gt;% filter(AB &gt;= 500)</code><br /><code>m &lt;- MASS::fitdistr(career_filtered$average, dbeta,</code><br /><code>                    start = list(shape1 = 1, shape2 = 10))</code><br /><br /><code>alpha0 &lt;- m$estimate[1]</code><br /><code>beta0 &lt;- m$estimate[2]</code></p>
<p><code>career_eb &lt;- career %&gt;%</code><br /><code>  mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0)) %&gt;%</code><br /><code>  mutate(alpha1 = H + alpha0,</code><br /><code>         beta1 = AB - H + beta0) %&gt;%</code><br /><code>  arrange(desc(eb_estimate))</code></p>
<h2>So let&#39;s take a look at the two batters in question, Craig Biggio and Jose Altuve</h2>
<p><code># Save them as separate objects too for later:</code><br /><code>biggio &lt;- career_eb %&gt;% filter(name == &quot;Craig Biggio&quot;)</code><br /><code>altuve &lt;- career_eb %&gt;% filter(name == &quot;Jose Altuve&quot;)</code><br /><code>bagwell &lt;- career_eb %&gt;% filter(name == &quot;Jeff Bagwell&quot;)</code><br /><code>two_players &lt;- bind_rows(biggio, altuve)</code><br /><br /><code>kable(head(two_players))</code></p>
<p><strong>playerID  —  name  —  H  —  AB  —  average  —  eb_estimate  —  alpha1  —  beta1</strong></p>
<p>biggicr01  —  Craig Biggio  —  3060  —  7816  —  0.281  —  0.281  —  3137  —  8035</p>
<p>altluvjo01  —  Jose Altuve  —  1046  —  2315  —  0.311  —  0.307  —  1123  —  2534</p>
<p>We see that Altuve has slightly higher batting average, and a higher shrunken empirical bayes estimate. But is Altuve&#39;s true probability of getting a hit higher than Biggios? Or is the difference due to chance?</p>
<p>The answer lies in considering the range of plausible values for their &quot;true&quot; batting averages after we have taken their batting average (record) into account, or the &quot;actual posterior distributions&quot;.</p>
<p>These posterior distributions are modeled as beta distributions with the parameters Beta(α0 + H, α0 + β0 + H + AB)</p>
<p><code>library(broom)</code><br /><code>library(ggplot2)</code><br /><code>theme_set(theme_bw())</code><br /><br /><code>two_players %&gt;%</code><br /><code>  inflate(x = seq(.26, .33, .00025)) %&gt;%</code><br /><code>  mutate(density = dbeta(x, alpha1, beta1)) %&gt;%</code><br /><code>  ggplot(aes(x, density, color = name)) +</code><br /><code>  geom_line() +</code><br /><code>  labs(x = &quot;Batting average&quot;, color = &quot;&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_1.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/44_1.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_1.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/44_1.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="44_1" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>This posterior is a probalistic representations of our uncertainty in each estimate. When we ask what is the probability Altuve is better, we are asking &quot;if I drew a random draw from Altuve&#39;s batting record and a random draw from Biggio&#39;s, what is the probability Altuve is higher&quot;?</p>
<p>Notice how Biggio&#39;s and Atluve&#39;s distribution overlap near the .290 range. Although by examing the distribution there is NOT enough uncertainty in each of the estimates to determine that Biggio could be a better hitter than Altuve at the current year statistics in 2014. If we took a random draw from Biggio&#39;s distribution from Altuve&#39;s, its very unlikely Biggio would be higher.</p>
<p><code>career_eb %&gt;%</code><br /><code>  filter(name %in% c(&quot;Craig Biggio&quot;, &quot;Jose Altuve&quot;, &quot;Jeff Bagwell&quot;)) %&gt;%</code><br /><code>  inflate(x = seq(.26, .33, .00025)) %&gt;%</code><br /><code>  mutate(density = dbeta(x, alpha1, beta1)) %&gt;%</code><br /><code>  ggplot(aes(x, density, color = name)) +</code><br /><code>  geom_line() +</code><br /><code>  labs(x = &quot;Batting average&quot;, color = &quot;&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_2.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/44_2.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_2.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/44_2.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="44_2" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Jeff Bagwell won a Silver Slugger Award in 1994 and had an excellent batting record. Notice the vast amount of overlap in Bagwell and Altuve&#39;s distributions. This means there is enough uncertainty in the estimates that Bagwell could easily be a better batter than Altuve.</p>
<h2>Posterior Probability</h2>
<p>We may be interested in the probability that Altuve is a stronger hitter than Biggio within our model. From the graph we can already tell that its greater than 50%, how can we quantify this?</p>
<p>We need to kow the probability one beta ditribution is greater than another.</p>
<p>I&#39;m going to illustrate three common routes in solving a Bayesian problem: 1) Simulation of posterior draws 2) Numerical integration 3) Closed-form approximation</p>
<h3>Simulation of posterior draws</h3>
<p>Simulation is the quickest way around not having to do any math. Using each player&#39;s α1 and β1 parameters, draw a million items from each of them using rbeta, and compare results:</p>
<p><code>altuve_simulation &lt;- rbeta(1e6, altuve$alpha1, altuve$beta1)</code><br /><code>biggio_simulation &lt;- rbeta(1e6, biggio$alpha1, biggio$beta1)</code><br /><code>bagwell_simulation &lt;- rbeta(1e6, bagwell$alpha1, bagwell$beta1)</code><br /><code>sim &lt;- mean(altuve_simulation &gt; biggio_simulation)</code><br /><code>head(sim)</code><br /><br /><code>## [1] 0.999</code></p>
<p>A 99% probability that Altuve is a better batter than Biggio.</p>
<p>For fun lets compare Altuve to Bagwell.</p>
<p><code>sim2 &lt;- mean(bagwell_simulation &gt; altuve_simulation )</code><br /><code>sim2</code><br /><br /><code>## [1] 0.103</code></p>
<p>A much lower probability of 10% that Bagwell is a better batter than Altuve.</p>
<p>You could turn up or down the number of draws depending on how much you value speed vs precision. We didn&#39;t have to do any mathematical derivation or proofs. Even if we had a more complicated model, the process for simulating from it would still straightforward. This is one of the reasons Bayesian simulation approaches have become popular: computational power has gotten cheap, while doing math is as expensive.</p>
<h3>Integration</h3>
<p>These two posteriors have their own independent distribution, and together they form a joing distribution - a density over particular pairs of x and y. The joint distribution could be imagined as a density cloud:</p>
<p><code>library(tidyr)</code><br /><br /><code>x &lt;- seq(.270, .312, .0002)</code><br /><code>crossing(altuve_x = x, biggio_x = x) %&gt;%</code><br /><code>  mutate(altuve_density = dbeta(altuve_x, altuve$alpha1, altuve$beta1),</code><br /><code>  biggio_density = dbeta(biggio_x, biggio$alpha1, biggio$beta1),</code><br /><code>  joint = altuve_density * biggio_density) %&gt;%</code><br /><code>  ggplot(aes(altuve_x, biggio_x, fill = joint)) +</code><br /><code>  geom_tile() +</code><br /><code>  geom_abline() +</code><br /><code>  scale_fill_gradient2(low = &quot;white&quot;, high = &quot;red&quot;) +</code><br /><code>  labs(x = &quot;Altuve batting average&quot;,</code><br /><code>  y = &quot;Biggio batting average&quot;,</code><br /><code>  fill = &quot;Joint density&quot;) +</code><br /><code>  theme(legend.position = &quot;none&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_3.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/44_3.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_3.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/44_3.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="44_3" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Here we are asking what fraction of the joint probability density lies below the black line, where altuve&#39;s average is greater than Biggio&#39;s. Clearly more lies below than above, confirming the posterior probability that Altuve is a better hitter by 99%.</p>
<p>Using numerical integration to calculate this quantitatively would look like this in R:</p>
<p><code>d &lt;- .00002</code><br /><code>limits &lt;- seq(.26, .33, d)</code><br /><code>sum(outer(limits, limits, function(x, y) {</code><br /><code>  (x &gt; y) *</code><br /><code>  dbeta(x, altuve$alpha1, altuve$beta1) *</code><br /><code>  dbeta(y, biggio$alpha1, biggio$beta1) *</code><br /><code>  d ^ 2</code><br /><code>}))</code><br /><br /><code>## [1] 0.997</code></p>
<p>The approach becomes harder to control in problems that have many dimensions.</p>
<h3>Closed-form approximation</h3>
<p>Closed-form approximation is a much faster approximation approach. When α and β are both fairly large, the beta starts looking similar to a normal distribution, so much so that it can be closely approximated.</p>
<p>If you draw the normal approximation to the Altuve and Biggio, they are visually indistinguishable:</p>
<p><code>two_players %&gt;%</code><br /><code>  mutate(mu = alpha1 / (alpha1 + beta1),</code><br /><code>  var = alpha1 * beta1 / ((alpha1 + beta1) ^ 2 * (alpha1 + beta1 + 1))) %&gt;%</code><br /><code>  inflate(x = seq(.26, .33, .00025)) %&gt;%</code><br /><code>  mutate(density = dbeta(x, alpha1, beta1),</code><br /><code>  normal = dnorm(x, mu, sqrt(var))) %&gt;%</code><br /><code>  ggplot(aes(x, density, group = name)) +</code><br /><code>  geom_line(aes(color = name)) +</code><br /><code>  geom_line(lty = 2)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_4.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/44_4.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_4.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/44_4.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="44_4" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>The probability one normal is greater than another is very easy to calculate mathematically:</p>
<p><code>h_approx &lt;- function(alpha_a, beta_a,</code><br /><code>  alpha_b, beta_b) {</code><br /><code>  u1 &lt;- alpha_a / (alpha_a + beta_a)</code><br /><code>  u2 &lt;- alpha_b / (alpha_b + beta_b)</code><br /><code>  var1 &lt;- alpha_a * beta_a / ((alpha_a + beta_a) ^ 2 * (alpha_a + beta_a + 1))</code><br /><code>  var2 &lt;- alpha_b * beta_b / ((alpha_b + beta_b) ^ 2 * (alpha_b + beta_b + 1))</code><br /><code>  pnorm(0, u2 - u1, sqrt(var1 + var2))</code><br /><code>}</code><br /><br /><code>h_approx(altuve$alpha1, altuve$beta1, biggio$alpha1, biggio$beta1)</code><br /><br /><code>## [1] 0.999</code></p>
<p>The calculation is vecorizable in R. The downside being that for low α or low β, the normal approximation to the beta is going to fit rather poorly. The closed-form approximation is systematically biased. In certain problems it will give too high of an answer and some cases too low. When we have prior alpha and beta we are safe using the closed-form approximation.</p>
<h2>Confidence and credible intervals</h2>
<p>In frequentist statistics is a contigency table comparing two proporations. Such as:</p>
<p><strong>Player  —  Hits  —  Misses</strong></p>
<p>Craig Biggio  —  3060  —  7816</p>
<p>Jose Altuve  —  1046  —  2315</p>
<p>A common classical way to approach contingency table problems in with Pearson&#39;s chi-squared test, implemented in R as <code>prop.test</code>:</p>
<p><code>prop.test(two_players$H, two_players$AB)</code><br /><br /><code>##</code><br /><code>##  2-sample test for equality of proportions with continuity</code><br /><code>##  correction</code><br /><code>##</code><br /><code>## data:  two_players$H out of two_players$AB</code><br /><code>## X-squared = 10, df = 1, p-value = 9e-04</code><br /><code>## alternative hypothesis: two.sided</code><br /><code>## 95 percent confidence interval:</code><br /><code>##  -0.0478 -0.0119</code><br /><code>## sample estimates:</code><br /><code>## prop 1 prop 2</code><br /><code>##  0.281  0.311</code></p>
<p>We see a significant value less than .05. Therefore confirming our posterior distribution.</p>
<p>Prop test also gives you a confidence interval for the difference between the two players.</p>
<p>Now we will use empirical Bayes to compute the credible interval about the difference in Altuve and Biggio. We can do this simulation or integration but we will use our normal approximation approach:</p>
<p><code>credible_interval_approx &lt;- function(a, b, c, d) {</code><br /><code>  u1 &lt;- a / (a + b)</code><br /><code>  u2 &lt;- c / (c + d)</code><br /><code>  var1 &lt;- a * b / ((a + b) ^ 2 * (a + b + 1))</code><br /><code>  var2 &lt;- c * d / ((c + d) ^ 2 * (c + d + 1))</code><br /><br /><code>  mu_diff &lt;- u2 - u1</code><br /><code>  sd_diff &lt;- sqrt(var1 + var2)</code><br /><br /><code>  data_frame(posterior = pnorm(0, mu_diff, sd_diff),</code><br /><code>    estimate = mu_diff,</code><br /><code>    conf.low = qnorm(.025, mu_diff, sd_diff),</code><br /><code>    conf.high = qnorm(.975, mu_diff, sd_diff))</code><br /><code>}</code><br /><code>credible_interval_approx(altuve$alpha1, altuve$beta1, biggio$alpha1, biggio$beta1)</code><br /><br /><code>## # A tibble: 1 x 4</code><br /><code>##   posterior estimate conf.low conf.high</code><br /><code>##                    </code><br /><code>## 1     0.999  -0.0262  -0.0433  -0.00911</code></p>
<p><code>set.seed(188)</code><br /><br /><code>intervals &lt;- career_eb %&gt;%</code><br /><code>  filter(AB &gt; 10) %&gt;%</code><br /><code>  sample_n(20) %&gt;%</code><br /><code>  group_by(name, H, AB) %&gt;%</code><br /><code>  do(credible_interval_approx(altuve$alpha1, altuve$beta1, .$alpha1, .$beta1)) %&gt;%</code><br /><code>  ungroup() %&gt;%</code><br /><code>  mutate(name = reorder(paste0(name, &quot; (&quot;, H, &quot; / &quot;, AB, &quot;)&quot;), -estimate))</code></p>
<p><code>f &lt;- function(H, AB) broom::tidy(prop.test(c(H, altuve$H), c(AB, altuve$AB)))</code><br /><code>prop_tests &lt;- purrr::map2_df(intervals$H, intervals$AB, f) %&gt;%</code><br /><code>  mutate(estimate = estimate1 - estimate2,</code><br /><code>    name = intervals$name)</code><br /><br /><code>all_intervals &lt;- bind_rows(</code><br /><code>  mutate(intervals, type = &quot;Credible&quot;),</code><br /><code>  mutate(prop_tests, type = &quot;Confidence&quot;)</code><br /><code>)</code></p>
<p><code>ggplot(all_intervals, aes(x = estimate, y = name, color = type)) +</code><br /><code>  geom_point() +</code><br /><code>  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +</code><br /><code>  xlab(&quot;Altuve average - Player average&quot;) +</code><br /><code>  ylab(&quot;Player&quot;)</code></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_5.png" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/44_5.png 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/44_5.png 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/44_5.png 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="44_5" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>
<p>Because there is not a lot of information on certain players their credible intervals end up smaller than their confidence intervals. This is because we are able to use the prior to adjust the expectations (Esix Snead may have ended up with a higher batting average than Altuve but we are sure it was not .25 higher). When provided with a lot of information, the confidence and credible intervals approach almost perfectly. Therefore, empirical Bayes A/B credible intervals are a way to &quot;shrink&quot; frequentist confidence intervals, by sharing power across players.</p>
<h2>Conclusion:</h2>
<p>We are acting as if baseball players make up one homogeneous pool, this is mathematically convenient but its ignoring a lot of information about players. Pitchers faced, stadiums played in, length of career. For instance, ignoring how long Altuve&#39;s career compared to Biggio&#39;s 20 year career. This leads to bias where empirical Bayes tends to overestimate players with very few at-bats.</p>
<p>Also, this post is ONLY comparing Altuve&#39;s BATTING AVERAGE to Biggio&#39;s and not taking into account how valuable Biggio was to the Astros over the years. Starting at catcher then moving to second base and even dabbling in center field.</p>
<p>For a moving piece on Biggio read Bill James analysis of Craig Biggio <a href="http://www.slate.com/articles/sports/sports_nut/2008/02/the_epic_of_craig_biggio.html">here</a>. Despite a little negativity there is one thing James hit spot on, &quot;Biggio was the guy who would do whatever needed to be done.&quot;</p>]]></content:encoded>
    </item>
    <item>
      <title>Headshots of Allan Butler</title>
      <link>https://allanbutler.com/headshots/</link>
      <guid isPermaLink="true">https://allanbutler.com/headshots/</guid>
      <pubDate>Thu, 01 Jan 2015 04:35:00 GMT</pubDate>
      <description>These images may be used as headshots of Allan Butler for speaking and media appearances. Click each image for high-quality, print-ready file.</description>
      <content:encoded><![CDATA[<p>These images may be used as headshots of Allan Butler for speaking and media appearances.</p>
<p><em>Click each image for high-quality, print-ready file.</em></p>
<figure><img src="/cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/headshot_2025-684x1024.jpg" srcset="/cdn-cgi/image/width=400,quality=80,fit=scale-down,format=auto/_media/headshot_2025-684x1024.jpg 400w, /cdn-cgi/image/width=800,quality=80,fit=scale-down,format=auto/_media/headshot_2025-684x1024.jpg 800w, /cdn-cgi/image/width=1200,quality=80,fit=scale-down,format=auto/_media/headshot_2025-684x1024.jpg 1200w" sizes="(max-width: 700px) 100vw, 700px" alt="headshot_2025-684x1024.jpg" loading="lazy" decoding="async" style="max-width:100%;height:auto;display:block;" /></figure>]]></content:encoded>
    </item>
  </channel>
</rss>
