Big Data: AI Solutions to Intractable Problems (Part II)

Classic Big Data Analytics: Exploding Manholes and Fires in Illegal Conversions

Exploding Manholes

Consolidated Edison, “Con Ed,” an investor owned utility in New York City, has about 250,000 manholes in the City, about 50,000 of which are in Manhattan. The manhole covers are cast iron, a few inches thick, and 24 inches in diameter. Each weighs up to 300 pounds. Every year a few hundred catch fire, some of which explode up into the air. In 2007, Con-Ed went to the statisticians at Columbia University and asked them to identify which manholes were more likely to catch fire and explode, to try to find patterns in order to manage the problem.

Fires in Illegal Conversions

An illegal conversion is a one family house or apartment that houses three or four families or a community of 12 to 20 people. As noted above, New York City firefighters are 15 times more likely to die in fires in illegal conversions than in other types of residences. The FDNY knew that they could save lives by finding illegal conversions. But how? Using gut feelings and following their instinct inspectors had a 13% success rate in finding illegal conversions.

Finding the Needles in the Haystacks

Cynthia Rudin, statistician turned data scientist, led the team for Con Ed. They looked at “Trouble tickets” and found a mess. The term “Service Box” had 38 different variants, including “S, SB, S/B, S.B., SBX, S Bx, S Box, Serv Box, SERV/BOX, Service Box, etc. Database application designers should enforce quality control when and where the data are entered into the database. With 38 different terms for the same data point Con Ed’s data had no integrity.

But Cynthia and her team found 106 data points that they believed were reasonable predictors of a major manhole disaster. They condensed this to a smaller set of the strongest signals. By 2009 they were ready. They predicted problem spots. The top 10% of manholes on their list were 44% of the manholes that ended up with severe incidents.

Mike Flowers, hired by Mayor Bloomberg as the New York’s first head of Data Analytics, tackled the problem of tracking down illegal conversions. He and his team looked at records from the Dept. of Buildings, Housing Preservation, tax records, the NYPD. They looked at 911 calls, hospitalization data, reports on rodent infestations. They looked at construction and renovation permits, and Buildings violations because permits indicate careful and diligent homeowners and violations indicate issues. It took 2 years, but by 2011 they were ready. The methods were inexact, but the amount of data – on every residential property in New York City – compensated for the imperfections. The success rate finding illegal conversions went from 13% to 70%.

A House of Cards

In its early days Netflix subscribers would upload the titles of films that they wanted to watch and Netflix would mail a subscriber up to three (3) DVDs or Blu-Ray disks at a time. After watching the movies subscribers would mail the disks back. Netflix would then mail the next set of three movies.

Given the state of the art back then, we can assume that Netflix used a relational database like DB2, or MS SQL Server, or Oracle and the database contained information about subscribers and information about their stock of DVDs. The subscriber information was likely name, address, phone number, email, films watched, films out for watching, and films they want to watch. The information on movies may have been title, stars, director, genre, year made, number of copies on hand, number of copies with subscribers. After a while they probably started mining the database to determine most popular films, and trends.

In 2007 Netflix started streaming with its “Watch Now” service. To optimize this service they needed to store multiple copies of each movie in different data centers near their subscribers. Today Netflix is estimated to store 1,100 to 1,200 replicas1 of each film in different formats in order to stream to different screens, from TVs to computers, tablets, and phones, in different formats and in different resolutions, such as 4K and 1080P, across the world.

In 2011, Netflix took a big leap forward by leveraging information about the films and television shows subscribers watched to suggest what they might want to watch, asking questions like:

If Joe likes “Terminator,” and “Rambo,” would he like “Rocky”?

If Bob likes “Spy Game” and “Lara Croft” would he like “Mr. and Mrs. Smith”?

And how can we make money producing our own films?

Based on correlated viewing habits of their subscribers, Netflix determined that ninety percent of the 12 million people who liked the film “An American President” and television series “The West Wing” would watch one episode of “House of Cards” and, if it was compelling, most would watch additional episode. On February 1, 2013, Netflix premiered “House of Cards.” Their analysis was correct.

Search, Retail, and Social Media

Unlike Altavista, Ask Jeeves, Yahoo and its other competitors, Google approached Internet search from a data scientist’s perspective. Google’s engineers reasoned that just as the value of a scientific paper is in the number of other papers in which it is cited, the popularity of a website is the number of other websites that contain links to it. This worked, however, an unintended consequence paved the way for exaggerated, fictitious, or sensationalized stories designed to influence elections and political referenda. There are likely to be more searches for “Clinton and Monica” than “Clinton leads NATO against ethnic cleansing in Yugoslavia.”

Other examples of Big Data include Amazon using information on things people are looking for, purchasing, or streaming to recommend things they might want to buy or stream. Facebook and LinkedIn use information such as where users live, work, studied, and, of course, who they know, to identify other people they might know.

Correlating All These Data

While American Express probably evaluates terabytes2 of data each year, Cynthia Rudin at Columbia and Mike Flowers at the City of New York probably worked with much less data. However, they worked with multiple collections of data from different sources; with data that was not neatly organized in the tables of rows and columns of relational databases managed by DB2, Oracle, or SQL Server.

However, Rudin and her team didn’t do a concise statistical analysis. They didn’t analyze 1% or 2% of the 51,000 manholes in Manhattan and 250,000 manholes in the five boroughs. They looked at the entire dataset. The sample size was 100%. In statistical terms, “N = 100%” or “N = 1.”

Similarly, Flowers didn’t look at 1% or 2% of the roughly 4.4 million single family and two-family homes3 in NYC. He and his team looked at data on 900,000 units. Their sample size was roughly 20.45%.

This is a fundamental change in technique made possible by high capacity and inexpensive disk drives and high speed low cost computer memory, and graphics processing chips, GPUs, which are designed for simultaneous, or parallel execution of simple processes on large amounts of data.

As Viktor Mayer-Schonberger and Kenneth Cukier hammer home, science has been build looking for causation; using statistical samples of events to understand how and why they happen. Statistics are built on a representative sample. But things are different when your sample size is 100%. At that point you can look for correlations.

That’s what Cynthia Rudin and Mike Flowers did. They didn’t look for causation. They looked for correlation. And they found it. Netflix had a high degree of confidence that 11 million people would watch one or more episodes. Amazon and Apple realized that people who bought songs or CDs by certain artists, would buy songs or stream music by other artists performing in the same genre. American Express looks at purchases that make sense, and looks out for transactions that are out of sync. You can not use your credit cards in stores in Houston or San Francisco when you are in L. A. Correlation, not causation.

NOTES:

  1. Netflix Architecture. https://www.geeksforgeeks.org/system-design-netflix-a-complete-architecture/ ↩︎
  2. A “Petabyte” is a one thousand terabytes, one million gigabytes, one trillion megabytes, one thousand trillion kilobytes. For reference, one page of single space typed text may be four kilobytes, an image taken with a two to four megapixel digital camera, from 2010, will be roughly one megabyte. A petabyte, therefore could hold one trillion images or 250 trillion pages of text. ↩︎
  3. New York Housing Statistics, Info Please, viewed Sept 15, 2023. https://www.infoplease.com/us/census/new-york/housing-statistics ↩︎