Introduction: Fortune 500 Companies

7 minute read

R Counterpart

Overview

The purpose of this introductory notebook is to get our feet wet with python for data analysis, starting off with a simple, but certainly not boring dataset. We will be using the Fortune 500 companies list from 1955-2005.

We'll do some simple plotting with pandas and matplotlib to explore the data in greater detail and address issues with the data as we come across them.

Imports

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The holy trinity of python data analysis, if ever in doubt, these imports are always a good place to start.

Data

The external data used in this notebook can be obtained from: https://www.dataquest.io/blog/large_files/fortune500.csv

The CSV file contains information about fortune 500 companies spanning the years 1955 - 2005.
There are 25500 rows representing 51 (years) sets of 500 (companies).

The provided columns include:

  • Year - year that the fortune 500 list was published
  • Rank - company's position on the fortune 500 list
  • Company - name of the company
  • Revenue (in millions) - company's revenue for that year
  • Profit (in millions) - company's profit for that year
In [1]:
### Uncomment this line to pull in the dataset ###
#!wget https://www.dataquest.io/blog/large_files/fortune500.csv
--2018-09-10 23:24:21--  https://www.dataquest.io/blog/large_files/fortune500.csv
Resolving www.dataquest.io (www.dataquest.io)... 104.196.190.180
Connecting to www.dataquest.io (www.dataquest.io)|104.196.190.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 882629 (862K) [text/csv]
Saving to: 'fortune500.csv'

     0K .......... .......... .......... .......... ..........  5%  287K 3s
    50K .......... .......... .......... .......... .......... 11%  321K 3s
   100K .......... .......... .......... .......... .......... 17%  312K 2s
   150K .......... .......... .......... .......... .......... 23%  433K 2s
   200K .......... .......... .......... .......... .......... 29%  465K 2s
   250K .......... .......... .......... .......... .......... 34%  477K 2s
   300K .......... .......... .......... .......... .......... 40%  439K 1s
   350K .......... .......... .......... .......... .......... 46%  522K 1s
   400K .......... .......... .......... .......... .......... 52%  658K 1s
   450K .......... .......... .......... .......... .......... 58%  583K 1s
   500K .......... .......... .......... .......... .......... 63%  793K 1s
   550K .......... .......... .......... .......... .......... 69%  750K 1s
   600K .......... .......... .......... .......... .......... 75%  918K 0s
   650K .......... .......... .......... .......... .......... 81%  580K 0s
   700K .......... .......... .......... .......... .......... 87%  932K 0s
   750K .......... .......... .......... .......... .......... 92%  829K 0s
   800K .......... .......... .......... .......... .......... 98% 1.55M 0s
   850K .......... .                                          100%  410K=1.6s

2018-09-10 23:24:23 (527 KB/s) - 'fortune500.csv' saved [882629/882629]

In [4]:
f500_df = pd.read_csv('data/fortune500.csv')
f500_df.head()
Out[4]:
Year Rank Company Revenue (in millions) Profit (in millions)
0 1955 1 General Motors 9823.5 806
1 1955 2 Exxon Mobil 5661.4 584.8
2 1955 3 U.S. Steel 3250.4 195.4
3 1955 4 General Electric 2959.1 212.6
4 1955 5 Esmark 2510.8 19.1

Exploratory Data Analysis

Overall, the data is relatively clean. However, there are a few potential issues that should be noted:

  • The Profit column had some entries with a value of 'N.A.' which caused the dtype of entire column to be cast to object instead of float preventing many numpy operations. This was resolved by simply replacing the 'N.A.' values with np.nan (0 also could have been used).
  • Two companies (CBS and OfficeMax) appear to have been added to the list twice between the years 1972-1981 and 1995-2003. No attempt was made address these duplications, just in case there was a particular reason this was done (such as pre-merger representations), but it could done by dropping the duplicate rows or replacing the company name with unknown.
In [3]:
f500_df.dtypes
Out[3]:
Year                       int64
Rank                       int64
Company                   object
Revenue (in millions)    float64
Profit (in millions)      object
dtype: object
In [5]:
# replace string 'N.A.' with numpy's nan value to correct dtype of Profit column
f500_df['Profit (in millions)'] = f500_df['Profit (in millions)'].replace('N.A.',np.nan).astype(np.float)
f500_df.dtypes
Out[5]:
Year                       int64
Rank                       int64
Company                   object
Revenue (in millions)    float64
Profit (in millions)     float64
dtype: object

The dtype of Profit hinted at a potential issue with the data. Replacing the 'N.A.' values with numpy's np.nan allows us to cast the column to a float instead of an object, opening up the data for far more manipulations.

In [38]:
f500_df.Company.nunique()
Out[38]:
1887

1887 unique companies have been included on the Fortune 500 list over the years

In [8]:
(f500_df.Company.value_counts() > 50).sum()
Out[8]:
74

74 companies have been on the list every year since 1955

In [7]:
f500_df.Company.value_counts()[:10]
Out[7]:
CBS                       57
OfficeMax                 55
Dow Chemical              51
Northrop Grumman          51
Paccar                    51
McGraw-Hill               51
General Mills             51
Unisys                    51
Bristol-Myers Squibb      51
Navistar International    51
Name: Company, dtype: int64

It appears that CBS and OfficeMax, have duplicate entries in certain years, let's see if we can track down the erroneous years.

In [3]:
# Years 1972-1981 and 1995-2003 have duplicate values
(f500_df[f500_df.Company.isin(['CBS','OfficeMax'])]
 .groupby(['Year','Company'])['Company','Year']
 .count()).query('Company > 1')
Out[3]:
Company Year
Year Company
1972 CBS 2 2
1973 CBS 2 2
1974 CBS 2 2
1975 CBS 2 2
1976 CBS 2 2
1977 CBS 2 2
1978 CBS 2 2
1979 CBS 2 2
1980 CBS 2 2
1981 CBS 2 2
1995 CBS 2 2
1996 OfficeMax 2 2
1997 OfficeMax 2 2
1998 OfficeMax 2 2
1999 OfficeMax 2 2
2000 OfficeMax 2 2
2001 OfficeMax 2 2
2002 OfficeMax 2 2
2003 OfficeMax 2 2
In [6]:
# Choose 3 companies for plotting
comp3_df = f500_df[f500_df.Company.isin(['Kellogg','3M','Pfizer'])].copy()
In [4]:
# Rename the columns for easier pandas access
comp3_df.rename(columns={'Revenue (in millions)':'Revenue_m', 'Profit (in millions)':'Profit_m'}, inplace=True)
In [5]:
comp3_df.sort_values(['Year','Rank'])[:9]
Out[5]:
Year Rank Company Revenue_m Profit_m
130 1955 131 3M 230.9 24.6
185 1955 186 Kellogg 169.5 12.7
214 1955 215 Pfizer 145.2 15.2
624 1956 125 3M 281.9 34.7
689 1956 190 Kellogg 182.1 13.8
706 1956 207 Pfizer 163.8 15.3
1115 1957 116 3M 330.8 38.7
1195 1957 196 Kellogg 201.7 15.1
1217 1957 218 Pfizer 178.4 18.3
In [6]:
plt.style.use('seaborn')
fig, ax = plt.subplots(figsize=(10,6))
ax.plot('Year','Rank',data=comp3_df[comp3_df.Company=='3M'],c='red', linestyle='solid',label='3M')
ax.plot('Year','Rank',data=comp3_df[comp3_df.Company=='Kellogg'],c='orange', linestyle='dotted',label='Kellogg')
ax.plot('Year','Rank',data=comp3_df[comp3_df.Company=='Pfizer'],c='blue', linestyle='dashed',label='Pfizer')
ax.axvline(1994)
ax.set_xlabel('Year', fontdict={'fontsize':'14'})
ax.set_xticks(np.arange(1955,2006,5))

ax.set_ylabel('Rank',fontdict={'fontsize':'14'})
ax.set_yticks(np.arange(0,300,30))
ax.invert_yaxis()

ax.legend(fontsize='large',loc='best')
fig.suptitle('Fortune 500 Rank')
ax.set_title('3M, Kellogg, and Pfizer (1955-2005)');

Clearly, there was an upset in 1994 that led to the rank decline of at least 3 powerful companies. In absence of other information, one may guess that the proliferation of the internet may be at play in these sharp declines.

In [235]:
f93_96 = f500_df[f500_df.Year.isin(range(1993,1997))][['Year','Company','Rank']]

To get a closer look, we'll zoom in on the range between 1993 and 1996

In [257]:
f36_piv = f93_96.pivot_table(values=['Rank'],index=['Company'],columns=['Year']).Rank

Using a pivot table, we can see which companies joined the fortune 500 list between these years

In [258]:
f36_piv.fillna(501).diff(axis=1)[[1994,1995]].sort_values([1995])[:10]
Out[258]:
Year 1994 1995
Company
Wal-Mart Stores 0.0 -497.0
AT&T 0.0 -496.0
Sears Roebuck 0.0 -492.0
State Farm Insurance Cos 0.0 -489.0
Prudential Ins. Co. of America 0.0 -488.0
Kmart Holding 0.0 -486.0
Citicorp 0.0 -484.0
ITT Industries 0.0 -478.0
Kroger 0.0 -476.0
American Intl. Group 0.0 -475.0

Here, the negative numbers indicate how far each company jumped in rank between 1994 and 1995. 0 indicates no ranking for the year.

In [269]:
f36_piv[f36_piv[1994].isna() & ~f36_piv[1995].isna()].shape
Out[269]:
(288, 4)

In total, 288 companies joined the fortune 500 list that were not on it the previous year. So, our original hypothesis that the internet was the largest contributer is likely inaccurate. Given the shear number of new companies on the list, the more likely scenario is that in 1995 there was a change in how the ranking were calculated, or more companies were being considered.

In [343]:
f5s = f500_df.pivot_table(values=['Rank'],index=['Year'],columns=['Company']).Rank
In [420]:
year_diff = pd.Series({i+1:len(set(f5s.loc[i].dropna().index) - set(f5s.loc[i+1].dropna().index)) for i in f5s.index if i < 2005})
In [434]:
year_diff.plot(xticks=np.arange(1955,2006,5), 
               label=(plt.xlabel('Year'),plt.ylabel('$\Delta$ Companies')), 
               title="Difference in companies compared to previous year");

Looking at a chart of the year to year change in listed companies, we can see that 1994 to 1995 is an enormous outlier compared to all other year pairs. Assuming that there isn't an issue with dataset itself, we can almost certainly attribute this change to a revision in how the Fortune 500 list was chosen.

Conclusions

This notebook took a brief look at the Fortune 500 companies dataset (1955-2005). Our analysis led to several interesting discoveries, namely, we found that CBS and OfficeMax are included twice in certain years, and from 1994 to 1995 there was a dramatic shift in included companies. Despite the chosen dataset being relatively simple, we still managed to extract some useful information out of it. We did not even use the revenue or profits columns, so there is still a lot of room for exploration.

Future Work

We leave this notebook with some unanswered questions:

  • What actually caused the '94-95 shift?
  • Why were some companies included multiple times?

A fully fleshed out analysis would seek external sources of information and bring it into the notebook to answer these questions. Additionally, we could certainly investigate profits and revenue to see how they correlate with ranking from year to year, or even do some simple linear modeling to really start to understand this data. These sort of analyses and many others will be performed in following notebooks, so have a look around if you'd like to dig a little deeper.

In [ ]: