Hotelling Society - Invited Symposium
Psychology 6140

Analysis of Baseball Salaries

Hotelling Society Symposium
Dataset Description and Call for Papers

The Statistical Education Section of the Society for Exploration of Multivariate Data (commonly known as the Hotelling Society) is sponsoring a special symposium and poster session titled Why They Make What They Make - An Analysis of Major League Baseball Salaries. The session will be held in February, at the next Society meetings in Toronto, and is intended to allow members to compare techniques for analyzing and displaying data. The session provides a forum for both old and new statistical and graphical techniques to describe and summarize the data. Your participation is invited.

The results of each analysis will be discussed during the session. You have the option of presenting a 15-minute oral discussion of your results or preparing a poster. Each participant will be given the opportunity to display his/her results and other conference attendees will be encouraged to discuss them with you.

Since this session is treated as an organized contributed-paper session, you must submit a title and abstract to the Society office, chaired by Prof. Friendly. The abstract must be postmarked not later than January 15, 2017. The abstract can be very general. A sample as a model is provided below. Variations might include descriptions of the analyses you carried out, or what you found. You have until the conference to work on the analysis. Please send a (possibly revised) copy of your abstract by email to the conference chair no later than two weeks before the conference. After the meetings you will be given an opportunity to include a short paper describing your analysis in the 2017 Proceedings of the Statistical Education Section of the Hotelling Society.

Sample Abstract

Please submit your abstract single-spaced, in the body of an email message (no Word attachments, please-- copy/paste instead) in a format like that shown below. You can use this template, and send it by email to friendly(AT)yorku(DOT)ca with a Subject like "6140 Hotelling Abstract".

\Title{Analysis of Baseball Salary Data}
\Author{Crunch D. Numbers}{Babe Ruth University}
This paper describes and summarizes the relation ships between 1987 salaries of major league baseball players and the player's performance. We use regression methods to show relationships between a player's salary to his 1986 and career performance statistics and the player's team. Some surprising findings are discussed.

Description of the Baseball Data Sets

The data consist of three files consisting of data on the regular and leading substitute hitters in 1986, the regular pitchers in 1986 and the team statistics. You have the option of focussing your analysis on any one or more of the hitter data, the pitcher data, or the team data (or something else you find; see Other Sources below).

The data is stored in various formats (SAS input files, SAS stored datasets, SPSS system (.sav) files) on the Hebb server,

These files may also be downloaded from the links below. (Use [Shift-Click] to save them in a local file on your computer.)

File SAS input file
(raw data)
SAS dataset
in psy6140\lib\
View it SPSS dataset CSV dataset
(comma delimited)

(V9: baseball.sas7bdat)
hitters.htm baseball.sav baseball.csv
Pitchers psy614.pitcher
(V9: pitcher.sas7bdat)
pitcher.htm pitcher.sav pitcher.csv
(V9: team.sas7bdat)
teams.htm team.sav team.csv
Hitters 921 bball92.html bball92.csv
Attendance2 MLBattend.csv

1Contains the Major League baseball players who played at least one game in both the 1991 and 1992 seasons, excluding pitchers, with 1992 salaries.
2Contains home game attendance, wins, losses, etc. for all teams from 1969-2000

Data sources

The salary data were taken from Sports Illustrated, April 20, 1987. The salary of any player not included in that article is listed as a missing value. The 1986 and career statistics were taken from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York. The team attendance figures were obtained from the Elias Sports Bureau (personal communication).

Note: The data was entered by the Society's clerical staff from coding forms provided by the symposium organizers. While the staff were generally careful and skilled at data entry, and printouts were checked for obvious errors, no checks against original sources were carried out, and the possibility that errors are contained in the data cannot be ruled out. Standard statistical practices of data screening should therefore be included in your analyses.


One obvious goal is to use statistical and/or graphical methods to attempt to explain differences in the salaries of major league baseball players and to answer the question Are players paid according to their performance?.

If, on examination of the data, you deem some other question(s) of greater interest, feel free to make that the basis of your contribution. In addition you are free to construct other variables from those given (such as batting average, which is calculated as a ratio of the number of hits to times at bat).

Finally, you are also free to gather and use other data related to your chosen question.

Other sources

You might want to examine some of the following internet sources:
Sean Lahman's Baseball Archive, has data sets with batting statistics for every player from 1871-2014, as well as various team and pitcher statistics. See also his Stats Archive page.
Chris Green's Baseball Resources lists a more comprehensive and recent collection of Baseball sites with databases, statistics, links to baseball research, etc.
Articles dealing with baseball statistics from Chance. In addition, Scott Berry writes a column, ``A statistician reads the sports pages'' in each issue of Chance, which often contains baseball-related data.
USA Today - Statistics for individuals from 1992 - 2005.
CBS Sportsline - Sortable statistics for players and teams (current season)
The Society for American Baseball Research's compilation of baseball resources on the Internet, and their data archive of baseball statistics (including 1995 team rosters and salaries).
A related compilation of baseball statistics on the internet (Historical baseball performance stats, recent salary information, etc.)
The World Wide Web of Sports at MIT also has an extensive baseball section, and

The Hitter File []

You can easily view these datasets from SAS. From the Explorer window, choose Libraries, then select PSY614 and double-click the name of one of the three data sets. Some of these measures are defined below; see the Baseball Glossary or Baseball jargon for definitions of most baseball terms and statistics not described below.

There is one observation per hitter in the file Unless otherwise noted, all performance statistics refer to the 1986/87 baseball season. Career statistics count all years that a player actually played in the major leagues (some can get sent back down to the minors and get called up again). The variables are:

NAME Hitters name
LEAGUE Player's league
TEAM Player's team
ATBAT Times at Bat: Number of official plate appearances by a hitter. It counts as an official at-bat as long as the batter does not walk, sacrifice, get hit by a pitch or reach base due to catcher's interference.
HITS Hits:
HOMER Home Runs
RUNS Runs: The number of runs scored by a player. A run is scored by an offensive player who advances from batter to runner and touches first, second, third and home base in that order without being put out.
RBI Runs Batted In: A hitter earns a run batted in when he drives in a run via a hit, walk, sacrifice (bunt or fly) fielder's choice, hit- batsman or on an error (when the official scorer rules that the run would have scored anyway).
WALKS Walks: A ``walk'' (or ``base on balls'') is an award of first base granted to a batter who receives four pitches outside the strike zone.
YEARS Years in the Major Leagues. As far as we can tell, this counts all years a player has actually played in the Major Leagues, not necessarily consectutive.
ATBATC Career Times at Bat
HITSC Career Hits
HOMERC Career Home Runs
RUNSC Career Runs Scored
RBIC Career Runs Batted In
POSITION Player's position(s). See list of codes used below under Coding for some of the variables. (You are free to recode these as you see fit.)
PUTOUTS Put Outs. A put out is credited when a fielder causes a batter or runner to be, well, put out; e.g., catches the batter's fly ball, tags a base runner out before he reaches the base, etc.
ASSISTS Assists. An assist is credited when a fielder assists in a play causing a player to be put out; e.g.,
SALARY 1987 Annual salary on opening day (in 1000$)
BATAVG Batting Average, calculated as 1000*(HITS/ATBAT)
BATAVGC Career Batting Average, calculated as 1000*(HITSC/ATBATC)

The Pitcher File []

There is one observation per pitcher in the file Unless otherwise noted, all performance statistics refer to the 1986/87 baseball season. The variables are:

NAME Pitcher's name
TEAM Team at the end of 1986
LEAGUE League at the end of 1986
WINS Number of Wins
LOSSES Number of Losses
ERA Earned Run Average
GAMES Number of Games
INNINGS Number of Innings pitched
SAVES Number of Saves
YEARS Years in the major leagues
WINSC Number of Wins during his career
LOSSESC Number of Losses during his career
ERAC Earned Run Average during his career
GAMESC Number of Games during his career
INNINGC Number of Innings pitched during his career
SAVESC Number of Saves during his career
SALARY 1987 annual salary ($1000s)
LEAGUE7 League at the beginning of 1987
TEAM7 Team at the beginning of 1987

The Team File []

There is one observation per team in the file The variables are:
RANK Position in final league standings 1986
WINS Number of wins in 1986
LOSSES Number of losses in 1986
ATTHOME Attendance for home games in 1986
ATTAWAY Attendance for away games in 1986
SALARY 1987 average salary ($1000)

Coding for some of the Variables

Team Names

Team names are recorded in all data sets as the 3-character codes given below. The hitter's file contains a SAS format, $team. which can be used to print or display the team names in more readable form. (Note that team codes uniquely distinguish American and National League teams in the same city.)
   value $team
     'ATL'='Atlanta      '
     'BAL'='Baltimore    '
     'BOS'='Boston       '
     'CAL'='California   '
     'CHA'='Chicago A    '   (Sox)
     'CHN'='Chicago N    '   (Cubs)
     'CIN'='Cincinnati   '
     'CLE'='Cleveland    '
     'DET'='Detroit      '
     'HOU'='Houston      '
     'KC '='Kansas City  '
     'LA '='Los Angeles  '
     'MIL'='Milwaukee    '
     'MIN'='Minnesota    '
     'MON'='Montreal     '
     'NYA'='New York A   '   (Yankees)
     'NYN'='New York N   '   (Mets)
     'OAK'='Oakland      '
     'PHI'='Philadelphia '
     'PIT'='Pittsburgh   '
     'SD '='San Diego    '
     'SEA'='Seattle      '
     'SF '='San Francisco'
     'STL'='St. Louis    '
     'TEX'='Texas        '
     'TOR'='Toronto      '


The data files use the following 1-character codes for the LEAGUE variable(s):
     N    National
     A    American


The team file uses the following 1-character codes for the DIVISION variable:
     W    West
     E    East

Player's position(s):

If a substitute played 70% of his games at one position, that is the only position listed for him in the hitter's data set. If he did not play 70% of his games at one position, but played 90% of his games at two position, he is listed with a combination position, such as "S2" for shortstop and second base, or "CO" for catcher and outfield. If a player failed to meet either the 70% or 90% requirement listed above, he is listed as a utillity player ("UT").

The list below shows the complete set of 2-character codes used for player's position in the hitter's file. These values define a SAS format, $posfmt. which can be used to print the positions in the form shown on the right of the = sign.

   value $posfmt
     '1B' = 'First Base'
     '2B' = 'Second Base'
     'SS' = 'Short Stop'
     '3B' = 'Third Base'
     'RF' = 'Right Field'
     'CF' = 'Center Field'
     'LF' = 'Left Field'
     'C ' = 'Catcher'
     'DH' = 'Designated Hitter'

     'OF' = 'Outfield'
     'UT' = 'Utility'
     'OS' = 'Outfield & Short Stop'
     '3S' = 'Third Base & Short Stop'
     '13' = 'First & Third Base'
     '3O' = 'Third Base & Outfield'
     'O1' = 'Outfield & First Base'
     'S3' = 'Short Stop & Third Base'
     '32' = 'Third & Second Base'
     'DO' = 'Designated Hitter & Outfield'
     'OD' = 'Outfield & Designated Hitter'
     'CD' = 'Catcher & Designated Hitter'
     'CS' = 'Catcher & Short Stop'
     '23' = 'Second & Third Base'
     '1O' = 'First Base and Outfield'
     '2S' = 'Second Base and Short Stop';
In addition, the BASEBALL SAS file defines another format $pos., which can be used to collapse positions into a shorter list, based on a player's primary fielding position:
   /* Recode position to short list */
   value $pos
     'CS','CD'       ='C '
     'OS','O1','OD'  ='OF'
     'CF','RF','LF'  ='OF'
     '1O','13'       ='1B'
     '2S','23'       ='2B'
     'DO'            ='DH'
     'S3'            ='SS'
     '32','3S','3O'  ='3B' ;
For example, to find the average salary of players according to the collapsed position codes, use format position $pos.; with the MEANS procedure:
proc means;
   class position;
	format position $pos.;
	var salary;