Skip to the content.

Assignment - Module 2

27 July 2021

Group 23: Aman Jindal | Yuhang Jiang | Daniel Gabriel Tan | Qining Liu



# Importing Libraries

import pandas as pd
import numpy as np
import datetime as dt
import os
import re
from pprint import pprint
from functools import reduce


Q1. Pandas Mul Function - Use this dataframe created in the code chunk below. We have three columns with random float values. The goal is to multiply columns 1 and 2 with columns 3. Firstly, notice how running the function (df[[‘col1’,’col2’]]*df[‘col3’]) returns a matrix of NaN values. Now, use the .mul() function by passing df[[‘col1’,’col2’]].mul(df[‘col3’],axis=0). Notice how we get the same NaN matrix when we change axis=1 instead.

# Creating a dataframe for problem 1
np.random.seed(123)
df = pd.DataFrame() # Creating a blank data-frame
df['col1'] = np.random.rand(10)
df['col2'] = np.random.rand(10)
df['col3'] = np.random.rand(10)
df
       col1      col2      col3
0  0.696469  0.343178  0.634401
1  0.286139  0.729050  0.849432
2  0.226851  0.438572  0.724455
3  0.551315  0.059678  0.611024
4  0.719469  0.398044  0.722443
5  0.423106  0.737995  0.322959
6  0.980764  0.182492  0.361789
7  0.684830  0.175452  0.228263
8  0.480932  0.531551  0.293714
9  0.392118  0.531828  0.630976

Answer 1:

df[['col1','col2']]*df['col3']
   col1  col2   0   1   2   3   4   5   6   7   8   9
0   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
df[['col1','col2']].mul(df['col3'],axis=0)
       col1      col2
0  0.441841  0.217712
1  0.243056  0.619278
2  0.164344  0.317726
3  0.336866  0.036465
4  0.519776  0.287564
5  0.136646  0.238342
6  0.354829  0.066023
7  0.156321  0.040049
8  0.141256  0.156124
9  0.247417  0.335571
df[['col1','col2']].mul(df['col3'],axis=1)
   col1  col2   0   1   2   3   4   5   6   7   8   9
0   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9   NaN   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN


Q2. Load CSV data (stock_data.csv) - parse dates using the pd.to_datetime command, read the stock returns as numeric values using the pd.to_numeric command, drop all NaN values using the dropna(), and reset the index. We will be working with this dataframe (unless stated otherwise) for the remainder of the questions below.

Answer 2:

input_file_folder = 'Data' # Folder Name within current directory where the file is stored
input_file_name = 'stock_data.csv' # Name of the File
cwd = os.getcwd() 
input_file_path = os.path.join(cwd, input_file_folder, input_file_name)
# Reading the Data

df = pd.read_csv(filepath_or_buffer=input_file_path, sep=',',header=0)

df.dtypes
permno                int64
company_name         object
date                 object
total_returns        object
price                object
share_outstanding     int64
dtype: object
# Converting `total_returns` to numeric, `price` to numeric, `date` to datetime  

df['total_returns'] = pd.to_numeric(df['total_returns'], errors='coerce')
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df.dtypes
permno                        int64
company_name                 object
date                 datetime64[ns]
total_returns               float64
price                       float64
share_outstanding             int64
dtype: object
# Dropping NaN rows and resetting the index

df.dropna(axis=0, how='any', inplace=True)
df.reset_index(drop=True, inplace=True)
print('DataFrame Shape: {}'.format(df.shape))
DataFrame Shape: (2913, 6)


Q3. Aggregating - For each PERMNO, calculate the following statistics for the total_returns - sum, mean, std dev, and median. Hint - use .groupby(‘permno’) and then aggregate by the stats(.agg() function might be useful).

Answer 3:

cols_to_group = ['total_returns']               # Columns to Group
cols_to_groupby = ['permno']                    # Columns to Groupby 

std_pop_lambda = lambda x:x.std(ddof=0)         # Creating a custom function
std_pop_lambda.__name__ = 'Std_Pop_Lambda'      # Naming the lambda Function
def std_pop(x): return x.std(ddof=0)            # Another way to achieve the above. Need not name the function separately 

col_stats_reqd = ['sum', 'mean', std_pop,'median']      # Statistics required permno wise for Total Returns

df_returns_groups = df.groupby(by=cols_to_groupby)[cols_to_group]
df_returns_groups_stats = df_returns_groups.agg(col_stats_reqd).round(4)
display(df_returns_groups_stats.head())
       total_returns                        
                 sum    mean std_pop  median
permno                                      
10001         1.0360  0.0113  0.0870  0.0050
10006         1.8390  0.0163  0.0882  0.0037
10014         1.8046  0.0668  0.1621  0.0714
10028         4.1703  0.0184  0.1986  0.0000
10029         0.4969  0.0155  0.1406 -0.0187


Q4. Filtering Text - You can parse the company name column to search for companies that match your key words. A basic filter would be to check for “OIL”, “GAS”, and “ENERGY” companies. Write a program that returns the number of unique PERMNOs that match the above 3 key-words. Hint: Try using df[‘company_name’].str.contains(). If you are interested in text parsing in Python, regex (regular expression) is a powerful library you may want to try out!

Answer 4:

col_reqd = 'permno'                         # Column name whose Unique Values are required
col_to_search = 'company_name'              # Column name in which keywords needs to be searched 
info_to_collate = ['Count', 'permnos']      # Values that will be collated for all the keywords
regex_dict = {'OIL': r'\boil\b',           # Regex patterns for keywords viz. oil, gas & energy
              'GAS': r'\bgas\b',
              'ENERGY': r'\benergy\b'}

answer_dict = {}
for key, pattern in regex_dict.items():
    mask = df[col_to_search].str.contains(pattern, flags=re.IGNORECASE, regex=True)
    unique_values_list = df.loc[mask, col_reqd].unique().tolist()
    answer_dict[key] = dict(zip(info_to_collate,
                                [len(unique_values_list),unique_values_list]))    
    
    
print('Dictionary of Permnos for specified Keywords\n\n')
pprint(answer_dict, sort_dicts=False)
Dictionary of Permnos for specified Keywords


{'OIL': {'Count': 0, 'permnos': []},
 'GAS': {'Count': 1, 'permnos': [10001]},
 'ENERGY': {'Count': 2, 'permnos': [10001, 10137]}}

Q5. Merging - Create a new dataframe with the last available date and price for each PERMNO. For example, the only entry for PERMNO 10001 is 31-Jul-2017 (date) and 12.95 (price). Merge this new dataframe onto the original dataframe. Fill the missing/NaN values using pandas fillna function with a method of your choice.

Answer 5:

col_to_groupby = 'permno'                       # Column name whose Unique Values are required
col_max = 'date'                                # Column whose max values id required to be figured out 
cols_to_retain = ['permno','date','price']      # Columns which are required post filtering    

df_filtered = df.loc[df.groupby(cols_to_groupby)[col_max].agg(pd.Series.idxmax),
                     cols_to_retain]          

df_merged = pd.merge(left=df, right=df_filtered, how='outer',
                     left_index=True, right_index=True,
                     suffixes=['_original','_filtered'], sort=False)

df_merged['permno_filtered'].fillna(df_merged['permno_original'], inplace=True)
df_merged['date_filtered'].fillna(df_merged['date_original'], inplace=True)
df_merged['price_filtered'].fillna(df_merged['price_original'], inplace=True)

display(df_merged.head())
print('\nDataFrame Shape: {}'.format(df_merged.shape))
   permno_original company_name date_original  total_returns  price_original  \
0            10001   ENERGY INC    2009-12-31       0.162621         10.3000   
1            10001   ENERGY INC    2010-01-31      -0.018932         10.0600   
2            10001   ENERGY INC    2010-02-28      -0.000656         10.0084   
3            10001   ENERGY INC    2010-03-31       0.020643         10.1700   
4            10001   ENERGY INC    2010-04-30       0.124385         11.3900   

   share_outstanding  permno_filtered date_filtered  price_filtered  
0               4361          10001.0    2009-12-31         10.3000  
1               4361          10001.0    2010-01-31         10.0600  
2               4361          10001.0    2010-02-28         10.0084  
3               4361          10001.0    2010-03-31         10.1700  
4               6070          10001.0    2010-04-30         11.3900  
DataFrame Shape: (2913, 9)


Q6. Lagged Market Cap - Create three new columns that contain the 6-month, 1-year, and 5-year lagged market cap. For example: PERMNO 10001 on 30-Jun-2016 will have the market cap as on 31-Dec-2015 (7.45 ⋅ 10505 = 78262.25) as its 6-month lag market cap. Perform a similar exercise to generate the 1-year and the 5-year lag market caps.

Answer 6:

# Method 1 of Solving Question 6

lag_freqs = [6, 12, 60]                                 # List of lag frequencies in months
df['market_cap'] = df['price']*df['share_outstanding']

dfs_to_add = [df.set_index('date').groupby('permno')['market_cap']
                .shift(periods=lag, freq='M')
                .rename('market_cap_lag'+str(lag)+'months')
                .reset_index() for lag in lag_freqs]

df_ans_m1 = reduce(lambda left,right: pd.merge(left, right, how='left',
                                               on=['permno','date']), [df] + dfs_to_add)

df_ans_m1.fillna(np.nan,inplace=True)
display(df_ans_m1.tail())
print('\nDataFrame Shape: {}'.format(df_ans_m1.shape))
      permno          company_name       date  total_returns  price  \
2908   10137  ALLEGHENY ENERGY INC 2010-07-31       0.102514  22.80   
2909   10137  ALLEGHENY ENERGY INC 2010-08-31      -0.010965  22.55   
2910   10137  ALLEGHENY ENERGY INC 2010-09-30       0.094013  24.52   
2911   10137  ALLEGHENY ENERGY INC 2010-10-31      -0.053834  23.20   
2912   10137  ALLEGHENY ENERGY INC 2011-01-31       0.063531  25.78   

      share_outstanding  market_cap  market_cap_lag6months  \
2908             169579  3866401.20             3552344.85   
2909             169615  3824818.25             3840760.50   
2910             169615  4158959.80             3900110.00   
2911             169615  3935068.00             3693234.60   
2912             169939  4381027.42             3866401.20   

      market_cap_lag12months  market_cap_lag60months  
2908              4270548.79              4632333.00  
2909              4476046.03              4908238.40  
2910              4494689.16              5000048.64  
2911              3867602.06              4599654.12  
2912              3552344.85              5665655.87  
DataFrame Shape: (2913, 10)
# Method 2 of Solving Question 6

lag_freqs = [6, 12, 60]                                 # List of lag frequencies in months
col_to_groupby = 'permno'
time_freq = 'M'                                         # Time Frequency of Data is Monthly

df['market_cap'] = df['price']*df['share_outstanding']
df.index.name = 'original_index'                        # Setting an index name for ease of reference
df_permno_groups = df.groupby(col_to_groupby)
df_ans_m2 = pd.DataFrame()                              # DataFrame to collate answers

for name, df_group in df_permno_groups:
    df_group.reset_index(inplace=True)
    df_group.set_index('date', drop=False, inplace=True)

    for lag in lag_freqs:
        df_group['market_cap_lag'+str(lag)+'months'] = df_group['market_cap'].shift(periods=lag, freq=time_freq)

    df_group.fillna(np.nan, inplace=True) 
    df_group.set_index(df.index.name, inplace=True)
    df_ans_m2 = pd.concat([df_ans_m2, df_group]) 

df_ans_m2.index.name = None
df.index.name = None
display(df_ans_m2.tail())
print('\nDataFrame Shape: {}'.format(df_ans_m2.shape))
      permno          company_name       date  total_returns  price  \
2908   10137  ALLEGHENY ENERGY INC 2010-07-31       0.102514  22.80   
2909   10137  ALLEGHENY ENERGY INC 2010-08-31      -0.010965  22.55   
2910   10137  ALLEGHENY ENERGY INC 2010-09-30       0.094013  24.52   
2911   10137  ALLEGHENY ENERGY INC 2010-10-31      -0.053834  23.20   
2912   10137  ALLEGHENY ENERGY INC 2011-01-31       0.063531  25.78   

      share_outstanding  market_cap  market_cap_lag6months  \
2908             169579  3866401.20             3552344.85   
2909             169615  3824818.25             3840760.50   
2910             169615  4158959.80             3900110.00   
2911             169615  3935068.00             3693234.60   
2912             169939  4381027.42             3866401.20   

      market_cap_lag12months  market_cap_lag60months  
2908              4270548.79              4632333.00  
2909              4476046.03              4908238.40  
2910              4494689.16              5000048.64  
2911              3867602.06              4599654.12  
2912              3552344.85              5665655.87  
DataFrame Shape: (2913, 10)


Q7. Resampling Frequency - Convert the monthly dataframe that you loaded above into quarterly data and annual data. Hint: First, load and clean the data as per procedure mentioned above. Second, create a new column containing quarters corresponding to the given month (ex: 30-Nov-2010 is a quarter 4). Lastly, create a new dataframe “df_quarter” by aggregating data using the groupby command. For the quarterly dataframe your columns will be {permno, date, quarter, total_returns, market_cap}. Note that you will need to add the monthly returns to generate quarterly returns. Also note that you will need the quarter end market cap for each PERMNO each quarter. For example : In the Quarterly dataframe, PERMNO 10001 as of 31-Mar-2010 will have total returns 0.001055 (-0.018932-0.000656+0.020643) and the market cap 44351.37 (10.17$\cdot$4361). Perform the same procedure for creating data at an annual frequency.

Answer 7:

# Quarterly Data Conversion

df_quarterly = pd.merge(left=df.drop(['total_returns'], axis=1), 
                        right=df.groupby('permno')
                                .resample('Q', on='date')['total_returns']
                                .sum().reset_index(), 
                        how='right', on=['permno','date']).reset_index(drop=True)

display(df_quarterly.head())
print('\nDataFrame Shape: {}'.format(df_quarterly.shape))
   permno     company_name       date  price  share_outstanding  market_cap  \
0   10001       ENERGY INC 2009-12-31  10.30             4361.0    44918.30   
1   10001       ENERGY INC 2010-03-31  10.17             4361.0    44351.37   
2   10001       ENERGY INC 2010-06-30  10.86             6080.0    66028.80   
3   10001  GAS NATURAL INC 2010-09-30  11.12             6073.0    67531.76   
4   10001  GAS NATURAL INC 2010-12-31  10.52             7834.0    82413.68   

   total_returns  
0       0.162621  
1       0.001055  
2       0.085793  
3       0.048630  
4      -0.033330  
DataFrame Shape: (1037, 7)
# Yearly Data Conversion

df_yearly = pd.merge(left=df.drop(['total_returns'], axis=1), 
                     right=df.groupby('permno')
                             .resample('Y', on='date')['total_returns']
                             .sum().reset_index(), 
                     how='right', on=['permno','date']).reset_index(drop=True)

display(df_yearly.head())
print('\nDataFrame Shape: {}'.format(df_yearly.shape))
   permno     company_name       date  price  share_outstanding  market_cap  \
0   10001       ENERGY INC 2009-12-31  10.30             4361.0    44918.30   
1   10001  GAS NATURAL INC 2010-12-31  10.52             7834.0    82413.68   
2   10001  GAS NATURAL INC 2011-12-31  11.42             8154.0    93118.68   
3   10001  GAS NATURAL INC 2012-12-31   9.33             8157.0    76104.81   
4   10001  GAS NATURAL INC 2013-12-31   8.03            10452.0    83929.56   

   total_returns  
0       0.162621  
1       0.102148  
2       0.136491  
3      -0.143090  
4      -0.080232  
DataFrame Shape: (277, 7)




The End.