Data Preprocessing
Step 1: Observe the source data - While observing the Source file: Hotel_Reviews.csv in a pandas DataFrame
- Data contains 515738 rows × 17 columns
- The 17 columns and corresponding data types were


Step 2: Since there was no unique column in the table, we used the pandas dataframe index to creae a unique column 'Review_Id' for working with data in PostgreSQL
hotel_df.index.name='Review_Id'
Step 3: Remove null values
Step 3.1: Check for null/na values
hotel_df.isnull().sum()
Step 3.2: Drop rows that are na/null and reset the index from 0
hotel_df = hotel_df.dropna().reset_index(drop=True)
Step 3.3: The operation above will delete the index name, so renaming it again as 'Review_Id'
hotel_df.index.name='Review_Id'
Step 3.4: Check for null again
hotel_df.isnull().sum()
Step 4: We found that, for best project efficiency, the 17 columns could be divided into two groups: 'review group' and 'location group' with the review group having information pertaining to reviews only, and the location group with information pertaining to location only.
Breaking the big table into 2 tables: 1st with hotel reviews, 2nd with hotel locations
Step 4.1: Create Table 1: Hotel reviews only - no location data (drop columns 'lat'and 'lng')
# hotel reviws without lat and lng
hotel_reviews = hotel_df.drop(columns=['lat','lng'])
#view the table
hotel_reviews

Step 4.2: Create Table 2: DF with locations (no reviews)
# creating a subset of the dataframe with unique values on the column 'Hotel_Address' and then reset the index to 0
unique_address_df = hotel_df.drop_duplicates(subset=['Hotel_Address']).reset_index(drop=True)
# view the table
unique_address_df

Step 4.3: Select only 4 columns that we want - "Hotel_Address", "Hotel_Name", "lat", "lng"
# selecting the ONLY 4 required columns
hotel_location = unique_address_df[["Hotel_Address", "Hotel_Name", "lat", "lng"]]
#view the table
hotel_location

Step 5: Create sample tables with 1000 rows
Step 5.1: Table 1: review table - selecting the first 1000 rows
# creating a sample of table 2, with 1000 rows
sample_of_hotel_location = hotel_location.loc[0:999]
# view the table
sample_of_hotel_location

Step 5.2: Table 2: location table - selecting the first 1000 rows
# creating a sample of table 1, with 1000 rows
sample_of_hotel_reviews = hotel_reviews.loc[0:999]
# view the table
sample_of_hotel_reviews

Step 6: Extract city and country using lat and lng, using geopy
Step 6.1: Create 2 new columns for City and Country
# creating 2 new columns
hotel_location_city_country["City"] = ""
hotel_location_city_country["Country"] = ""
# view the df
hotel_location_city_country

Step 6.2: Install GeoPy
# install the module geopy
! pip install geopy
# import module
from geopy.geocoders import Nominatim
# initialize Nominatim API
geolocator = Nominatim(user_agent="geoapiExercises")
Step 6.3: Create a for loop to iterate through the rows, use the co-ordinates to pull the 'city' and 'county' and store them in the 'city' and 'country' columns
# declare a variable to iterate over the rows
i=0
# loop for extracting city and county from the data using GeoPy
for row in hotel_location_city_country.itertuples():
Latitude = str(row.lat)
Longitude = str(row.lng)
location = geolocator.reverse(Latitude+","+Longitude)
address = location.raw['address']
# traverse the data
city = address.get('city', '')
state = address.get('state', '')
country = address.get('country', '')
code = address.get('country_code')
zipcode = address.get('postcode')
print(f"Hotel {i+1}: Co-ordinates ({Latitude}, {Longitude}) | City: {city}, {country}")
hotel_location_city_country.at[i, 'City'] = city
hotel_location_city_country.at[i, 'Country'] = country
i+=1
# view the df
hotel_location_city_country

Next Step
At the end of each ipynb file, the new tables were stored in Zip files.
The next step would be Storing the tables in Database