Saturday, November 19, 2022

Data Cleaning using Open refine and SQL Lite - NYPL menus data

Data quality is an important characteristic that determines the reliability of decision making. With data, insights and analysis are as good as the data. Garbage data leads to garbage analysis. Data cleansing is one of the most important steps in ensuring data quality. Data cleaning repairs or removes incorrect, corrupt, malformed, duplicate, or incomplete data in a data set. Combining multiple data sources increases the chances of data being duplicated or mislabeled. This project aims to perform end-to-end data cleansing using various tools and techniques. The suitability of the cleaned data is validated using use cases.

NYPL's crowdsourcing project "What's on the menu." The New York Public Library is digitizing and transcribing its collection of historical menus. The collection includes about 45,000 menus from the 1840s to the present, and the goal of the digitization project is to transcribe each page of each menu, creating an enormous database of dishes, prices, locations, and so on. There are several empty and null fields for certain items in the data. Much of the information for dishes and menus is still being normalized and cleaned. Hence this dataset presents a lot of data-cleansing opportunities and challenges.


  1. Dataset : NYPL Menu
  2. Open Refine Install : Open Refine
  3. Python Sqlite : Sqlite3


No comments:

Post a Comment