Thursday, November 23, 2023

Simple Linear Regression - Python Code

 To illustrate a simple linear regression example in Python, we can use synthetic data. Let's create a small dataset that simulates the relationship between engine size (in liters) and fuel efficiency (in miles per gallon) for a set of cars. We'll use the scikit-learn library for the regression analysis and matplotlib for plotting.

Here's a step-by-step guide along with the Python code:

  1. Generate Synthetic Data: Create a dataset of engine sizes and corresponding fuel efficiencies.
  2. Create a Linear Regression Model: Use scikit-learn to fit a linear regression model.
  3. Predict and Plot: Predict fuel efficiency for a range of engine sizes and plot the results.

Wednesday, November 22, 2023

Logistic Regression Concepts

 Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). Here are some key concepts and methodologies involved in logistic regression:


Monday, November 20, 2023

Linear Regression Concepts

 Linear regression is a foundational technique in statistics and machine learning used to model the relationship between a dependent variable and one or more independent variables. Here's a breakdown of its key concepts:


Neural Networks and Deep Learning Concepts

 Neural networks and deep learning are key concepts in the field of artificial intelligence and machine learning. Here's a brief overview:


Machine Learning Primer

Machine Learning (ML) is a field of artificial intelligence that focuses on building systems that can learn from and make decisions based on data. Here's a brief overview of some basics:


Saturday, November 19, 2022

Data Cleaning using Open refine and SQL Lite - NYPL menus data


Data quality is an important characteristic that determines the reliability of decision making. With data, insights and analysis are as good as the data. Garbage data leads to garbage analysis. Data cleansing is one of the most important steps in ensuring data quality. Data cleaning repairs or removes incorrect, corrupt, malformed, duplicate, or incomplete data in a data set. Combining multiple data sources increases the chances of data being duplicated or mislabeled. This project aims to perform end-to-end data cleansing using various tools and techniques. The suitability of the cleaned data is validated using use cases.

NYPL's crowdsourcing project "What's on the menu." The New York Public Library is digitizing and transcribing its collection of historical menus. The collection includes about 45,000 menus from the 1840s to the present, and the goal of the digitization project is to transcribe each page of each menu, creating an enormous database of dishes, prices, locations, and so on. There are several empty and null fields for certain items in the data. Much of the information for dishes and menus is still being normalized and cleaned. Hence this dataset presents a lot of data-cleansing opportunities and challenges.

Predicting the house sale price using Multiple Linear Regression

 Home buyers always worry about overpaying for the house of their choice. It is natural for all buyers to get a second price consultation or to make sure that the house price is within a reasonable price range for that locality. On the other hand, potential house sellers will want to understand which features(or amenities) influence the house price the most. This study focuses on taking historic house sale prices of King County from the year 2014 - 2015 and building a Regression model that can help users understand which features affect the price the most and also predict the price with reasonable accuracy and confidence level.

Our project will incorporate the following topics :

  • Variable manipulation
  • Multiple Linear regression
  • Residual diagnostics
  • Data analysis and interpretation
  • Model Building
  • Model Evaluation