Automatidata Data Analysis Project Proposal
- sherry salek
- Apr 10, 2023
- 5 min read
Updated: Apr 10, 2023
Google Advanced Data Analytics Professional Certificate, End of Course 1 Project
Scenario
About Automatidata:
Automatidata is a data consulting firm that specializes in helping businesses make the most of their data. They work with clients to transform raw data into actionable insights and solutions, such as performance dashboards, customer-facing tools, and strategic business recommendations. Their approach involves identifying the client's business needs and using data analytics to meet those needs, ultimately helping their clients make better decisions and achieve their goals.
Overall, Automatidata's mission is to help businesses turn their data into a strategic asset, driving growth and success in today's digital age.
Automatidata hired me as the newest member of their data analytics team.
Scenario Background:
As a data consulting firm, Automatidata has been tasked with developing a regression model for the New York City Taxi and Limousine Commission (TLC) to predict ride duration based on location and time of day data. The purpose of this project is to help the TLC better understand and manage taxi and limousine services in New York City.
The TLC data comes from over 200,000 taxi and limousine licensees, making approximately one million combined trips per day.
Note: This project's dataset was created for pedagogical purposes and may not be indicative of New York City taxi cab riders' behavior.
Project Tasks:
The project team will need to complete several tasks to achieve this goal, including:
Developing a global-level project document: The data team will need to work with the Senior Project Manager, to develop a project document that outlines the goals and milestones of the project.
Inspecting the TLC dataset: the Senior Data Analyst, pointed out that the TLC dataset needs to be inspected before any analysis can begin. The data team will need to perform exploratory data analysis (EDA) to determine what information the dataset provides.
Developing a regression model: The main focus of the project is to develop a regression model that can provide insights to the TLC. The Director of Data Analysis, mentioned that the team will need to determine whether or not the model meets the project requirements before presenting any insights to the TLC.
Developing visuals: the Operations Manager at the TLC, requested visuals to share with the TLC's executives. The data team will need to to develop these visuals.
Establishing relationships between variables: The Director of Data Analysis suggested that the data team consider A/B testing to establish the relationship between variables within the TLC dataset.
Presenting insights to the TLC: Once the final model is developed, the data team will need to determine the main talking points going into the presentation with the TLC.
Based on these tasks, the data team will need to develop a PACE Strategy Document that outlines the project's overall goal, key milestones, and specific tasks that need to be completed. The proposal should also include a timeline for completing each task, as well as the tools and techniques the data team will use to complete them. Additionally, the proposal should identify potential risks and mitigation strategies for each task to ensure that the project stays on track.
In terms of audience members, the data team will need to communicate with several key stakeholders. Each stakeholder has their own specific concerns and priorities, so the data team will need to tailor their communications to each stakeholder's needs.
Business Problem:
The TLC regulates and licenses taxi cabs and for-hire vehicles in New York City. To effectively manage and regulate these services, the TLC needs to understand how long rides are taking and identify areas where ride duration can be improved. The current process of manually collecting ride duration data is time-consuming and expensive. Therefore, the TLC has partnered with Automatidata to develop a regression model that can predict ride duration based on location and time of day data.
Questions and Considerations
1. Who is your audience for this project?
The audience for this project is the New York City Taxi and Limousine Commission (TLC), specifically the head of the finance and administration department and the operations manager.
2. What are you trying to solve or accomplish? And, what do you anticipate the impact of this work will be on the larger needs of the client?
The primary goal of this project is to develop a regression model that accurately predicts ride duration based on location and time of day data. By doing so, we aim to help the New York City Taxi and Limousine Commission (TLC) to better understand and manage taxi and limousine services in the city. The manual collection of ride duration data is time-consuming and expensive, so the regression model will provide an efficient and cost-effective solution for the TLC.
The impact of this work is significant, as it will enable the TLC to identify areas where ride duration can be improved and optimize their services accordingly. This can lead to increased customer satisfaction, reduced wait times for rides, and more efficient use of resources for the TLC. Additionally, the model can be used for other related purposes, such as estimating trip fares and improving traffic flow. Overall, our goal is to provide the TLC with a valuable tool that can enhance their ability to regulate and manage taxi and limousine services in New York City.
3. What questions need to be asked or answered?
What is the condition of the provided dataset?
What steps can I take to reduce the impact of bias?
What variables should be included in the regression model?
How should the variables be transformed, if needed, to improve model performance?
What type of regression model should be used? (e.g., linear regression, polynomial regression, etc.)
How should the model be evaluated and validated?
What is the expected level of prediction accuracy, and how can this be improved?
How can the model be optimized for real-time prediction in a production environment?
What other insights or patterns can be identified from the data that may be useful for the TLC's decision-making process?
How can the model be integrated into the TLC's existing systems and processes?
4. What resources are required to complete this project?
Data: A large dataset of taxi and limousine rides in New York City, including location, time, and ride duration.
Python Notebook
Time and budget: Sufficient time and budget may be required to complete the project successfully.
Input from stakeholders
5. What are the deliverables that will need to be created over the course of this project?
Data Exploration Report: This report will detail the data cleaning and preprocessing steps taken to prepare the dataset for analysis. It will also provide a detailed analysis of the dataset to gain insights into the data distribution, identify missing values, and identify any correlations among the variables.
Model Selection Report: This report will detail the various machine learning models that were evaluated. It will explain the strengths and weaknesses of each model and describe the process for selecting the most appropriate model for the task.
Regression Model Report: This report will detail the development of the regression model, including the feature engineering and selection process.
Prediction Results Report: This report will summarize the final regression model's predictions on the test data, including performance metrics, and provide an interpretation of the model's results.
Presentation of Findings: A presentation of findings will be created to summarize the key insights and results of the project. This presentation will include visualizations and graphs to illustrate the model's performance and any other significant findings.
PROJECT PROPOSAL
Comments