Uploaded by Rahul Patel

unit2 PDS

advertisement
UNIT-1(Overview of Python & Data Structure)
PREPARED BY:
Rachita Mohanty
Assistant Professor
Computer Engineering Department
SAFFRONY INSTITUTE OF TECHNOLOGY
Topic to be Covered
 Core Competencies of a Data Scientist:
 Linking Data Science, Big Data, and AI:
 Role of Programming:
 Data Science Pipeline:
 Python's Role in Data Science:
 Shifting Profile of Data Scientists:
 Working with Python
 Loading Data, Training a Model, Viewing Results:
 Using the Python Ecosystem for Data Science
Core Competencies of a Data Scientist:
A data scientist needs a combination of skills from various
domains:
 Statistical Analysis: Proficiency in statistical methods to
analyze and interpret data.
 Programming: Strong coding skills, often in languages like
Python or R.
 Machine Learning: Understanding of machine learning
algorithms and techniques.
 Data Manipulation: Ability to clean, pre-process, and
transform data.
 Domain Knowledge: Familiarity with the specific industry or
field the data pertains to.
 Data Visualization: Skill in creating meaningful
visualizations to communicate insights.
 Communication: Ability to explain complex findings to both
technical and non-technical stakeholders.
Linking Data Science, Big Data, and AI:
 Data Science: The overall field that involves
extracting insights and knowledge from data.
 Big Data: Handling and analyzing large volumes of
data that traditional methods cannot manage.
 Artificial Intelligence (AI): Enabling machines to
simulate human intelligence, often using data and
algorithms.
Differences between Big Data and Data Science:
Data Science
•Data Science is an area.
Big Data
•Big Data is a technique to collect, maintain and
process huge information.
•It is about the collection, processing, analyzing, and
•It is about extracting vital and valuable information
utilizing of data in various operations. It is more
from a huge amount of data.
conceptual.
•It is a field of study just like Computer Science, •It is a technique for tracking and discovering trends
Applied Statistics, or Applied Mathematics.
in complex data sets.
•The goal is to make data more vital and usable i.e.
•The goal is to build data-dominant products for a
by extracting only important information from the
venture.
huge data within existing traditional aspects.
•Tools mainly used in Data Science include SAS, R, •Tools mostly used in Big Data include Hadoop,
Python, etc
Spark, Flink, etc.
•It is a superset of Big Data as data science consists
•It is a sub-set of Data Science as mining activities
of Data scrapping, cleaning, visualization, statistics,
which is in a pipeline of Data science.
and many more techniques.
•It is mainly used for scientific purposes.
•It is mainly used for business purposes and
customer satisfaction.
•It broadly focuses on the science of the data.
•It is more involved with the processes of handling
voluminous data.
Role of Programming:
 Programming is crucial for a data scientist as it





enables:
Data manipulation and cleaning.
Implementing machine learning algorithms.
Building and deploying models.
Automation of tasks.
Creating data visualizations.
Data Science Pipeline:
 Preparing the Data: Cleaning, transforming, and




pre-processing raw data.
Exploratory Data Analysis (EDA): Understanding
data characteristics through visualization and
summary statistics.
Learning from Data: Applying machine learning
algorithms to train models.
Visualizing and Obtaining Insights: Creating
visualizations to interpret and communicate findings.
Data Products: Developing tools, dashboards, or
applications that provide insights to end-users.
Python's Role in Data Science:
 Python is widely used in data science due to its
simplicity, versatility, and extensive libraries like
NumPy, pandas, scikit-learn, and Matplotlib. It
provides tools for data manipulation, analysis,
visualization, and machine learning.
Shifting Profile of Data Scientists:
 As the field evolves, data scientists are expected to
have knowledge of more advanced techniques, such as
deep learning, natural language processing, and AI
ethics. They also need to collaborate with domain
experts and effectively communicate results.
Working with Python:
 Python's simplicity and libraries make it suitable for
data science tasks. You can quickly learn Python using
online tutorials, courses, and resources like the official
Python documentation.
Loading Data, Training a Model,
Viewing Results:
 Loading Data: Use libraries like pandas to read and
manipulate data from various sources.
 Training a Model: Employ libraries like scikit-learn to
train machine learning models on the data.
 Viewing Results: Visualize model performance and
insights using Matplotlib or libraries tailored to
specific types of visualizations.
Libraries in order to perform specific
data science task in python.
 Following are the list of libraries which we are going to use in this subject.
 Performing fundamental scientific computing using NumPy
 Performing data analysis using pandas
 Plotting the data using matplotlib
 Accessing scientific tools using SciPy
 Implementing machine learning using Scikit-learn
 Going for deep learning with Keras and TensorFlow
 Creating graphs with NetworkX
 Parsing HTML documents using Beautiful Soup
Key Features of NumPy:
 Multidimensional Arrays:
 NumPy introduces the ndarray, which is a multi-dimensional
array object.
 These arrays can be 1-dimensional (vectors), 2-dimensional
(matrices), or even higher-dimensional.
 Efficient Numerical Operations:
 NumPy arrays are more memory-efficient and faster for
numerical computations compared to regular Python lists.
 This is due to NumPy's underlying implementation in C and
its optimization for numerical tasks.
 Broadcasting:
 NumPy allows element-wise operations on arrays of different
shapes and dimensions through broadcasting.
 This simplifies operations and avoids explicit loops.
Key Features of NumPy:
 Mathematical Functions:
 NumPy provides a wide range of mathematical functions for basic
arithmetic, linear algebra, trigonometry, statistics, and more.
 Array Indexing and Slicing:
 NumPy supports advanced indexing and slicing operations on
arrays, making it easy to access and manipulate specific parts of the
data.
 Universal Functions (ufuncs):
 These are functions that operate element-wise on arrays and
support broadcasting.
 Examples include addition, subtraction, exponentiation, etc.
 Integration with other Libraries:
 NumPy integrates well with other libraries used in data science,
such as pandas (for data manipulation) and Matplotlib (for
visualization).
Mathematical Functions:
 import numpy as np
 # Create NumPy arrays
 a = np.array([1, 2, 3])
 b = np.array([4, 5, 6])
 # Element-wise operations
 result = a + b
Array Indexing and Slicing using
numpy
 import numpy as np
 arr = np.array([[1, 2, 3], [4, 5, 6]])
 element = arr[1, 2]
 # Accesses the element at row 1, column 2 (value: 6)
 import numpy as np
 # Creating an array
 arr = np.array([1, 2, 3, 4, 5])
 # Performing operations
 result = arr * 2
 print(result) # [ 2 4 6 8 10]





# Calculating mean and standard deviation
mean_value = np.mean(arr)
std_dev = np.std(arr)
print("Mean:", mean_value)
print("Standard Deviation:", std_dev)

import numpy as np


# Creating a 1-dimensional array
arr1 = np.array([1, 2, 3, 4, 5])


# Creating a 2-dimensional array (matrix)
arr2 = np.array([[1, 2, 3], [4, 5, 6]])



# Basic arithmetic operations
result = arr1 + 10
print(result) # [11 12 13 14 15]



# Matrix multiplication
mat_product = np.dot(arr2, np.array([2, 2, 2]))
print(mat_product) # [12 30]




# Broadcasting
broadcasted = arr2 * 2
print(broadcasted) # [[ 2 4 6]
# [ 8 10 12]]



# Statistical operations
mean_value = np.mean(arr1)
print(mean_value) # 3.0



# Slicing
subset = arr1[1:4]
print(subset) # [2 3 4]
 [1 2 3 4 5]
 [11 12 13 14 15]
 [12 30]
 [[ 2 4 6]
 [ 8 10 12]]
 3.0
 [2 3 4]
Integration with other Libraries
using numpy in python
 import numpy as np
 import matplotlib.pyplot as plt
 x = np.linspace(0, 10, 100)
 y = np.sin(x)
 plt.plot(x, y)
 plt.xlabel('X-axis')
 plt.ylabel('Y-axis')
 plt.title('Sine Wave')
 plt.show()
Universal Functions (ufuncs) in
numpy
 import numpy as np
 a = np.array([1, 2, 3])
 b = np.array([4, 5, 6])
 # Basic arithmetic operations
 add_result = np.add(a, b)
 subtract_result = np.subtract(a, b)
 multiply_result = np.multiply(a, b)
 divide_result = np.divide(a, b)
Key Features of Pandas:
 DataFrame: The central data structure in pandas is the DataFrame, a two-dimensional
table-like structure with rows and columns. It allows you to store and manipulate data in
a tabular format, similar to a spreadsheet or SQL table.
 Series: A Series is a one-dimensional labeled array, similar to a column in a DataFrame. It
can hold data of any type and is useful for working with single columns or as
index/column labels.
 Data Manipulation: Pandas provides a wide range of functions for data manipulation,
including filtering, sorting, grouping, reshaping, merging, and joining datasets.
 Handling Missing Data: Pandas offers tools to handle missing or null values, allowing
you to fill, replace, or drop missing data points.
 Data I/O: Pandas supports reading and writing data from various file formats, such as
CSV, Excel, SQL databases, and more.
 Indexing and Selection: You can select, slice, and filter data using various indexing
techniques, such as label-based indexing, positional indexing, and Boolean indexing.
 Data Aggregation: Pandas simplifies the process of summarizing and aggregating data
using functions like groupby().
 Time Series: Pandas provides functionalities for working with time series data,
including date and time parsing, resampling, and time-based operations.
 Data Visualization: While not a primary visualization library, pandas integrates well
with visualization libraries like Matplotlib and Seaborn to create informative plots and
charts.

import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28]}
df = pd.DataFrame(data)

















# Display the DataFrame
print(df)
# Filtering data
young_people = df[df['Age'] < 30]
print(young_people)
# Grouping and aggregation
age_group = df.groupby('Age').count()
print(age_group)
# Reading and writing data
df.to_csv('people.csv', index=False)
new_df = pd.read_csv('people.csv')
# Displaying summary statistics
summary_stats = df.describe()
print(summary_stats)
matplotlib
 The matplotlib library gives a MATLAB like interface
for creating data presentations of the analysis.
 The library is initially limited to 2-D output, but it still
provide means to express analysis graphically.
 Without this library we can not create output that
people outside the data science community could
easily understand.
Examples
 import numpy as np
 import matplotlib.pyplot as plt
 x = np.linspace(0, 10, 100)
 y = np.sin(x)





plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Sine Wave')
plt.show()
SciPy
 The SciPy stack contains a host of other libraries that we can also download
separately.
 These libraries provide support for mathematics, science and engineering.
 When we obtain SciPy, we get a set of libraries designed to work together to
create applications of various sorts, these libraries are:

NumPy

Pandas

matplotlib

Jupeter

Sympy etc..
Scikit-learn


The Scikit-learn library is one of many Scikit libraries that build on the capabilities provided by
NumPy and SciPy to allow Python developers to perform domain specific tasks.


Scikit-learn library focuses on data mining and data analysis, it provides access to following
sort of functionality:

Classification

Regression

Clustering

Dimensionality reduction

Model selection

Pre-processing

Scikit-learn is the most important library we are going to learn in this subject
Keras and TensorFlow
Keras is an application programming interface (API) that is used to train deep
learning models.
An API often specifies a model for doing something, but it doesn’t provide an
implementation.
TensorFlow is an implementation for the keras, there are many other
implementations for the keras like
Microsoft’s cognitive Toolkit, CNKT Theano
NetworkX
 NetworkX is a Python package for the creation,
manipulation, and study of the structure,
 dynamics, and functions of complex networks (For example
GPS setup to discover routes
 through city streets).
 NetworkX also provides the means to output the resulting
analysis in a form that humans
 understand.
 Main advantage of using NetworkX is that nodes can be
anything (including images) and edges
 can hold arbitrary data.
Beautiful Soup
 Beautiful Soup is a Python package for parsing HTML
and XML documents.
 It creates a parse tree for parsed pages that can be used
to extract data from HTML, which is
 useful for web scraping.
Important Questions
 Justify why python is most suitable language for Data
Science.
 Explain Core competencies of a data scientist.
 Explain steps of Data Science Pipeline.
 Explain different programming styles (programming
paradigms) in python.
 Explain Factors affecting Speed of Execution.
 Linking Data Science, Big Data, and AI
 Write down key features of NumPy and Pandas
 Write down the difference between Data science and Big
Data.
 List out different types of library used in Python.
Thank You
Download