The Introduction to Data Programming aims to equip you with the basics of analysing data using the popular and rich data processing language - Python. The key focus is to enable you to start coding immediately - by looking at simple examples to reinforce your understanding and by coding out the actual examples to validate your concepts.
You will learn to set up the software tools required to enable you to start writing Python routines and programs.
You will learn how to read data from various sources, starting with text files. To understand what the content is and make use of it, you will explore the various text processing functions within the Python language and learn about the basic formats in which text is stored in files.
In this blog we will some important data programming related topics:
Data Types and Structure
What is the difference between data and information?
For many organisations, data is created when business process occurs and the input and output of these processes are stored as figures, statistics and numbers. However, in order for these data to be used effectively by the organisation, they have to be properly categorised, structured, interpreted and analysed so that meaningful information or knowledge can be derived from them and actions can be applied to the organisation that can bring about improvements.
In this learn how data can be organised and managed using data management software known as database. The data from the database can be retrieved and processed programmatically using Python through the use of relevant database commands languages. Since data may come in many different possible forms and formats, there are different kinds of databases that are designed to manage both structured and unstructured or formatted data.
Through programming techniques, we can process the data in these database by treating them as data sources and retrieve the data by using specific query languages supported by the database software vendors.
Data Types and Structure
In this we will be introduced to Pandas, an open source Python library that provides us with the ability to work with data and perform data analysis operations such as loading data from various formats and sources, cleansing and transforming data and presenting the analysis by plotting graphs.
Using Pandas, we can make more complex analysis with larger datasets, collate data from multiple sources and identify patterns and trends in data - performing the basic fundamental operations for business intelligence activities and big data analysis.
Using the concept of dataframe in Pandas, we can perform operations on this dataset, like picking out specific columns or rows of data that we need to do further calculations - very much like the way we are used to in a desktop spreadsheet software. We can also query the data in the dataframe, operate on them (eg do calculations) , pick specific rows (ie Series) that fulfils certain criteria (eg if the value > 50% )
In this , we will learn what are the ways to combine different sets of data, put them together into a dataset and to analyse them correctly. Why do we need to do that - you may wonder?
In many business scenarios or working environments, the data that you will need to use for analysis does not necessary come to you in one complete package or repository. More often than not, you will need to collect or sample data from several places and combine that together.
How to we ensure that the analysis performed will be correct by combining the various sources will be correct ?
As there’s a saying - garbage in garbage out - one of the first steps in performing data analysis work is to ensure that the source data is factually relevant and correct. To do that, the input source data usually goes through a process of tidying or “cleaning” to ensure the input data used contains the right data. For example, If the data is to be calculated (e.g. summed up or multiplied), it has to be of a data type that is numeric - and not textual.
During the process of combining the different datasets, we may introduce missing or invalid data and hence we can decide how to treat these missing data by eliminating them or by performing some calculations to default them into a specific value. This is useful in situations where we need to make some assumptions about the missing data and assign values to them for further calculations.
We will also learn the basic concepts of pivoting - ie “changing” the column header values into row values by using the melt and pivot_table functions in pandas. This allow us to reorganise or “tidy” up the data into a format that illustrate and explains the data better.
With a combined dataset and after performing the various methods to tidy up the data, we may noticed that certain data values may be repeated in the rows. As we may recall the concept of normalisation (from database design) is to eliminate redundancy and to improve data’s integrity, we can apply the same concept to the combined dataset by splitting the dataset into subsets that filter out the respective columns that contains the repeated data and eliminating the redundant or non-meaningful rows.
We have learnt how to create new datasets for analysis by concatenating or merging separate data sets and also learn how to improve the quality of the data by improving its integrity as well as by reducing its redundancy.
By using the pivoting and unpivoting data in the dataset, you can view data in different presentation format - summarised into rows or columns, which can help to understand or interpret the data better. The data can be summarised or aggregated by grouping similar data together or by summing certain values in the data or counting the number of occurrences of a particular value of data.
In this we will explore the basic techniques to process the values in the dataset. For example, we may want to convert a string “01” representing a number into a integer value so that we can sum them up together. Or we may convert a string “2018-01-01” into a Date object so that we can calculate the number of days difference between 2 string representing dates.
These techniques allow us to convert, calculate or transform the values into meaningful analysis. We can label certain values as text or apply a formula to it to derive other values from it.
Using grouping techniques, we can categorise common (row) values together to find out what is the average, max or min given a aggregation criteria. This allow us to analyse information common in business scenarios such as for example - the average sales values of each purchase or who is the top performing cashier based the transactional retail data collected.
In this, we learnt the different ways to retrieve data from text files, from databases or by getting it over the internet using HTTP requests or APIs. And to understand what kind of data we have, we introduce the concept of plotting - the process of creating charts and graphs to visualise a set of data (ie a dataset).
We’ve also covered how the structured data that is retrieved from databases can be used programmatically in Python through ORM - enabling us to perform the basic operations of read, write, insert and updates for small sets of data.
For handling larger datasets, we introduced Pandas, a comprehensive Python library that allow data to be organised into dataframes which provides data manipulation and processing Pandas functions that helps us combine datasets, filter data records, transform or aggregate data values through statistical Pandas functions or self-written Python functions.
Using Python for Data Processing
Installing Python and Related Analytics Libraries and Tools
Python is a popular general purpose and high-level programming language. It has fairly simple syntax rules to help make code base readable and maintainable. It comes with a robust standard library and many open source frameworks (e.g. Flask) and tools. For data science analytical work, it can be used to facilitate data analysis and visualisation - a feat that is supported by enthusiastic proponents that continuously introduce libraries and modules to the scientific community.
For the purpose of this course, we recommend installing Python and some of the related analytics Python libraries (eg Numpy) using Anaconda, an open source distribution platform for programming languages and applications related to data science and machine language learning.
To install Anaconda, navigate your browser to https://www.anaconda.com/download/, choose the version (at least Python 2.7) for your Operating System and follow the installation instructions accordingly. On Mac OS and Windows, you should get a graphical installer. Just download the installation program and double-click it. We recommend installing Anaconda’s default options.
To check that Anaconda and Python install went well, check for your Python version by typing at the command-line the following:
For Windows command-prompt:
For Mac OS prompt:
$ python --version
You can also check for a listing of all the Anaconda packages that have been installed on your system with the following command:
$ conda list
Jupyter Notebook comes installed with Anaconda. The app produces notebook documents which contain both the Python codes and rich text elements (e.g. paragraphs, figures, equations). This allows you to read the documents which contain analysis and results, and at the same time be able to execute codes to run data analysis.
You can launch the Jupyter Notebook app from the Anaconda Navigator application which is installed on your system. Figure 1.1 shows the Jupyter Notebook app listed as one of the menu items in Anaconda Navigator main screen. Figure 1.2 shows Jupyter Notebook after being launched.
Figure 1.1 Anaconda Navigator
After clicking on Launch button editor open to write the code:
Extensive documentation on how to use Jupyter Notebook is available on its website http://jupyter.org/documentation.
You can send your requirement details if your have any project related to Data Programming at: