What is Data?
Data can take the form of numbers, words, images, sounds, videos
But is a random collection of numbers or words or images data?
1. Is 17, 67, 98, 12 data?
2. Is a photograph you come across on the street data?
3. Context about what the numbers are (where they come from, what they signify, etc.) make them count as data.
Data versus Metadata
-> Metadata is data about data. It provides some information about the data. Some examples:
Examples: the column names and the data types of a column in a relational database table
The number of rows in a table
Information about when and where a piece of audio/video was recorded
-> Thesis (to consider): For something to count as data, it should come with some Metadata or context.
-> Do you agree or disagree with this thesis?
-> For Metadata do we necessarily need Meta-Meta-data for it count as Metadata, and so on to absurdity?
Data Varieties & Operations
What is Managing Data or Data Management?
Variety of operations on various types of data
Maintaining integrity, quality, and security of data
Infrastructure needed to do all these things
Structured vs Unstructured vs Semi-Structured
We can distinguish between them in terms of the degree of information about the “content” of (components of) the data provided by the associated meta-data
Some people talk as if structured data is numbers and unstructured data is words.
1. This is complete nonsense
Examples: Documents, videos, images, soundtracks, etc.
The Metadata associated with unstructured data provides almost no information about the content of the data or its components (e.g., occurrences of the persons, locations, organizations, events, etc. in the document).
Separate processes are required to extract and tag the components of interest in unstructured data.
1. The tags are Metadata
Examples: Emails, XML documents
Metadata associated with Semi-Structured data provides
1. sufficient’ information about only some of the components of the data/document (e.g., emails) AND/OR
2. ‘insufficient’ information about some of the components of the data/document (e.g., XML documents)
Emails contain Metadata which tags the sender, receiver, time, etc. fields, but do not provide any Metadata about the components of the body of the email.
Examples: rows in a relational database tables, information entered in a typical web form, JSON objects, etc.
The Metadata associated with structured data gives ‘sufficient’ information about the components of the data.
1. Whether the Metadata provides ‘sufficient’ information depends on one’s needs.
2. Example: If the address field in a database table contains city and street names and if you need to extract that information, you may not regard the ‘address’ tag as sufficient information.
What are Data Models?
Should NOT be confused with:
1. Data formats (16 bytes vs. 32 bytes, etc.)
2. Data types (integer vs. string, etc.)
A Data Model consists of:
1.Data representation (tuples vs. objects vs. graphs, etc.)
2.Query language (e.g., SQL, CQL)
3.Constraints that provide semantics for data (e.g., integrity constraints)
Not a precise term
What is the opposite of “Big Data”?
Big data is generally characterized by three V’s.
1. Volume—humongous amount of data (petabytes and more)
- But that is not enough to make it big data
2. Velocity—speed at which data needs to be ingested and retrieved
- Real-time applications need real-time processing
3. Variety—structured, unstructured, semi-structured, textual, graphic, video, etc., may all be leveraged by a single application
Some people add more V’s, such as Veracity (Accuracy).
Data vs. Information vs. Knowledge
•Knowledge is conceptualization and integration of information at a yet higher level (e.g., that region is undergoing a drought)
•Information is conceptualization of data at a higher level (e.g., amount of rainfall at that location over a period of time)
•Data can be thought of as the recording of some event or state of the world (e.g., water collected in a rain gauge at a certain location at a certain time)
And Get Instant help in Big Data Project Assignment Using Machine Learning.