Dataset Glossary
-
Dataset
A structured collection of related data points or facts. Datasets are fundamental to data analysis, machine learning, and information management. They can be organized in various formats, such as tables, spreadsheets, or files.
Key Characteristics: Typically includes observations (rows) and variables (columns).
-
Data Point (Observation)
A single record or instance within a dataset. In a tabular dataset, a data point usually corresponds to a single row.
-
Variable (Feature)
A characteristic or attribute that can be measured or observed for each data point. In a tabular dataset, a variable typically corresponds to a column.
Types: Can be numerical (e.g., age, price) or categorical (e.g., gender, product type).
-
Metadata
Data that describes other data. Metadata for a dataset can include information about its origin, format, variables, units of measurement, and any pre-processing applied.
-
Schema
The structural definition of a dataset, outlining the names, types, and relationships of its variables. It acts as a blueprint for the data.
-
CSV (Comma-Separated Values)
A common file format for storing tabular data. Each line in a CSV file typically represents a data point, and values within a line are separated by commas.
-
Database
An organized collection of data, often stored and accessed electronically from a computer system. Databases are typically controlled by a database management system (DBMS).
-
Data Warehouse
A large, centralized repository used for storing integrated data from one or more disparate sources. Data warehouses are primarily used for reporting and data analysis, rather than for transactional processing.
-
API (Application Programming Interface)
A set of rules and protocols that allows different software applications to communicate with each other. Datasets can often be accessed and queried through APIs.
-
Data Lake
A centralized repository that allows you to store all your structured and unstructured data at any scale. Unlike a data warehouse, a data lake stores data in its raw format.