Terminology


I will try to avoid too much jargon in my posts along the way, but the use of some terminology is helpful in understanding databases and mastering their use.

A

Ad-hoc Query – a query created by a user as and when needed, similar to predefined and stored query.

Aggregate Function – a function that returns a significant single result based on a group set of multiple rows as an input queried by user. Commons of the aggregate functions include AVG, COUNT, MAX, and more. Also it is known as Set Function.

Algorithm – A theory in statistics made up of multiple sets of steps, operations, or procedures that will provide the analyst a desired outcome.

Alias – A nickname for a table name.

ALTER DATABASE – A SQL statement that changes the size or settings of a database.

ALTER USER – A SQL statement used to change a user’s system settings such as a password.

Analytics – Using software-based algorithms and statistics to derive meaning from data.

Anomaly – an undesirable consequence caused by a modification to data table during the normalization process.

Applet – a Java bytecode program that is run by the Java virtual machine embedded in a browser.

Application – a business computer system that processes to a database consists of menus, forms, reports, queries, and web pages.

Application Program – a custom program for processing a database that may be written in a standard procedural language. An example of application program is Java.

Association Rule – One of the data mining techniques that describes the relationship among the items found in the database.

Attribute – a single data that relates to a database, a component of a structured type that is also known as column.

AVG – a SQL command that returns the average of all value within a column when queried by the user.

B

Back end – a section of a Database Management System that interacts directly with the database.

Behavioral Analytics – Using data about people’s behavior to understand intent and predict future actions.

Big Data – The definition is ongoing as many experts argue about how best to define this term. Originally it was defined as Volume, Velocity, and Variety;

According to an analyst, volume refers to the sheer depth of the datasets. Later, different analysts’ reference of volume was expanded by referring to the datasets whose size is beyond the ability of typical database software tools to analyze and store. Velocity refers to the speed at which the data is acquired and used. Not only are companies and organizations collecting more and more data at a faster rate, they want to derive meaning from that data as soon as possible, often in real time. Variety refers to the different types of data that are available to collect and analyze in addition to the structured data found in a typical database. Currently there are four different categories of information that constitute big data:

Machine-generated data – including RFID data, geolocation data from mobile devices, and data from monitoring devices such as utility meters.

Computer log data such as clickstreams from websites.

Textual social media information from sources such as Twitter and Facebook.

Multimedia social and other information from Flickr, YouTube, and other similar sites.

Boyce-Codd normal form (BCNF) – a normal form that requires meeting criteria where a relation database in which every determinant is a candidate key.

Business Intelligence (BI) System – information systems that assist the users in the analysis of the activities and help predict the future events. There are two categories of BI system including reporting systems and data mining system.

Business Rule – a statement or rule was implemented and/or defined during a database designing process of a system that inform both the user and application programmer of how data is used to support the business.

C

Candidate Key – an attribute or group of attributes identifies a unique row (data records) in relation tables, in which this key can be chosen to be a primary key.

Cascading Deletion – a referential integrity action specifying that when a parent row is deleted, citing that related child rows to be deleted as well.

Cascading Update – a referential integrity action specifying that when the key of parent is updated result that foreign keys of child rows should be updated accordingly.

Catalog – A named collection of schemas.

Categorical Dependent Variable – A variable is measured on a nominal scale and identifies a class or group membership.

CHECK Constraint – In SQL, a constraint stat what data values are allowed in an attributed (or particular column).

Classification Trees – A technique in data mining used to predict membership in the class of the categorical dependent variable from the predictor variable. The technique usually divides the sample into groups according to the simple rules, and then resplits into subsamples.

Client – A user workstation representing the front end of a database management system, a part that displays information on the screen responding to user input.

Client/server System – a multiuser system – a central processor or server is connected to multiple user workstations, the clients.

Clustering – A type of analysis that divides data into the groups where the members of each group are as close as possible to each other while separating the different group as far from each other as possible.

Column – a table component that corresponding to a single attribute.

Composite Key – a key that is made up of two or more columns in a table.

COMMIT – a command issued to DBMS order to make database modifications a permanent in order a change will be able to survive system crashes and other failures. It can be contrast with a rollback command.

Concurrent Access – Two or more users operating on the same rows in a table in a database at the same time.

Constraint – a restriction a user implements on the column in a table in database defining data integrity.

COUNT - a SQL statement is given by a user when seeking for the number of rows in the table (a method of counting how many rows in a table).

Customer Relationship Management (CRM) – The processes organizations or businesses use to keep records of the contacts of customers by using software including the customers’ addresses, quotes, sales, and future needs to assure that the customers is being serviced with satisfaction.

D

Dashboard – A graphical reporting of static or real-time data on a desktop or mobile device. The data represented is typically high-level to give managers a quick report on status or performance.

Data – a sequence of statements or facts that is gathered, stored, and manipulated into context that is useful to the user.

Data Administration – an enterprise-wide function concern and responsible for the effective use and control of organization’s data assets.

Data Analytics – The application of software to derive information or meaning from data. The end result includes a report, an indication of status, or an action taken automatically based on the information received.

Data Analyst – A person who is responsible for the tasks of modeling, preparing, and cleaning data for the purpose of deriving actionable information from the data.

Data Dictionary – a catalog of data and application metadata that can be accessed by a user.

Data Feed – The means for a person to receive a stream of data. Examples: RSS or Twitter.

Data Governance – A set of processes or rules that ensure the integrity of the data and that data management best practices are met.

Data Integrity – the state of database that fulfilled all constraints.

Data Control Language (DCL) – a component of SQL that consist the commands that used to protect the data from being corrupted and harmful.

Data Definition Language (DDL) – a component of SQL is used by user to define, modify, and removal of database structure.

Data Manipulation Language (DML) – a component of SQL used to operate the entry, manipulate and removal of the data inside the database.

Data Mining – a use of automated data analysis techniques to discover a relationship between the data. A process with a minimum of four stages: (1) data preparation that includes data cleaning, (2) initial exploration, (3) model building, and (4) deployment.

Data Mining Application – a business intelligence system that use statistical and mathematical techniques to perform an analysis in order to make predictions and make a business decision.

Data Redundancy – a database having same data stored more in one place.

Data Science – A discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.

Data Scientist – A person who practices data science.

Data Type – a form of data that is predefined which include the numbers and dates.

Database – a self-describing collection of files contains data and metadata.

Database Administrator (DBA) – a person responsible for the functionality, integrity, and safety of a database.

Database Design – a diagram that represent that a database will be implemented in a database management system product.

Database Engine – a part of the database management system that interacts directly with the database, serving as part of the back end.

Database Management System (DBMS) – a set of programs used to define, administer, and process the database and its application. The examples of DBMS include Microsoft Access, Oracle, and MySQL.

Data Warehouse – a centralized database capturing information from business processes to be analyzed for determining predictive relationships through the use of data mining techniques.

Deadlock – A situation where resources (i.e. locks) or concurrent processing are held by two or more connections that are each needed by the other connection so that they are stuck in an infinite wait loop while 2 or more users are trying to gain access to the data at the same time.

Decision Support system (DSS) – one or more applications designed to help the middle-management to make business decisions.

Decision Tree Analysis – a form of unsupervised data mining that classifies the interest into two or more groups according to values of attributes measuring the history.

DELETE – A SQL command used to execute an order to delete the record(s) from a table in the database.

Deletion Anomaly – an inconsistent found in the multi-table database occurs due to a deletion of a row from one of the table.

Diagnostics Area – A data structure that is managed by the database management system containing detailed information about a previous SQL statement executed and any errors that occurred during its execution.

Domain – the set of values in a database is within allowable that attributes can assume.

E

Entity Integrity – a property of database table that is consistent with real world object that is modeling by the database.

Equijoin – the process of joining relation A contribute attribute A1 with B contain attribute B1 to form a relation c; resulting in A1=B1 to represent all rows in relation C.

Extract – a portion of operational database is downloaded to a personal computer for a processed in order to reduce a communication cost and time when querying and create reports from the data through transaction processing.

Extract, Transform, and Load (ETL) system – a portion of data warehouse that converts operation data to data warehouse data.

F

Foreign Key – a column or more in a database table that is used to reference a primary key found in another table within same database.

Front End – The part of a database management system that interacts with the user.

Functional Dependency – a relationship between or among attributes of a relational database table in a way that an attribute determines another attribute.

G

GRANT – a SQL command that is part of DCL that is used by a primary user such as the administrator to give an ability or access toward an object or database to another user.

Graphical User Interface (GUI) – a user interface that use graphical elements for interaction.

H

Hadoop- An open source software library project administered by the Apache Software Foundation. According to Apache, Hadoop is “a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.”

Hierarchical Database Model – a type of database that is a tree-structured model of a data.

HTML (HyperText Markup Language) – a standard formatting language used for a web documents.

Hypertext Transfer Protocol (HTTP) – a standardized of using TCP/IP to communication via Internet.

I

Identifier – an attribute naming or identifying an entity.

Implicit Lock – a lock that is automatically placed by the DBMS.

Index – a table of pointers used to locate rows quickly through the large size of database to save the time and improve the efficient of scanning the tables in the database.

Information Management – The practice of collecting, managing, and distributing information of all types: digital, paper-based, structured, and unstructured.

INSERT – SQL command executed to add a new record to a table within a database.

Insertion Anomaly – an error occurs when a new row is inserted into one of table in a multi-table database.

International Organization for Standardization (ISO) – the international standards organization enforce SQL standards.

Internet – the worldwide network of computers.

Intranet – A network that uses World Wide Web hardware and software that restricts access to users within an organization.

J

Java – an object-oriented programming language that contain better memory management than C++. Usually a primarily used for Internet application, however can be used as a general purpose programming language.

Join – a relational operator allowing the user to combines the data from multiple tables to form a result table to create a report.

K

Knowledge Discovery – an interchangeable name/term associated with data mining.

L

LEFT OUTER Join – a join that includes all the rows of first table in a SQL statement.

Locking – A method for safely protecting data from being changed by two or more users (processes/threads) at the same time.

LOWER – a SQL command converts a character string to all lowercase when queried by the user.

M

Mapping – the translation of data in one format to another.

Market Basket Analysis – a type of data mining that estimates the correlations of items that are purchased together.

MAX – a SQL statement returns the maximum value of a table when queried by the user.

Metadata – a data describes the data structure within a database.

Microsoft Access – a personal or individual database that offers a flexible environment for database developers and users. It uses Microsoft Office interface and allows for integration with larger-scale enterprise databases.

MIN – a SQL statement returns the minimum value of a table when queried by the user.

Modification Anomaly – an inconsistent occurs when a medication such as insertion, deletion, or update is being made in a table in a multi-table database.

N

Nested Query – a SQL queries statement that includes one or more sub-queries.

Normalization – The process of organizing data to minimize redundancy and remove ambiguity. Normalization involves separating a database into tables and defining relationships between the tables. There are three main stages of normalization called normal forms as follows:

Normal Form (1NF): Each field in a table must contain different information.

Second Normal Form (2NF): All attributes that are not dependent upon the primary key in a database table must be eliminated.

Third Normal Form (3NF): No duplicate information is permitted. So, for example, if two tables both require a common field, this common field information should be separated into a different table.

There are 2 more normalization forms, fourth normal form (4NF) and fifth normal form (5NF), but they are rarely used. Normalization makes databases more efficient and easier to maintain.

NoSQL – A database management system that does not use the relational model. NoSQL is designed to handle large data volumes that do not follow a fixed schema. Ideally, it is suited for use with very large data volumes that do not require the relational model.

NULL constraint – In SQL, a constraint species that a column may have empty cells in numbers of rows.

O

Object – a basic-block builder in an object programming oriented.

Online Analytical Processing (OLAP) – allows for the real-time analysis of data stored in a database. The OLAP server contains specialized algorithms and indexing tools to process data mining tasks efficiently for a database performance.

Online Transaction Processing (OLTP) – the database allows the real-time processing of SQL transactions to support e-commerce and other time-critical applications.

Open Source Software (OSS) – Software that is released under a License that permits each recipient of the software to copy and modify the software; allows each recipient to distribute the software in modified or unmodified form; and does not charge fees or require a royalty for the permission to copy, modify, or distribute the software.

Optimizer – A component of the SQL system that estimates the optimum (i.e., fastest) method to access the database data requested by particular SQL select, update, or delete statements

Oracle – a relational database management system is marketed by Oracle Corporation.

Outer Join – A type of join formed between two tables. In addition to including the rows from the two tables with matching join column values, it will also include the values from one table that does not have matching rows in the other.

P

Parameter – a variable within an application written in SQL module language.

Personal Database System – A DBMS product intended for use by an individual or small business that uses this tool to form and reports. An example of a personal database system is Microsoft Access.

Predictive Analytics – Using statistical functions on one or more datasets in order to predict trends or future events.

Predictive Modeling – The process used to develop a model that will predict a trend or outcome.

Primary Key – a column in a database that is a uniquely identifies each record in the table.

Programmer – a person who use programming language to create an application program.

Q

Query – a question or demand inquiry by the user to extract the needed data from the database for the use in creating a result of the analysis.

Query Analysis – The process of analyzing a search query to optimize the best possible result.

R

Record – Representation of a physical or conceptual object. Example: For a Business Customer tables, a name of customer is a record.

Referential Integrity – a relational database system concept in which all tables in database is consistent with each other in order to ensure the integrity of data.

Regression – A data analysis technique used in statistics for building predictive models, it determines a mathematical equation that minimizes some measure of the error between the prediction from the regression model and the actual data.

Relation – a 2D array of rows and columns contain a single valued entries barrier any duplicate rows.

Repository – a collection of metadata about database structure, applications, web pages, users and other application components.

Risk Analysis - The application of statistical methods on one or more datasets used to determine the likely risk of a project, action, or decision.

ROLLBACK – the process of recovering a database in which before images are applied to database to return to a checkpoint at which a database is consistent.

Root-Cause Analysis - An analysis process to determine the main cause of an event or problem.

S

Schema – a collection of meta-data describing the relations in a database. Considered as the “layout” of a database or the blueprint of how data is organized into tables.

Segmentation – The process that creates mutually exclusive collections of records where the records share similar attributes either in unsupervised learning (such as clustering) or in supervised learning for a particular prediction field.

SELECT – A primary SQL command in the programming, used to retrieve the data from the relational database system.

Sensitivity Analysis – A process to determine the sensitivity of a predictive model to dynamics in a predictor value. With this analysis technique, the end users can gauge the environmental change on the accuracy of the model.

Set Function – a function that produces a single result based on an entire of table rows, also known as aggregate function.

SQL – an industry standard data sub-language that was designed to create, manipulates, and controls the relational database.

Stored Procedure – A named and, optionally, parameterized compiled set of SQL database access statements that can be executed as a unit through a call to the stored procedure.

Subquery – a query found within a query.

SUBSTRING – a SQL command will extract a substring from a source string when queried by the user.

SUM – a SQL statement consists adding up the values in a column when queried by the user.

Supervised learning – A class of data mining learning applications and techniques; the system builds a model based on the prediction of a well-defined prediction field.

T

Table – a predefined format of rows and columns define an entity.

Text Analytics – A set of linguistic, pattern recognition, extraction, visualization, and predictive techniques that describes the process applying techniques in order to solve business problems from facts, business rules, and relationships.

Text Mining – A process of extracting a meaning from a collection of documents, a step includes calculating statistically the words, terms, and structure of the documents.

Time-series forecasting – a process of using a data mining tool, such as neural networks, to learn to predict temporal sequences of patterns; so using a set of patterns, it can predict a future value.

Transaction – a sequence of SQL statement that has an effect that is not accessible to other transactions until all of statements have been executed.

Trigger – A code that tells a database management system what other actions to perform after certain SQL statements have been executed.

U

Unsupervised learning – A data analysis technique where a model is built without a well – defined goal or prediction field. It is used for exploration and general data organization. Example: Clustering.

UPDATE – A SQL command used to edit the existing record in a table within a database.

Update Anomaly – an anomaly (problem) in a database when a table row is updated.

UPPER – a SQL command converts a character string to all uppercase when queried by the user.

V

Value Function – a function that performs an operation on a type of data such as character, number, or date/time.

View – an executed statement allowing you to create “virtual tables”. A view is stored on the database server as an SQL statement that pulls data from one or more tables. With a view, the users query the view with an executed view statement just as they would any real database table. Views are used for security purposes.

Visualization – The graphical display consisting of data and models to help the user understand the structure and meaning of the information contained in them.

W

World Wide Web – an aspect of Internet that has a user interfaces that a program application uses to access the data network through the Web Browsers.

X

XML (Extensible Markup Language) – a standard markup language that provides a clear separation between structure, content, and materialization, a widely accepted in order to exchanging data between dissimilar systems.