Describe responsibilities for data analysts – Describe core data concept

Describe responsibilities for data analysts

As a data analyst, your role is crucial in uncovering valuable insight and patterns within data to support informed decision-making within your organization. You play a pivotal role in analyz-ing, interpreting, and visualizing the data to derive meaningful information that drives business strategies. Let’s explore your comprehensive role as a data analyst from a business perspective.

Your primary responsibility is to explore and analyze data to uncover trends, correlations, and patterns that provide valuable insights. You work with various data sources, ranging from structured databases to unstructured text files, and use statistical techniques and analytical tools to extract meaningful information.

Data exploration is an essential part of your role. You dive deep into the data, examining its structure, quality, and relationships. You identify relevant variable and metrics to analyze, ensuring that the data is appropriate for the questions or problems at hand.

Once you have gathered and cleaned the data, you apply statistical analysis tech-niques to identify patterns and relationships. You may perform tasks such as descriptive statistics, hypothesis testing, regression analysis, or clustering to extract insights from the data. These analyses help you uncover trends, anomalies, and relationships that can guide decision-making.

Data visualization is another crucial aspect of your role. You use visual tools and techniques to present data in a clear and concise manner. By creating charts, graphs, and dashboards, you transform complex datasets into easily understandable visual representations. These visualiza-tions help stakeholders to grasp insights quickly and make informed decisions.

22 CHAPTER 1 Describe core data concept

An example of your role as a data analyst is analyzing customer behavior data for an e-commerce company. You would examine the data to understand customer preferences, identify purchasing patterns, and segment customers based on their buying behaviors. These insights would then inform marketing strategies, product recommendations, and customer retention efforts.

Your expertise as a data analyst contributes to evidence-based decision-making within your organization. By analyzing and interpreting data, you provide insights that support strategic planning, optimize operations, and drive business growth.

EXAM TIP

When preparing for the exam, you should focus on understanding the core data concepts and their practical application. Familiarize yourself with different data representation for-mats, storage options, and common data workloads. Pay attention to the roles and respon-sibilities of database administrators, data engineers, and data analysts. Additionally, practice relating these concepts to real-world scenarios to reinforce your understanding. Being able to apply your knowledge to practical situations will help you excel on the exam and in real-world data management scenarios.

Describe relational concepts – Identify considerations for relational data on Azure

Skill 2.1: Describe relational concepts

When exploring relational concepts, you will find yourself thinking about data manage-ment within the cloud-centric world of Microsoft Azure. Data management handling is

29

indispensable for leveraging Azure’s robust data services, which require a keen understanding of relational database intricacies.

Envision organizing data in a structured manner, akin to tables with rows signifying unique records and columns delineating specific attributes. This visualization encapsulates the essence of relational data, which serves as the bedrock for building, querying, and manipulating data-bases in a coherent and systematic fashion. Mastery of relational data concepts is imperative, as it fosters data integrity and efficiency.

Delving deeper, you encounter normalization, a methodical strategy aimed at minimizing redundancy and fortifying data integrity. Normalization is about dissecting your data, dis-tributing it across related tables, and ensuring each piece of information resides in its rightful place. Grasping normalization’s significance is paramount, as it directly influences your data-base’s structural optimization, enhancing maintainability and expediting query performance.

As you navigate through this data-centric realm, SQL emerges as your primary tool of communication with relational databases. Learning SQL means learning the lexicon of data-bases, where you can retrieve, insert, update, and delete data with precision and efficiency. A standardized language across relational databases, SQL is vital for effective data management and manipulation.

Lastly, your journey brings you to database objects—diverse elements within a database that you can create, modify, and interact with. Objects range from tables and views to indexes and stored procedures, each playing a unique and crucial role in how data is structured, stored, and accessed. Understanding these components is like solving a complex puzzle; recogniz-ing how each piece connects allows you to forge a complete, efficient, and effective database solution.

This skill covers how to:

  • Identify features of relational data
  • Describe normalization and why it is used
  • Identify common structured query language (SQL) statements

Identify features of relational data – Identify considerations for relational data on Azure

Identify features of relational data

Relational data refers to the type of data that can be organized into a table structure with related values connected to each other. This table structure consists of rows and columns, where each row represents a record, and each column represents a particular type of data called a file or attribute. Here are some of the key features of relational data:

  • Tables: In a relational database, all data is stored in a table, which is structured in rows and columns. Each column represents a type of data (an attribute), and each row repre-sents an instance of that data (a record). For example, in a Student table, columns could

30 CHAPTER 2   Identify considerations for relational data on Azure

be StudentID, Name, and Major, while each row would represent an individual student’s data.

  • Keys: Keys are a crucial feature of relational data. A primary key is a unique identifier for a row in the table. Foreign keys establish relationships between tables. For instance, a CourseID might be a foreign key in an Enrollments table, linking it to the Courses table.
  • Normalization: This is the process of organizing data to minimize redundancy and avoid data anomalies, such as update errors. It involves decomposing a table into less redundant tables without losing information.
  • Relationships: The relational model allows for relations among tables, typically through foreign keys. Relationships can be one-to-one, one-to-many, or many-to-many. For example, one student (in the Student table) can be enrolled in many courses (in the Courses table); this is a one-to-many relationship.
  • Data integrity: Relational database enforces data integrity rules, ensuring the accuracy and consistency of data. These include entity integrity (no duplicate rows), referential integrity (consistency across relationships), and domain integrity (data types and value consistency).
  • SQL: SQL is used for querying and manipulating the data stored in a relational database.

In Table 2-1, each row represents a student with a unique Student_id, their name is Student_ Name, and the ID of the course in which they’re enrolled is Course_id. The Course_id in Table 2-1 is a foreign key that connects Table 2-1 to Table 2-2.

In Table 2-2, each row represents a course with a unique Course_id and its name is Course_ Name. The Course_id in this table is the primary key.

TABLE 2-1  Student table

TABLE 2-2  Course table

Skill 2.1: Describe relational concepts     CHAPTER 2     31

In the context of Azure, there are several services that are built on the relational model, such as Azure SQL Database and SQL Server on Azure VMs. These services allow you to work with relational data using SQL for managing and manipulating the relational data.

The relational concept is at the heart of most traditional database systems, and understand-ing it is key to becoming proficient in any database-related work, especially when working with Azure’s data services.

Describe normalization and how it is used – Identify considerations for relational data on Azure

Describe normalization and how it is used

When you’re designing a relational database, one of the key approaches you’ll use is normal-ization. This process is all about organizing your data in a way that reduces redundancy and helps avoid any inconsistencies that could lead to confusion or errors.

You’ll achieve normalization by dividing your data into multiple related tables, each one focused on specific data. These divisions are guided by rules known as normal forms. There are several normal forms, each with a specific set of rules. The main ones are first normal form, second normal form, third normal form, Boyce–Codd normal form, fourth normal form, and fifth normal form. Each successive normal form comes with stricter rules.

These are the benefits of normalization:

  • Reduces data redundancy: By breaking down tables and eliminating duplicate data, normalization helps minimize redundancy.
  • Improves data consistency and integrity: Normalization enforces rules that enhance data accuracy and reliability, reducing the likelihood of anomalies.
  • Simplifies query handling: Well-organized data simplifies the process of querying, updating, and managing data.
  • Efficiently uses storage: By reducing redundancy, normalization ensures efficient use of storage.

Normalization is conducted through a series of steps, or what we call normal forms:

  1. First normal form (1NF): Ensures all columns in a table are atomic and each cell can contain only a single value. No repeating groups are allowed.

Example: A Students table with a Subjects column storing comma-separated values

violates 1NF. It should be divided into two tables: Students and Subjects.

  • Second normal form (2NF): Achieved when the table is in 1NF and all nonkey attri-butes are fully functionally dependent on the primary key.

Example: A table with the columns StudentID, Subject, and Teacher, where Teacher depends on Subject, not on StudentID, violates 2NF. To achieve 2NF, divide the table to separate subjects and their teachers.

32 CHAPTER 2   Identify considerations for relational data on Azure

  • Third normal form (3NF): Achieved when the table is in 2NF and all the attributes are functionally dependent only on the primary key.

Example: If a table contains StudentID, StudentAge, and StudentCity, and if StudentAge can be derived from StudentID, it should be moved to a separate table to satisfy 3NF.

  • Boyce-Codd normal form (BCNF): A stronger version of 3NF. A table is in BCNF if it is in 3NF, and for every functional dependency X → Y, X should be a superkey.

Example: In a table with the columns StudentID, CourseID, and Instructor, if Instructor is dependent on CourseID and not on StudentID, the table should be split to satisfy BCNF.

  • Fourth normal form (4NF): Achieved when a table is in BCNF and does not have multi-valued dependencies.

Example: A table with StudentID, Courses, and Hobbies, where Courses and Hobbies are independent multivalued facts about a student, violates 4NF. This should be divided into separate tables.

  • Fifth normal form (5NF): A table is in 5NF if it is in 4NF and there are no join depen-dencies that do not follow from the key constraints.

Example: A complex scenario where a table involves a three-way relationship and break-ing it down into smaller tables and then reconstructing it does not lose information.

Let’s take a look at an example of normalizing the data in Table 2-3.

TABLE 2-3  Unnormalized table

After normalization, this data could be divided into three tables: Student, Course, and Enrollment (see Tables 2-4, 2-5, and 2-6).

TABLE 2-4  Normalized Student table

TABLE 2-5  Normalized Course table

Skill 2.1: Describe relational concepts     CHAPTER 2        33

TABLE 2-6 Normalized Enrollment table

In this example, the Enrollment table establishes the relationship between the Student and Course tables. Now, if the course name changes, you need to update it in only one place, enhancing data consistency.

Challenges of normalization – Identify considerations for relational data on Azure

Challenges of normalization

While database normalization is a fundamental practice for optimizing database design, it brings with it a set of challenges that can impact the performance and manageability of your database systems. Understanding these challenges is crucial for database designers and administrators:

  • Complex queries: Sometimes normalized tables lead to complex queries that might degrade performance.
  • Design complexity: Achieving higher normal forms may result in a complex database design.

When to use normalization

Normalization should be used during the design phase of the database to ensure that the data is stored efficiently and the database can be maintained easily.

As you learn to work with databases in Azure and other platforms, understanding the principles and benefits of normalization will be invaluable. It’s a tool that can help you create efficient, reliable, and manageable databases.

Identify common structured query language (SQL)

statements

SQL is a standardized programming language that is used to manage and manipulate rela-tional databases, as pictured in Figure 2-1. Developed in the 1970s at IBM by Donald D. Chamberlin and Raymond F. Boyce, SQL has been the de facto standard for interacting with relational databases for decades. Despite being more than a half-century old, SQL’s power and ubiquity have ensured its relevance even in today’s big data age.

34 CHAPTER 2 Identify considerations for relational data on Azure

FIGURE 2-1  Showing a SQL relational database

At its core, SQL allows you to create, manipulate, and query relational databases, which store data in structured tables. But beyond these basic capabilities, SQL is used in a variety of real-world scenarios. It powers business intelligence and drives decision-making in companies, and organizations can use it to gain insights from massive datasets. Whether you’re a data ana-lyst, a data scientist, a back-end developer, or even a machine learning engineer, understand-ing SQL is an invaluable skill.

However, you’ll notice that different database systems support various SQL dialects. These dialects, while fundamentally the same, often come with their own set of proprietary exten-sions that are specific to a particular database system. These are some popular SQL dialects:

■■ Transact-SQL (T-SQL): This is Microsoft’s proprietary extension of SQL, which includes a set of programmable functions and procedural programming. T-SQL is primarily used with Microsoft SQL Server.

■■ PL/SQL: This stands for Procedural Language/SQL, a dialect developed by Oracle. PL/ SQL includes procedural language constructs, and it supports structured programming.
■■ MySQL: MySQL uses a dialect of SQL that is rich in functions, including string pro-cessing, date and time processing, and advanced features such as replication and partitioning.

■■ PostgreSQL: PostgreSQL supports a version of SQL that includes many features not available in other database systems, such as window functions and common table expressions.

Understanding these different dialects is valuable, especially when migrating from one database system to another for working with multiple database systems.

SQL statement types – Identify considerations for relational data on Azure

SQL statement types

SQL statements can be broadly classified into several types:

■■ Data Definition Language (DDL)

■■ Data Manipulation Language (DML)

■■ Data Control Language (DCL)

■■ Transaction Control Language (TCL)

Skill 2.1: Describe relational concepts     CHAPTER 2      35

DATA DEFINITION LANGUAGE

You’ll use DDL statements to create, modify, or delete the structure of database objects. The common statements are CREATE, ALTER, and DROP, as shown in Table 2-7.

TABLE 2-7  DDL statements

Figure 2-2 shows an example of the DDL CREATE statement that has been used to create a new table named Student with two columns: StudentID with the data type INT (an integer, or whole number) and FirstName with the data type VARCHAR (VARCHAR stands for variable-length character data).

FIGURE 2-2  DDL CREATE statement

Figure 2-3 shows an ALTER statement that has been used to alter the Students table by add-ing a new column named Email with a data type of VARCHAR (255).

FIGURE 2-3  DDL ALTER statement

Figure 2-4 shows the DROP statement used to delete the Student table from the database.

FIGURE 2-4  DDL DROP statement

36 CHAPTER 2     Identify considerations for relational data on Azure

Figure 2-5 shows the TRUNCATE statement used to delete all the records from the Students table while keeping the table structure for future use.

FIGURE 2-5 DDL TRUNCATE statement

Figure 2-6 shows the RENAME statement used to rename the Statement table to Learners.

FIGURE 2-6 DDL RENAME statement These DDL statements form the basis of structuring your data in a relational database. You’ll frequently use these commands as you build and manage your databases, so be sure to get comfortable with their syntax and uses. Practice is key!

DATA MANIPULATION LANGUAGE – Identify considerations for relational data on Azure

DATA MANIPULATION LANGUAGE

You’ll use DML statements to insert, update, and delete data in a database. The common state-ments include INSERT, UPDATE, and DELETE, as shown in Table 2-8.

TABLE 2-8 DML statements

Skill 2.1: Describe relational concepts       CHAPTER 2    37

Figure 2-7 shows an example of a DML INSERT statement used to add a new record into the Students table. The new student has a StudentID of 1, a FirstName of John, and a LastName of Doe.

FIGURE 2-7  DML INSERT statement

Figure 2-8 shows the UPDATE statement used to update the Students table by changing the FirstName to Jane for the student with a StudentID of 1.

FIGURE 2-8  DML UPDATE statement

Figure 2-9 shows the DELETE statement used to remove records from the Students table where the StudentID is 1.

FIGURE 2-9  DML DELETE statement

Although technically part of the Data Query Language, SELECT is often considered part of DML as it is fundamental to manipulating data. The SELECT statement is used to fetch data from a database.

Figure 2-10 shows the SELECT statement fetching all the data from the Students table for the record where StudentID is 1.

FIGURE 2-10  DML SELECT statement

Figure 2-11 shows how to merge data from UpdatedStudents (based on StudentID) and insert new records from UpdatedStudents into Students.

38 CHAPTER 2 Identify considerations for relational data on Azure

FIGURE 2-11  DML MERGE statement

Mastering the DML statements is crucial for effective database operations and data management. Make sure to practice these commands with different tables and databases to thoroughly understand how they work.

DATA CONTROL LANGUAGE – Identify considerations for relational data on Azure

DATA CONTROL LANGUAGE

You will use DCL to create roles, permissions, and referential integrity; it is also used to con-trol access to databases by securing them. The common statements are GRANT, REVOKE, and DENY, as shown in Table 2-9.

TABLE 2-9  DCL statements

Figure 2-12 shows how to grant User1 permission to execute SELECT and INSERT operations on the Students table.

FIGURE 2-12  DCL GRANT statement

Skill 2.1: Describe relational concepts     CHAPTER 2   39

Figure 2-13 shows how to revoke User1’s permission to perform INSERT operations on the Students table.

FIGURE 2-13  DCL REVOKE statement

Figure 2-14 shows how to deny User1 the permission to perform DELETE operations on the Students table. Even if another rule grants User1 the DELETE permission, this DENY rule will override it.

FIGURE 2-14  DCL DENY statement

Understanding these DCL statements is important for managing access control and ensur-ing data security in your database. Practice using these statements to gain a clear under-standing of their impact on database operations. Make sure to always carefully consider

the implications of granting, revoking, or denying permissions to maintain the integrity and security of your data.

TRANSACTION CONTROL LANGUAGE

TCL commands are used to manage transactions in the database. They include COMMIT, ROLL-BACK, and SAVEPOINT, as shown in Table 2-10.

TABLE 2-10  TCL statements

Figure 2-15 shows how you initialize a new transaction. Any SQL statements following this will be part of this transaction.

40 CHAPTER 2 Identify considerations for relational data on Azure

FIGURE 2-15  TCL BEGIN TRANSACTION statement

Figure 2-16 shows how to start a transaction, make an update to the Students table, and then commit the transaction. This means the update is saved and cannot be rolled back.

FIGURE 2-16  TCL COMMIT statement

Figure 2-17 shows a transaction started that is making an update to the Students table. However, you can decide to roll back the transaction, meaning the update to the Students table will not be saved and the data remains as it was before the transaction began.

FIGURE 2-17  DCL ROLLBACK statement

Figure 2-18 shows a transaction started that is making an update to the Students table, cre-ating a savepoint, and making another update. Then, you decide to roll back the transaction to the savepoint. The first update will be saved, but the second update will not.

FIGURE 2-18  TCL SAVEPOINT statement

These TCL commands are crucial for maintaining the integrity of your data, especially in situations where you need to perform multiple related operations as a single atomic unit. If any of these operations fails, the TCL commands allow you to roll back the entire transaction, preventing your data from ending up in an inconsistent state.
Skill 2.1: Describe relational concepts     CHAPTER 241



Identify common database objects – Identify considerations for relational data on Azure

Identify common database objects

Database objects are essential components that you will interact with while working with data-bases. These objects are utilized to store, manipulate, and retrieve the data housed within your database.

  • Tables: At the heart of any database are tables. They are the structure that holds the data. A table consists of rows (or records) and columns (or fields). Each column represents a category of data, and each row contains the actual data values, as shown in Table 2-11.

TABLE 2-11 Table object

  • Views: A view is a virtual table based on the resultset of a SQL statement. They allow you to present the data in a different perspective from that of the base table, and they can hide complexity or secure sensitive data. Figure 2-19 shows how you would create a view called TopStudent that shows only those students with a GPA greater than 3.5.

FIGURE 2-19 Simplifying data queries with a view to display top students

  • Indexes: As you delve deeper into your interaction with databases, you’ll find that indexes are fundamental to optimizing performance and enhancing the speed of data retrieval.

An index, in the context of a database, is similar to an index in a book. In a book, an index allows you to find information quickly without having to read every page. Similarly, a database index provides a quick lookup pathway to the data in a table.

42 CHAPTER 2   Identify considerations for relational data on Azure

Creating an index involves the CREATE INDEX statement. When you create an index on a table column, the database management system (DBMS) generates a data structure that allows it to find the data associated with the indexed columns more swiftly.

Let’s look at an example. Say you frequently execute a query as shown in Figure 2-20 that searches students based on their last names in the Students table.

FIGURE 2-20  The search query

To speed up this query, you can create an index on the LastName column, as shown in Figure 2-21.

FIGURE 2-21  Enhancing query efficiency with an index on the LastName column in the Students table

Now, the DBMS will use the idx_lastname index to find Smith rapidly, which can lead to significantly faster query performance.

However, while indexes are powerful, they should be used judiciously. Creating an index is not a trivial task for the DBMS. It takes time and consumes storage space. Moreover, every time a table’s data changes (via an INSERT, UPDATE, or DELETE statement), the associated indexes also need to be updated. Thus, having too many indexes, especially on a table that frequently changes, can actually degrade the performance.

Therefore, a balanced approach is required when dealing with indexes. They should be cre-ated on columns that are frequently searched or used in the WHERE clause, JOIN operations, or sorting data (ORDER BY).

By implementing indexes in your database, you’re well on your way to creating efficient,

performance-tuned data retrieval options. This knowledge will greatly enhance your ability to effectively manage data in a relational database.

  • Stored procedures: These are a group of SQL statements that form a logical unit and perform a particular task. Stored procedures can be used to encapsulate logic, enforce security, and increase performance.

In Figure 2-22 you could create a stored procedure to add a new student to the Students table.

Skill 2.1: Describe relational concepts     CHAPTER 2   43

FIGURE 2-22 Stored procedure

  • Triggers: Triggers are special types of stored procedures that automatically execute when an event occurs in the database server. Triggers can help maintain the integrity of the data. You can create a trigger that automatically updates a LastModified field in the

Students table whenever a record is updated, as shown in Figure 2-23.

FIGURE 2-23 Trigger implementation to auto-update the LastModified timestamp in the Students table

By understanding and utilizing these objects, you can effectively manage your data and optimize your interactions with the database.

Describe relational Azure data services – Identify considerations for relational data on Azure

Skill 2.2: Describe relational Azure data services

Having gained some familiarity with relational concepts and SQL, your next step is to explore relational data services within Azure. The Microsoft Azure cloud platform offers a suite of com-prehensive and robust data services, catering to a wide array of relational database needs.

In this section, you will uncover the specifics of Azure’s relational data offerings. You’ll delve into the Azure SQL family of products, learning how each product fits into different scenarios based on the specific requirements. You’ll explore the capabilities and uses of SQL Server on Azure Virtual Machines and how it offers flexibility in the migration process.

Furthermore, you will get acquainted with Azure database services for open-source data-base systems. Open-source databases have their unique features and community support, and

44 CHAPTER 2 Identify considerations for relational data on Azure

Azure provides first-class support for these systems, allowing you to leverage their capabilities in a cloud environment.

Upon completion of this section, you will have a good understanding of Azure’s relational data services, empowering you to make informed decisions about which service best suits your data requirements. Let’s begin this exciting journey into the world of Azure relational data services.

This skill covers how to:

Describe the Azure SQL family of products including Azure SQL Database, Azure SQL, Azure Managed Instance, and SQL Server on Azure Virtual Machines

Microsoft Azure provides a robust suite of SQL offerings, known collectively as the Azure SQL family. This portfolio of SQL-based services in Azure caters to a variety of needs, ranging from managing relational data to intelligent, cloud-native relational databases.

  • Azure SQL Database: Azure SQL Database is a fully managed relational database service that provides the broadest SQL Server engine compatibility. It’s an intelligent, scalable service that offers seamless integration with Azure services such as Azure Active Directory and Power BI. Azure SQL Database supports built-in intelligence that learns your unique database patterns and automatically tunes the database for improved performance and protection. You can use it to build data-driven applications and websites in your programming language of choice, without needing to manage any infrastructure.
  • Azure SQL Managed Instance: Azure SQL Managed Instance provides the broadest SQL Server engine compatibility and native virtual network (VNet) support. It’s a fully managed service, which allows you to migrate your SQL Server workloads to Azure with zero code changes. SQL Managed Instance is best for most migrations to the cloud as it provides a high degree of compatibility with the SQL Server programming surface area.
  • Azure SQL Server on Virtual Machines: Azure SQL Server on Virtual Machines lets you run SQL Server inside a virtual machine in the cloud. This service is your best choice when you require full control over the SQL Server engine and the VM it runs on. It is suitable for migrating existing applications to the cloud with minimal changes or when you need SQL Server features that are not supported in Azure SQL Database or Azure SQL Managed Instance.

Skill 2.2: Describe relational Azure data services      CHAPTER 2    45

■■ Azure Synapse Analytics: Formerly SQL Data Warehouse, Azure Synapse Analytics is an analytics service that brings together big data and data warehousing. It gives you the freedom to query data on your terms, using serverless on-demand or provisioned resources.

■■ Azure SQL Edge: Azure SQL Edge is an optimized relational database engine geared toward Internet of Things (IoT) and edge computing scenarios. It offers a small footprint that can run a variety of devices from low-power edge devices to high-performance edge servers.

Azure SQL Edge provides the same SQL engine that powers SQL Server and Azure SQL, making it fully compatible with your existing SQL tools and skills. It supports data streaming through built-in stream analytics, machine learning scoring, and storage tiering.

What’s even more exciting is that it allows data to be processed close to the source, mini-mizing latency and bandwidth usage. It can operate in connected, disconnected, and hybrid environments, synchronizing data with Azure SQL Database or Azure SQL Managed Instance whenever connectivity is available.

Figure 2-24 illustrates how Azure SQL Edge can fit into an IoT solution architecture.

FIGURE 2-24  Azure SQL Edge