Approximate COUNT DISTINCT

Published On: 2019-01-03By:

We all have written queries that use COUNT DISTINCT to get the unique number of non-NULL values from a table. This process can generate a noticeable performance hit especially for larger tables with millions of rows. Many times, there is no way around this. To help mitigate this overhead SQL Server 2019 introduces us to approximating the distinct count with the new APPROX_COUNT_DISTINCT function. The function approximates the count within a 2% precision to the actual answer at a fraction of the time.

Let’s see this in action.

In this example, I am using the AdventureworksDW2016CTP3 sample database which you can download here

SET STATISTICS IO ON
SELECT COUNT(DISTINCT([SalesOrderNumber])) as DISTINCTCOUNT
FROM [dbo].[FactResellerSalesXL_PageCompressed]

SQL Server Execution Times:  CPU time = 3828 ms,  elapsed time = 14281 ms.

SELECT APPROX_COUNT_DISTINCT ( [SalesOrderNumber]) as APPROX_DISTINCTCOUNT
FROM [dbo].[FactResellerSalesXL_PageCompressed]

SQL Server Execution Times: CPU time = 7390 ms,  elapsed time = 4071 ms.

You can see the elapsed time is significantly lower! Great improvement using this new function.

The first time I did this, I did it wrong. A silly typo with a major result difference. So take a moment and learn from my mistake.

Note that I use COUNT(DISTINCT(SalesOrderNumber) ) not DISTINCT COUNT (SalesOrderNumber ). This makes all the difference. If you do it wrong the numbers will be way off as you can see from the below result set.  You’ll also find that the APPROX_DISTINCTCOUNT will return much slower then the Distinct Count which is not expected. 

Remember COUNT(DISTINCT expression) evaluates the expression for each row in a group, and returns the number of unique, non-null values, which is what APPROX_COUNT_DISTINCT does. DISTINCT COUNT (expression) just returns a row count of the expression, there is nothing DISTINCT about it. 

Always fun tinkering with something new!


Static Data Masking (SSMS 18.0 Preview)

Published On: 2018-12-19By:

Ever need to have a test database on hand that you can allow others to query “real like” data without actually giving them actual production data values? In SQL Server Management Studio (SSMS) 18.0 preview Microsoft introduces us to Static Data Masking. Static Data Masking is a new feature that allows you to create a cloned copy of your database and replace sensitive data with new data (fake data, referred to as masked). You can use this for things like development of business reports and analytics, trouble shooting, database development and even sharing data with outside teams or third parties. Unlike Dynamic Data Masking added in SQL Server 2016, this feature does not hide the data with characters, rather it replaces the entire value.  For example with dynamic data masking the name Peter = Pxxxx, whereas Static Data Masking changes Peter to Paul.  This makes it very easy to use in place of production. Let’s see it in action. If you are not on a newer version on SSMS, don’t worry, you can download it

To get started Right click on the database you want to clone and mask and go to Tasks. Then down towards the bottom you will see Mask Database (Preview). Click on that.

In this example I am using the AdventureworksDW2016CTP3 example database.  Exploring the subsequent window that pops up you can see there are three main steps to complete. First is to choose which tables and fields you want to mask and apply the masking type you want. You’ll note it gives you the ability to filter for specific columns like Social Security Number.. Using this will narrow down the list of fields you need to configure, or you can simply mask all columns in the database. For my example we will be using just one table, DimEmployee. Secondly, choose the location of the backup file. Static Data Masking uses a backup (BAK) files to create the cloned masked database. We will use the defaulted location for step two.  The third step is to name your masked database. Note highlighted in yellow that the processes is fully logged and that log is linked in this screen.

To configure masking on the DimEmployee table,  I clicked the arrow next to the table name to expand the column list. It gives me the option to mask all columns in the table, but to keep it simple we will keep it to three. Base Rate, Department Name and Email Address.

You’ll several options to choose from described below per MSDN. We choose Shuffle for BaseRate and DepartmentName and used a supplied replacement value, Single Value, for email address.

Masking Function Description
NULL Masking NULL masking replaces all the values in the column with NULL. If the column does not allow NULL values, Static Data Masking tool will return an error.
Single-value masking

 

Single-value masking replaces all the values in the column with a single fixed value, this value is specified by the user.
Shuffle masking

 

All the values in the column are shuffled to new rows. No new data is generated. Shuffle masking provides the option to maintain NULL entries in the column.
Group Shuffle masking

 

Group Shuffle binds several columns together in a shuffle group. The columns in a shuffle group will be shuffled together.
String Composite masking

 

String Composite masking generates random strings along a pattern. It is designed for strings that must follow a predefined pattern to be a valid entry. For example, American social security numbers have the format 123-45-6789. The syntax for String Composite masking is specified in the dialog box where the user has to enter the pattern.

When choosing the Single Value or String Composite Masking options, click on Configure to enter what value you want that to be. Check out MDSN for other value options you can use such as string patterns. After clicking Configure, a window  will appear for you to enter the new value.

Lastly, in step three, I named the database AdventureworksDW2016CTP3_masked. Then hit OK. The process took a little longer then I thought it would but given it is replacing a lot of data the wait wasn’t too bad.

Once complete you will get a Masking was successful message. Note to create the new cloned database the system took a back up. Here it reminds you where that back up is stored so you can remove it. It’s the same location provided in step 2 above.

If you refresh the databases in Object Explorer you will see a new cloned database was created based on the name we supplied.

Now let’s see exactly what this accomplished for us. Querying the original Employee table and the new clone Masked employee table.

SELECT TOP (1000)

       'Original' as MaskTest

      ,[FirstName]

      ,[LastName]

      ,[MiddleName]

      ,[BirthDate]

      ,[BaseRate]

      ,[DepartmentName]

      ,[EmailAddress]

  FROM [AdventureworksDW2016CTP3].[dbo].[DimEmployee]



    SELECT TOP (1000)

       'Masked'as MaskTest

      ,[FirstName]

      ,[LastName]

      ,[MiddleName]

      ,[BirthDate]

      ,[BaseRate]

      ,[DepartmentName]

      ,[EmailAddress]

  FROM [AdventureworksDW2016CTP3_masked].[dbo].[DimEmployee]

It’s very easy to see what it has done. It simple Shuffled the data around and replace values with random but real data values in the BasePrice and DepartmentName columns. For EmailAddress it changed all of the fields to the same value I supplied. I can see this as extremely useful in report development and analytics and it was very easy to accomplish.

A couple things I ran into while playing around with this new feature was that identity columns are not supported, and you will see a red x next to those fields. Additionally, if you choose Group Shuffle you must choose it for more than one field in the table. Other than these, there are only are a few more limitations worth mentioning.  It cannot be used with temporal tables, memory optimized table, computed columns or geographical data types.

All in all, this process was really easy to get setup. I caution you though before doing this make sure you have enough disk space for the backup to be created and the new database. While testing this I made several copies which consumed quite a bit of space. Even when it failed the files were created. Be sure to clean up those .bak files after you’re done.

SQL Server Vulnerability Assessment

Published On: 2018-12-12By:

SQL Server Vulnerability Assessment (VA) in SQL Server Management Studio 17.4 or later lets SQL Server scan your databases for potential security vulnerabilities and can be run against SQL Server 2012 or higher. If you are not on a newer version on SSMS, don’t worry, you can download it here.

Running any kind of scans against data always concerns me as performance impacts can really ruin your day. Luckily VA is light weight and runs without performance impacts while still giving you an in-depth view of where you could improved your SQL Server’s security. The process is designed to meet data privacy standards and compliance using knowledge base rules that look for deviations from Microsoft’s best practices.

To run an assessment simply choose a database, right click and choose Tasks. Here you will see Vulnerability Assessment choose that and Scan for Vulnerabilities. If you have run one previously you can access it here by choosing Open Existing Scan.

It will pop up a window to choose where you want the results saved. Once you click ok the process will run.

Here you can see my results from the AdventureworksDW2016CTP3 database. It has 6 failed items and 47 passed. It lists each item and assigns an appropriate a risk level.

Clicking on one of the listed items under failed gives you more details and remediation steps with scripts to fix it. Let’s look.

In this example, I chose an easy one. This database is not auditing for successful and failed login attempts. You can see below it gives us a description of the best practice rule not followed and provides us a query that we can run to see the results. I really like this feature and it’s a handy script to keep for later use when evaluating another server’s health. It even gives us a little copy button to copy out the script and the option to open it in a query window.

If you scroll down further, you will get to the recommended remediation steps and script. If there is no script provided it will give you a link to where to find the proper documentation on how to fix the issue. In my opinion, from what I have seen, VA does a good job explaining what’s needed to fix the issue. Always keep in mind, although this is created by Microsoft, I suggest running these in test first before production and taking the time to fully understand what it is doing.

You may have noticed in the two above screen shots the I have drawn a box around BASELINE and Approve as Baseline. A baseline allows you to add a customization of how the results are reported. This helps to reduce clutter on future scans.

By marking the result as a BASELINE you are telling VA that this is acceptable in your environment although it may not meet best practices or regulatory standards.  Anything in the future that matches the baseline are marked as passed in subsequent scans and will note reason for passing as Per Custom Baseline.

We can see this when I run another scan. You’ll note the report now shows I only have 5 failed (without me fixing the issue) and the additional information column shows baseline for the reason.

SQL Server Vulnerability Assessment is a great non-third party starting place for evaluating data privacy, security and compliance standards and is very easy to use. Give it a try and see where your severs stand.

SQL Data Discovery and Classification in SSMS 18.0

Published On: 2018-12-05By:

Data exploration and documentation can be cumbersome, classifying data can be one of those tedious but necessary things. With the introduction of such things as GDPR there is an ever greater need to classify and label your sensitive data. Microsoft attempts to help us out with this in the new SQL Data Discovery and Classification feature in SQL Server Management Studio (SSMS). This a new feature is available in Azure SQL Database and SSMS 18.0 and higher. It can be used against databases in SQL Server 2008, compatibly mode 100, and greater.

This feature will scan through your database tables and identifies columns that contain possible sensitive data. It then allows you to categorize that data as well provides a detailed report for auditing and compliance purposes.

Let’s see how it is done.

In SSMS on the database you want to use Right click on Tasks, then choose Data Discovery and Classification and Choose Classify Data. In this example, I am using the AdventureworksDW2016CTP3 database.

A results window will pop up showing how many field recommendations it has found. Click on it to view them..

When you view the data, you will see your data broken down by Schema, Table and Column. SSMS then attempts to categorize (information types) the information and estimate a sensitivity level (sensitively label) . It allows you to accept the recommendation by checking the box on left hand side. If it’s not quite what you wanted, you can adjust the Information Types and Sensitivity. Once you are satisfied with the category assignments,  click on the blue “Accept selected recommendations”  button located at the top and choose Save.

 

As shown below,  I have classified 10 columns and have 64 left unclassified

It also gives me the ability to manually add a field and classification by choosing the Add Classification button at the top.

 

As I mentioned previously, this feature provides a way to report on the data classification.

You can retrieve the report by clicking View Report, located adjacent to the Add Classification button at the top.

*Note if your report is blank you forgot to hit SAVE while classifying.

The report breaks all your data down nicely by schema, information types and sensitivity. You can see how this can be very useful and insightful when reviews data for compliance. It does take a little time to go through and validate the results of the classification. While this process might be lengthy to accomplish, in the long run, it is  well worth the time.

1 2 3 16

Video

Globally Recognized Expertise

As Microsoft MVP’s and Partners as well as VMware experts, we are summoned by companies all over the world to fine-tune and problem-solve the most difficult architecture, infrastructure and network challenges.

And sometimes we’re asked to share what we did, at events like Microsoft’s PASS Summit 2015.