A

B

C

NULL

D

… and the following T-SQL command;

SELECT COUNT(DISTINCT(<value>))

FROM <table>

In SQL Server, running the above command on the record set will return 4, because the NULL value is ignored. What about the equivalent command in Analysis Services? More on that shortly …

I’m currently implementing a BI solution for a client in which the fact table contains a “Purchase Order” column. The granularity of the fact table is the invoice line item, with dimensions for products, clients, date etc …. Two of the questions this solution needs to answer are;

1. How many invoices did we process?, and

2. How many purchase orders did we process?

The answer to the first question is easy. Every invoice in the fact table contains an invoice number so we can use a simple DistinctCount aggregate function to return the number of invoices as a measure. Note we use *distinct* count instead of count because the granularity of the fact table is the line item, so many rows will have the same invoice number.

Answering the second question (how many purchase orders) is a little bit more difficult. Not every invoice has a corresponding purchase order. “Direct invoices” are those that are paid without needing to have a pre-existing purchase order. In these cases, the Purchase Order column is NULL.

Back to the example at the start of the post; a distinct count in Analysis Services works differently than T-SQL. Where T-SQL will return 4, Analysis Services returns 5, considering NULL as something different than the other values. As a result, using the same DistinctCount aggregate function on the purchase order column will return one more than it should, assuming there are some invoices with a NULL purchase order value.

There’s a few ways around this problem, but the method I used was to create a new named calculation column in the data source view called NULL_PO_Exists with the following expression;

CASE WHEN PurchaseOrder IS NULL THEN 1 ELSE 0 END

Next, I created a new hidden measure called [Maximum NULL PO Exists] which used the MAX aggregate function over the NULL_PO_Exists column.

Next up, I set the visible property to false for the existing Purchase Order count measure called [Total Purchase Orders], and created a new calculation [Purchase Order Total] which used the following expression;

[Measures].[Total Purchase Orders] - [Measures].[Maximum NULL PO Exists]

In summary, the new measure is using the old measure as a base, and then subtracting 1 (if NULLS exist) or 0 (if no NULLS exist). A bit of fiddling around, but it works :-)

Finally, there’s a number of performance considerations when using distinct count. This white paper from Denny Lee does a great job of explaining the problem and a number of best practices for optimizing its performance.

Cheers,

Rod.