What is the right choice between Direct Lake and Import mode in Power BI?
At SQLBI, we do not publish content until we have had enough time to experiment with and collect data about new features. We usually don’t break the news unless we have enough time to test a feature in preview and consider the released service solid enough. This did not happen with Direct Lake, and we have not published any Direct Lake content yet, but it seemed not urgent for reasons we will see soon. However, the feedback collected from many attendees of SqlBits 2024 and the first Microsoft Fabric Conference raised the alarm: too many people have an incorrect perception of Direct Lake, which should be addressed as soon as possible to avoid architectural mistakes and unnecessary additional implementation costs.
UPDATE 2024-09-24: Microsoft published important updates to Direct Lake documentation that address many points described in the following article, clarifying many of the misconceptions that required writing this blog post in April. Hopefully, this update will provide a better perception, suggesting Direct Lake only when it is really needed despite its limitations, assuming that readers will have the patience to read all the details exposed!
Kurt Buhler and I were so concerned that I wrote this blog post just after the conference; Kurt helped me review the content and provided more ideas. We know that every single topic deserves an entire article, and we will write them once we have enough data and experience. But the damage of a wrong decision based on the enthusiasm for a new feature would have been too high if we had waited longer.
Thus, for those of you who do not have time to read:
- Queries in Import mode are faster than in Direct Lake mode: Only in the best possible scenario is query performance similar, which means that Direct Lake is usually slower, even though the difference can be negligible.
- Direct Lake is not a replacement for Import mode: Fabric provides Direct Lake as an additional option but Import and DirectQuery modes are also available.
- Direct Lake is not a replacement for DirectQuery mode: Direct Lake suits near-real-time scenarios similar to Import mode, but it does not handle real-time scenarios like DirectQuery.
- Direct Lake is not the “best of both worlds” for Import and DirectQuery mode: Direct Lake is unique and different from Import and DirectQuery; it suits specific scenarios, is not faster than import, and does not replace DirectQuery.
- Import and Direct Lake consume the same amount of memory at query time: when you use the large format for an import model, only the columns used in queries are loaded in memory. This is also a feature of Direct Lake, but it is not unique to Direct Lake.
- Direct Lake has modeling limitations compared to Import: You cannot use calculated columns, calculated tables, and MDX user hierarchies in Direct Lake. The latter impacts Excel’s user experience when consuming semantic models published on Power BI.
- Direct Lake requires physical tables for the semantic model: By using a view, the engine uses DirectQuery, losing the performance level close to Import. Creating a structure that matches 100% of the semantic models often requires creating an additional copy of the data.
- Direct Lake models require more technical skills to achieve optimal compression: the compression in Import mode is managed by VertiPaq, whereas in Direct Lake, it depends on how the data is loaded in the Delta format.
Some longer considerations if you have time – we will publish much longer and more detailed articles on these topics in the future, but the following considerations do not require benchmarks and tests; they are simply the direct consequence of the existing architecture.
Queries in Import mode are faster than in Direct Lake mode
When you use Direct Lake, the columns in Delta format are loaded in memory using the same VertiPaq engine used in Import mode. When the data is not yet in memory, the first time a query runs, the data must be transcoded in the VertiPaq format. In Import mode, the binary data is directly loaded in memory, whereas for Direct Lake, additional processing is required to adapt the data to the VertiPaq format. Thus, the initial loading cannot be faster, and the following queries cannot be faster because the format is the same. Moreover, the performance of the queries once the data is in memory strongly depends on the compression level of the Parquet files used in the Delta format, as described later.
Direct Lake is not a replacement for Import mode
The fact that Direct Lake is new does not imply it is better for all the use cases. Indeed, it is an additional option for those scenarios where a lakehouse already contains tables that can be loaded in memory directly without additional transformations.
Direct Lake is not a replacement for DirectQuery mode
Changes in Direct Lake require a fast but necessary refresh of the semantic model, which clears the cache and reloads data from the lakehouse to the memory. This refresh can be automatic or manual. In DirectQuery, this is never necessary, and there is never latency for real-time scenarios (at the expense of query performance). Direct Lake can suit near-real-time scenarios similar to Import mode, providing different options that allow more control over the updates to the lakehouse.
Direct Lake is not the “best of both worlds” for Import and DirectQuery
Direct Lake is neither import nor DirectQuery. Direct Lake is better than DirectQuery for managing scenarios where the data volume does not fit in memory, but only a fraction of the columns are queried, so the probability of the fallback to DirectQuery is minimal. Indeed, Direct Lake uses a system similar (but slower) to Import when the data can be loaded in memory and falls back to DirectQuery when the memory is not available or when the semantic model is not compatible with Direct Lake (for example, when the data is loaded from a view).
Import and Direct Lake consume the same amount of memory
The on-demand load of the columns in a semantic model is a feature that was released in December 2021. This is not a feature that is unique to Direct Lake; it is a feature that is available by default to all the large semantic models.
Moving to Direct Lake for this feature usually does not make sense: Direct Lake and Import are almost identical in this regard! The only difference is that when the memory is not large enough to load the required columns, Direct Lake switches to DirectQuery (with a difference in performances of orders of magnitude), whereas in Import mode, the users get an out-of-memory error. While this difference could be important for models in the order of magnitude of more than 200/400GB, it is definitely not relevant for models below 10/20GB, and there is a big “it depends” in the area in the middle.
Direct Lake has modeling limitations compared to Import
The lack of calculated columns and calculated tables affects the ability to solve certain modeling issues within the semantic model. By forcing the model author to create these tables in the lakehouse with Direct Lake, a data modeling need could raise a request for change in the data preparation process. For example, the lack of the VertiPaq engine to create aggregated calculated tables to create aggregated snapshots (like those required for non-visual totals in row-level security) could increase the overall processing time to refresh a semantic model: calculation in the lakehouse could be slower than in VertiPaq. As usual, it depends – but calculated tables (and columns) guarantee data consistency with a minimal development effort. The development cost for Direct Lake will probably be higher than Import to implement similar model requirements. At the moment, calculated tables that do not reference tables in the lakehouse are supported, so tables created for parameters should work, but the limitation is still strong for many use cases (like a dynamic Date table with a range based on existing data, or a snapshot/aggregation table).
By default, Direct Lake models created in Fabric also have a case-sensitive collation. This means that if you are moving your import model or DirectQuery model to Direct Lake, you may get a different result between the two models despite having identical data and DAX. Make sure to use the right collation by using Tabular Editor, TMSL, or manually editing and deploying the BIM file. Direct Lake models do not support mixed storage types, although this is likely a temporary limitation – at the moment, all the tables for Direct Lake must be stored in the same lakehouse or warehouse, and must all have the same Direct Lake storage mode.
Direct Lake requires physical tables for the semantic model
In order to use the VertiPaq engine, you must connect a physical table of the lakehouse in Direct Lake mode. If anything in the physical table requires a small change to be represented in the semantic model, the simple view that can satisfy the requirements automatically forces the engine to query the table in SQL using DirectQuery. A view is never loaded in memory when using Direct Lake mode. If you switch to Import, the view can be loaded in memory with the regular VertiPaq compression and performance.
For example, if a column with a group name should have a particular sort order that is not alphabetical (like Small, Medium, Large), an additional numeric column could be addressed using a view. Views work well in Import mode, and using views to present data for a semantic model is a common best practice. However, using views is not possible with Direct Lake, because their presence switches all requests to DirectQuery, removing the performance benefit of Direct Lake over DirectQuery. Therefore, the additional column with the sort order for the group name should be added to the existing Delta table (if possible) or could lead to creating a copy of the original table to add the required column.
However, a reason for choosing Direct Lake is to keep a single copy of the data. If a physical table must be processed in order to satisfy specific modeling requirements, this often results in an additional copy of the data in the lakehouse, with an additional development effort to prepare that table. Unless the destination table is also used in other reporting tools, it does not seem to be a big difference compared to the additional “copy” for VertiPaq created in Import mode.
Direct Lake models require more technical skills to achieve optimal compression
The compression achieved by VertiPaq in Import mode by default can hardly be improved by fine-tuning the partitions and the data distribution. To achieve compression results in Direct Lake similar to the VertiPaq default settings, the developer must understand the internal format of Parquet, plan to compress data with specific algorithms, and plan a partitioning strategy to control how the files are stored in the lakehouse. Additional skills and time are required to get close to VertiPaq. Increased cost for the same performance is not usually a good return on the investment, so there should be some other very good reason to balance that.
Conclusions
We are very excited about Direct Lake because it solves problems for 2-3% of models that were hard to manage in import mode (those models above 200/400 GB). However, if your model works well in Import, there is no reason to use Direct Lake. You can continue to use Import from your lakehouse if you prefer to store data in a lakehouse.