Quantcast
Channel: Intel Developer Zone Articles
Viewing all 533 articles
Browse latest View live

Intel® Advisor 2017 Beta Update 1 - What’s new

$
0
0

Intel® Advisor 2017 Beta Update 1 - What’s new

 

We’re pleased to announce new version of the Vectorization Assistant tool - Intel® Advisor 2017 Beta Update 1.

Below are highlights of the new functionality in Intel Advisor 2017 Update 1.

Full support for all analysis types on the second generation Intel® Xeon Phi processor (code named Knights Landing)

FLOPS and mask utilization

Tech Preview feature! Accurate hardware independent FLOPS measurement tool. (AVX512 only) Mask aware . Unique capability to correlate FLOPS with performance data.

Improved MPI workflow allows you to create snapshots for MPI results, so you can collect data in CLI and transfer self-contained packed result to a workstation with GUI for analysis. We also fixed some GUI and CLI interoperability issues.

Project properties dialog was extended with command line configuration for MPI launcher.

Project properties dialog also extended to allow setting FLOPS analysis.

 

Memory Access Patterns

MAP analysis now detects Gather instruction usage, unveiling more complex access patterns. A SIMD loop with Gather instructions will work faster than scalar one, but slower, than SIMD loop without Gather operations.  If a loop has “Gather stride” category, check new “Details” tab in Refinement report for information about strides and mask shape for the gather operation. One of possible solutions is to inform compiler about your data access patterns via OpenMP 4.x options – for cases, when gather instructions are not necessary actually.

In addition to Gather profiling, for AVX512 MAP analysis also detects Gather/scatter instruction usage, these instructions allow more code to vectorize but you can obtain greater performance by avoiding these types of instructions.

 

SMART MODE

New “Smart mode” for Survey results, which is an effective way to simplify data representation and to automatically select the loops that can be most impactful and/or suitable from a SIMD vector performance perspective.

 

NEW RECOMMENDATIONS

  1. Consider outer loop vectorization in appropriate cases 
  2. Consider refactoring the code to avoid using single Gather or Scatter instructions if the pattern is regular

 

Get Intel Advisor and more information

Visit the product site, where you can find videos and tutorials. .


Getting Started with Intel® Active Management Technology (AMT)

$
0
0

Introduction

This document contains information on how to get started with Intel® Active Management Technology (Intel® AMT). It provides an overview of the features, as well as information on minimum system requirements, configuration of an Intel AMT client, and the developer tools available to help create applications for Intel AMT.

Intel AMT supports remote applications running on Microsoft Windows* or Linux*. Intel AMT Release 2.0 and higher supports only Windows-based local applications. For a complete list of system requirements see the Intel AMT Implementation and Reference Guide.

Getting Started

In order to manage an Intel AMT client or run the samples from the SDK, use a separate system to remotely manage your Intel AMT device. Refer to the Intel AMT Implementation and Reference Guide located in the Docs folder of the Intel AMT SDK for more details.

What is Intel® Active Management Technology?

Intel AMT is part of the Intel® vPro™ technology offering. Platforms equipped with Intel AMT can be managed remotely, regardless of its power state or if it has a functioning OS or not.

The Converged Security and Manageability Engine (CSME) powers the Intel AMT system. As a component of the Intel® vPro™ platform, Intel AMT uses a number of elements in the Intel vPro platform architecture. Figure 1 shows the relationship between these elements.


Figure 1: Intel® Active Management Technology 11 architecture.

Pay attention to the network connection associated with the Intel® Management Engine (Intel® ME). The NIC changes according to which Intel AMT release you are using.

  • The CSME firmware contains the Intel AMT functionality.
  • Flash memory stores the firmware image.
  • Enable the Intel AMT capability by using the CSME as implemented by an OEM platform provider. A remote application performs the enterprise setup and configuration.
  • On power-up, the firmware image is copied into the Double Data Rate (DDR) RAM.
  • The firmware executes on the Intel® processor with Intel ME and uses a small portion of the DDR RAM (Slot 0) for storage during execution. RAM slot 0 must be populated and powered on for the firmware to run.

Intel AMT stores the following information in flash (Intel ME data):

  • OEM-configurable parameters:
    • Setup and configuration parameters such as passwords, network configuration, certificates, and access control lists (ACLs)
    • Other configuration information, such as lists of alerts and Intel AMT System Defense policies
    • The hardware configuration captured by the BIOS at startup
  • Details for the 2016 platforms with Intel vPro technology (Release 11.x) are:
    • 14nm process
    • Platform (mobile and desktop): 6th generation Intel® Core™ processor
    • CPU: SkyLake
    • PCH: Sunrise Point

What Is new with the Intel® Active Management Technology SDK Release 11.0

  • CSME is the new architecture for Intel AMT 11. Prior to Intel AMT 11 CSME was called Intel® Management Engine BIOS Extension (Intel® MEBx).
  • MOFs and XSL files: The MOFs and XSL files in the \DOCS\WS-Management directory and the class reference in the documentation are at version 11.0.0.1139.
  • New WS-Eventing and PET table argument fields: Additional arguments added to the CILA alerts provide a reason code for the UI connection and the hostname of the device which generates the alert.
  • Updated OpenSSL* version: The OpenSSL version is at v1.0. The redirection library has also been updated.
  • Updated Xerces version: Both Windows and Linux have v3.1.2 of the Xerces library.
  • HTTPS support for WS events: Secure subscription to WS Events is enabled.
  • Remote Secure Erase through Intel AMT boot options: The Intel AMT reboot options has an option to securely erase the primary data storage device.
  • DLL signing with strong name: The following DLLs are now signed with a strong name: CIMFramework.dll, CIMFrameworkUntyped.dll, DotNetWSManClient.dll, IWSManClient.dll, and Intel.Wsman.Scripting.dll
  • Automatic platform reboot triggered by HECI and Agent Presence watchdogs: An option to automatically trigger a reboot whenever a HECI or Agent Presence watchdog reports that its agent has entered an expired state.
  • Replacement of the IDE-R storage redirection protocol: Storage redirection works over the USB-R protocol rather than the IDE-R protocol.
  • Updated SHA: The SHA1 certificates are deprecated, with a series of implemented SHA256 certificates.

Configuring an Intel® Active Management Technology Client

Preparing your Intel® Active Management Technology Client for Use

Figure 2 shows the modes, or stages, that an Intel AMT device passes through before it becomes operational.


Figure 2: Configuration flow.

Before configuring an Intel AMT device from the Setup and Configuration Application (SCA), it must be prepared with initial setup information and placed into Setup Mode. The initial information will be different depending on the available options in the Intel AMT release, and the settings performed by the platform OEM. Table 1 summarizes the methods to perform setup and configuration on the different releases of Intel AMT.

Setup MethodApplicable to Intel® Active Management Technology (Intel® AMT) ReleasesFor More Information
Legacy1.0; Releases 2.x and 3.x in legacy modeSetup and Configuration in Legacy Mode
SMB2.x, 3.x, 4.x, 5.xSetup and Configuration in SMB Mode
PSK2.0 Through Intel AMT 10, Deprecated in Intel AMT 11Setup and Configuration Using PSK
PKI2.2, 2.6, 3.0 and laterSetup and Configuration Using PKI (Remote Configuration)
Manual6.0 and laterManual Setup and Configuration (from Release 6.0)
CCM, ACM7.0 and later

Client Control Mode and Admin Control Mode

Manually Configuring Clients for Intel AMT 7.0 and Later

Table 1: Setup methods according to Intel® Active Management Technology version.

Intel® Setup and Configuration Software (Intel® SCS) 11 can provision systems back to Intel AMT 2.x. For more information about Intel SCS and provisioning methods as they pertain to the various Intel AMT releases, visit Download the latest version of Intel® Setup and Configuration Service (Intel® SCS)

Manual Configuration Tips

There are no feature limitations when configuring a platform manually since the 6.0 release, but there are some system behaviors to be noted:

  • API methods will not return a PT_STATUS_INVALID_MODE status because there is only one mode.
  • TLS is disabled by default and must be enabled during configuration. This will always be the case with manual configuration as you cannot set TLS parameters locally.
  • The local platform clock will be used until the network time is remotely set. An automatic configuration will not be successful unless the network time was set (and this can only be done after configuring TLS or Kerberos*). Enabling TLS or Kerberos after the configuration will not work if the network time was not set.
  • The system enables WEB UI by default.
  • The system enables SOL and IDE-R by default.
  • The system disables Redirection listener by default starting in Intel AMT 10.
  • If KVM is enabled locally via the CSME, it still will not be enabled until an administrator activates it remotely.

Starting with Intel AMT 10, some devices are shipped without a physical LAN adapter. These devices cannot be configured using the current USB Key solutions provided by Intel SCS 11.

Manual Setup

During power up, the Intel AMT platform displays the BIOS startup screen, then it processes the MEBx. During this process, access to the Intel MEBX can be made; however the method is BIOS vendor-dependent. Some methods are:

  • Most BIOS vendors add entry into the CSME via the one-time boot menu. Select the appropriate key (Ctrl+P is typical) and follow the prompts.
  • Some OEM platforms prompt you to press <Ctrl+P> after POST. When you press <Ctrl+P>, control passes to the Intel MEBx (CSME) main menu.
  • Some OEMs integrate the CSME configuration inside the BIOS (uncommon).
  • Some OEMs have an option in the BIOS to show/hide the <Ctrl+P> prompt, so if the prompt is not available in the one-time boot menu check the BIOS to activate the CTRL+P.

Client Control Mode and Admin Control Mode

At setup completion, Intel AMT 7.0 and later devices go into one of two control modes:

  • Client Control Mode. Intel AMT enters this mode after performing a basic host-based setup (see Host-Based (Local) Setup). It limits some of Intel AMT functionality, reflecting the lower level of trust required to complete a host-based setup.
  • Admin Control Mode. After performing remote configuration, USB configuration, or a manual setup via the CSME, Intel AMT enters Admin Control Mode.

There is also a configuration method that performs an Upgrade Client to Admin procedure. This procedure presumes the Intel AMT device is in Client Control Mode, but moves the Intel AMT device to Admin Control mode.

In Admin Control Mode there are no limitations to Intel AMT functionality. This reflects the higher level of trust associated with these setup methods.

Client Control Mode Limitations

When a simple host-based configuration completes, the platform enters Client Control Mode and imposes the following limitations:

  • The System Defense feature is not available.
  • Redirection (IDE-R and KVM) actions (except initiation of an SOL session) and changes in boot options (including boot to SOL) requires advanced consent. This still allows remote IT support to resolve end-user problems using Intel AMT.
  • With a defined Auditor, the Auditor’s permission is not required to perform un-provisioning.
  • A number of functions are blocked to prevent an untrusted user from taking control of the platform.

Manually Configuring an Intel Active Management Technology 11.0 Client

The Intel AMT platform displays the BIOS startup screen during power up, then processes the BIOS Extensions. Entry into the Intel AMT BIOS Extension is BIOS vendor-dependent.

If you are using an Intel AMT reference platform (SDS or SDP), the display screen prompts you to press <Ctrl+P>. Then the control passes to the CSME main menu.

In the case of it being a OEM system It is still easy to use the one-time boot menu, although entry into the CSME is usually an included option as part of the one-time boot menu. The exact key sequence varies by OEM, BIOS and Model.

Manual Configuration for Intel® AMT 11.0 Clients with Wi-Fi*-Only Connection

Many systems no longer have a wired LAN connector. You can configure and activate the Intel ME, then via WebUI or some alternate method to push the wireless settings.

  1. Change the default password to a new value (required to proceed). The new value must be a strong password. It should contain at least one uppercase letter, one lowercase letter, one digit, and one special character, and be at least eight characters.
    1. Enter CSME during startup.
    2. Enter the Default Password of “admin”.
    3. Enter and confirm New Password.
  2. Select Intel AMT Configuration.
  3. Select/Verify Manageability Feature Selection is Enabled.
  4. Select Activate Network Access.
  5. Select “Y” to confirm Activating the interface.
  6. Select Network Setup.
  7. Select Intel® ME network Name Settings.
    1. Enter Host Name.
    2. Enter Domain Name.
  8. Select User Consent.
    1. By default, this is set for KVM only; can be changed to none or all.
  9. Exit CSME.
  10. Configure Wireless via ProSet Wireless Driver synching, WebUI, or an alternative method.

Manual Configuration for Intel AMT 11.0 Clients with LAN Connection

Enter the CSME default password (“admin”).

Change the default password (required to proceed). The new value must be a strong password. It should contain at least one uppercase letter, one lowercase letter, one digit, and one special character, and be at least eight characters. A management console application can change the Intel AMT password without modifying the CSME password.

  1. Select Intel AMT Configuration.
  2. Select/Verify Manageability Feature Selection is Enabled.
  3. Select Activate Network Access.
  4. Select “Y” to confirm Activating the interface.
  5. Select Network Setup.
  6. Select Intel ME network Name Settings.
    1. Enter Host Name.
    2. Enter Domain Name.
  7. Select User Consent.
    1. By default, this is set for KVM only; can be changed to none or all.
  8. Exit CSME.

Accessing Intel® Active Management Technology via the WebUI Interface

An administrator with user rights can remotely connect to the Intel AMT device via the Web UI by entering the URL of the device. Depending on whether TLS has been activated, the URL will change:

  • Non-TLS - http:// <IP_or_FQDN>:16992
  • TLS - https:// <FQDN_only>:16993

You can also use a local connection using the host’s browser for a non TLS connection. Use either localhost or 127.0.0.1 as the IP address. Example: http://127.0.0.1:16992

Intel Active Management Technology Support Requirements

In addition to having the BIOS and CSME configured correctly, the Wireless NIC needs to be Intel AMT Compliant. Specific drivers and services must be present and running in order to use the Intel AMT to manage the host OS.

To verify that the Intel AMT drivers and services are loaded correctly, look for them in the Device Manger and Services in the host OS. Frequently check the OEM’s download site for upgraded versions of the BIOS, firmware, and drivers.

Here are the drivers and services that should appear in the host OS:

  • Intel® Ethernet Network Connection i218-LM #
  • Intel® Dual Band Wireless-AC 8260 or similar #
  • Intel® Management Engine Interface (Intel® MEI) driver
  • Serial-Over-LAN (SOL) driver
  • Intel® Management and Security Status (Intel® MSS) Application Local Management Service**
  • Intel® AMT Management and Security Status application**
  • HID mouse and keyboard drivers***

* Network controller and wireless interface versions will vary depending on the generation of the Intel vPro platform.

** Part of the complete Intel MEI (Chipset) Driver package

*** HID Drivers are needed when connecting via Intel AMT KVM. These default drivers are not normally an issue; however, we have seen issues on stripped-down custom OS installs. If a connection is made to a device without the HID drivers, the OS tries to auto-download the drivers. Once the install is done, reconnect the KVM connection.

Note: The version level of the drivers must match the version level of the firmware and BIOS. If non-compatible versions are installed, Intel AMT will not work with the features that require those interfaces.

Physical Device - Wireless Ethernet Connection

By default, any wireless Intel vPro platform will have an Intel AMT enabled wireless card installed, such as an Intel Dual Band Wireless-AC 8260. Any wireless card other than one from Intel will not have wireless Intel AMT capabilities. If you have a card other than the Intel Dual Band Wireless-AC 8260 you can use ark.intel.com to verify whether the wireless card is Intel AMT compliant.

Windows OS Required Software

Device drivers are not necessary for remote management; however, they are essential for local communication to the firmware. Functions like discovery or configuration via the OS will require the Intel MEI driver, SOL driver, LMS service and Intel® Management and Security Status (Intel® MSS).

Device Drivers - Intel® Management Engine Interface

Intel MEI is required to communicate to the firmware. The Windows automatic update installs the Intel MEI driver by default. The Intel MEI driver should stay in version step with the Intel MEBX version.

The Intel MEI driver is in the Device Manager under “System devices” as “Intel® Management Engine Interface.”

Device Drivers - Serial-Over-LAN Driver

The SOL driver used during redirection operation where a remote CD drive is mounted during a IDE redirection operation.

The SOL driver is in the Device Manager under “Ports” as “Intel® Active Management Technology – SOL (COM3).”


Figure 3: Serial-Over-LAN driver.

Service - Intel Active Management Technology LMS Service

The Local Manageability Service (LMS) runs locally in an Intel AMT device and enables local management applications to send requests and receive responses. The LMS responds to the requests directed at the Intel AMT local host and routes them to the Intel® ME via the Intel® MEI driver. This service installer is packaged with the Intel MEI drivers on the OEM websites.

Please note that when installing the Windows OS, the Windows Automatic Update service installs the Intel MEI driver only. IMSS and the LMS Service are not installed. The LMS service communicates from an OS application to the Intel MEI driver. If the LMS service is not installed, go to the OEM website and download the Intel MEI driver, which is usually under the Chipset Driver category.


Figure 4: Intel® Management Engine Interface driver.

The LMS is a Windows service installed on the host platform that has Intel AMT Release 9.0 or greater. Prior to this, the LMS was known as the User Notification Service (UNS) starting from Intel AMT Release 2.5 to 8.1.

The LMS receives a set of alerts from the Intel AMT device. LMS logs the alert in the Windows Application event log. To view the alerts, right-click My Computer, and then select Manage>System Tools>Event Viewer>Application.

Tool - Intel® Management and Security Status Tool

The Intel MSS tool can be accessed by the blue-key icon in the Windows tray.


Figure 5: Sys Tray Intel® Management and Security Status icon.

General tab

The General tab of the Intel MSS tool shows the status of Intel vPro features available on the platform and an event history. Each tab has additional details.


Figure 6: Intel® Management and Security Status General tab.

Intel AMT Tab

This interface allows the local user to terminate KVM and Media Redirection operations, perform a Fast Call for Help, and see the System Defense state.


Figure 7: Intel® Management and Security Status Intel AMT tab.

Advanced Tab

The Advanced tab of the Intel MSS tool shows more detailed information on the configuration of Intel AMT and its features. The screenshot in Figure 8 verifies that Intel AMT has been configured on this system.


Figure 8:Intel® Management and Security Status Advanced tab.

Intel Active Management Technology Software Development Kit (SDK)

TheIntel AMT Software Development Kit (SDK) provides low-level programming capabilities so developers can build manageability applications that take full advantage of Intel AMT.

The Intel AMT SDK provides sample code and a set of APIs that let developers easily and quickly incorporate Intel AMT support into their applications. The SDK also includes a full set of HTML documentation.

The SDK supports C++ and C# on Microsoft Windows and Linux operating systems. Refer to the user guide and the Readme files in each directory for important information on building the samples.

The SDK, as delivered, is a set of directories that can be copied to any location. The directory structure should be copied in its entirety due to the interdependencies between components. There are three folders at the top level: DOCS (contains SDK documentation), and one each for Linux and Windows (sample code.) For more information on how to get started and how to use the SDK, see the Intel® AMT Implementation and Reference Guide.

As illustrated by the screenshot in Figure 9 of the Intel® AMT Implementation and Reference Guide, you can get more information on system requirements and how to build the sample code by reading the Using the Intel® AMT SDK section. The documentation is available on the Intel® Software Network here: Intel® AMT SDK (Latest Release)


Figure 9: Intel AMT Implementation and Reference Guide

Other Intel Active Management Technology Resources

Intel AMT Implementation and Reference Guide
Intel AMT SDK Download
High-level API Article and download
Intel® Platform Solutions Manager Article and Download
Power Shell Module download
KVM Application Developer’s Guide
Redirection Library
C++ CIM Framework API
C# CIM Framework API
Intel® ME WMI Provider
System Health Validation (NAP)
Use Case Reference Designs

Appendix

The following table provides a snapshot of features supported by Intel AMT Releases 8 through 11.

Read about all the features in the Intel AMT SDK Implementation and Reference Guide (“Intel AMT Features” section.)

FeatureIntel® Active Management Technology (Intel® AMT) 8Intel AMT 9Intel AMT 10Intel AMT 11
Hardware InventoryXXXX
Persistent IDXXXX
Remote Power On/OffXXXX
SOL/IDERXXXX
Event ManagementXXXX
Third-Party Data StorageXXXX
Built-in Web ServerXXXX
Flash ProtectionXXXX
Firmware UpdateXXXX
HTTP Digest/ TLSXXXX
Static and Dynamic IPXXXX
System DefenseXXXX
Agent PresenceXXXX
Power PoliciesXXXX
FeatureAMT 8AMT 9AMT 10AMT 11
Mutual AuthenticationXXXX
Kerberos*XXXX
TLS-PSKXXXDeprecated
Privacy IconXXXX
Intel® Management Engine Wake-on-LANXXXX
Remote ConfigurationXXXX
Wireless ConfigurationXXXX
Endpoint Access Control (EAC) 802.1XXXX
Power PackagesXXXX
Environment DetectionXXXX
Event Log Reader RealmXXXX
System Defense HeuristicsXXXX
WS-MAN InterfaceXXXX
VLAN Settings for Intel AMTXXXX
Network InterfacesXXXX
FeatureAMT 8AMT 9AMT 10AMT 11
Fast Call For Help (CIRA)XXXX
Access MonitorXXXX
Microsoft NAP* SupportXXXX
Virtualization Support for Agent PresenceXXXX
PC Alarm ClockXXXX
KVM Remote ControlXXXX
Wireless Profile SynchronizationXXXX
Support for Internet Protocol Version 6XXXX
Host-Based ProvisioningXXXX
Graceful ShutdownXXXX
WS-Management APIXXXX
SOAP CommandsXDeprecatedDeprecatedDeprecated
InstantGo Support   X
Remote Secure Erase   X

How to Analyze Intel® Media SDK-optimized applications with Intel® Graphics Performance Analyzers

$
0
0

When developing a media application, you often wonder, “Am I getting the performance I should be? Am I using fixed function logic or my EU array?” This article will show how to set up Intel® Graphics Performance Analyzers (Intel® GPA) to analyze the real time performance of your Intel® Media SDK-optimized application. 

First, let’s start with Intel® GPA. Intel GPA is a very useful tool for identifying media pipeline inefficiencies and targeting application optimization opportunities. Second, the Intel® Media SDK is software development library that exposes the hardware media acceleration capabilities of Intel® platforms for decoding, encoding and video processing (see hardware requirements for applicable processors). To get started, we’ll use Intel GPA to analyze some of the Intel® Media SDK sample use cases as examples. Please refer to each sample description for details.

For this article, you will need both Intel GPA and the Intel Media SDK. You can get the free downloads of Intel® GPA and either Intel Media SDK (for clients) or Intel® Media Server Studio Community Edition (where Intel Media SDK is a component).

Throughout this article, we use Intel® GPA 2016 R1 with the latest Intel Media SDK 2016. Note that future revisions for Intel GPA and Intel Media SDK will introduce new features that may deviate from some parts of this article.

Setting up Intel GPA

Run the Intel GPA installer to install tools. With Intel GPA you can get in-depth traces of media workloads, capturing for instance operations such as fixed function hardware processing (denoted as MFX from now on) or execution unit (EU) processing. Intel GPA also has the capability to display real-time Intel GPU load via the Media Performance dialog.

The Intel GPA tool is started by launching the “Intel® Graphics Monitor” application from the Start window or from the task bar. Right-clicking the task bar icon brings up a menu with options.

For the purpose of media workload analysis, you will create a media analysis profile in the Intel GPA profiles menu:

1)Select the “Profiles” menu item -> click the “HUD Metrics” tab. Clear existing metrics in “Metrics to display in the HUD”, by selecting all the metrics and click on “Remove”. Now select following from “Available metrics” from Media and click “Add” to add to “Metrics to display in the HUD”.

  • MFX Decode Usage
  • GPU Overall Usage
  • MFX Engine Usage
  • MFX Encode Usage
Choose a group name and click on “Add Group” to save your settings.  The profile should look just like the one below.  Click Apply so save changes.

2) Select the “Preferences” menu item. Make sure the following check boxes are de-selected:

i.Auto-detect launched applications

ii.Disable Tracing

  1. Press “OK” to exit the "Preferences" dialog window

Analyzing the Application

Intel GPA uses an injection method to collect metrics during the runtime of the application. Injection takes place at application start time and for this tutorial, will be launched from the analyze application menu within GPA.

  1. To capture detailed workload metrics select “Analyze Application” from the Intel GPA from the task bar menu.
  2. Enter the executable path, name and arguments in the “Command line” dialog item. Make sure that “Working Folder” is set correctly and no spaces in path. 

3) To start capturing metrics for the specified workload press the “Run” button.

Real-time graphs can also be enabled by pressing “Ctrl+F1” during rendering to view real-time metrics.  Note: Metrics can be changed during runtime from within the profiles menu from the setup step.

Before analyzing any of the other Intel Media SDK samples workloads, let’s examine what information is presented in metrics captured by Intel® GPA, why they are important, and what are possible next steps for improvement. Each metric is displayed in real time as a percentage value over time. The red number is the minimum value, the green number is the maximum value, and the white number is the current value. 

Below is a table of metric descriptions for each media metric offered within Intel GPA. For a full list of Media Metrics supported in GPA, please refer to our documentation

Metric Name

Metric Description

MFX Engine usage

Represents percentage of the time the Multi-Format Codec (MFX) engine is active

MFX Decode Usage

Represents percentage of time that MFX-based decode operations are active

GPU Overall Usage

Represents percentage of time either execution unit (EU) and MFX (Media Fixed Function) is active

EU Engine Usage

Represents percentage of time the Execution Unit (EU) engine is active

MFX Encode Usage

Represents percentage of time that MFX-based encode operations are active

 

Interpreting the Data

If you are observing GPU overall usage is high, then check IOPattern in your application, as mismatched IOPattern in MediaSDK implementation consumes large amounts of extra buffers and internal copies and its is recommended to avoid such scenarios. Refer to technical articles here, which explain common Media SDK use-case scenario settings and how to further optimize your media pipelines to achieve best performance.   

Questions, Comments, Feature requests regarding analyzing your media application? Connect with other developers and Intel Technical Experts on the Intel Media SDK forum and Intel GPA forum!

SPIR-V is a better SPIR with Intel® OpenCL™ Code Builder

$
0
0

Download the pdf version of the article

Table of Contents

Introduction

In this short tutorial we are going to give you a brief introduction to Khronos SPIR-V, touch on the differences between a SPIR-V binary and a SPIR binary, and then are going to demonstrate couple of ways of creating SPIR-V binaries using tools shipped with the latest Intel® SDK for OpenCL Applications and way of consuming SPIR-V binaries in your OpenCL program. Please read Using SPIR for Fun and Profit with Intel® OpenCL™ Code Builder first, since you can consider this article as the second in a series. You will need Microsoft Visual Studio 2015 or later (note, that Community edition should work just fine) and Intel® SDK for OpenCL Applications for Windows.

What is SPIR-V and how is it different from SPIR?

SPIR-V is an evolution of the SPIR standard. You could read more about it at Khronos website at https://www.khronos.org/spir . While SPIR was designed as an intermediate binary representation for OpenCL programs only, SPIR-V acts as a cross-API intermediate language for both OpenCL 2.1 and 2.2 and the new Khronos Vulkan graphics and compute API.

How to produce SPIR-V binary with an Intel® OpenCL* command line compiler?

Run the following from the command line to produce a 64-bit SPIR-V binary:

ioc64 -cmd=build -input=SobelKernels.cl -spirv64=SobelKernels_x64.spirv

To produce a 32-bit SPIR-V binary, use the following command line:

ioc64 -cmd=build -input=SobelKernels.cl -spirv32=SobelKernels_x86.spirv

How to produce SPIR-V binary with Intel® Code Builder?

Start Microsoft Visual Studio and create a new Code Builder Session:

and give the location of the sample code and the SobelKernels.cl file. Make sure to name the session SobelKernels:

Click on the resulting Session in the Code Builder Session Explorer and select Build:

Once you build the session, you should see the following:

Note, that both x86 and x64 SPIR-V binaries and their textual representations are produced by default. The textual representation is useful for informational purposes only. It should match the representation produced by the Khronos SPIR-V tools. SPIR-V binaries could be distributed with the applications and replace SPIR binaries and/or OpenCL source code.

How to produce SPIR-V binary right in your Microsoft Visual Studio Project?

Make sure to set the following flags in your MS Visual Studio Project:

Resulting SPIR-V binaries (both 32-bit and 64-bit versions) will be generated in the project build directory.

How to consume SPIR-V binary in your OpenCL program?

Currently, only Experimental OpenCL 2.1 CPU Only Platform supports SPIR-V binaries on Intel® processors. To make sure that the device supports SPIR-V, use clGetDeviceInfo call and query for CL_DEVICE_IL_VERSION. It should return “SPIR-V_1.0”. Note that the device version should be “OpenCL 2.1” or higher.

You will need to read a binary file (in our case SobelKernels_x64.spirv) into a character array with regular C or C++ APIs. Then you need to create a program with clCreateProgramWithIL call. You will then need to build the program with clBuildProgram and provide regular optimization flags, like “-cl-mad-enable”. For a description of how to build a regular SPIR or a native device binary please see Using SPIR for fun and profit with Intel® OpenCL™ Code Builder.

Advantages of a SPIR-V binary

You should be able to compile kernels written for OpenCL 1.2, 2.0, 2.1, and 2.2 in both OpenCL C and OpenCL C++ (coming soon). Other advantages include increase compatibility between compiler frontends and across vendors; some degree of security, since it is fairly hard to disassemble the SPIR-V binary and understand the resulting code; marginally faster compile time.

Disadvantages of a SPIR-V binary

SPIR-V binary is larger than a corresponding SPIR binary. Since SPIR-V is a relatively new feature, it might not be supported by as many vendors yet. As with SPIR binaries, SPIR-V binaries need to ship in both 32-bit and 64-bit form and it is the responsibility of the programmer to write the logic to select between them. 32-bit SPIR-V binaries are needed for 32-bit devices (both CPU and GPU could be both a 32-bit device or a 64-bit device - to figure out which one you have, please use clGetDeviceInfo to query CL_DEVICE_ADDRESS_BITS property as shown in the sample code), 64-bit SPIR-V binaries are needed for 64-bit devices. Another issue is that extensions or anything else that’s discoverable at compile time can be problematic.

Building and Running a SPIR-V Sample

Building SPIR-V sample is very similar to building a SPIR sample, but when running it, specify the following on the command line:

.\x64\Release\Sobel_OCL.exe 100 cpu Experimental 2048 2048 no_show_CL spirv

The result of running it should be 

Notice, that four *_validation.ppm files appeared in your directory. When you examine them, they all should contain the following picture of a nicely Sobelized dog:

Conclusion

In this short tutorial we gave a brief overview of SPIR-V and its differences with SPIR. For an article on how to use SPIR, please see Using SPIR for Fun and Profit with Intel® OpenCL™ Code Builder.

Acknowledgements

Thank you to Uri Levy, Oleg Maslov, Alexey Sotkin, Vladimir Lazarev and Ben Ashbaugh for reviewing the sample code and this article!

References

  1. Khronos SPIR website: https://www.khronos.org/spir
  2. Khronos SPIR FAQ: https://www.khronos.org/faq/spir
  3. Khronos SPIR-V Tools: https://github.com/KhronosGroup/SPIRV-Tools
  4. Khronos LLVM framework with SPIR-V support: https://github.com/KhronosGroup/SPIRV-LLVM
  5. Intel® SDK for OpenCL Applications: https://software.intel.com/en-us/intel-opencl
  6. Intel® Media Server Studio: https://software.intel.com/en-us/intel-media-server-studio
  7. Intel®  Code Builder for OpenCL™ API for Linux*: https://software.intel.com/en-us/articles/intel-code-builder-for-opencl-api
  8. User Manual for OpenCL™ Code Builder: https://software.intel.com/en-us/code-builder-user-manual
  9. Using SPIR for Fun and Profit with Intel® OpenCL™ Code Builder by Robert Ioffe

About the Author

Robert Ioffe is a Technical Consulting Engineer at Intel’s Software and Solutions Group. He is an expert in OpenCL programming and OpenCL workload optimization on Intel® Iris™ and Intel® Iris Pro™ Graphics with deep knowledge of Intel Graphics Hardware. He was heavily involved in Khronos standards work, focusing on prototyping the latest features and making sure they can run well on Intel architecture. Most recently he has been working on prototyping Nested Parallelism (enqueue_kernel functions) feature of OpenCL 2.0 and wrote a number of samples that demonstrate Nested Parallelism functionality, including GPU-Quicksort for OpenCL 2.0. He also recorded and released two Optimizing Simple OpenCL Kernels videos and a video on Nested Parallelism.

You might also be interested in the following:

Optimizing Simple OpenCL Kernels: Modulate Kernel Optimization

Optimizing Simple OpenCL Kernels: Sobel Kernel Optimization

GPU-Quicksort in OpenCL 2.0: Nested Parallelism and Work-Group Scan Functions

Sierpiński Carpet in OpenCL 2.0

Legal Information

* Other names and brands may be claimed as the property of others.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
Copyright © 2016, Intel Corporation. All rights reserved.

Download the Sample

Introducing the Intel® Software Guard Extensions Tutorial Series

$
0
0

Today we are launching a multi-part tutorial series aimed at software developers who want to learn how to integrate Intel® Software Guard Extensions (Intel® SGX) into their applications. The intent of the series is to cover every aspect of the software development cycle when building an Intel SGX application, beginning at application design and running through development, testing, packaging, and deployment. While isolated code samples and individual articles are valuable, this in-depth look at enabling Intel SGX in a single application provides developers with a hands-on and holistic view of the technology as it is woven into a real-world application.

This tutorial will consist of several parts—currently 12 articles are planned, though the exact number may change—each covering a specific topic. While a precise schedule has not been set, each part in the series should be published every two to three weeks and in these broad phases:

  1. Concepts and design
  2. Application development and Intel SGX integration
  3. Validation and testing
  4. Packaging and deployment
  5. Disposition

Source code will accompany relevant sections of the series and will be distributed under the Intel Sample Source Code license. Don’t expect to start seeing source code for a few weeks, however. The first phase of the tutorial will cover the early fundamentals of Intel SGX application development.

Goals

At the end of the series, the developer will know how to:

  • Identify an application’s secrets
  • Apply the principles of enclave design
  • Use trusted libraries in an enclave
  • Build support for dual code paths in an application (to provide legacy support for platforms without Intel SGX capabilities)
  • Use the Intel SGX debugger
  • Create an Intel SGX application installer package

The sample application

Throughout the series we will be developing a basic password manager. The final product is not meant to be a commercially viable application, but rather one with sufficient functionality to make it a reasonable performer that follows smart security practice. This application is simple enough to be reasonably covered in the tutorial without being so simple that it’s not a useful example.

What you’ll need

Developers who want to work with the source code as it is released will require the following:

Hardware requirements

HardwareHard RequirementComments
Intel® processor with Intel® Secure Key technologyYesThe password manager will make extensive use of the digital random number generator provided by Intel Secure Key technology. See http://ark.intel.com to find specific processor models with Intel Secure Key technology support.
6th generation Intel® Core™ processor with Intel® Software Guard Extensions (Intel® SGX) enabled BIOSNoTo get the most out of the tutorial, a processor that supports Intel SGX is necessary, but the application development can take place on a lesser system and Intel SGX applications can be run in the simulator provided with the SDK.

 

Software requirements

These software requirements are based on the current, public release of the Intel SGX Software Developer’s Kit (SDK). As newer versions of the SDK are released, the requirements may change.

SoftwareHard RequirementComments
Intel® Software Guard Extensions (Intel® SGX) SDK v1.1YesRequired for developing Intel SGX applications.
Microsoft Visual Studio* 2012 Professional EditionYesRequired for the SDK. Each SDK release is tied to specific versions of Visual Studio in order to enable the wizards, developer tools, and various integration components.
Intel® Parallel Studio XE 2013 Professional Edition for Windows* No

This is a soft requirement for the SDK, but it is not strictly necessary for Intel SGX SDK development.

The compiler included in Visual Studio 2012 does not provide intrinsics for the RDSEED instruction, but a fallback code path exists.

Coming soon

Part 1 of the series, Intel SGX Foundations, will post in the next couple of weeks. In it, we will provide an overview of the technology and lay the groundwork for the rest of the tutorial.

Stay tuned

This series will cover every aspect of the software development cycle when building an Intel SGX application, beginning at application design, and running through development, testing, packaging, and deployment. The tutorials will cover concepts and design, application development and Intel SGX integration, validation and testing, packaging and deployment, and disposition.

We’re excited to be launching this series and are looking forward to having you join us!

Intel Tamper Protection Toolkit - Beta Closed

$
0
0

The Intel® Tamper Protection Toolkit team thanks you for your interest in the Beta product.

The Intel® Tamper Protection Toolkit Beta is now closed. Intel has stop support of this product. 

Regards,
Intel Tamper Protection Toolkit Team

An Introduction to Intel® Active Management Technology Wireless Connections

$
0
0

Introduction to Intel® Active Management Technology Wireless

With the introduction of wireless-only platforms starting with Intel Active Management Technology (Intel® AMT) 10, it is even more important for an ISV to integrate support for wireless management of AMT devices.

The wireless feature of Intel AMT is just like any wireless connection; it is not an automatic initial connection process. However, there are several major differences between wired and wireless Intel AMT communication, including the following:

  • Wireless Intel AMT interfaces are disabled by default and must be enabled and configured with a wireless profile (friendly name, SSID, passwords, encryption, and authentication at a minimum) which is pushed to the client using one of several methods.
  • Where a wired interface is shared by the host OS and Intel AMT (two different IP addresses), the wireless interface must be DHCP assigned only one IP address and is controlled by the OS unless the host is deemed unavailable, in which case control is given to the Intel AMT firmware.

This article will address the Intel AMT wireless configuration and describe how to handle the various aspects that are important for a clean integration.

Intel AMT Wireless Support Progression for Intel AMT 2.5 through 11

  • Intel AMT 2.5 and 2.6: Wireless is supported only when the OS is in a powered-on state (S0).
  • Intel AMT 4.0: Wireless is supported in all sleep states (Sx) but depends on configuration settings (Note: Intel AMT 5.0 did not support wireless).
  • Intel AMT 6: Syncs Intel AMT and host OS wireless profiles.
  • Intel AMT 7.0: Wireless is supported and host-based configuration is available; however remote configuration is not available over wireless.
  • Intel AMT 9.5: First release to support wireless-only platforms. USB provisioning is not supported on these devices.

Understanding Intel AMT Wireless Connection Requirements

The connection parameters for an Intel AMT wireless device closely resemble those required for the Host OS connection. The firmware requires information including SSID, the authentication method, encryption type, and passphrase at a minimum. In more advanced connections, 802.1x profile information is also required.

All these settings are wrapped into a Profile which is considered as either an Admin or User Profile and saved within the Intel AMT firmware. The Admin or IT profiles are added to the firmware using Intel AMT APIs; see a list of configuration methods below. User profiles cannot be added to the MEBx via an Intel AMT API, they are created using the Intel AMT WebUI, or with profile syncing using the Intel® PROSet wireless drivers.

The Intel AMT firmware holds a maximum of 16 total profiles, of which a maximum of 8 can be user profiles. With the ninth user profile, the oldest user profile is overwritten. The combination of Admin and User profile are a maximum of 16 profiles.

Key Differences between Wired and Wireless Intel AMT Support

  • Default state. The wireless management interface is initially disabled and must be enabled in addition to creating and deploying the wireless profile. In contrast, wired connections are on by default.
  • Network type. Only infrastructure network types are supported by Intel AMT, not ad hoc or peer-to-peer networks.
  • DHCP dependence. While wired Intel AMT connections support either DHCP or static IP assignment, wireless AMT connection requires DHCP, and it will share its IP with the host OS.
  • Power state limitations. Wireless AMT is only available when the system is plugged into AC Power and in the s0 or s5 state.
  • Microsoft Active Directory* integration. 802.1x wireless authentication requires Active Directory integration with the Intel® Setup and Configuration Software (Intel® SCS.)
  • OS control of packets. On the wireless connection, all traffic goes directly to the OS (which can then forward it to Intel AMT), unless the OS is off, failed, or in a sleep state. In those cases manageability traffic goes directly to Intel AMT, which means that when the host returns to S0 or the driver is restarted, Intel AMT must return control to the host, or the host will not have wireless connectivity. This affects remote connections to Intel AMT including IDE-R and KVM.  See Link Preference details below (added in 6.0 and automated in 8.1)
  • Wired-only Intel AMT features are not supported on wireless only platforms; Heuristic Policies , Auto-Synch of IP Addresses , Local Setup Directly to Admin Control Mode , 802.1x Profile Configuration.

Configuration Methods

Basic configuration of wireless for Intel AMT is covered in the article: Intel® vPro™ Setup and Configuration Integration, but here is additional information specific to wireless setup.

Wireless profiles can be placed in the Intel AMT firmware several ways. However, any system that is wireless only (no RJ45 connector) cannot be provisioned by a USB key.

  • Initial Intel AMT configuration
    • Profile type: Admin or Client, basic or advanced 802.1x
    • Tools available: Acuconfig, ACUWizard or Intel SCS
  • Intel AMT WebUI
    • Profile type: User, basic only.
    • Tool used: For web browser, use http://<IPorFQDNofDevice>:16992, or for TLS use https://<FQDNofDevice>:16993

Intel® AMT

  • Delta configuration
    • Profile type: Admin for reconfiguring specific settings only
    • Tools available: Acuconfig, ACU Wizard, or Intel SCS
  • Wi-Fi profile syncing  (Intel AMT 6.0 and later)
    • Profile type: User
    • Requires Intel® PROSet wireless drivers and the Intel® AMT Local Manageability Service (LMS)
    • Enables or disables synced OS and AMT wireless profiles (during configuration).
  • WS-Management
    • Profile type: Admin
    • Tools: Intel® vPro PowerShell module, WirelessConfiguration.exe, WS-Man custom using CIM_elements

Connection Types - Authentication/Encryption

Intel AMT supports several authentication and encryption types for wireless connections.

  • User profiles can be configured with WEP or no encryption.
  • Admin profiles must be TKIP or CCMP with WPA or higher security.
  • 802.1x profiles are not automatically synchronized by the Intel PROSet wireless driver

Table 1 shows the possible security settings for Intel AMT wireless profiles.

 Noneweptkipccmp
Open SystemXX  
Shared KeyXX  
Wi-Fi* Protected Access (WPA)
Pre-Shared Key (PSK)
  XX
wpa ieee 802.1X  XX
wpa2 psk  XX
wpa 2 ieee 802.1X  XX

Table 1:Security settings for Intel® Active Management Technology wireless profiles.

Settings to Ensure Connectivity during Remote Connection

Link Control and Preference

In a typical Intel AMT remote power management command, the Intel AMT system gets immediately rebooted. With a wireless KVM the session will get dropped as the WLAN because the control of the wireless interface does not get passed to the firmware. This lack of passing the control from the OS to the firmware can take up to two minutes for the Intel AMT wireless connection to be reestablished.

To prevent this connectivity loss, the preferred method is to programmatically perform the change of link control prior to making the power control request.

For additional Information see my blogs: KVM and Link Control and more general Link Preference and Control.

TCP Time Outs

During changes to link control and power transition, wireless connectivity will temporarily be down during these state changes. If that duration lasts too long, the sessions created using the redirection library will be terminated. This termination is due to exceeding the HB setting within the redirection library (see Table 2).

Time OutDefault ValuesSuggested Value
Hb (client heartbeat interval)5 seconds7.5 seconds
RX (client receive)2 x Hb3 x Hb

Table 2:TCP default and suggested changes.

Currently the default session time-out setting works most of the time. However we now recommend changing the HB interval and the client receive interval by adding parameters during calls to the redirection library. These time-out values need to affect both the IDER TCP and SOL TCP sessions. For additional Information, see the following; IMR_IDEROpenTCPSession or IMR_SOLOpenTCPSessionEx.

Wireless Link Policy

Another aspect is the wireless power policy of the firmware. This policy governs power control in different sleep states. The allowable values are: Disable, EnableS0, and EnableS0AndSxAC. These settings are usually set during configuration. However identifying if an Intel AMT client will be able to maintain connectivity after a reboot or power down will improve technician expectation of client behavior.

To query the Wi-Fi Link policy use the HLAPI.Wireless.WiFiLinkPolicy enumeration

To set the Wi-Fi Link policy use the HLAPI.Wireless. IWireless.SetWiFiLinkPolicy method of the Intel AMT HLAPI

Summary

Intel AMT wireless functionality may be called a feature, but this feature should be a cornerstone for any integration of Intel AMT functionality into a console application. Without this integration many devices will not be manageable due at the introduction of Intel AMT version 10).

A successfully basic integration is composed of several factors: Intel AMT wireless configuration, connection verification for wired or wireless, and wireless Link control operations.

Resource Lists

About the Author

Joe Oster has been working with Intel® vPro™ technology and Intel AMT technology since 2006. When not working, spending time working on his family’s farm or flying drones and RC aircraft.

Fluid Simulation for Video Games (Part 20)

$
0
0

Download PDF of Fluid Simulation for Video Games (Part 20) [PDF 1.1MB]

Comparison of fluid simulations

Figure 1: Comparison of fluid simulations with the same interior vorticity but different solvers and boundary values. The top row uses direct integral method, where the domain boundary has no explicit influence. The middle row uses a Poisson solver, with boundary conditions of vector potential calculated from a treecode integral method. The bottom row uses a Poisson solver, with boundary conditions of vector potential set to zero.

Assigning Vector Potential at Boundaries

Fluid simulation entails computing flow velocity everywhere in the fluid domain. This article places the capstone on a collection of techniques that combine to provide a fast way to compute velocity from vorticity. This is the recipe:

  1. Within some domain, use vortex particles to track where fluid rotates.
  2. At domain boundaries, integrate vorticity to compute the vector potential.
  3. Throughout the domain interior, solve a vector Poisson equation to compute vector potential everywhere else.
  4. Everywhere in the domain, compute the curl of vector potential to obtain the velocity field.
  5. Advect particles according the velocity field.

This system has some nice properties:

  • Most game engines already support particle systems. This technique builds on such systems.
  • Accelerate the integration of vorticity by using the treecode O(N log N), which is pretty fast.
  • Solving a vector Poisson equation can be even faster: O(N).
  • Computing curl is mathematically simple and fast: O(N).
  • Most particle systems already support advecting particles according to a velocity field.
  • The same velocity field can advect both vortex and tracer particles.
  • All the computation routines have fairly simple mathematical formulae.
  • The algorithms are numerically stable.
  • You can parallelize all the computation routines above by using Intel® Threading Building Blocks (Intel® TBB).
  • Using particles to represent vorticity means that only the most interesting aspects of the flow cost resources.

The scheme is well-suited to fire and smoke simulations, but it has at least one drawback: It’s not well suited to simulating liquid–air interfaces. If you want to simulate pouring, splashes, or waves, other techniques are better. Smoothed-particle hydrodynamics (SPH) and shallow-water wave equations will give you better results.

This article complements part 19, describing how to improve fidelity while reducing computational cost by combining techniques described in earlier articles in this series. In particular, this article describes the following steps:

  1. Compute vector potential at boundaries only by integrating vorticity with a treecode.
  2. Enable the multigrid Poisson solver described in part 6.
  3. Modify UpSample to skip overwriting values at boundaries.
  4. Use Intel TBB to parallelize UpSample and DownSample.

The code accompanying this article provides a complete fluid simulation using the vortex particle method. You can switch between various techniques, including integral and differential techniques, to compare their performance and visual aesthetics. At the end of the article, I offer some performance profiles that demonstrate that the method this article describes runs the fastest of all those presented in this series. To my eye, it also offers the most visually pleasing motion.

Part 1 and part 2 summarized fluid dynamics and simulation techniques. Part 3 and part 4 presented a vortex-particle fluid simulation with two-way fluid–body interactions that run in real time. Part 5 profiled and optimized that simulation code. Part 6 described a differential method for computing velocity from vorticity. Figure 1 shows the relationships between the various techniques and articles. Part 7 showed how to integrate a fluid simulation into a typical particle system. Part 8, part 9, 10, and part 11 explained how to simulate density, buoyancy, heat, and combustion in a vortex-based fluid simulation. Part 12 explained how improper sampling caused unwanted jerky motion and described how to mitigate it. Part 13 added convex polytopes and lift-like forces. Part 14, part 15, part 16, part 17, and part 18 added containers, SPH, liquids, and fluid surfaces, respectively.

Integrate Vorticity to Compute Vector Potential at Boundaries

Part 19 provided details of using a treecode algorithm to integrate vorticity and compute vector potential. Use the treecode algorithm to compute vector potential only at boundaries, as shown in Figure 2. (Later, the vector Poisson solver will “fill in” vector potentials through the domain interior.)

Vector potential computed at domain boundaries only for a vortex ring

Figure 2: Vector potential computed at domain boundaries only for a vortex ring

The treecode algorithm has an asymptotic time complexity of O(N log N), where N is the number of points where the computation occurs. At first glance, that seems more expensive than the O(N) Poisson solver, but you can confine the treecode computation to the boundaries (a two-dimensional manifold) that have Nb points, so the treecode algorithm costs O(Nb log Nb). In contrast, the Poisson algorithm runs in the three-dimensional interior, which has Ni points. (Figure 3 shows how the ratio of the numbers of boundary-to-interior points diminishes as the problem grows.) For a domain that has Ni∝Np3 points, Nb∝Np2= Ni2/3, so the overall cost of this algorithm is:

O(Ni2/3 log Ni2/3) + O(Ni)

The first term grows more slowly than the second, so asymptotically, the algorithm overall has asymptotic time complexity O(Ni).

Ratio of face to volume points on a cubic grid

Figure 3: Ratio of face to volume points on a cubic grid. As the number of grid points increases, the relative cost of computing boundary values diminishes compared to computing interior values.

Retain two code paths to compute the integral either throughout the entire domain or just at boundaries. You can accommodate that by adding logic to conditionally skip the domain interior grid points. The yellow-highlighted code below shows that modification.

The treecode algorithm takes many conditional branches, and its memory access pattern has poor spatial locality: It jumps around a lot, which makes the algorithm run slowly. Fortunately, vector potential values at boundaries have properties you can exploit to reduce the cost of computing them: They don’t vary much spatially, and they’re far from most of the "action" in the domain interior. You can compute boundary values with lower spatial granularity to save compute time. To do so, compute vector potential on boundaries at every other point, then copy those values to their neighbors. That cuts the cost about in half. The cyan-highlighted code below shows that modification.

(If you’re curious about how much this decimated computation affects the final result, try computing with and without decimation. See if you can tell the difference.)

void VortonSim::ComputeVectorPotentialAtGridpoints_Slice( size_t izStart , size_t izEnd , bool boundariesOnly
                     , const UniformGrid< VECTOR< unsigned > > & vortonIndicesGrid , const NestedGrid< Vorton > & influenceTree )
{
    const size_t            numLayers               = influenceTree.GetDepth() ;
    UniformGrid< Vec3 > &   vectorPotentialGrid     = mVectorPotentialMultiGrid[ 0 ] ;
    const Vec3 &            vMinCorner              = mVelGrid.GetMinCorner() ;
    static const float      nudge                   = 1.0f - 2.0f * FLT_EPSILON ;
    const Vec3              vSpacing                = mVelGrid.GetCellSpacing() * nudge ;
    const unsigned          dims[3]                 =   { mVelGrid.GetNumPoints( 0 )
                                                        , mVelGrid.GetNumPoints( 1 )
                                                        , mVelGrid.GetNumPoints( 2 ) } ;
    const unsigned          numXY                   = dims[0] * dims[1] ;
    unsigned                idx[ 3 ] ;
    const unsigned          incrementXForInterior   = boundariesOnly ? ( dims[0] - 1 ) : 1 ;

    // Compute fluid flow vector potential at each boundary grid point, due to all vortons.
    for( idx[2] = static_cast< unsigned >( izStart ) ; idx[2] < izEnd ; ++ idx[2] )
    {   // For subset of z index values...
        Vec3 vPosition ;
        vPosition.z = vMinCorner.z + float( idx[2] ) * vSpacing.z ;
        const unsigned  offsetZ     = idx[2] * numXY ;
        const bool      topOrBottom = ( 0 == idx[2] ) || ( dims[2]-1 == idx[2] ) ;
        for( idx[1] = 0 ; idx[1] < dims[1] ; ++ idx[1] )
        {   // For every grid point along the y-axis...
            vPosition.y = vMinCorner.y + float( idx[1] ) * vSpacing.y ;
            const unsigned  offsetYZ    = idx[1] * dims[0] + offsetZ ;
            const bool      frontOrBack = ( 0 == idx[1] ) || ( dims[1]-1 == idx[1] ) ;
            const unsigned  incX        = ( topOrBottom || frontOrBack ) ? 1 : incrementXForInterior ;
            for( idx[0] = 0 ; idx[0] < dims[0] ; idx[0] += incX )
            {   // For every grid point along the x-axis...
                vPosition.x = vMinCorner.x + float( idx[0] ) * vSpacing.x ;
                const unsigned offsetXYZ = idx[0] + offsetYZ ;

                    if( 0 == ( idx[0] & 1 ) )
                    {   // Even x indices.  Compute value.
                        static const unsigned zeros[3] = { 0 , 0 , 0 } ; /* Starter indices for recursive algorithm */
                        if( numLayers > 1 )
                        {
                            vectorPotentialGrid[ offsetXYZ ] = ComputeVectorPotential_Tree( vPosition , zeros , numLayers – 1
                                                                                    , vortonIndicesGrid , influenceTree ) ;
                        }
                        else
                        {
                            vectorPotentialGrid[ offsetXYZ ] = ComputeVectorPotential_Direct( vPosition ) ;
                        }
                    }
                    else
                    {   // Odd x indices. Copy value from preceding grid point.
                        vectorPotentialGrid[ offsetXYZ ] = vectorPotentialGrid[ offsetXYZ - 1 ] ;
                    }
            }
        }
    }
}

Retain Boundary Values When Up-Sampling

Interleaved between each solver step, multigrid algorithms down-sample values from finer to coarser grids, then up-sample values from coarser to finer grids (as show in in Figure 4 and explained in part 6). This resampling creates a problem: Lower fidelity information up-sampled from coarser grids replaces boundary values originally computed on finer grids.

A fine grid, a medium grid, and a coarse grid in a multigrid solver

Figure 4: A fine grid, a medium grid, and a coarse grid in a multigrid solver

Because using the treecode to compute vector potential values is expensive, you want to avoid recomputing that. So, modify UniformGrid::UpSample to avoid overwriting values at boundaries. Use a flag in that routine to indicate whether toa omit or include boundary points in the destination grid. To omit writing at boundaries, change the for loop begin and end values to cover only the interior. (See the code for that below.) Then, in the multigrid algorithm, during the up-sampling phase, pass the flag to omit up-sampling at boundaries. This code snippet is a modification of the version of VortonSim::ComputeVectorPotential originally presented in part 6. The highlighted text shows the modification:

// Coarse-to-fine stage of V-cycle: Up-sample from coarse to fine, running iterations of Poisson solver for each up-sampled grid.
for( unsigned iLayer = maxValidDepth ; iLayer >= 1 ; -- iLayer )
{
  // Retain boundary values as they were computed initially (above) in finer grids.
  vectorPotentialMultiGrid.UpSampleFrom( iLayer , UniformGridGeometry::INTERIOR_ONLY ) ;
  SolveVectorPoisson( vectorPotentialMultiGrid[ iLayer - 1 ] , negativeVorticityMultiGrid[ iLayer - 1 ]
					, numSolverSteps , boundaryCondition , mPoissonResidualStats ) ;
}

Avoid Superfluous and Expensive Intermediate Fidelity When Down-Sampling

The down-sampling routine provided in part 6 accumulates values from multiple grid points in the finer source grid to compute values in the coarser destination grid. That provides higher-fidelity results, but because the Poisson solver overwrites those values with refinements, the additional fidelity is somewhat superfluous. It’s computationally cheaper to down-sample using nearest values (instead of accumulating), then running more iterations of the Poisson solver (if you want the additional fidelity in the solution). So, you can also modify DownSample to use a faster but less accurate down-sampling technique. This code snippet is a modification of the version of VortonSim::ComputeVectorPotential originally presented in part 6. The highlighted text shows the modification:

// Fine-to-coarse stage of V-cycle: down-sample from fine to coarse, running some iterations of the Poisson solver for each down-sampled grid.
  for( unsigned iLayer = 1 ; iLayer < negativeVorticityMultiGrid.GetDepth() ; ++ iLayer )
  {
      const unsigned minDim = MIN3( negativeVorticityMultiGrid[ iLayer ].GetNumPoints( 0 )
          , negativeVorticityMultiGrid[ iLayer ].GetNumPoints( 1 ) , negativeVorticityMultiGrid[ iLayer ].GetNumPoints( 2 ) ) ;
      if( minDim > 2 )
      {
          negativeVorticityMultiGrid.DownSampleInto( iLayer , UniformGridGeometry::FASTER_LESS_ACCURATE ) ;
          vectorPotentialMultiGrid.DownSampleInto( iLayer , UniformGridGeometry::FASTER_LESS_ACCURATE ) ;
          SolveVectorPoisson( vectorPotentialMultiGrid[ iLayer ] , negativeVorticityMultiGrid[ iLayer ] , numSolverSteps
                            , boundaryCondition , mPoissonResidualStats ) ;
      }
      else
      {
          maxValidDepth = iLayer - 1 ;
          break ;
      }
  }

Parallelize Resampling Algorithms with Intel® Threading Building Blocks

Even with the above changes, the resampling routines cost significant time. You can use Intel TBB to parallelize the resampling algorithms. The approach follows the familiar recipe:

  • Write a worker routine that operates on a slice of the problem.
  • Write a functor class that wraps the worker routine.
  • Write a wrapper routine that directs Intel TBB to call a functor.

The worker, functor, and wrapper routines for DownSample and UpSample are sufficiently similar that I only include DownSample in this article. You can see the entire code in the archive that accompanies this article.

This excerpt from the worker routine for DownSample shows the slicing logic and modifications made to implement nearest sampling, described above:

void DownSampleSlice( const UniformGrid< ItemT > & hiRes , AccuracyVersusSpeedE accuracyVsSpeed , size_t izStart , size_t izEnd )
        {
            UniformGrid< ItemT > &  loRes        = * this ;
            const unsigned  &       numXhiRes           = hiRes.GetNumPoints( 0 ) ;
            const unsigned          numXYhiRes          = numXhiRes * hiRes.GetNumPoints( 1 ) ;
            static const float      fMultiplierTable[]  = { 8.0f , 4.0f , 2.0f , 1.0f } ;

            // number of cells in each grid cluster
            const unsigned pClusterDims[] = {   hiRes.GetNumCells( 0 ) / loRes.GetNumCells( 0 )
                                            ,   hiRes.GetNumCells( 1 ) / loRes.GetNumCells( 1 )
                                            ,   hiRes.GetNumCells( 2 ) / loRes.GetNumCells( 2 ) } ;

            const unsigned  numPointsLoRes[3]   = { loRes.GetNumPoints( 0 ) , loRes.GetNumPoints( 1 ) , loRes.GetNumPoints( 2 ) };
            const unsigned  numXYLoRes          = loRes.GetNumPoints( 0 ) * loRes.GetNumPoints( 1 ) ;
            const unsigned  numPointsHiRes[3]   = { hiRes.GetNumPoints( 0 ) , hiRes.GetNumPoints( 1 ) , hiRes.GetNumPoints( 2 ) };
            const unsigned  idxShifts[3]        = { pClusterDims[0] / 2 , pClusterDims[1] / 2 , pClusterDims[2] / 2 } ;

            // Since this loop iterates over each destination cell, it parallelizes without contention.
            unsigned idxLoRes[3] ;
            for( idxLoRes[2] = unsigned( izStart ) ; idxLoRes[2] < unsigned( izEnd ) ; ++ idxLoRes[2] )
            {
                const unsigned offsetLoZ = idxLoRes[2] * numXYLoRes ;
                for( idxLoRes[1] = 0 ; idxLoRes[1] < numPointsLoRes[1] ; ++ idxLoRes[1] )
                {
                    const unsigned offsetLoYZ = idxLoRes[1] * loRes.GetNumPoints( 0 ) + offsetLoZ ;
                    for( idxLoRes[0] = 0 ; idxLoRes[0] < numPointsLoRes[0] ; ++ idxLoRes[0] )
                    {   // For each cell in the loRes layer...
                        const unsigned  offsetLoXYZ   = idxLoRes[0] + offsetLoYZ ;
                        ItemT        &  rValLoRes  = loRes[ offsetLoXYZ ] ;
                        unsigned clusterMinIndices[ 3 ] ;
                        unsigned idxHiRes[3] ;

                        if( UniformGridGeometry::FASTER_LESS_ACCURATE == accuracyVsSpeed )
                        {
                            memset( & rValLoRes , 0 , sizeof( rValLoRes ) ) ;
                            NestedGrid::GetChildClusterMinCornerIndex( clusterMinIndices , pClusterDims , idxLoRes ) ;
                            idxHiRes[2] = clusterMinIndices[2] ;
                            idxHiRes[1] = clusterMinIndices[1] ;
                            idxHiRes[0] = clusterMinIndices[0] ;
                            const unsigned offsetZ      = idxHiRes[2] * numXYhiRes ;
                            const unsigned offsetYZ     = idxHiRes[1] * numXhiRes + offsetZ ;
                            const unsigned offsetXYZ    = idxHiRes[0] + offsetYZ ;
                            const ItemT &  rValHiRes    = hiRes[ offsetXYZ ] ;
                            rValLoRes = rValHiRes ;
                        }
                        else
                        { ... see archive for full code listing...
                        }
                    }
                }
            }
        }

These routines have an interesting twist compared to others in this series: They are methods of a templated class. That means that the functor class must also be templated. The syntax is much easier when the functor class is nested within the UniformGrid class. Then, the fact that it is templated is implicit: The syntax is formally identical to a nontemplated class.

Here is the functor class for DownSample. Note that it is defined inside the UniformGrid templated class:

class UniformGrid_DownSample_TBB
{
			  UniformGrid &                         mLoResDst               ;
		const UniformGrid &                         mHiResSrc               ;
		UniformGridGeometry::AccuracyVersusSpeedE   mAccuracyVersusSpeed    ;
	public:
		void operator() ( const tbb::blocked_range & r ) const
		{   // Perform subset of down-sampling
			SetFloatingPointControlWord( mMasterThreadFloatingPointControlWord ) ;
			SetMmxControlStatusRegister( mMasterThreadMmxControlStatusRegister ) ;
			mLoResDst.DownSampleSlice( mHiResSrc , mAccuracyVersusSpeed , r.begin() , r.end() ) ;
		}
		UniformGrid_DownSample_TBB( UniformGrid & loResDst , const UniformGrid & hiResSrc
								  , UniformGridGeometry::AccuracyVersusSpeedE accuracyVsSpeed )
			: mLoResDst( loResDst )
			, mHiResSrc( hiResSrc )
			, mAccuracyVersusSpeed( accuracyVsSpeed )
		{
			mMasterThreadFloatingPointControlWord = GetFloatingPointControlWord() ;
			mMasterThreadMmxControlStatusRegister = GetMmxControlStatusRegister() ;
		}
	private:
		WORD        mMasterThreadFloatingPointControlWord   ;
		unsigned    mMasterThreadMmxControlStatusRegister   ;
} ;

Here is the wrapper routine for DownSample:

void DownSample( const UniformGrid & hiResSrc , AccuracyVersusSpeedE accuracyVsSpeed )
  {
      const size_t numZ = GetNumPoints( 2 ) ;
# if USE_TBB
      {
          // Estimate grain size based on size of problem and number of processors.
          const size_t grainSize =  Max2( size_t( 1 ) , numZ / gNumberOfProcessors ) ;
          parallel_for( tbb::blocked_range( 0 , numZ , grainSize )
                      , UniformGrid_DownSample_TBB( * this , hiResSrc , accuracyVsSpeed ) ) ;
      }
# else
      DownSampleSlice( hiResSrc , accuracyVsSpeed , 0 , numZ ) ;
# endif
  }

Performance

Table 1 shows the duration (in milliseconds per frame) of various routines run on a computer with a 3.50‑GHz Intel® Core™ i7-3770K processor with four physical cores and two local cores per physical core.

No. of ThreadsFrameVorton SimVector PotentialUp-SamplePoissonRender
129.86.642.370.05540.22213.8
217.95.111.360.02050.1487.53
313.14.691.280.01960.1534.99
413.04.551.220.01160.1485.04
811.14.441.130.00230.1413.97

Table 1: Duration of Routines Run on an Intel® Core™ i7 Processor

Notice that Vorton Sim does not speed up linearly with the number of cores. Perhaps the algorithms have reached the point where data access (not instructions) is the bottleneck.

Summary and Options

This article presented a fluid simulation that combines integral and differential numerical techniques to achieve an algorithm that takes time linear in the number of grid points or particles. The overall simulation can’t be faster than that because each particle has to be accessed to be rendered. It also provides better results than the treecode because the latter uses approximations everywhere in the computational domain that the Poisson solver does not, and the Poisson solver has an inherently smoother and more globally accurate solution.

More work could be done to improve this algorithm. Currently, the numerical routines are broken up logically so that they’re easier to understand, but this causes the computer to revisit the same data repeatedly. After data-parallelizing the routines, their run times become bound by memory access instead of instructions. So, if instead all the fluid simulation operations were consolidated into a single monolithic routine that accessed the data only once and that super-routine were parallelized, it might lead to even greater speed.

About the Author

Dr. Michael J. Gourlay works at Microsoft as a principal development lead on HoloLens in the Environment Understanding group. He previously worked at Electronic Arts (EA Sports) as the software architect for the Football Sports Business Unit, as a senior lead engineer on Madden NFL*, and as an original architect of FranTk* (the engine behind Connected Careers mode). He worked on character physics and ANT* (the procedural animation system that EA Sports uses), on Mixed Martial Arts*, and as a lead programmer on NASCAR. He wrote Lynx* (the visual effects system used in EA games worldwide) and patented algorithms for interactive, high-bandwidth online applications.

He also developed curricula for and taught at the University of Central Florida, Florida Interactive Entertainment Academy, an interdisciplinary graduate program that teaches programmers, producers, and artists how to make video games and training simulations.

Prior to joining EA, he performed scientific research using computational fluid dynamics and the world’s largest massively parallel supercomputers. His previous research also includes nonlinear dynamics in quantum mechanical systems and atomic, molecular, and optical physics. Michael received his degrees in physics and philosophy from Georgia Tech and the University of Colorado at Boulder.

Follow Michael on Twitter: @MiJaGourlay.


Troubleshooting on FLEXlm License Manager Error: "Vendor daemon can't talk to lmgrd"

$
0
0

Problem

Start Flexlm license manager by clicking "Start" button few times but the prompt "License....can not start" is still there. Closed license manager, then re-open it, the License manager is running some how, button "Stop" appear.

From task manager, I see the process "LMGRD.exe"(Service of license manager) is running. After waiting for a moment, INTEL.exe just start/stop (appeared / deleted) rapidly. After a little more, LMGRD.exe is disappeared too.

From license server log, found below error message after license manager started:

11:35:36 (lmgrd) Starting vendor daemons ...
11:35:36 (lmgrd) License server manager (lmgrd) startup failed:
11:35:36 (lmgrd) File not found, 28000
11:35:36 (lmgrd) Started INTEL (pid 11396)
11:35:36 (INTEL) FlexNet Licensing version v11.12.0.0 build 136775 x64_n6
...
11:35:39 (INTEL) Vendor daemon can't talk to lmgrd (Cannot connect to license server system. (-15,10:10061 "WinSock: Connection refused"))
11:35:39 (INTEL) EXITING DUE TO SIGNAL 28 Exit reason 5
11:35:41 (lmgrd) INTEL exited with status 28 (Communications error)

Firewall is already turned off.

Version

Windows 7

FlexNet Licensing v11.12.0.0

Solution

1. Stop the license service using LMTOOLS.EXE or in Windows Services.
2. Stop any process in Task Manager like lmgrd.exe and adskflex.exe (or whatever the vendor daemon is named).
3. Start the DEP program from Control Panel > System > Advanced > Performance Settings > Data Excution Prevention.
4. Add an exception for LMGRD.exe, INTEL.exe and possibly also LMUTIL.EXE, LMTOOLS.EXE where the license manager is installed. Or you may select turn off DEP for all programs.

Intel® Software Guard Extensions Tutorial Series: Part 1, Intel® SGX Foundation

$
0
0

The first part in the Intel® Software Guard Extensions (Intel® SGX) tutorial series is a brief overview of the technology. For more detailed information, see the documentation provided in the Intel Software Guard Extensions SDK. Find the list of all the tutorials in this series in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

Understanding Intel® Software Guard Extensions Technology

Software applications frequently need to work with private information such as passwords, account numbers, financial information, encryption keys, and health records. This sensitive data is intended to be accessed only by the designated recipient. In Intel SGX terminology, this private information is referred to as an application’s secrets.

The operating system’s job is to enforce security policy on the computer system so that these secrets are not unintentionally exposed to other users and applications. The OS will prevent a user from accessing another user’s files (unless permission to do so has been explicitly granted), one application from accessing another application’s memory, and an unprivileged user from access OS resources except through tightly controlled interfaces. Applications often employ additional safeguards, such as data encryption, to ensure that data sent to storage or over a network connection cannot be accessed by third parties even if the OS and hardware are compromised.

Despite these protections, there is still a significant vulnerability present in most computer systems: while there are numerous  guards in place that protect one application from another, and the OS from an unprivileged user, an application has virtually no protection from processes running with higher privileges, including the OS itself. Malware that obtains administrative privileges has unrestricted access to all system resources and all applications running on the system. Sophisticated malware can target an application’s protection schemes to extract encryption keys and even the secret data itself directly from memory.

To enable the high-level protection of secrets and help defend against these software attacks, Intel designed Intel SGX. Intel SGX is a set of CPU instructions that enable applications to create enclaves: protected areas in the application’s address space that provide confidentiality and integrity even in the presence of privileged malware. Enclave code is enabled by using special instructions, and it is built and loaded as a Windows* Dynamic Link Library (DLL) file.

Intel SGX can reduce the attack surface of an application. Figure 1 demonstrates the dramatic difference between attack surfaces with and without the help of Intel SGX enclaves.

Attack-surface areas with and without Intel® Software Guard Extensions enclaves.

Figure 1: Attack-surface areas with and without Intel® Software Guard Extensions enclaves.

How Intel Software Guard Extensions Technology Helps Secure Data

Intel SGX offers the following protections from known hardware and software attacks:

  • Enclave memory cannot be read or written from outside the enclave regardless of the current privilege level and CPU mode.
  • Production enclaves cannot be debugged by software or hardware debuggers. (An enclave can be created with a debug attribute that allows a special debugger—the Intel SGX debugger—to view its content like a standard debugger. This is intended to aid the software development cycle.)
  • The enclave environment cannot be entered through classic function calls, jumps, register manipulation, or stack manipulation. The only way to call an enclave function is through a new instruction that performs several protection checks.
  • Enclave memory is encrypted using industry-standard encryption algorithms with replay protection. Tapping the memory or connecting the DRAM modules to another system will yield only encrypted data (see Figure 2).
  • The memory encryption key randomly changes every power cycle (for example, at boot time, and when resuming from sleep and hibernation states). The key is stored within the CPU and is not accessible.
  • Data isolated within enclaves can only be accessed by code that shares the enclave.

There is a hard limit on the size of the protected memory, set by the system BIOS, and typical values are 64 MB and 128 MB. Some system providers may make this limit a configurable option within their BIOS setup. Depending on the footprint of each enclave, you can expect that between 5 and 20 enclaves can simultaneously reside in memory.

How Intel® Software Guard Extensions helps secure enclave data in protected applications.

Figure 2: How Intel® Software Guard Extensions helps secure enclave data in protected applications.

Design Considerations

Application design with Intel SGX requires that the application be divided into two components (see Figure 3):

  • Trusted component. This is the enclave. The code that resides in the trusted code is the code that accesses an application’s secrets. An application can have more than one trusted component/enclave.
  • Untrusted component. This is the rest of the application and any of its modules. It is important to note that, from the standpoint of an enclave, the OS and the VMM are considered untrusted components.

The trusted component should be as small as possible, limited to the data that needs the most protection and those operations that must act directly on it. A large enclave with a complex interface doesn’t just consume more protected memory: it also creates a larger attack surface.

Enclaves should also have minimal trusted-untrusted component interaction. While enclaves can leave the protected memory region and call functions in the untrusted component (through the use of a special instruction), limiting these dependencies will strengthen the enclave against attack.

Intel® Software Guard Extensions application execution flow.

Figure 3: Intel® Software Guard Extensions application execution flow.

Attestation

In the Intel SGX architecture, attestation refers to the process of demonstrating that a specific enclave was established on a platform. There are two attestation mechanisms:

  • Local attestation occurs when two enclaves on the same platform authenticate to each other.
  • Remote attestation occurs when an enclave gains the trust of a remote provider.

Local Attestation

Local attestation is useful when applications have more than one enclave that need to work together to accomplish a task or when two separate applications must communicate data between enclaves. Each enclave must verify the other in order to confirm that they are both trustworthy. Once that is done, they establish a protected session and use an ECDH Key Exchange to share a session key. That session key can be used to encrypt the data that must be shared between the two enclaves.

Because one enclave cannot access another enclave’s protected memory space, even when running under the same application, all pointers must be dereferenced to their values and copied, and the complete data set must be marshaled from one enclave to the other.

Remote Attestation

With remote attestation, a combination of Intel SGX software and platform hardware is used to generate a quote that is sent to a third-party server to establish trust. The software includes the application’s enclave, and the Quoting Enclave (QE) and Provisioning Enclave (PvE), both of which are provided by Intel. The attestation hardware is the Intel SGX-enabled CPU. A digest of the software information is combined with a platform-unique asymmetric key from the hardware to generate the quote, which is sent to a remote server over an authenticated channel. If the remote server determines that the enclave was properly instantiated and is running on a genuine Intel SGX-capable processor, it can now trust the enclave and choose to provision secrets to it over the authenticated channel.

Sealing Data

Sealing data is the process of encrypting it so that it can be written to untrusted memory or storage without revealing its contents. The data can be read back in by the enclave at a later date and unsealed (decrypted). The encryption keys are derived internally on demand and are not exposed to the enclave.

There are two methods of sealing data:

  • Enclave Identity. This method produces a key that is unique to this exact enclave.
  • Sealing Identity. This method produces a key that is based on the identity of the enclave’s Sealing Authority. Multiple enclaves from the same signing authority can derive the same key.

Sealing to the Enclave Identity

When sealing to the Enclave Identity, the key is unique to the particular enclave that sealed the data and any change to the enclave that impacts its signature will result in a new key. With this method, data sealed by one version of an enclave is inaccessible by other versions of the enclave, so a side effect of this approach is that sealed data cannot be migrated to newer versions of the application and its enclave. This is intended for applications where old, sealed data should not be used by newer versions of the application.

Sealing to the Sealing Identity

When sealing to the sealing identity, multiple enclaves from the same authority can transparently seal and unseal each other’s data. This allows data from one version of an enclave to be migrated to another, or to be shared among applications from the same software vendor.

If older versions of the software and enclave need to be prevented from accessing data that is sealed by newer application versions, the authority can choose to include a Software Version Number (SVN) when signing the enclave. Enclave versions older than the specified SVN will not be able to derive the sealing key and thus will be prevented from unsealing the data.

How We’ll Use Intel Software Guard Extensions Technology in the Tutorial

We’ve described the three key components of Intel SGX: enclaves, attestation, and sealing. For this tutorial, we’ll focus on implementing enclaves since they are at the core of Intel SGX. You can’t do attestation or sealing without establishing an enclave in the first place. This will also keep the tutorial to a manageable size.

Coming Up Next

Part 2 of the tutorial will focus on the password manager application that we’ll be building and enabling for Intel SGX. We’ll cover the design requirements, constraints, and the user interface. Stay tuned!

 

Implementing User Experience Guidelines in Intel® RealSense™ Applications

$
0
0

Download sample application ›

Introduction

User experience (UX) guidelines exist for the implementation of Intel® RealSense™ technology in applications. However, these guidelines are hard to visualize for four main reasons: (a) You have to interpret end-user interaction in a non-tactile environment during the application design phase where you don’t yet have a prototype for end-user testing, (b) the application could be used on different form factors like laptops and All-In-Ones where the Field-of-View (FOV) and user placement for interaction are different, (c) you have to work with the different fidelities and FOVs of a color and depth camera, and (d) different Intel® RealSense™ SDK modalities have different UX requirements. Having a real-time feedback mechanism to gauge this impact is therefore critical. In this article, we cover an application that is developed for the use of Intel® RealSense™ application developers to help visualize the UX requirements and implement these guidelines in code. The source code for the application is available for download through this article.

The Application

The application works for the user-facing cameras only. Both the F200 and SR300 cameras are covered in the scope of the application. Provision is made to seamlessly switch between the two cameras within the application. If using the F200 camera, the application works on Windows* 8 or Windows® 10. However, if using the SR300 camera, the application requires Windows 10.

There are two windows within the application. One window provides the real-time camera feed where the user can interact. This section also provides visual indicators, which are analogous to visual feedback you will provide in your application. In each of the scenarios below, we call out the visual feedback that has been implemented. The other window provides the code snippets that are required to implement a specific UX scenario. In the sections below, I will walk you through the scenarios covered. WPF is the framework used for development.

Version of the Intel RealSense SDK: Build 8.0.24.6528

Version of Intel® RealSense™ Depth Camera Manager (DCM) (F200): Version 1.4.27.52404

Version of the Intel RealSense Depth Camera Manager (SR300): Version 3.1.25.2599

The application is expected to work on more recent versions of the SDK but has been validated on the version above.

Scenarios

General scenarios

Depth and RGB resolution

The RGB and the depth cameras support different resolutions and have different aspect ratios. Different modalities also have different resolution requirements for each of these cameras.

Below is a snapshot of the stream resolution and frame rate as indicated in the SDK documentation on working with multiple modalities:

The UX problem:

How do I know which areas of the screen real estate should be used for 3D interactions and which ones for UI placement? How can I indicate to the end user visually or through auditory feedback when they have moved out of the interaction zone?

Implementation:

The application uses the SDK API (mentioned below) to obtain the color and depth resolution data for each modality and weaves the depth map over the color map to show superimposing areas. Within the camera feed window, look for the yellow boundary that indicates the space that overlaps the color and depth map. This is your visual feedback. From a UX perspective, you can now visually identify areas of the screen that have to be used for FOV 3D interactions as opposed to UI element placements. Experiment by selecting the different modalities in the first column and choosing from available color and depth resolutions to understand the implications of RGB to depth mapping for your desired usage. The snapshots below show some of the examples of how this overlap changes with the change in inputs.

Example using both depth and color:

Experiment with how the mapping changes as the user switches between different color and depth resolutions. Also choose other modalities that use both depth and RGB to see how the supported color and depth resolution lists change.

Example using only depth:

An example where this is handy is when you are using the hand skeletal tracking. You do not need the color camera for this use case; however, you can switch between the available depth resolutions to see how the screen mapping changes.

Example using only color:

If your application is restricted to using only facial detection, 2D capability will suffice as all you need is the bounding box for the faces. However, if you need the 78 landmarks, you will need to switch to using the 3D example.

The sample application available for download from this article walks through the code required to implement this in your application. As a highlight, the two APIs you will need to create the depth and color resolution iterative lists for each modality are PXCMDevice.QueryCaptureProfile (int i) and the PXCMVideoModule.QueryCaptureProfile(). However, for the visual representation of how the two maps overlap, you will have to use the Projection interface. We know that each pixel has a color and a depth value associated with it. In order to apply the overlap of the depth map on the color map, in this example, we just choose only one value of depth. In order to implement this, the application uses the blob module. The workaround uses the closest blob to the camera (say your hand) and maps the center of this blob (observable as a cyan dot on the screen). The depth value of this pixel is then used as a single depth value to map the depth map to the color map.

Optimal Lighting

The Intel RealSense SDK does not provide any direct API to identify the lighting situation in the environment where the camera is operating. Bad lighting can result in a lot of noise within the color data.

The UX problem:

From within an application, it would be nice to provide visual feedback to the user asking them to move to the ideal lighting environment. Within the application, watch how the camera feed provides the current luminance value displayed on the screen.

Implementation:

The application uses the RGB values and applies the log average luminance to identify the lighting conditions. More information on the use of log average luminance could be found here.

The formula used to identify the log average luminance value for each pixel is:

L = 0.27R + 0.67G + 0.06B;

The values range from 0 for pitch black to 1 for very bright light. We do not define a threshold in this sample because this is something the developers would have to experiment with. Some factors that could affect luminance values is backlight, black clothing (resulting in many pixels giving a close to 0 rating, thus bringing down the average value), outdoor versus indoor lighting conditions, and so on.

Since we have to perform this calculation per pixel in each frame of data, this is a compute-intensive operation. The application shows how to implement this computation using the GPU for optimal performance.

Raw Streams

The Intel RealSense SDK provides APIs for capturing the color and depth streams. However, in some cases, it may be necessary to capture the raw streams to perform low-level computation. The Intel RealSense SDK provides C++ API with .NET wrappers. This means that the memory containing the images live in unmanaged memory. This is non-optimal when displaying images in WPF.

One way to work through this is using the PXCMImage.ToBitmap() API to create an unmanaged HBITMAP wrapped around the image data and use System.Windows.Interop.Imaging.CreateBitmapSourceFromHBitmap() to copy the data into the managed heap and then wrap a WPF BitmapSource object around it.

The UX problem:

The problem with the above-mentioned approach is that the YUY2-> RGB conversion is done on the CPU following which we have to handle an unmanaged to managed memory copy for the image data. This slows down the process a lot and could result in lost data and jittery displays.

The Implementation:

The application shows an alternate implementation using the Direct3D* Image Source introduced in Service Pack 1 of the Microsoft .NET framework version 3, which allows arbitrary DirectX* 9 surfaces to be included in WPF. We implement an unmanaged DirectX library to do the color conversion to display on the GPU. This approach also allows for the GPU accelerated image processing via Pixel Shaders for any custom manipulation needed (example: processing depth image data). The snapshot below shows the raw color, IR and depth streams, and the depth images as shown by the custom shader.

Facial Recognition

One of the most commonly used modalities within the Intel RealSense SDK is the face module. This module allows recognizing up to four people in the FOV while also providing 78 landmark points for each face. Using these data points, it is possible to integrate a facial recognition implementation within applications. Windows Hello* in the Windows 10 OS uses these landmarks to identify templates that can be used to identify people at login. More information on how Windows Hello works can be found here. In this application, we focus on some of the UX issues around this module and how to provide visual feedback to correct end-user interaction for better UX.

The UX problem:

The most prominent UX challenge comes from the fact that your end users may not understand where the FOV of the camera is. They may be completely outside this frustrum or be too far away from the computer, thus being out of range. The Intel RealSense SDK provides many alerts to capture these scenarios. However, implementing these to provide visual feedback to the end user when out of the FOV is critical. In the application, when the end user is in the FOV and in the allowed range, a green bounding box is provided indicating you are within the interaction zone. Experiment with moving your head toward the edges of your computer or by moving farther away—you will notice a red bounding box appear as soon as the camera loses face data.

The implementation:

The Intel RealSense SDK provides the following alerts for effectively handling user errors: ALERT_FACE_OUT_OF_FOV, ALERT_FACE_OCCLUDED, ALERT_FACE_LOST. For more information on alerts, refer to the PXCMFaceModule. The application uses a simple ViewModel architecture to capture the errors and act on them in the XAML code.

Immersive Collaboration

Imagine a photo booth setup where you are trying to obtain a background segmented image of yourself. As mentioned in the Depth and RGB scenario above, the range for each of the Intel RealSense modalities is different. So how do we indicate to the end user what the optimal range for the 3D camera is, so they can position themselves accordingly within the FOV?

The UX problem:

As with the facial detection scenario, providing a visual indicator to the end user when they move in and out of range is important. In this application, note that the slider is set to the optimal range for the camera FOV for 3D segmentation (indicated in green). To identify the lowest minimal range, move the left slider toward the end with the picture of the camera. Note how the pixels turn white. On the other hand, if I want to identify the maximum optimal range, I move the right slider toward the right. Beyond the optimal point, the pixel are pigmented red. The range in between the two sliders now provides the optimal range for segmentation.

Take a look at the last image for a second. You notice another UX issue when using BGS. As I move closer to the background, in this case, the chair, the 3D segmentation module creates one blob from the foreground as well as the background object. You will also notice this in cases where you have a black background and are wearing a black shirt. Identifying depth with uniform pixels is hard. We do not address that scenario in this application, but we want to mention this UX challenge as something to be aware of.

The implementation:

The 3D segmentation module provides alerts to handle UX scenarios. Some of the important alerts we implement here are: ALERT_USER_IN_RANGE, ALERT_USER_TOO_CLOSE, and ALERT_USER_TOO_FAR. The application implements these alerts to provide the pigmentation as well as textual feedback to indicate when the user is too close or too far.

3D Scanning

The 3D scanning module for front-facing cameras provide for scanning the face and small objects. In this application, we will use the face scanning example to demonstrate some of the UX challenges and how to provide an implementation in code to add visual and auditory feedback.

The UX problem:

One of the key challenges in getting a good scan involves detecting the scan area. Usually this gets locked after a few seconds after the scan begins. Here is a snapshot of the region the camera needs to detect for a good scan:

If the user cannot determine the correct scan area, the scan module fails. An example scenario of how things could go wrong: While scanning a face, the user is required to face the camera until the camera detects the face, then turn “slowly” to the left and through the center to the right. Providing visual feedback in the form of a bounding box for the face when the user is within the camera FOV is therefore important when looking at the screen. Note that this is feedback that is required before we can even start the scan. Once the scan begins, when the user turns to the left or to the right, the user cannot see the screen and hence visual feedback is useless. In the sample application, we build both visual and audio feedback to assist with this scenario.

The implementation:

The PXCM3DScan module incorporates the following alerts: ALERT_IN_RANGE, ALERT_TOO_CLOSE, ALERT_TOO_FAR, ALERT_TRACKING, and ALERT_TRACKING_LOST. Within the application, we capture these alerts asynchronously and provide both the visual and audio feedback as necessary. Here is a snapshot of the application capturing the alerts and providing feedback.

Visual feedback before starting the scan and while the scan is in progress:

Note that in this example, we are not demonstrating how you can save the mesh and render it. You can learn more about the specifics of implementing the 3D scan module in your apps through the SDK API documentation.

Summary

The use of Intel RealSense technology in applications poses many UX challenges, both from the perspective of understanding non-tactile feedback and how end users use and interpret the technology. Through a real-time demonstration of some of the UX challenges and code snippets showing potential ways to address those challenges, we hope this application will help developers and UI designers gain a better understanding of Intel RealSense technology.

Additional Resources

Designing Apps for Intel® RealSense™ Technology – User Experience Guidelines with Examples for Windows*

UX Best Practices for Intel® RealSense™ Camera (User Facing) - Technical Tips

API without Secrets: Introduction to Vulkan* Part 4

$
0
0

Download  [PDF 890KB]

Link to Github Sample Code


Go to: API without Secrets: Introduction to Vulkan* Part 3: First Triangle


Table of Contents

Tutorial 4: Vertex Attributes – Buffers, Images, and Fences

Tutorial 4: Vertex Attributes – Buffers, Images, and Fences

In previous tutorials we learned the basics. The tutorials themselves were long and (I hope) detailed enough. This is because the learning curve of a Vulkan* API is quite steep. And, as you can see, a considerable amount of knowledge is necessary to prepare even the simplest application.

But now we can build on these foundations. So the tutorials will be shorter and focus on smaller topics related to a Vulkan API. In this part I present the recommended way of drawing arbitrary geometry by providing vertex attributes through vertex buffers. As the code of this lesson is similar to the code from the “03 – First Triangle” tutorial, I focus on and describe only the parts that are different.

I also show a different way of organizing the rendering code. Previously we recorded command buffers before the main rendering loop. But in real-life situations, every frame of animation is different, so we can’t prerecord all the rendering commands. We should record and submit the command buffer as late as possible to minimize input lag and acquire as recent input data as possible. We will record the command buffer just before it is submitted to the queue. But a single command buffer isn’t enough. We should not record the same command buffer until the graphics card finishes processing it after it was submitted. This moment is signaled through a fence. But waiting on a fence every frame is a waste of time, so we need more command buffers used interchangeably. With more command buffers, more fences are also needed and the situation gets more complicated. This tutorial shows how to organize the code so it is easily maintained, flexible, and as fast as possible.

Specifying Render Pass Dependencies

We start by creating a render pass, in the same way as the previous tutorial. But this time we will provide additional information. Render pass describes the internal organization of rendering resources (images/attachments), how they are used, and how they change during the rendering process. Images’ layout changes can be performed explicitly by creating image memory barriers. But they can also be performed implicitly, when proper render pass description is specified (initial, subpass, and final image layouts). Implicit transition is preferred, as drivers can perform such transitions more optimally.

In this part of tutorial, identically as in the previous part, we specify “transfer src” for initial and final image layouts, and “color attachment optimal” subpass layout for our render pass. But previous tutorials lacked important, additional information, specifically how the image was used (that is, what types of operations occurred in connection with an image), and when it was used (which parts of a rendering pipeline were using an image). This information can be specified both in the image memory barrier and the render pass description. When we create an image memory barrier, we specify the types of operations which concern the given image (memory access types before and after barrier), and we also specify when this barrier should be placed (pipeline stages in which image was used before and after the barrier).

When we create a render pass and provide a description for it, the same information is specified through subpass dependencies. This additional data is crucial for a driver to optimally prepare an implicit barrier. Below is the source code that creates a render pass and prepares subpass dependencies.

std::vector<VkSubpassDependency> dependencies = {
  {
    VK_SUBPASS_EXTERNAL,                            // uint32_t                       srcSubpass
    0,                                              // uint32_t                       dstSubpass
    VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT,           // VkPipelineStageFlags           srcStageMask
    VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,  // VkPipelineStageFlags           dstStageMask
    VK_ACCESS_MEMORY_READ_BIT,                      // VkAccessFlags                  srcAccessMask
    VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT,           // VkAccessFlags                  dstAccessMask
    VK_DEPENDENCY_BY_REGION_BIT                     // VkDependencyFlags              dependencyFlags
  },
  {
    0,                                              // uint32_t                       srcSubpass
    VK_SUBPASS_EXTERNAL,                            // uint32_t                       dstSubpass
    VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,  // VkPipelineStageFlags           srcStageMask
    VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT,           // VkPipelineStageFlags           dstStageMask
    VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT,           // VkAccessFlags                  srcAccessMask
    VK_ACCESS_MEMORY_READ_BIT,                      // VkAccessFlags                  dstAccessMask
    VK_DEPENDENCY_BY_REGION_BIT                     // VkDependencyFlags              dependencyFlags
  }
};

VkRenderPassCreateInfo render_pass_create_info = {
  VK_STRUCTURE_TYPE_RENDER_PASS_CREATE_INFO,        // VkStructureType                sType
  nullptr,                                          // const void                    *pNext
  0,                                                // VkRenderPassCreateFlags        flags
  1,                                                // uint32_t                       attachmentCount
  attachment_descriptions,                          // const VkAttachmentDescription *pAttachments
  1,                                                // uint32_t                       subpassCount
  subpass_descriptions,                             // const VkSubpassDescription    *pSubpasses
  static_cast<uint32_t>(dependencies.size()),       // uint32_t                       dependencyCount&dependencies[0]                                  // const VkSubpassDependency     *pDependencies
};

if( vkCreateRenderPass( GetDevice(), &render_pass_create_info, nullptr, &Vulkan.RenderPass ) != VK_SUCCESS ) {
  std::cout << "Could not create render pass!"<< std::endl;
  return false;
}

1.Tutorial04.cpp, function CreateRenderPass()

Subpass dependencies describe dependencies between different subpasses. When an attachment is used in one specific way in a given subpass (for example, rendering into it), but in another way in another subpass (sampling from it), we can create a memory barrier or we can provide a subpass dependency that describes the intended usage of an attachment in these two subpasses. Of course, the latter option is recommended, as the driver can (usually) prepare the barriers in a more optimal way. And the code itself is improved—everything required to understand the code is gathered in one place, one object.

In our simple example, we have only one subpass, but we specify two dependencies. This is because we can (and should) specify dependencies between render passes (by providing the number of a given subpass) and operations outside of them (by providing a VK_SUBPASS_EXTERNAL value). Here we provide one dependency for color attachment between operations occurring before a render pass and its only subpass. The second dependency is defined for operations occurring inside a subpass and after the render pass.

What operations are we talking about? We are using only one attachment, which is an image acquired from a presentation engine (swapchain). The presentation engine uses an image as a source of a presentable data. It only displays an image. So the only operation that involves this image is “memory read” on the image with “present src” layout. This operation doesn’t occur in any normal pipeline stage, but it can be represented in the “bottom of pipeline” stage.

Inside our render pass, in its only subpass (with index 0), we are rendering into an image used as a color attachment. So the operation that occurs on this image is “color attachment write”, which is performed in the “color attachment output” pipeline stage (after a fragment shader). After that the image is presented and returned to a presentation engine, which again uses this image as a source of data. So, in our example, the operation after the render pass is the same as before it: “memory read”.

We specify this data through an array of VkSubpassDependency members. And when we create a render pass and a VkRenderPassCreateInfo structure, we specify the number of elements in the dependencies array (through dependencyCount member), and provide an address of its first element (through pDependencies). In a previous part of the tutorial we have provided 0 and nullptr for these two fields. VkSubpassDependency structure contains the following fields:

  • srcSubpass – Index of a first (previous) subpass or VK_SUBPASS_EXTERNAL if we want to indicate dependency between subpass and operations outside of a render pass.
  • dstSubpass – Index of a second (later) subpass (or VK_SUBPASS_EXTERNAL).
  • srcStageMask – Pipeline stage during which a given attachment was used before (in a src subpass).
  • dstStageMask – Pipeline stage during which a given attachment will be used later (in a dst subpass).
  • srcAccessMask – Types of memory operations that occurred in a src subpass or before a render pass.
  • dstAccessMask – Types of memory operations that occurred in a dst subpass or after a render pass.
  • dependencyFlags – Flag describing the type (region) of dependency.

Graphics Pipeline Creation

Now we will create a graphics pipeline object. (We should create framebuffers for our swapchain images, but we will do that during command buffer recording). We don’t want to render a geometry that is hardcoded into a shader. We want to draw any number of vertices, and we also want to provide additional attributes, not only vertex positions. What should we do first?

Writing Shaders

First have a look at the vertex shader written in GLSL code:

#version 450

layout(location = 0) in vec4 i_Position;
layout(location = 1) in vec4 i_Color;

out gl_PerVertex
{
  vec4 gl_Position;
};

layout(location = 0) out vec4 v_Color;

void main() {
    gl_Position = i_Position;
    v_Color = i_Color;
}

2.shader.vert

This shader is quite simple, though more complicated than the one from Tutorial 03.

We specify two input attributes (named i_Position and i_Color). In Vulkan, all attributes must have a location layout qualifier. When we specify a description of the vertex attributes in Vulkan API, the names of these attributes don’t matter, only their indices/locations. In OpenGL* we could ask for a location of an attribute with a given name. In Vulkan we can’t do this. Location layout qualifiers are the only way to go.

Next, we redeclare the gl_PerVertex block in the shader. Vulkan uses shader I/O blocks, and we should redeclare a gl_PerVertex block to specify exactly what members of this block to use. When we don’t, the default definition is used. But we must remember that the default definition contains gl_ClipDistance[], which requires us to enable a feature named shaderClipDistance (and in Vulkan we can’t use features that are not enabled during device creation or our application may not work correctly). Here we are using only a gl_Position member so the feature is not required.

We then specify an additional output varying variable called v_Color in which we store vertices’ colors. Inside a main function we copy values provided by an application to proper output variables: position to gl_Position and color to v_Color.

Now look at a fragment shader to see how attributes are consumed.

#version 450

layout(location = 0) in vec4 v_Color;

layout(location = 0) out vec4 o_Color;

void main() {
  o_Color = v_Color;
}

3.shader.frag

In a fragment shader, the input varying variable v_Color is copied to the only output variable called o_Color. Both variables have location layout specifiers. The v_Color variable has the same location as the output variable in the vertex shader, so it will contain color values interpolated between vertices.

These shaders can be converted to a SPIR-V assembly the same way as previously. The following commands do this:

glslangValidator.exe -V -H shader.vert > vert.spv.txt

glslangValidator.exe -V -H shader.frag > frag.spv.txt

So now, when we know what attributes we want to use in our shaders, we can create the appropriate graphics pipeline.

Vertex Attributes Specification

The most important improvement in this tutorial is added to the vertex input state creation, for which we specify a variable of type VkPipelineVertexInputStateCreateInfo. In this variable we provide pointers to structures, which define the type of vertex input data and number and layout of our attributes.

We want to use two attributes: vertex positions, which are composed of four float components, and vertex colors, which are also composed of four float values. We will lay all of our vertex data in one buffer using the interleaved attributes layout. This means that position for the first vertex will be placed, next color for the same vertex, next the position of second vertex, after that the color of the second vertex, then position and color of third vertex and so on. All this specification is performed with the following code:

std::vector<VkVertexInputBindingDescription> vertex_binding_descriptions = {
  {
    0,                                                          // uint32_t                                       binding
    sizeof(VertexData),                                         // uint32_t                                       stride
    VK_VERTEX_INPUT_RATE_VERTEX                                 // VkVertexInputRate                              inputRate
  }
};

std::vector<VkVertexInputAttributeDescription> vertex_attribute_descriptions = {
  {
    0,                                                          // uint32_t                                       location
    vertex_binding_descriptions[0].binding,                     // uint32_t                                       binding
    VK_FORMAT_R32G32B32A32_SFLOAT,                              // VkFormat                                       format
    offsetof(struct VertexData, x)                              // uint32_t                                       offset
  },
  {
    1,                                                          // uint32_t                                       location
    vertex_binding_descriptions[0].binding,                     // uint32_t                                       binding
    VK_FORMAT_R32G32B32A32_SFLOAT,                              // VkFormat                                       format
    offsetof( struct VertexData, r )                            // uint32_t                                       offset
  }
};

VkPipelineVertexInputStateCreateInfo vertex_input_state_create_info = {
  VK_STRUCTURE_TYPE_PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO,    // VkStructureType                                sType
  nullptr,                                                      // const void                                    *pNext
  0,                                                            // VkPipelineVertexInputStateCreateFlags          flags;
  static_cast<uint32_t>(vertex_binding_descriptions.size()),    // uint32_t                                       vertexBindingDescriptionCount&vertex_binding_descriptions[0],                              // const VkVertexInputBindingDescription         *pVertexBindingDescriptions
  static_cast<uint32_t>(vertex_attribute_descriptions.size()),  // uint32_t                                       vertexAttributeDescriptionCount&vertex_attribute_descriptions[0]                             // const VkVertexInputAttributeDescription       *pVertexAttributeDescriptions
};

4.Tutorial04.cpp, function CreatePipeline()

First specify the binding (general memory information) of vertex data through VkVertexInputBindingDescription. It contains the following fields:

  • binding – Index of a binding with which vertex data will be associated.
  • stride – The distance in bytes between two consecutive elements (the same attribute for two neighbor vertices).
  • inputRate – Defines how data should be consumed, per vertex or per instance.

The stride and inputRate fields are quite self-explanatory. Additional information may be required for a binding member. When we create a vertex buffer, we bind it to a chosen slot before rendering operations. The slot number (an index) is this binding and here we describe how data in this slot is aligned in memory and how it should be consumed (per vertex or per instance). Different vertex buffers can be bound to different bindings. And each binding may be differently positioned in memory.

Next step is to define all vertex attributes. We must specify a location (index) for each attribute (the same as in a shader source code, in location layout qualifier), source of data (binding from which data will be read), format (data type and number of components), and offset at which data for this specific attribute can be found (offset from the beginning of a data for a given vertex, not from the beginning of all vertex data). The situation here is exactly the same as in OpenGL where we created Vertex Buffer Objects (VBO, which can be thought of as an equivalent of “binding”) and defined attributes using glVertexAttribPointer() function through which we specified an index of an attribute (location), size and type (number of components and format), stride and offset. This information is provided through the VkVertexInputAttributeDescription structure. It contains these fields:

  • location – Index of an attribute, the same as defined by the location layout specifier in a shader source code.
  • binding – The number of the slot from which data should be read (source of data like VBO in OpenGL), the same binding as in a VkVertexInputBindingDescription structure and vkCmdBindVertexBuffers() function (described later).
  • format – Data type and number of components per attribute.
  • offset – Beginning of data for a given attribute.

When we are ready, we can prepare vertex input state description by filling a variable of type VkPipelineVertexInputStateCreateInfo which consist of the following fields:

  • sType – Type of structure, here it should be equal to VK_STRUCTURE_TYPE_PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO.
  • pNext – Pointer reserved for extensions. Right now set this value to null.
  • flags – Parameter reserved for future use.
  • vertexBindingDescriptionCount – Number of elements in the pVertexBindingDescriptions array.
  • pVertexBindingDescriptions – Array describing all bindings defined for a given pipeline (buffers from which values of all attributes are read).
  • vertexAttributeDescriptionCount – Number of elements in the pVertexAttributeDescriptions array.
  • pVertexAttributeDescriptions – Array with elements specifying all vertex attributes.

This concludes vertex attributes specification at pipeline creation. But to use them, we must create a vertex buffer and bind it to command buffer before we issue a rendering command.

Input Assembly State Specification

Previously we have drawn a single triangle using a triangle list topology. Now we will draw a quad, which is more convenient to draw by defining just four vertices, not two triangles and six vertices. To do this, we must use triangle strip topology. We define it through VkPipelineInputAssemblyStateCreateInfo structure that has the following members:

  • sType – Structure type, here equal to VK_STRUCTURE_TYPE_PIPELINE_INPUT_ASSEMBLY_STATE_CREATE_INFO.
  • pNext – Pointer reserved for extensions.
  • flags – Parameter reserved for future use.
  • topology – Topology used for drawing vertices (like triangle fan, strip, list).
  • primitiveRestartEnable – Parameter defining whether we want to restart assembling a primitive by using a special value of vertex index.

Here is the code sample used to define triangle strip topology:

VkPipelineInputAssemblyStateCreateInfo input_assembly_state_create_info = {
  VK_STRUCTURE_TYPE_PIPELINE_INPUT_ASSEMBLY_STATE_CREATE_INFO,  // VkStructureType                                sType
  nullptr,                                                      // const void                                    *pNext
  0,                                                            // VkPipelineInputAssemblyStateCreateFlags        flags
  VK_PRIMITIVE_TOPOLOGY_TRIANGLE_STRIP,                         // VkPrimitiveTopology                            topology
  VK_FALSE                                                      // VkBool32                                       primitiveRestartEnable
};

5.Tutorial04.cpp, function CreatePipeline()

Viewport State Specification

In this tutorial we introduce another change. Previously, for the sake of simplicity, we have hardcoded the viewport and scissor test parameters, which unfortunately caused our image to be always the same size, no matter how big the application window was. This time, we won’t specify these values through the VkPipelineViewportStateCreateInfo structure. We will use a dynamic state for that. Here is a code responsible for defining static viewport state parameters:

VkPipelineViewportStateCreateInfo viewport_state_create_info = {
  VK_STRUCTURE_TYPE_PIPELINE_VIEWPORT_STATE_CREATE_INFO,        // VkStructureType                                sType
  nullptr,                                                      // const void                                    *pNext
  0,                                                            // VkPipelineViewportStateCreateFlags             flags
  1,                                                            // uint32_t                                       viewportCount
  nullptr,                                                      // const VkViewport                              *pViewports
  1,                                                            // uint32_t                                       scissorCount
  nullptr                                                       // const VkRect2D                                *pScissors
};

6.Tutorial04.cpp, function CreatePipeline()

The structure that defines static viewport parameters has the following members:

  • sType – Type of the structure, VK_STRUCTURE_TYPE_PIPELINE_VIEWPORT_STATE_CREATE_INFO here.
  • pNext – Pointer reserved for extension-specific parameters.
  • flags – Parameter reserved for future use.
  • viewportCount – Number of viewports.
  • pViewports – Pointer to a structure defining static viewport parameters.
  • scissorCount – Number of scissor rectangles (must have the same value as viewportCount parameter).
  • pScissors – Pointer to an array of 2D rectangles defining static scissor test parameters for each viewport.

When we want to define viewport and scissor parameters through a dynamic state, we don’t have to fill pViewports and pScissors members. That’s why they are set to null in the example above. But, we always have to define the number of viewports and scissor test rectangles. These values are always specified through the VkPipelineViewportStateCreateInfo structure, no matter if we want to use dynamic or static viewport and scissor state.

Dynamic State Specification

When we create a pipeline, we can specify which parts of it are always static, defined through structures at a pipeline creation, and which are dynamic, specified by proper function calls during command buffer recording. This allows us to lower the number of pipeline objects that differ only with small details like line widths, blend constants, or stencil parameters, or mentioned viewport size. Here is the code used to define parts of pipeline that should be dynamic:

std::vector<VkDynamicState> dynamic_states = {
  VK_DYNAMIC_STATE_VIEWPORT,
  VK_DYNAMIC_STATE_SCISSOR,
};

VkPipelineDynamicStateCreateInfo dynamic_state_create_info = {
  VK_STRUCTURE_TYPE_PIPELINE_DYNAMIC_STATE_CREATE_INFO,         // VkStructureType                                sType
  nullptr,                                                      // const void                                    *pNext
  0,                                                            // VkPipelineDynamicStateCreateFlags              flags
  static_cast<uint32_t>(dynamic_states.size()),                 // uint32_t                                       dynamicStateCount&dynamic_states[0]                                            // const VkDynamicState                          *pDynamicStates
};

7.Tutorial04.cpp, function CreatePipeline()

It is done by using a structure of type VkPipelineDynamicStateCreateInfo, which contains the following fields:

  • sType – Parameter defining the type of a given structure, here equal to VK_STRUCTURE_TYPE_PIPELINE_DYNAMIC_STATE_CREATE_INFO.
  • pNext – Parameter reserved for extensions.
  • flags – Parameter reserved for future use.
  • dynamicStateCount – Number of elements in pDynamicStates array.
  • pDynamicStates – Array containing enums, specifying which parts of a pipeline should be marked as dynamic. Each element of this array is of type VkDynamicState.

Pipeline Object Creation

We now have defined all the necessary parameters of a graphics pipeline, so we can create a pipeline object. Here is the code that does it:

VkGraphicsPipelineCreateInfo pipeline_create_info = {
  VK_STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO,              // VkStructureType                                sType
  nullptr,                                                      // const void                                    *pNext
  0,                                                            // VkPipelineCreateFlags                          flags
  static_cast<uint32_t>(shader_stage_create_infos.size()),      // uint32_t                                       stageCount&shader_stage_create_infos[0],                                // const VkPipelineShaderStageCreateInfo         *pStages&vertex_input_state_create_info,                              // const VkPipelineVertexInputStateCreateInfo    *pVertexInputState;&input_assembly_state_create_info,                            // const VkPipelineInputAssemblyStateCreateInfo  *pInputAssemblyState
  nullptr,                                                      // const VkPipelineTessellationStateCreateInfo   *pTessellationState&viewport_state_create_info,                                  // const VkPipelineViewportStateCreateInfo       *pViewportState&rasterization_state_create_info,                             // const VkPipelineRasterizationStateCreateInfo  *pRasterizationState&multisample_state_create_info,                               // const VkPipelineMultisampleStateCreateInfo    *pMultisampleState
  nullptr,                                                      // const VkPipelineDepthStencilStateCreateInfo   *pDepthStencilState&color_blend_state_create_info,                               // const VkPipelineColorBlendStateCreateInfo     *pColorBlendState&dynamic_state_create_info,                                   // const VkPipelineDynamicStateCreateInfo        *pDynamicState
  pipeline_layout.Get(),                                        // VkPipelineLayout                               layout
  Vulkan.RenderPass,                                            // VkRenderPass                                   renderPass
  0,                                                            // uint32_t                                       subpass
  VK_NULL_HANDLE,                                               // VkPipeline                                     basePipelineHandle
  -1                                                            // int32_t                                        basePipelineIndex
};

if( vkCreateGraphicsPipelines( GetDevice(), VK_NULL_HANDLE, 1, &pipeline_create_info, nullptr, &Vulkan.GraphicsPipeline ) != VK_SUCCESS ) {
  std::cout << "Could not create graphics pipeline!"<< std::endl;
  return false;
}
return true;

8.Tutorial04.cpp, function CreatePipeline()

The most important variable, which contains references to all pipeline parameters, is of type VkGraphicsPipelineCreateInfo. The only change from the previous tutorial is an addition of the pDynamicState parameter, which points to a structure of VkPipelineDynamicStateCreateInfo type, described above. Every pipeline state, which is specified as dynamic, must be set through a proper function call during command buffer recording.

The pipeline object itself is created by calling the vkCreateGraphicsPipelines() function.

Vertex Buffer Creation

To use vertex attributes, apart from specifying them during pipeline creation, we need to prepare a buffer that will contain all the data for these attributes. From this buffer, the values for attributes will be read and provided to the vertex shader.

In Vulkan, buffer and image creation consists of at least two stages. First, we create the object itself. Next, we need to create a memory object, which will then be bound to the buffer (or image). From this memory object, the buffer will take its storage space. This approach allows us to specify additional parameters for the memory and control it with more details.

To create a (general) buffer object we call vkCreateBuffer(). It accepts, among other parameters, a pointer to a variable of type VkBufferCreateInfo, which defines parameters of created buffer. Here is the code responsible for creating a buffer used as a source of data for vertex attributes:

VertexData vertex_data[] = {
  {
    -0.7f, -0.7f, 0.0f, 1.0f,
    1.0f, 0.0f, 0.0f, 0.0f
  },
  {
    -0.7f, 0.7f, 0.0f, 1.0f,
    0.0f, 1.0f, 0.0f, 0.0f
  },
  {
    0.7f, -0.7f, 0.0f, 1.0f,
    0.0f, 0.0f, 1.0f, 0.0f
  },
  {
    0.7f, 0.7f, 0.0f, 1.0f,
    0.3f, 0.3f, 0.3f, 0.0f
  }
};

Vulkan.VertexBuffer.Size = sizeof(vertex_data);

VkBufferCreateInfo buffer_create_info = {
  VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO,             // VkStructureType        sType
  nullptr,                                          // const void            *pNext
  0,                                                // VkBufferCreateFlags    flags
  Vulkan.VertexBuffer.Size,                         // VkDeviceSize           size
  VK_BUFFER_USAGE_VERTEX_BUFFER_BIT,                // VkBufferUsageFlags     usage
  VK_SHARING_MODE_EXCLUSIVE,                        // VkSharingMode          sharingMode
  0,                                                // uint32_t               queueFamilyIndexCount
  nullptr                                           // const uint32_t        *pQueueFamilyIndices
};

if( vkCreateBuffer( GetDevice(), &buffer_create_info, nullptr, &Vulkan.VertexBuffer.Handle ) != VK_SUCCESS ) {
  std::cout << "Could not create a vertex buffer!"<< std::endl;
  return false;
}

9.Tutorial04.cpp, function CreateVertexBuffer()

At the beginning of the CreateVertexBuffer() function we define a set of values for position and color attributes. First, four position components are defined for first vertex, next four color components for the same vertex, after that four components of a position attribute for second vertex are specified, next a color values for the same vertex, after that position and color for third and fourth vertices. The size of this array is used to define the size of a buffer. Remember though that internally graphics driver may require more storage for a buffer than the size requested by an application.

Next we define a variable of VkBufferCreateInfo type. It is a structure with the following fields:

  • sType – Type of the structure, which should be set to VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO value.
  • pNext – Parameter reserved for extensions.
  • flags – Parameter defining additional creation parameters. Right now it allows creation of a buffer backed by a sparse memory (something similar to a mega texture). As we don’t want to use sparse memory, we can set this parameter to zero.
  • size – Size, in bytes, of a buffer.
  • usage – This parameter defines how we intend to use this buffer in future. We can specify that we want to use buffer as a uniform buffer, index buffer, source of data for transfer (copy) operations, and so on. Here we intend to use this buffer as a vertex buffer. Remember that we can’t use a buffer for a purpose that is not defined during buffer creation.
  • sharingMode – Sharing mode, similarly to swapchain images, defines whether a given buffer can be accessed by multiple queues at the same time (concurrent sharing mode) or by just a single queue (exclusive sharing mode). If a concurrent sharing mode is specified, we must provide indices of all queues that will have access to a buffer. If we want to define an exclusive sharing mode, we can still reference this buffer in different queues, but only in one at a time. If we want to use a buffer in a different queue (submit commands that reference this buffer to another queue), we need to specify buffer memory barrier that transitions buffer’s ownership from one queue to another.
  • queueFamilyIndexCount – Number of queue indices in pQueueFamilyIndices array (only when concurrent sharing mode is specified).
  • pQueueFamilyIndices – Array with indices of all queues that will reference buffer (only when concurrent sharing mode is specified).

To create a buffer we must call vkCreateBuffer() function.

Buffer Memory Allocation

We next create a memory object that will back the buffer’s storage.

VkMemoryRequirements buffer_memory_requirements;
vkGetBufferMemoryRequirements( GetDevice(), buffer, &buffer_memory_requirements );

VkPhysicalDeviceMemoryProperties memory_properties;
vkGetPhysicalDeviceMemoryProperties( GetPhysicalDevice(), &memory_properties );

for( uint32_t i = 0; i < memory_properties.memoryTypeCount; ++i ) {
  if( (buffer_memory_requirements.memoryTypeBits & (1 << i)) &&
    (memory_properties.memoryTypes[i].propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) ) {

    VkMemoryAllocateInfo memory_allocate_info = {
      VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO,     // VkStructureType                        sType
      nullptr,                                    // const void                            *pNext
      buffer_memory_requirements.size,            // VkDeviceSize                           allocationSize
      i                                           // uint32_t                               memoryTypeIndex
    };

    if( vkAllocateMemory( GetDevice(), &memory_allocate_info, nullptr, memory ) == VK_SUCCESS ) {
      return true;
    }
  }
}
return false;

10.Tutorial04.cpp, function AllocateBufferMemory()

First we must check what the memory requirements for a created buffer are. We do this by calling the vkGetBufferMemoryRequirements() function. It stores parameters for memory creation in a variable that we provided the address of in the last parameter. This variable must be of type VkMemoryRequirements and it contains information about required size, memory alignment, and supported memory types. What are memory types?

Each device may have and expose different memory types—heaps of various sizes that have different properties. One memory type may be a device’s local memory located on the GDDR chips (thus very, very fast). Another may be a shared memory that is visible both for a graphics card and a CPU. Both the graphics card and application may have access to this memory, but such memory type is slower than the device local-only memory (which is accessible only to a graphics card).

To check what memory heaps and types are available, we need to call the vkGetPhysicalDeviceMemoryProperties() function, which stores information about memory in a variable of type VkPhysicalDeviceMemoryProperties. It contains the following information:

  • memoryHeapCount – Number of memory heaps exposed by a given device.
  • memoryHeaps – An array of memory heaps. Each heap represents a memory of different size and properties.
  • memoryTypeCount – Number of different memory types exposed by a given device.
  • memoryTypes – An array of memory types. Each element describes specific memory properties and contains an index of a heap that has these particular properties.

Before we can allocate a memory for a given buffer, we need to check which memory type fulfills a buffer’s memory requirements. If we have additional, specific needs, we can also check them. For all of this, we iterate over all available memory types. Buffer memory requirements have a field called memoryTypeBits and if a bit on a given index is set in this field, it means that for a given buffer we can allocate a memory of the type represented by that index. But we must remember that while there must always be a memory type that fulfills buffer’s memory requirements, it may not support some other, specific needs. In this case we need to look for another memory type or change our additional requirements.

Here, our additional requirement is that memory needs to be host visible. This means that application can map this memory and get access to it—read it or write data to it. Such memory is usually slower than the device local-only memory, but this way we can easily upload data for our vertex attributes. The next tutorial will show how to use device local-only memory for better performance.

Fortunately, the host visible requirement is popular, and it should be easy to find a memory type that supports both the buffer’s memory requirements and the host visible property. We then prepare a variable of type VkMemoryAllocateInfo and fill all its fields:

  • sType – Type of the structure, here set to VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO.
  • pNext – Pointer reserved for extensions.
  • allocationSize – Minimum required memory size that should be allocated.
  • memoryTypeIndex – Index of a memory type we want to use for a created memory object. It is the index of one of bits that are set (has value of one) in buffer’s memory requirement.

After we fill such a structure we call vkAllocateMemory() and check whether the memory object allocation succeeded.

Binding a Buffer’s Memory

When we are done creating a memory object, we must bind it to our buffer. Without it, there will be no storage space in a buffer and we won’t be able to store any data in it.

if( !AllocateBufferMemory( Vulkan.VertexBuffer.Handle, &Vulkan.VertexBuffer.Memory ) ) {
  std::cout << "Could not allocate memory for a vertex buffer!"<< std::endl;
  return false;
}

if( vkBindBufferMemory( GetDevice(), Vulkan.VertexBuffer.Handle, Vulkan.VertexBuffer.Memory, 0 ) != VK_SUCCESS ) {
  std::cout << "Could not bind memory for a vertex buffer!"<< std::endl;
  return false;
}

11.Tutorial04.cpp, function CreateVertexBuffer()

AllocateBufferMemory() is a function that allocates a memory object. It was presented earlier. When a memory object is created, we bind it to the buffer by calling the vkBindBufferMemory() function. During the call we must specify a handle to a buffer, handle to a memory object, and an offset. Offset is very important and requires some additional explanation.

When we queried for buffer memory requirement, we acquired information about required size, memory type, and alignment. Different buffer usages may require different memory alignment. The beginning of a memory object (offset of 0) satisfies all alignments. This means that all memory objects are created at addresses that fulfill the requirements of all different usages. So when we specify a zero offset, we don’t have to worry about anything.

But we can create larger memory object and use it as a storage space for multiple buffers (or images). This, in fact, is the recommended behavior. Creating larger memory objects means we are creating fewer memory objects. This allows driver to track fewer objects in general. Memory objects must be tracked by a driver because of OS requirements and security measures. Larger memory objects don’t cause big problems with memory fragmentation. Finally, we should allocate larger memory amounts and keep similar objects in them to increase cache hits and thus improve performance of our application.

But when we allocate larger memory objects and bind them to multiple buffers (or images), not all of them can be bound at offset zero. Only one can be bound at this offset, others must be bound further away, after a space used by the first buffer (or image). So the offset for the second, and all other buffers bound to the same memory object, must meet alignment requirements reported by the query. And we must remember it. That’s why alignment member is important.

When our buffer is created and memory for it is allocated and bound, we can fill the buffer with data for vertex attributes.

Uploading Vertex Data

We have created a buffer and we have bound a memory that is host visible. This means we can map this memory, acquire a pointer to this memory, and use this pointer to copy data from our application to the buffer itself (similar to the OpenGL’s glBufferData() function):

void *vertex_buffer_memory_pointer;
if( vkMapMemory( GetDevice(), Vulkan.VertexBuffer.Memory, 0, Vulkan.VertexBuffer.Size, 0, &vertex_buffer_memory_pointer ) != VK_SUCCESS ) {
  std::cout << "Could not map memory and upload data to a vertex buffer!"<< std::endl;
  return false;
}

memcpy( vertex_buffer_memory_pointer, vertex_data, Vulkan.VertexBuffer.Size );

VkMappedMemoryRange flush_range = {
  VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE,            // VkStructureType        sType
  nullptr,                                          // const void            *pNext
  Vulkan.VertexBuffer.Memory,                       // VkDeviceMemory         memory
  0,                                                // VkDeviceSize           offset
  VK_WHOLE_SIZE                                     // VkDeviceSize           size
};
vkFlushMappedMemoryRanges( GetDevice(), 1, &flush_range );

vkUnmapMemory( GetDevice(), Vulkan.VertexBuffer.Memory );

return true;

12.Tutorial04.cpp, function CreateVertexBuffer()

To map memory, we call the vkMapMemory() function. In the call we must specify which memory object we want to map and a region to access. Region is defined by an offset from the beginning of a memory object’s storage and size. After the successful call we acquire a pointer. We can use it to copy data from our application to the provided memory address. Here we copy vertex data from an array with vertex positions and colors.

After a memory copy operation and before we unmap a memory (we don’t need to unmap it, we can keep a pointer and this shouldn’t impact performance), we need to tell the driver which parts of the memory was modified by our operations. This operation is called flushing. Through it we specify all memory ranges that our application copied data to. Ranges don’t have to be continuous. Ranges are defined by an array of VkMappedMemoryRange elements which contain these fields:

  • sType – Structure type, here equal to VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE.
  • pNext – Pointer reserved for extensions.
  • memory – Handle of a mapped and modified memory object.
  • offset – Offset (from the beginning of a given memory object’s storage) at which a given range starts.
  • size – Size, in bytes, of an affected region. If the whole memory, from an offset to the end, was modified, we can use the special value of VK_WHOLE_SIZE.

When we define all memory ranges that should be flashed, we can call the vkFlushMappedMemoryRanges() function. After that, the driver will know which parts were modified and will reload them (that is, refresh cache). Reloading usually occurs on barriers. After modifying a buffer, we should set a buffer memory barrier, which will tell the driver that some operations influenced a buffer and it should be refreshed. But, fortunately, in this case such a barrier is placed implicitly by the driver on a submission of a command buffer that references the given buffer and no additional operations are required. Now we can use this buffer during rendering commands recording.

Rendering Resources Creation

We now must prepare resources required for a command buffer recording. In previous tutorials we have recorded one static command buffer for each swapchain image. Here we will reorganize the rendering code. We will still display a simple, static scene, but the approach presented here is useful in real-life scenarios, where displayed scenes are dynamic.

To record command buffers and submit them to queue in an efficient way, we need four types of resources: command buffers, semaphores, fences and framebuffers. Semaphores, as we already discussed, are used for internal queue synchronization. Fences, on the other hand, allow the application to check if some specific situation occurred, e.g. if command buffer’s execution after it was submitted to queue, has finished. If necessary, application can wait on a fence, until it is signaled. In general, semaphores are used to synchronize queues (GPU) and fences are used to synchronize application (CPU).

To render a single frame of animation we need (at least) one command buffer, two semaphores—one for a swapchain image acquisition (image available semaphore) and the other to signal that presentation may occur (rendering a finished semaphore)—a fence, and a framebuffer. The fence is used later to check whether we can rerecord a given command buffer. We will keep several numbers of such rendering resources, which we can call a virtual frame. The number of these virtual frames (consisting of a command buffer, two semaphores, a fence, and a framebuffer) should be independent of a number of swapchain images.

The rendering algorithm progresses like this: We record rendering commands to the first virtual frame and then submit it to a queue. Next we record another frame (command buffer) and submit it to queue. We do this until we are out of all virtual frames. At this point we will start reusing frames by taking the oldest (least recently submitted) command buffer and rerecording it again. Then we will use another command buffer, and so on.

This is where the fences come in. We are not allowed to record a command buffer that has been submitted to a queue until its execution in the queue is finished. During command buffer recording, we can use the “simultaneous use” flag, which allows us to record or resubmit a command buffer that has already been submitted. This may impact performance though. A better way is to use fences and check whether a command buffer is not used any more. If a graphics card is still processing a command buffer, we can wait on a fence associated with a given command buffer, or use this additional time for other purposes, like improved AI calculations, and after some time check again to see whether a fence is signaled.

How many virtual frames should we have? One is not enough. When we record and submit a single command buffer, we immediately wait until we can rerecord it. It is a waste of time of both the CPU and the GPU. The GPU is usually faster, so waiting on a CPU causes more waiting on a GPU. We should keep the GPU as busy as possible. That is why thin APIs like Vulkan were created. Using two virtual frames gives huge performance gain, as there is much less waiting both on the CPU and the GPU. Adding a third virtual frame gives additional performance gain, but the increase isn’t as big. Using four or more groups of rendering resource doesn’t make sense, as the performance gain is negligible (of course this may depend on the complexity of the rendered scene and calculations performed by the CPU-like physics or AI). When we increase the number of virtual frames we also increase the input lag, as we present a frame that’s one to three frames behind the CPU. So two or three virtual frames seems to be the most reasonable compromise between performance, memory usage, and input lag.

You may wonder why the number of virtual frames shouldn’t be connected with the number of swapchain images. This approach may influence the behavior of our application. When we create a swapchain, we ask for the minimal required number of images, but the driver is allowed to create more. So different hardware vendors may implement drivers that offer different numbers of swapchain images, even for the same requirements (present mode and minimal number of images). When we connect the number of virtual frames with a number of swapchain images, our application will use only two virtual frames on one graphics card, but four virtual frames on another graphics card. This may influence both performance and mentioned input lag. It’s not a desired behavior. By keeping the number of virtual frames fixed, we can control our rendering algorithm and fine-tune it to our needs, that is, balance the time spent on rendering and AI or physics calculations.

Command Pool Creation

Before we can allocate a command buffer, we first need to create a command pool.

VkCommandPoolCreateInfo cmd_pool_create_info = {
  VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO,       // VkStructureType                sType
  nullptr,                                          // const void                    *pNext
  VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT | // VkCommandPoolCreateFlags       flags
  VK_COMMAND_POOL_CREATE_TRANSIENT_BIT,
  queue_family_index                                // uint32_t                       queueFamilyIndex
};

if( vkCreateCommandPool( GetDevice(), &cmd_pool_create_info, nullptr, pool ) != VK_SUCCESS ) {
  return false;
}
return true;

13.Tutorial04.cpp, function CreateCommandPool()

The command pool is created by calling vkCreateCommandPool(), which requires us to provide a pointer to a variable of type VkCommandPoolCreateInfo. The code remains mostly unchanged, compared to previous tutorials. But this time, two additional flags are added for command pool creation:

  • VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT – Indicates that command buffers, allocated from this pool, may be reset individually. Normally, without this flag, we can’t rerecord the same command buffer multiple times. It must be reset first. And, what’s more, command buffers created from one pool may be reset only all at once. Specifying this flag allows us to reset command buffers individually, and (even better) it is done implicitly by calling the vkBeginCommandBuffer() function.
  • VK_COMMAND_POOL_CREATE_TRANSIENT_BIT – This flag tells the driver that command buffers allocated from this pool will be living for a short amount of time, they will be often recorded and reset (re-recorded). This information helps optimize command buffer allocation and perform it more optimally.

Command Buffer Allocation

Allocating command buffers remains the same as previously.

for( size_t i = 0; i < Vulkan.RenderingResources.size(); ++i ) {
  if( !AllocateCommandBuffers( Vulkan.CommandPool, 1, &Vulkan.RenderingResources[i].CommandBuffer ) ) {
    std::cout << "Could not allocate command buffer!"<< std::endl;
    return false;
  }
}
return true;

14.Tutorial04.cpp, function CreateCommandBuffers()

The only change is that command buffers are gathered into a vector of rendering resources. Each rendering resource structure contains a command buffer, image available semaphore, rendering finished semaphore, a fence and a framebuffer. Command buffers are allocated in a loop. The number of elements in a rendering resources vector is chosen arbitrarily. For this tutorial it is equal to three.

Semaphore Creation

The code responsible for creating a semaphore is simple and the same as previously shown:

VkSemaphoreCreateInfo semaphore_create_info = {
  VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO,      // VkStructureType          sType
  nullptr,                                      // const void*              pNext
  0                                             // VkSemaphoreCreateFlags   flags
};

for( size_t i = 0; i < Vulkan.RenderingResources.size(); ++i ) {
  if( (vkCreateSemaphore( GetDevice(), &semaphore_create_info, nullptr, &Vulkan.RenderingResources[i].ImageAvailableSemaphore ) != VK_SUCCESS) ||
    (vkCreateSemaphore( GetDevice(), &semaphore_create_info, nullptr, &Vulkan.RenderingResources[i].FinishedRenderingSemaphore ) != VK_SUCCESS) ) {
    std::cout << "Could not create semaphores!"<< std::endl;
    return false;
  }
}
return true;

15.Tutorial04.cpp, function CreateSemaphores()

Fence Creation

Here is the code responsible for creating fence objects:

VkFenceCreateInfo fence_create_info = {
  VK_STRUCTURE_TYPE_FENCE_CREATE_INFO,              // VkStructureType                sType
  nullptr,                                          // const void                    *pNext
  VK_FENCE_CREATE_SIGNALED_BIT                      // VkFenceCreateFlags             flags
};

for( size_t i = 0; i < Vulkan.RenderingResources.size(); ++i ) {
  if( vkCreateFence( GetDevice(), &fence_create_info, nullptr, &Vulkan.RenderingResources[i].Fence ) != VK_SUCCESS ) {
    std::cout << "Could not create a fence!"<< std::endl;
    return false;
  }
}
return true;

16.Tutorial04.cpp, function CreateFences()

To create a fence object we call the vkCreateFence() function. It accepts, among other parameters, a pointer to a variable of type VkFenceCreateInfo, which has the following members:

  • sType – Type of the structure. Here it should be set to VK_STRUCTURE_TYPE_FENCE_CREATE_INFO.
  • pNext – Pointer reserved for extensions.
  • flags – Right now this parameter allows for creating a fence that is already signaled.

A fence may have two states: signaled and unsignaled. The application checks whether a given fence is in a signaled state, or it may wait on a fence until the fence gets signaled. Signaling is done by the GPU after all operations submitted to the queue are processed. When we submit command buffers, we can provide a fence that will be signaled when a queue has finished executing all commands that were issued in this one submit operation. After the fence is signaled, it is the application’s responsibility to reset it to an unsignaled state.

Why create a fence that is already signaled? Our rendering algorithm will record commands to the first command buffer, then to the second command buffer, after that to the third, and then once again to the first (after its execution in a queue has ended). We use fences to check whether we can record a given command buffer once again. But what about the first recording? We don’t want to keep separate code paths for the first command buffer recording and for the following recording operations. So when we issue a command buffer recording for the first time, we also check whether a fence is already signaled. But because we didn’t submit a given command buffer, the fence associated with it can’t become signaled as a result of the finished execution. So the fence needs to be created in an already signaled state. This way, for the first time, we won’t have to wait for it to become signaled (as it is already signaled), but after the check we will reset it and immediately go to the recording code. After that we submit a command buffer and provide the same fence, which will get signaled by the queue when operations are done. The next time, when we want to rerecord rendering commands to the same command buffer, we can do the same operations: wait on the fence, reset it, and then start command buffer recording.

Drawing

Now we are nearly ready to record rendering operations. We are recording each command buffer just before it is submitted to the queue. We record one command buffer and submit it, then the next command buffer and submit it, then yet another one. After that we take the first command buffer, check whether we can use it, and we record it and submit it to the queue.

static size_t           resource_index = 0;
RenderingResourcesData ¤t_rendering_resource = Vulkan.RenderingResources[resource_index];
VkSwapchainKHR          swap_chain = GetSwapChain().Handle;
uint32_t                image_index;

resource_index = (resource_index + 1) % VulkanTutorial04Parameters::ResourcesCount;

if( vkWaitForFences( GetDevice(), 1, ¤t_rendering_resource.Fence, VK_FALSE, 1000000000 ) != VK_SUCCESS ) {
  std::cout << "Waiting for fence takes too long!"<< std::endl;
  return false;
}
vkResetFences( GetDevice(), 1, ¤t_rendering_resource.Fence );

VkResult result = vkAcquireNextImageKHR( GetDevice(), swap_chain, UINT64_MAX, current_rendering_resource.ImageAvailableSemaphore, VK_NULL_HANDLE, &image_index );
switch( result ) {
  case VK_SUCCESS:
  case VK_SUBOPTIMAL_KHR:
    break;
  case VK_ERROR_OUT_OF_DATE_KHR:
    return OnWindowSizeChanged();
  default:
    std::cout << "Problem occurred during swap chain image acquisition!"<< std::endl;
    return false;
}

if( !PrepareFrame( current_rendering_resource.CommandBuffer, GetSwapChain().Images[image_index], current_rendering_resource.Framebuffer ) ) {
  return false;
}

VkPipelineStageFlags wait_dst_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,                          // VkStructureType              sType
  nullptr,                                                // const void                  *pNext
  1,                                                      // uint32_t                     waitSemaphoreCount
  ¤t_rendering_resource.ImageAvailableSemaphore,    // const VkSemaphore           *pWaitSemaphores
  &wait_dst_stage_mask,                                   // const VkPipelineStageFlags  *pWaitDstStageMask;
  1,                                                      // uint32_t                     commandBufferCount¤t_rendering_resource.CommandBuffer,              // const VkCommandBuffer       *pCommandBuffers
  1,                                                      // uint32_t                     signalSemaphoreCount
  ¤t_rendering_resource.FinishedRenderingSemaphore  // const VkSemaphore           *pSignalSemaphores
};

if( vkQueueSubmit( GetGraphicsQueue().Handle, 1, &submit_info, current_rendering_resource.Fence ) != VK_SUCCESS ) {
  return false;
}

VkPresentInfoKHR present_info = {
  VK_STRUCTURE_TYPE_PRESENT_INFO_KHR,                     // VkStructureType              sType
  nullptr,                                                // const void                  *pNext
  1,                                                      // uint32_t                     waitSemaphoreCount
  ¤t_rendering_resource.FinishedRenderingSemaphore, // const VkSemaphore           *pWaitSemaphores
  1,                                                      // uint32_t                     swapchainCount
  &swap_chain,                                            // const VkSwapchainKHR        *pSwapchains&image_index,                                           // const uint32_t              *pImageIndices
  nullptr                                                 // VkResult                    *pResults
};
result = vkQueuePresentKHR( GetPresentQueue().Handle, &present_info );

switch( result ) {
  case VK_SUCCESS:
    break;
  case VK_ERROR_OUT_OF_DATE_KHR:
  case VK_SUBOPTIMAL_KHR:
    return OnWindowSizeChanged();
  default:
    std::cout << "Problem occurred during image presentation!"<< std::endl;
    return false;
}

return true;

17.Tutorial04.cpp, function Draw()

So first we take the least recently used rendering resource. Then we wait until the fence associated with this group is signaled. If it is, this means that we can safely take a command buffer and record it. But this also means that we can take semaphores used to acquire and present an image that was referenced in a given command buffer. We shouldn’t use the same semaphore for different purposes or in two different submit operations, until the previous submission is finished. The fences prevent us from altering both command buffers and semaphores. And as you will soon see, framebuffers too.

When a fence is finished, we reset the fence and perform normal drawing-related operations: we acquire an image, record operations rendering into an acquired image, submit the command buffer, and present an image.

After that we take another set of rendering resources and perform these same operations. Thanks to keeping three groups of rendering resources, three virtual frames, we lower the time wasted on waiting for a fence to be signaled.

Recording a Command Buffer

A function responsible for recording a command buffer is quite long. This time it is even longer, because we use a vertex buffer and a dynamic viewport and scissor test. And we also create temporary framebuffers!

Framebuffer creation is simple and fast. Keeping framebuffer objects along with a swapchain means that we need to recreate them when the swapchain needs to be recreated. If our rendering algorithm is complicated, we have multiple images and framebuffers associated with them. If those images need to have the same size as swapchain images, we need to recreate all of them (to include potential size change). So it is better and more convenient to create framebuffers on demand. This way, they always have the desired size. Framebuffers operate on image views, which are created for a given, specific image. When a swapchain is recreated, old images are invalid, not existent. So we must recreate image views and also framebuffers.

In the “03 – First Triangle” tutorial, we had framebuffers of a fixed size and they had to be recreated along with a swapchain. Now we have a framebuffer object in each of our virtual frame group of resources. Before we record a command buffer, we create a framebuffer for an image to which we will be rendering, and of the same size as that image. This way, when swapchain is recreated, the size of the next frame will be immediately adjusted and a handle of the new swapchain’s image and its image view will be used to create a framebuffer.

When we record a command buffer that uses a render pass and framebuffer objects, the framebuffer must remain valid for the whole time the command buffer is processed by the queue. When we create a new framebuffer, we can’t destroy it until commands submitted to a queue are finished. But as we are using fences, and we have already waited on a fence associated with a given command buffer, we are sure that the framebuffer can be safely destroyed. We then create a new framebuffer to include potential size and image handle changes.

if( framebuffer != VK_NULL_HANDLE ) {
  vkDestroyFramebuffer( GetDevice(), framebuffer, nullptr );
}

VkFramebufferCreateInfo framebuffer_create_info = {
  VK_STRUCTURE_TYPE_FRAMEBUFFER_CREATE_INFO,      // VkStructureType                sType
  nullptr,                                        // const void                    *pNext
  0,                                              // VkFramebufferCreateFlags       flags
  Vulkan.RenderPass,                              // VkRenderPass                   renderPass
  1,                                              // uint32_t                       attachmentCount
  &image_view,                                    // const VkImageView             *pAttachments
  GetSwapChain().Extent.width,                    // uint32_t                       width
  GetSwapChain().Extent.height,                   // uint32_t                       height
  1                                               // uint32_t                       layers
};

if( vkCreateFramebuffer( GetDevice(), &framebuffer_create_info, nullptr, &framebuffer ) != VK_SUCCESS ) {
  std::cout << "Could not create a framebuffer!"<< std::endl;
  return false;
}

return true;

18.Tutorial04.cpp, function CreateFramebuffer()

When we create a framebuffer, we take current swapchain extents and image view for an acquired swapchain image.

Next we start recording a command buffer:

if( !CreateFramebuffer( framebuffer, image_parameters.View ) ) {
  return false;
}

VkCommandBufferBeginInfo command_buffer_begin_info = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,        // VkStructureType                        sType
  nullptr,                                            // const void                            *pNext
  VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT,        // VkCommandBufferUsageFlags              flags
  nullptr                                             // const VkCommandBufferInheritanceInfo  *pInheritanceInfo
};

vkBeginCommandBuffer( command_buffer, &command_buffer_begin_info );

VkImageSubresourceRange image_subresource_range = {
  VK_IMAGE_ASPECT_COLOR_BIT,                          // VkImageAspectFlags                     aspectMask
  0,                                                  // uint32_t                               baseMipLevel
  1,                                                  // uint32_t                               levelCount
  0,                                                  // uint32_t                               baseArrayLayer
  1                                                   // uint32_t                               layerCount
};

if( GetPresentQueue().Handle != GetGraphicsQueue().Handle ) {
  VkImageMemoryBarrier barrier_from_present_to_draw = {
    VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER,           // VkStructureType                        sType
    nullptr,                                          // const void                            *pNext
    VK_ACCESS_MEMORY_READ_BIT,                        // VkAccessFlags                          srcAccessMask
    VK_ACCESS_MEMORY_READ_BIT,                        // VkAccessFlags                          dstAccessMask
    VK_IMAGE_LAYOUT_PRESENT_SRC_KHR,                  // VkImageLayout                          oldLayout
    VK_IMAGE_LAYOUT_PRESENT_SRC_KHR,                  // VkImageLayout                          newLayout
    GetPresentQueue().FamilyIndex,                    // uint32_t                               srcQueueFamilyIndex
    GetGraphicsQueue().FamilyIndex,                   // uint32_t                               dstQueueFamilyIndex
    image_parameters.Handle,                          // VkImage                                image
    image_subresource_range                           // VkImageSubresourceRange                subresourceRange
  };
  vkCmdPipelineBarrier( command_buffer, VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, 0, 0, nullptr, 0, nullptr, 1, &barrier_from_present_to_draw );
}

VkClearValue clear_value = {
  { 1.0f, 0.8f, 0.4f, 0.0f },                         // VkClearColorValue                      color
};

VkRenderPassBeginInfo render_pass_begin_info = {
  VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO,           // VkStructureType                        sType
  nullptr,                                            // const void                            *pNext
  Vulkan.RenderPass,                                  // VkRenderPass                           renderPass
  framebuffer,                                        // VkFramebuffer                          framebuffer
  {                                                   // VkRect2D                               renderArea
    {                                                 // VkOffset2D                             offset
      0,                                                // int32_t                                x
      0                                                 // int32_t                                y
    },
    GetSwapChain().Extent,                            // VkExtent2D                             extent;
  },
  1,                                                  // uint32_t                               clearValueCount
  &clear_value                                        // const VkClearValue                    *pClearValues
};

vkCmdBeginRenderPass( command_buffer, &render_pass_begin_info, VK_SUBPASS_CONTENTS_INLINE );

19.Tutorial04.cpp, function PrepareFrame()

First we define a variable of type VkCommandBufferBeginInfo and specify that a command buffer will be submitted only once. When we specify a VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT flag, we can’t submit a given command buffer more times. After each submission it must be reset. But the recording operation resets it due to the VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT flag used during command pool creation.

Next we define subresource ranges for image memory barriers. The layout transitions of the swapchain images are performed implicitly inside a render pass, but if the graphics and presentation queue are different, the queue transition must be manually performed.

After that we begin a render pass with the temporary framebuffer object.

vkCmdBindPipeline( command_buffer, VK_PIPELINE_BIND_POINT_GRAPHICS, Vulkan.GraphicsPipeline );

VkViewport viewport = {
  0.0f,                                               // float                                  x
  0.0f,                                               // float                                  y
  static_cast<float>(GetSwapChain().Extent.width),    // float                                  width
  static_cast<float>(GetSwapChain().Extent.height),   // float                                  height
  0.0f,                                               // float                                  minDepth
  1.0f                                                // float                                  maxDepth
};

VkRect2D scissor = {
  {                                                   // VkOffset2D                             offset
    0,                                                  // int32_t                                x
    0                                                   // int32_t                                y
  },
  {                                                   // VkExtent2D                             extent
    GetSwapChain().Extent.width,                        // uint32_t                               width
    GetSwapChain().Extent.height                        // uint32_t                               height
  }
};

vkCmdSetViewport( command_buffer, 0, 1, &viewport );
vkCmdSetScissor( command_buffer, 0, 1, &scissor );

VkDeviceSize offset = 0;
vkCmdBindVertexBuffers( command_buffer, 0, 1, &Vulkan.VertexBuffer.Handle, &offset );

vkCmdDraw( command_buffer, 4, 1, 0, 0 );

vkCmdEndRenderPass( command_buffer );

if( GetGraphicsQueue().Handle != GetPresentQueue().Handle ) {
  VkImageMemoryBarrier barrier_from_draw_to_present = {
    VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER,           // VkStructureType                        sType
    nullptr,                                          // const void                            *pNext
    VK_ACCESS_MEMORY_READ_BIT,                        // VkAccessFlags                          srcAccessMask
    VK_ACCESS_MEMORY_READ_BIT,                        // VkAccessFlags                          dstAccessMask
    VK_IMAGE_LAYOUT_PRESENT_SRC_KHR,                  // VkImageLayout                          oldLayout
    VK_IMAGE_LAYOUT_PRESENT_SRC_KHR,                  // VkImageLayout                          newLayout
    GetGraphicsQueue().FamilyIndex,                   // uint32_t                               srcQueueFamilyIndex
    GetPresentQueue().FamilyIndex,                    // uint32_t                               dstQueueFamilyIndex
    image_parameters.Handle,                          // VkImage                                image
    image_subresource_range                           // VkImageSubresourceRange                subresourceRange
  };
  vkCmdPipelineBarrier( command_buffer, VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, 0, 0, nullptr, 0, nullptr, 1, &barrier_from_draw_to_present );
}

if( vkEndCommandBuffer( command_buffer ) != VK_SUCCESS ) {
  std::cout << "Could not record command buffer!"<< std::endl;
  return false;
}
return true;

20.Tutorial04.cpp, function PrepareFrame()

Next we bind a graphics pipeline. It has two states marked as dynamic: viewport and scissor test. So we prepare structures that define viewport and scissor test parameters. The dynamic viewport state is set by calling the vkCmdSetViewport() function. The dynamic scissor test is set by calling the vkCmdSetScissor() function. This way, our graphics pipeline can be used for rendering into images of different sizes.

One last thing before we can draw anything is to bind appropriate vertex buffer, providing buffer data for vertex attributes. We do this through the vkCmdBindVertexBuffers() function call. We specify a binding number (which set of vertex attributes should take data from this buffer), a pointer to a buffer handle (or more handles if we want to bind buffers for multiple bindings) and an offset. The offset specifies that data for vertex attributes should be taken from further parts of the buffer. But we can’t specify offset larger than the size of a corresponding buffer (buffer, not memory object bound to this buffer).

Now we have specified all the required elements: framebuffer, viewport and scissor test, and a vertex buffer. We can draw the geometry, finish the render pass, and end the command buffer.

Tutorial04 Execution

Here is the result of rendering operations:

We are rendering a quad that has different colors in each corner. Try resizing the window; previously, the triangle was always the same size, only the black frame on the right and bottom sides of an application window grew larger or smaller. Now, thanks to the dynamic viewport state, the quad is growing or shrinking along with the window.

Cleaning Up

After rendering and before closing the application, we should destroy all resources. Here is a code responsible for this operation:

if( GetDevice() != VK_NULL_HANDLE ) {
  vkDeviceWaitIdle( GetDevice() );

  for( size_t i = 0; i < Vulkan.RenderingResources.size(); ++i ) {
    if( Vulkan.RenderingResources[i].Framebuffer != VK_NULL_HANDLE ) {
      vkDestroyFramebuffer( GetDevice(), Vulkan.RenderingResources[i].Framebuffer, nullptr );
    }
    if( Vulkan.RenderingResources[i].CommandBuffer != VK_NULL_HANDLE ) {
      vkFreeCommandBuffers( GetDevice(), Vulkan.CommandPool, 1, &Vulkan.RenderingResources[i].CommandBuffer );
    }
    if( Vulkan.RenderingResources[i].ImageAvailableSemaphore != VK_NULL_HANDLE ) {
      vkDestroySemaphore( GetDevice(), Vulkan.RenderingResources[i].ImageAvailableSemaphore, nullptr );
    }
    if( Vulkan.RenderingResources[i].FinishedRenderingSemaphore != VK_NULL_HANDLE ) {
      vkDestroySemaphore( GetDevice(), Vulkan.RenderingResources[i].FinishedRenderingSemaphore, nullptr );
    }
    if( Vulkan.RenderingResources[i].Fence != VK_NULL_HANDLE ) {
      vkDestroyFence( GetDevice(), Vulkan.RenderingResources[i].Fence, nullptr );
    }
  }

  if( Vulkan.CommandPool != VK_NULL_HANDLE ) {
    vkDestroyCommandPool( GetDevice(), Vulkan.CommandPool, nullptr );
    Vulkan.CommandPool = VK_NULL_HANDLE;
  }

  if( Vulkan.VertexBuffer.Handle != VK_NULL_HANDLE ) {
    vkDestroyBuffer( GetDevice(), Vulkan.VertexBuffer.Handle, nullptr );
    Vulkan.VertexBuffer.Handle = VK_NULL_HANDLE;
  }

  if( Vulkan.VertexBuffer.Memory != VK_NULL_HANDLE ) {
    vkFreeMemory( GetDevice(), Vulkan.VertexBuffer.Memory, nullptr );
    Vulkan.VertexBuffer.Memory = VK_NULL_HANDLE;
  }

  if( Vulkan.GraphicsPipeline != VK_NULL_HANDLE ) {
    vkDestroyPipeline( GetDevice(), Vulkan.GraphicsPipeline, nullptr );
    Vulkan.GraphicsPipeline = VK_NULL_HANDLE;
  }

  if( Vulkan.RenderPass != VK_NULL_HANDLE ) {
    vkDestroyRenderPass( GetDevice(), Vulkan.RenderPass, nullptr );
    Vulkan.RenderPass = VK_NULL_HANDLE;
  }
}

21.Tutorial04.cpp, destructor

We destroy all resources after the device completes processing all commands submitted to all its queues. We destroy resources in a reverse order. First we destroy all rendering resources: framebuffers, command buffers, semaphores and fences. Fences are destroyed by calling the vkDestroyFence() function. Then the command pool is destroyed. After that we destroy buffer by calling the vkDestroyBuffer() function, and free memory object by calling the vkFreeMemory() function. Finally the pipeline object and a render pass are destroyed.

Conclusion

This tutorial is based on the”03 – First Triangle” tutorial. We improved rendering by using vertex attributes in a graphics pipeline and vertex buffers bound during command buffer recording. We described the number and layout of vertex attributes. We introduced dynamic pipeline states for the viewport and scissors test. We learned how to create buffers and memory objects and how to bind one to another. We also mapped memory and upload data from the CPU to the GPU.

We have created a set of rendering resources that allow us to efficiently record and issue rendering commands. These resources consisted of command buffers, semaphores, fences, and framebuffers. We learned how to use fences, how to set up values of dynamic pipeline states, and how to bind vertex buffers (source of vertex attribute data) during command buffer recording.

The next tutorial will present staging resources. These are intermediate buffers used to copy data between the CPU and GPU. This way, buffers (or images) used for rendering don’t have to be mapped by an application and can be bound to a device’s local (very fast) memory.


Go to: API without Secrets: Introduction to Vulkan* Part 5: Staging Resources (To Be Continued...)


Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800- 548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

Using the Intel® RealSense™ Camera with TouchDesigner*: Part 3

$
0
0

The Intel® RealSense™ camera (R200) is a vital tool for creating VR and AR projects and real time performance interactivity. I found TouchDesigner*, created by Derivative*,  to be an excellent program for utilizing the information provided by the Intel RealSense cameras.

This third article is written from the standpoint of creating real time interactivity in performances using the Intel RealSense camera in combination with TouchDesigner. Interactivity in a performance always adds a magical element. There will be example photos and videos from an in-progress interactive dance piece I am directing and creating visuals for. There will be demos showing how you can create different interactive effects using the Intel RealSense camera (R200).  The interactive performance dance demo takes place in the Vortex Immersion dome in Los Angeles where I am a resident artist. The dancer and choreographer is Stevie Rae Gibbs, and Tim Hicks, cinematography and VR live action shooting, assisted me. The music was created by Winter Lazerus. The movies embedded in this article were shot by Chris Casady and Winter Lazerus.

Things to Consider When Creating an Interactive Immersive Project.

Just as in any performance there needs to be a theme. The theme of this short interactive demo is simple, liberation from what is trapping the dancer, the box weighing her down. The interactivity contributed to this theme. The effects were linked to the skeletal movements of the dancer and some were linked to the color and depth information provided by the Intel RealSense camera. The obvious linking of the effects to the dancer contributed to a sense of magic. The choreography and dancing had to work with the effects. Besides the use of theatrical lighting care had to be taken so that enough light was on the subject so that the Intel RealSense cameras could properly register. The camera distances from the dancer also had to be considered, taking into account the range of the camera and the effect wanted. The dancer also had to be careful to stay within the effective camera range.

The demo dance project is an immersive full dome performance piece so it had to be mapped to the dome. Having the effects mapped to the dome also influenced their look. For the Vortex Immersion dome, Jeff Smith of Eye Vapor has created a TouchDesigner interface for dome mapping. I used this interface as the base layer within which to put my TouchDesigner programming of the interactive effects.

Jeff Smith on Mapping the Dome Using TouchDesigner:

“There were several challenges in creating a real time mapping solution for a dome using TouchDesigner. One of the first things we had to work through was getting a perspective corrected image through each projector. The solution, which is well known now, is to place virtual projectors inside a mapped virtual dome and render out an image for each projector. Another challenge was to develop a set of alignment and blending tools to be able to perfectly calibrate and blend the projected image. And finally, we had to develop custom GLSL shaders to render real time fisheye imagery”.

Tim Hicks on Technical Aspects of Working with the RealSense Camera

“Working with the Intel RealSense camera was extremely efficient in creating a simple and stable workflow to connect our performer’s gestures through TouchDesigner, and then rendered out as interactive animations. Setup is quick and performance is reliable, even in low light, which is always an issue when working inside an immersive digital projection dome.”

Notes for Utilizing TouchDesigner and the Intel RealSense Camera

Like Part 1 and Part 2, Part 3 is aimed at those familiar with using TouchDesigner and its interface. If you are unfamiliar with TouchDesigner, before you follow the demos I recommend that you review some of the documentation and videos available here: Learning TouchDesigner. The Part 1 and Part 2 articles walk you through use of the TouchDesigner nodes described in this article, and provide sample .toe files to get you started.

free non-commercial copy of TouchDesigner is available and is fully functional, except that the highest resolution is limited to 1280 x 1280.  

Note: When using the Intel RealSense camera, it is important to pay attention to its range for best results.

Demo #1: Using the Depth Mapping of the R200 and SR300 Camera

This is a simple and effective way to create interactive colored lines that respond to the movement of the performer. In the case of this performance, the lines wrapped and animated around the entire dome in response to the movement of the dancer.

  1. Create the nodes you will need, arrange, and connect/wire them in a horizontal row in this order:
    • RealSense TOP node
    • Level TOP node
    • Luma Level TOP node
  2. Open the Setup parameters page of the RealSense TOP node and set the Image parameter to Depth.
  3. Set the parameters of the Level TOP and the Luma Level TOP to offset the brightness and contrast. Judge this by looking at the result you are getting in the effect.
    Figure1. You are using the Depth setting in the RealSense TOP node for the R200 camera.
    Figure1.You are using the Depth setting in the RealSense TOP node for the R200 camera.
  4. Create a Blur TOP and a Displace TOP.
  5. Wire the Luma Level TOP to the Blur TOP and the top connector on the Displace TOP.
  6. Connect the Blur TOP to the bottom connector of the Displace TOP (Note: the filter size of the blur should be based on what you want your final effect to look like).
    ​Figure 2. Set the Filter for the Blur TOP at 100 as a starting point
    Figure 2.Set the Filter for the Blur TOP at 100 as a starting point
  7. Create a Ramp TOP, a Composite TOP.
  8. Choose the colors you want your line to be in the Ramp TOP.
  9. Connect the Displace TOP to the top connector in the Composite TOP and the Ramp TOP to the bottom connector in the Displace TOP.
    ​Figure 3. You are using the Depth setting in the RealSense TOP node for the R200 camera.
    Figure 3.You are using the Depth setting in the RealSense TOP node for the R200 camera.
     
    ​Figure 4. The complete network for the effect.
    Figure 4.The complete network for the effect.
  10. Watch how the line reacts to the performer's motions.
    Figure 5.Videofrom the demo performance of the colored lines created from the depth mapping of the performer by the RealSense camera.

Demo #2: RealSense TOP Depth Mapping Second Effect

In this demo, we use TouchDesigner with the depth feature of the Intel RealSense R200 Camera to project duplicates of the performer and offset them in time. I used it in the demo performance to project several images of the dancer moving at different times, creating the illusion of more than one dancer. Note that this effect was not in the dance performance but it is very worth using.

  1. Add a RealSense TOP node to your scene.
  2. On the Setup parameters page for the RealSense TOP node, for the Image parameter select Depth.

    Create two Level TOP nodes and connect the RealSense TOP node to each of them.
    Figure 6. You are using the Depth setting in the RealSense TOP node for the R200 camera.
    Figure 6.You are using the Depth setting in the RealSense TOP node for the R200 camera.
  3. Adjust the level node parameters to give you the amount of contrast and brightness you want on your effect. You might go back after seeing the effect and readjust the parameters. As a starting point for both Level TOPS, in the Pre Parameters page, set the Brightness parameter to 2 and the Gamma parameter to 1.75.
  4. Create a Transform TOP and wire it to level2 TOP.
  5. In the Transform TOP Parameters, on the Transform page, set the Translate x parameter to .2.Note that translating the x 1 would move the image fully off.
  6. Create two Cache TOP nodes and wire one to the Transform TOP and one to level1 TOP.
  7. On the cache1 TOPs parameters Cache Page, set Cache Size to 32 and Output Index to -20.
  8. On the cache2 TOPs parameters Cache Page, set Cache Size to 32 and the Output Index to -40. I am using the Cache TOP to save and offset the timing of the images. Note that once you see how your effect is working with your performance you will want to go back and readjust these settings.

    Notes on the Cache TOP: The Cache TOP can be used to freeze images in the TOP by turning the Active parameter to Off. (You can set the cache size to 1.) The Cache TOP acts as a delay if you set Output Index to a negative number and leave the Active parameter at On. Once a sequence of images has been captures by turning the On parameter on and off, they can be looped by animating the Output Index parameter.

    For more info on the Cache TOP, click here.

    Figure 7. You could add in more Level TOPS to create more duplicates.
    Figure 7.You could add in more Level TOPS to create more duplicates.
  9. Wire both Cache TOPS to a Difference TOP.
    Figure 8. The Cache TOPS are wired to the Diff TOP so that both images of the performer will be seen.
    Figure 8.The Cache TOPS are wired to the Diff TOP so that both images of the performer will be seen.
     
    Figure 9. The entire network for the effect. Look at the effect when projected in your performance, go back, and adjust the node parameters as necessary.
    Figure 9.The entire network for the effect. Look at the effect when projected in your performance, go back, and adjust the node parameters as necessary.

Demo #3: RealSense TOP Color Mapping For Texture Mapping

Using the RealSense TOP node to texture map the geometries, in this case the boxes with the dancers moving image.

  1. Create a Geometry COMP and go inside it, down one level (/project1/geo1) and create an In SOP.
  2. Go back up to project1 and create a Box SOP.
  3. In the Box SOP Parameters, set the Texture Coordinates to Face Outside. This will insure that each face will get the full texture (Zero to 1).
  4. Wire the Box SOP to the Geometry COMPs input.
  5. Create a RealSense TOP Node and in the Parameters Setup page, set the Model to R200 and the Image to Color.
  6. Create a Phong MAT and in the Parameters RGB page set the Color Map to realsense1 or alternatively drag the RealSense TOP node into the Color Map parameter.
  7. In the Geo COMP Render parameter page, Material put phong1
  8. Create a Render TOP, a Camera COMP, a Light COMP.
  9. Create a Reorder TOP and in the Reorder parameter page, set the Output Alpha, Input 1 to One using the drop down.
    Figure 10. The entire network to show how the RealSense R200 Color mode can be used to   texture all sides of a Box Geo.
    Figure 10.The entire network to show how the Intel RealSense R200 Color mode can be used to texture all sides of a Box Geo.
     
    Figure 11.The dancer appears to be holding up the box, which is textured with her image.
     
    Figure 12. Multiple boxes with the image of the dancer animate around the dancer once she has lifted the box off herself.
    Figure 12.Multiple boxes with the image of the dancer animate around the dancer once she has lifted the box off herself.

Demo #4: RealSense CHOP Movement Control Over Large Particle Sphere

For this effect, I wanted the dancer to be able to interact playfully with a large particle ball. She moves towards the sphere and it moves away from her.

  1. Create a RealSense CHOP node. In the Parameters, Set Up page, Model to be an R200, the Mode to Finger/FaceTracking. Turn On the Person Center-Mass World Position and the Persons Center Mass Color Position.
  2. Connect the RealSense CHOP node to a Select CHOP node.
  3. In the Select CHOP, Select page, set the ChannelNames to, person1_center_mass:tx.
  4. Create Math CHOP node, leave the defaults on for now, (You can adjust them later as needed in your performance) and wire the select CHOP node to the Math CHOP node.
  5. Create a Lag CHOP node and wire the Math CHOP node to that.
  6. Connect the Lag CHOP node to a Null CHOP node and connect the Null CHOP node to a Trail CHOP node.
    Figure 13. The entire network to show how the RealSense R200 CHOP can be hooked up. The Trail CHOP node is very useful for seeing if and how much the RealSense camera is working.
    Figure 13.The entire network to show how the RealSense R200 CHOP can be hooked up. The Trail CHOP node is very useful for seeing if and how much the RealSense camera is working.
  7. Create a Taurus SOP, connect it to a Transform SOP and then connect the Transform SOP to a Material SOP.
  8. Create a Point Sprite MAT.
  9. In the Point Sprite MAT, Point Sprite parameters page, choose a yellow color.
  10. In the Material SOP, parameters page, set the Material to pointsprite1
  11. Create a Copy SOP, keep its default parameter settings, and wire the Material SOP to the bottom connection on it.
  12. Create a Sphere SOP, wire it to a Particle SOP.
  13. Wire the Particle SOP to the top connector in the Copy SOP.
  14. In the Particle SOP, State parameter page, Particle Type, to Render as Point Sprites.
  15. Connect the Copy SOP to a Geo COMP. Go one level down project1/geo1. Delete the Torus SOP and create an In SOP.
    Figure 14. For the more advanced a Point Sprite MAT can be used to change the look of the particles
    Figure 14.For the more advanced a Point Sprite MAT can be used to change the look of the particles
  16. Export the personal1_center_mass:tx channel from the Null SOP to the Transform SOP parameters, the Transform page, Translate tx
    Figure 15. Exporting the channel.
    Figure 15. Exporting the channel.
     
    Figure 16.The large particle ball assume a personality as the dancer plays with it, trying to catch it.

Demo #5: Buttons to Control Effects

Turning on and off interactive effects is important. In this demo, I will show the simplest way to do this using a button.

  1. Create a Button COMP.
  2. Connect it to a Null CHOP.
  3. Activate and export the channel from the Null CHOP to the Parameters, Render Page of the Geo COMP from the previous Demo 4. Pressing the button will turn the render of the Geo COMP on and off.
    Figure 17. An elementary button set up
    Figure 17.An elementary button set up

Summary

This article is designed to give the reader some basic starting points, techniques and ideas as to how to use the RealSense camera to create interactivity in a performance. There are many more sophisticated effects to be explored using the RealSense camera in combination with TouchDesigner.

Related Applications

Many apps that people have created for the RealSense camera are very useful.

https://appshowcase.intel.com/en-us/realsense/node/9167?cam=all-cam - drummer app for Intel RealSense Cameras.

https://appshowcase.intel.com/en-us/realsense?cam=all-cam - apps for all Intel RealSense cameras.

About the Author

Audri Phillips is a visualist/3d animator based out of Los Angeles, with a wide range of experience that includes over 25 years working in the visual effects/entertainment industry in studios such as Sony*, Rhythm and Hues*, Digital Domain*, Disney*, and Dreamworks* feature animation. Starting out as a painter she was quickly drawn to time based art. Always interested in using new tools she has been a pioneer of using computer animation/art in experimental film work including immersive performances. Now she has taken her talents into the creation of VR. Samsung* recently curated her work into their new Gear Indie Milk VR channel.

Her latest immersive work/animations include: Multi Media Animations for "Implosion a Dance Festival" 2015 at the Los Angeles Theater Center, 4 Full dome Concerts in the Vortex Immersion dome, one with the well-known composer/musician Steve Roach. The most recent being the fulldome concert, "Relentless Universe”. She also created animated content for the dome show for the TV series, “Constantine*” shown at the 2014 Comic-Con convention. Several of her Fulldome pieces, “Migrations” and “Relentless Beauty”, have been juried into "Currents", The Santa Fe International New Media Festival, and Jena FullDome Festival in Germany. She exhibits in the Young Projects gallery in Los Angeles.

She writes online content and a blog for Intel®. Audri is an Adjunct professor at Woodbury University, a founding member and leader of the Los Angeles Abstract Film Group, founder of the Hybrid Reality Studio (dedicated to creating VR content), a board member of the Iota Center, and an exhibiting member of the LA Art Lab. In 2011 Audri became a resident artist of Vortex Immersion Media and the c3: CreateLAB. A selection of her works are available on Vimeo , on creativeVJ and on Vrideo.

Intel® Software Guard Extensions Tutorial Series: Part 2, Application Design

$
0
0

The second part in the Intel® Software Guard Extensions (Intel® SGX) tutorial series is a high-level specification for the application we’ll be developing: a simple password manager. Since we’re building this application from the ground up, we have the luxury of designing for Intel SGX from the start. That means that in addition to laying out our application’s requirements, we’ll examine how Intel SGX design decisions and the overall application architecture influence one another.

Read the first tutorial in the series or find the list of all of the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

Password Managers At-A-Glance

Most people are probably familiar with password managers and what they do, but it’s a good idea to review the fundamentals before we get into the details of the application design itself.

The primary goals of a password manager are to:

  • Reduce the number of passwords that end users need to remember.
  • Enable end users to create stronger passwords than they would normally choose on their own.
  • Make it practical to use a different password for every account.

Password management is a growing problem for Internet users, and numerous studies have tried to quantify the problem over the years. A Microsoft study published in 2007—nearly a decade ago as of this writing—estimated that the average person had 25 accounts that required passwords. More recently, in 2014 Dashlane estimated that their US users had an average of 130 accounts, while the number of accounts for their worldwide users averaged in the 90s. And the problems don’t end there: people are notoriously bad at picking “good” passwords, frequently reusing the same password on multiple sites, which has led to some spectacular attacks. These problems boil down to two basic issues: passwords that are hard for hacking tools to guess are often difficult for people to remember, and having a greater number of passwords makes this problem more complex by having to remember which password is associated with which account.

With a password manager, you only need to remember one very strong passphrase in order to gain access to your password database or vault. Once you have authenticated to your password manager, you can look up any passwords you have stored, and copy and paste them into authentication fields as needed. Of course, the key vulnerability of the password manager is the password database itself: since it contains all of the user’s passwords it is an attractive target for attackers. For this reason, the password database is encrypted with strong encryption techniques, and the user’s master passphrase becomes the means for decrypting the data inside of it.

Our goal in this tutorial is to build a simple password manager that provides the same core functions as a commercial product while following good security practices and use that as a learning vehicle for designing for Intel SGX. The tutorial password manager, which we’ll name the “Tutorial Password Manager with Intel® Software Guard Extensions” (yes, that’s a mouthful, but it’s descriptive), is not intended to function as a commercial product and certainly won’t contain all the safeguards found in one, but that level of detail is not necessary.

Basic Application Requirements

Some basic application requirements will help narrow down the scope of the application so that we can focus on the Intel SGX integration rather than the minutiae of application design and development. Again, the goal is not to create a commercial product: the Tutorial Password Manager with Intel SGX does not need to run on multiple operating systems or on all possible CPU architectures. All we require is a reasonable starting point.

To that end, our basic application requirements are:

The first requirement may seem strange given that this tutorial series is about Intel SGX application development, but real-world applications need to consider the legacy installation base. For some applications it may be appropriate to restrict execution only to Intel SGX-capable platforms, but for the Tutorial Password Manager we’ll use a less rigid approach. An Intel SGX-capable platform will receive a hardened execution environment, but non-capable platforms will still function. This usage is appropriate for a password manager, where the user may need to synchronize his or her password database with other, older systems. It is also a learning opportunity for implementing dual code paths.

The second requirement gives us access to certain cryptographic algorithms in the non-Intel SGX code path and to some libraries that we’ll need. The 64-bit requirement simplifies application development by ensuring access to native 64-bit types and also provides a performance boost for certain cryptographic algorithms that have been optimized for 64-bit code.

The third requirement gives us access to the RDRAND instruction in the non-Intel SGX code path. This greatly simplifies random number generation and ensures access to a high-quality entropy source. Systems that support the RDSEED instruction will make use of that as well. (For information on the RDRAND and RDSEED instructions, see the Intel® Digital Random Number Generator Software Implementation Guide.)

The fourth requirement keeps the list of software required by the developer (and the end user) as short as possible. No third-party libraries, frameworks, applications, or utilities need to be downloaded and installed. However, this requirement has an unfortunate side effect: without third-party frameworks, there are only four options available to us for creating the user interface. Those options are:

  • Win32 APIs
  • Microsoft Foundation Classes (MFC)
  • Windows Presentation Foundation (WPF)
  • Windows Forms

The first two are implemented in native/unmanaged code while the latter two require .NET*.

The User Interface Framework

For the Tutorial Password Manager, we’re going to be developing the GUI using Windows Presentation Foundation in C#. This design decision impacts our requirements as follows:

Why use WPF? Mostly because it simplifies the UI design while introducing complexity that we actually want. Specifically, by relying on the .NET Framework, we have the opportunity to discuss mixing managed code, and specifically high-level languages, with enclave code. Note, though, that choosing WPF over Windows Forms was arbitrary: either environment would work.

As you might recall, enclaves must be written in native C or C++ code, and the bridge functions that interact with the enclave must be native C (not C++) functions. While both Win32 APIs and MFC provide an opportunity to develop the password manager with 100-percent native C/C++ code, the burden imposed by these two methods does nothing for those who want to learn Intel SGX application development. With a GUI based in managed code, we not only reap the benefits of the integrated design tools but also have the opportunity to discuss something that is of potential value to Intel SGX application developers. In short, you aren’t here to learn MFC or raw Win32, but you might want to know how to glue .NET to enclaves.

To bridge the managed and unmanaged code we’ll be using C++/CLI (C++ modified for Common Language Infrastructure). This greatly simplifies the data marshaling and is so convenient and easy to use that many developers refer to it as IJW (“It Just Works”).

Figure 1: Minimum component structures for native and C# Intel® Software Guard Extensions applications.

Figure 1 shows the impact to an Intel SGX application’s minimum component makeup when it is moved from native code to C#. In the fully native application, the application layer can interact directly with the enclave DLL since the enclave bridge functions can be incorporated into the application’s executable. In a mixed-mode application, however, the enclave bridge functions need to be isolated from the managed code block because they are required to be 100-percent native code. The C# application, on the other hand, can’t interact with the bridge functions directly, and in the C++/CLI model that means creating another intermediary: a DLL that marshals data between the managed C# application and the native, enclave bridge DLL.

Password Vault Requirements

At the core of the password manager is the password database, or what we’ll be referring to as the password vault. This is the encrypted file that will hold the end user’s account information and passwords. The basic requirements for our tutorial application are:

The requirement that the vault be portable means that we should be able to copy the vault file to another computer and still be able to access its contents, whether or not they support Intel SGX. In other words, the user experience should be the same: the password manager should work seamlessly (so long as the system meets the base hardware and OS requirements, of course).

Encrypting the vault at rest means that the vault file should be encrypted when it is not actively in use. At a minimum, the vault must be encrypted on disk (without the portability requirement, we could potentially solve the encryption requirements by using the sealing feature of Intel SGX) and should not sit decrypted in memory longer than is necessary.

Authenticated encryption provides assurances that the encrypted vault has not been modified after the encryption has taken place. It also gives us a convenient means of validating the user’s passphrase: if the decryption key is incorrect, the decryption will fail when validating the authentication tag. That way, we don’t have to examine the decrypted data to see if it is correct.

Passwords

Any account information is sensitive information for a variety of reasons, not the least of which is that it tells an attacker exactly which logins and sites to target, but the passwords are arguably the most critical piece of the vault. Knowing what account to attack is not nearly as attractive as not needing to attack it at all. For this reason, we’ll introduce additional requirements on the passwords stored in the vault:

This is nesting the encryption. The passwords for each of the user’s accounts are encrypted when stored in the vault, and the entire vault is encrypted when written to disk. This approach allows us to limit the exposure of the passwords once the vault has been decrypted. It is reasonable to decrypt the vault as a whole so that the user can browse their account details, but displaying all of their passwords in clear text in this manner would be inappropriate.

An account password is only decrypted when a user asks to see it. This limits its exposure both in memory and on the user’s display.

Cryptographic Algorithms

With the encryption needs identified it is time to settle on the specific cryptographic algorithms, and it’s here that our existing application requirements impose some significant limits on our options. The Tutorial Password Manager must provide a seamless user experience on both Intel SGX and non-Intel SGX platforms, and it isn’t allowed to depend on third-party libraries. That means we have to choose an algorithm, and a supported key and authentication tag size, that is common to both the Windows CNG API and the Intel SGX trusted crypto library. Practically speaking, this leaves us with just one option: Advanced Encryption Standard-Galois Counter Mode (AES-GCM) with a 128-bit key. This is arguably not the best encryption mode to use in this application, especially since the effective authentication tag strength of 128-bit GCM is less than 128 bits, but it is sufficient for our purposes. Remember: the goal here is not to create a commercial product, but rather a useful learning vehicle for Intel SGX development.

With GCM come some other design decisions, namely the IV length (12 bytes is most efficient for the algorithm) and the authentication tag.

Encryption Keys and User Authentication

With the encryption method chosen, we can turn our attention to the encryption key and user authentication. How will the user authenticate to the password manager in order to unlock their vault?

The simple approach would be to derive the encryption key directly from the user’s passphrase or password using a key derivation function (KDF). But while the simple approach is a valid one, it does have one significant drawback: if the user changes his or her password, the encryption key changes along with it. Instead, we’ll follow the more common practice of encrypting the encryption key.

In this method, the primary encryption key is randomly generated using a high-quality entropy source and it never changes. The user’s passphrase or password is used to derive a secondary encryption key, and the secondary key is used to encrypt the primary key. This approach has some key advantages:

  • The data does not have to be re-encrypted when the user’s password or passphrase changes
  • The encryption key never changes, so it could theoretically be written down in, say, hexadecimal notation and locked in a physically secure location. The data could thus still be decrypted even if the user forgot his or her password. Since the key never changes, it would only have to be written down once.
  • More than one user could, in theory, be granted access to the data. Each would encrypt a copy of the primary key with their own passphrase.

Not all of these are necessarily critical or relevant to the Tutorial Password Manager, but it’s a good security practice nonetheless.

Here the primary key is called the vault key, and the secondary key that is derived from the user’s passphrase is called the master key. The user authenticates by entering their passphrase, and the password manager derives a master key from it. If the master key successfully decrypts the vault key, the user is authenticated and the vault can be decrypted. If the passphrase is incorrect, the decryption of the vault key fails and that prevents the vault from being decrypted.

The final requirement, building the KDF around SHA-256, comes from the constraint that we find a hashing algorithm common to both the Windows CNG API and the Intel SGX trusted crypto library.

Account Details

The last of the high-level requirements is what actually gets stored in the vault. For this tutorial, we are going to keep things simple. Figure 2 shows an early mockup of the main UI screen.

Figure 2:Early mockup of the Tutorial Password Manager main screen.

The last requirement is all about simplifying the code. By fixing the number of accounts stored in the vault, we can more easily put an upper bound on how large the vault can be. This will be important when we start designing our enclave. Real-world password managers do not, of course, have this luxury, but it is one that can be afforded for the purposes of this tutorial.

Coming Up Next

In part 3 of the tutorial we’ll take a closer look at designing our Tutorial Password Manager for Intel SGX. We’ll identify our secrets, which portions of the application should be contained inside the enclave, how the enclave will interact with the core application, and how the enclave impacts the object model. Stay tuned!

Read the first tutorial in the series, Intel® Software Guard Extensions Tutorial Series: Part 1, Intel® SGX Foundation or find the list of all the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

 

Using Enclaves from .NET: Making ECALLS with Callbacks via OCALLS

$
0
0

One question about Intel® Software Guard Extensions (Intel® SGX) that comes up frequently is how to mix enclaves with managed code on Microsoft Windows* platforms, particularly with the C# language. While enclaves themselves must be 100 percent native code and the enclave bridge functions must be 100 percent native code with C (and not C++) linkages, it is possible, indirectly, to make an ECALL into an enclave from .NET and to make an OCALL from an enclave into a .NET object. There are multiple solutions for accomplishing these tasks, and this article and its accompanying code sample demonstrate one approach.

Mixing Managed Code and Native Code with C++/CLI

Microsoft Visual Studio* 2005 and later offers three options for calling unmanaged code from managed code:

  • Platform Invocation Services, commonly referred to by developers as P/Invoke
  • COM
  • C++/CLI

P/Invoke is good for calling simple C functions in a DLL, which makes it a reasonable choice for interfacing with enclaves, but writing P/Invoke wrappers and marshaling data can be difficult and error-prone. COM is more flexible than P/Invoke, but it is also more complicated; that additional complexity is unnecessary for interfacing with the C bridge functions required by enclaves. This code sample uses the C++/CLI approach.

C++/CLI offers significant convenience by allowing the developer to mix managed and unmanaged code in the same module, creating a mixed-mode assembly which can in turn be linked to modules comprised entirely of either managed or native code. Data marshaling in C++/CLI is also fairly easy: for simple data types it is done automatically through direct assignment, and helper methods are provided for more complex types such as arrays and strings. Data marshaling is, in fact, so painless in C++/CLI that developers often refer to the programming model as IJW (an acronym for “it just works”).

The trade-off for this convenience is that there can be a small performance penalty due to the extra layer of functions, and it does require that you produce an additional DLL when interfacing with Intel SGX enclaves.

 Minimum component makeup of an Intel® Software Guard Extensions application written in C# and C++/CLI.

Figure 1Minimum component makeup of an Intel® Software Guard Extensions application written in C# and C++/CLI.

Figure 1 illustrates the component makeup of a C# application when using the C++/CLI model. The managed application consists of, at minimum, a C# executable, a C++/CLI DLL, the native enclave bridge DLL, and the enclave DLL itself.

The Sample Application

The sample application provides two functions that execute inside of an enclave: one calls CPUID, and the other generates random data in 1KB blocks and XORs them together to produce a final 1KB block of random bytes. This is a multithreaded application, and you can run all three tasks simultaneously. The user interface is shown in Figure 2.

 Sample application user interface.

Figure 2:Sample application user interface.

To build the application you will need the Intel SGX SDK. This sample was created using the 1.6 Intel SGX SDK and built with Microsoft Visual Studio 2013. It targets the .NET framework 4.5.1.

The CPUID Tab

On the CPUID panel, you enter a value for EAX to pass to the CPUID instruction. When you click query, the program executes an ECALL on the current thread and runs the sgx_cpuid() function inside the enclave. Note that sgx_cpuid() does, in turn, make an OCALL to execute the CPUID instruction, since CPUID is not a legal instruction inside an enclave. This OCALL is automatically generated for you by the edgr8tr tool when you build your enclave. See the Intel SGX SDK Developer Guide for more information on the sgx_cpuid() function.

The RDRAND Tab

On the RDRAND panel you can generate up to two simultaneous background threads. Each thread performs the same task: it makes an ECALL to enter the enclave and generates the target amount of random data using the sgx_read_rand() function in 1 KB blocks. Each 1 KB block is XORd with the previous block to produce a final 1 KB block of random data that is returned to the application (the first block is XORd with a block of 0s).

For every 1 MB of random data that is generated, the function also executes an OCALL to send the progress back up to the main application via a callback. The callback function then runs a thread in the UI context to update the progress bar.

Because this function runs asynchronously, you can have both threads in the UI active at once and even switch to the CPUID tab to execute that ECALL while the RDRAND ECALLs are still active.

Overall Structure

The application is made up of the following components, three of which we’ll examine in detail:

  • C# application. A Windows Forms*-based application that implements the user interface.
  • EnclaveLink.dll. A mixed-mode DLL responsible for marshaling data between .NET and native code. This assembly contains two classes: EnclaveLinkManaged and EnclaveLinkNative.
  • EnclaveBridge.dll. A native DLL containing the enclave bridge functions. These are pure C functions.
  • Enclave.dll (Enclave.signed.dll). The Intel SGX enclave.

There is also a fifth component, sgx_support_detect.dll, which is responsible for the runtime check of Intel SGX capability. It ensures that the application exits gracefully when run on a system that does not support Intel SGX. We won’t be discussing this component here, but for more information on how it works and why it’s necessary, see the article Properly Detecting Intel® Software Guard Extensions in Your Applications.

The general application flow is that the enclave is not created immediately when the application launches. It initializes some global variables for referencing the enclave and creates a mutex. When a UI event occurs, the first thread that needs to run an enclave function checks to see if the enclave has already been created, and if not, it launches the enclave. All subsequent threads and events reuse that same enclave. In order to keep the sample application architecture relatively simple, the enclave is not destroyed until the program exists.

The C# Application

The main executable is written in C#. It requires a reference to the EnclaveLink DLL in order to execute the C/C++ methods that eventually call into the enclave.

On startup, the application calls static methods to prepare the application for the enclave, and then closes it on exit:

        public FormMain()
        {
            InitializeComponent();
            // This doesn't create the enclave, it just initializes what we need
            // to do so in an multithreaded environment.
            EnclaveLinkManaged.init_enclave();
        }

        ~FormMain()
        {
            // Destroy the enclave (if we created it).
            EnclaveLinkManaged.close_enclave();
        }

These two functions are simple wrappers around functions in EnclaveLinkNative and are discussed in more detail below.

When either the CPUID or RDRAND functions are executed via the GUI, the application creates an instance of class EnclaveLinkManaged and executes the appropriate method. The CPUID execution flow is shown, below:

      private void buttonCPUID_Click(object sender, EventArgs e)
        {
            int rv;
            UInt32[] flags = new UInt32[4];
            EnclaveLinkManaged enclave = new EnclaveLinkManaged();

            // Query CPUID and get back an array of 4 32-bit unsigned integers

            rv = enclave.cpuid(Convert.ToInt32(textBoxLeaf.Text), flags);
            if (rv == 1)
            {
                textBoxEAX.Text = String.Format("{0:X8}", flags[0]);
                textBoxEBX.Text = String.Format("{0:X8}", flags[1]);
                textBoxECX.Text = String.Format("{0:X8}", flags[2]);
                textBoxEDX.Text = String.Format("{0:X8}", flags[3]);
            }
            else
            {
                MessageBox.Show("CPUID query failed");
            }
        }

The callbacks for the progress bar in the RDRAND execution flow are implemented using a delegate, which creates a task in the UI context to update the display. The callback methodology is described in more detail later.

        Boolean cancel = false;
        progress_callback callback;
        TaskScheduler uicontext;

        public ProgressRandom(int mb_in, int num_in)
        {
            enclave = new EnclaveLinkManaged();
            mb = mb_in;
            num = num_in;
            uicontext = TaskScheduler.FromCurrentSynchronizationContext();
            callback = new progress_callback(UpdateProgress);

            InitializeComponent();

            labelTask.Text = String.Format("Generating {0} MB of random data", mb);
        }

        private int UpdateProgress(int received, int target)
        {
            Task.Factory.StartNew(() =>
            {
                progressBarRand.Value = 100 * received / target;
                this.Text = String.Format("Thread {0}: {1}% complete", num, progressBarRand.Value);
            }, CancellationToken.None, TaskCreationOptions.None, uicontext);

            return (cancel) ? 0 : 1;
        }

The EnclaveLink DLL

The primary purpose of the EnclaveLink DLL is to marshal data between .NET and unmanaged code. It is a mixed-mode assembly that contains two objects:

  • EnclaveLinkManaged, a managed class that is visible to the C# layer
  • EnclaveLinkNative, a native C++ class

EnclaveLinkManaged contains all of the data marshaling functions, and its methods have variables in both managed and unmanaged memory. It ensures that only unmanaged pointers and data get passed to EnclaveLinkNative. Each instance of EnclaveLinkManaged contains an instance of EnclaveLinkNative, and the methods in EnclaveLinkManaged are essentially wrappers around the methods in the native class.

EnclaveLinkNative is responsible for interfacing with the enclave bridge functions in the EnclaveBridge DLL. It also is responsible for initializing the global enclave variables and handling the locking.

#define MUTEX L"Enclave"

static sgx_enclave_id_t eid = 0;
static sgx_launch_token_t token = { 0 };
static HANDLE hmutex;
int launched = 0;

void EnclaveLinkNative::init_enclave()
{
	hmutex = CreateMutex(NULL, FALSE, MUTEX);
}

void EnclaveLinkNative::close_enclave()
{
	if (WaitForSingleObject(hmutex, INFINITE) != WAIT_OBJECT_0) return;

	if (launched) en_destroy_enclave(eid);
	eid = 0;
	launched = 0;

	ReleaseMutex(hmutex);
}

int EnclaveLinkNative::get_enclave(sgx_enclave_id_t *id)
{
	int rv = 1;
	int updated = 0;

	if (WaitForSingleObject(hmutex, INFINITE) != WAIT_OBJECT_0) return 0;

	if (launched) *id = eid;
	else {
		sgx_status_t status;

		status= en_create_enclave(&token, &eid, &updated);
		if (status == SGX_SUCCESS) {
			*id = eid;
			rv = 1;
			launched = 1;
		} else {
			rv= 0;
			launched = 0;
		}
	}
	ReleaseMutex(hmutex);

	return rv;
}

The EnclaveBridge DLL

As the name suggests, this DLL holds the enclave bridge functions. This is a 100 percent native assembly with C linkages, and the methods from EnclaveLinkNative call into these functions. Essentially, they marshal data and wrap the calls in the mixed mode assembly to and from the enclave.

The OCALL and the Callback Sequence

The most complicated piece of the sample application is the callback sequence used by the RDRAND operation. The OCALL must propagate from the enclave all the way up the application to the C# layer. The task is to pass a reference to a managed class instance method (a delegate) down to the enclave so that it can be invoked via the OCALL. The challenge is to do that within the following restrictions:

  1. The enclave is in its own DLL, which cannot depend on other DLLs.
  2. The enclave only supports a limited set of data types.
  3. The enclave can only link against 100 percent native functions with C linkages.
  4. There cannot be any circular DLL dependencies.
  5. The methodology must be thread-safe.
  6. The user must be able to cancel the operation.

The Delegate

The delegate is prototyped inside of EnclaveLinkManaged.h along with the EnclaveLinkManaged class definition:

public delegate int progress_callback(int, int);

public ref class EnclaveLinkManaged
{
	array<BYTE> ^rand;
	EnclaveLinkNative *native;

public:
	progress_callback ^callback;

	EnclaveLinkManaged();
	~EnclaveLinkManaged();

	static void init_enclave();
	static void close_enclave();

	int cpuid(int leaf, array<UINT32>^ flags);
	String ^genrand(int mb, progress_callback ^cb);

	// C++/CLI doesn't support friend classes, so this is exposed publicly even though
	// it's only intended to be used by the EnclaveLinkNative class.

	int genrand_update(int generated, int target);
};

When each ProgressRandom object is instantiated, a delegate is assigned in the variable callback, pointing to the UpdateProgress instance method:

    public partial class ProgressRandom : Form
    {
        EnclaveLinkManaged enclave;
        int mb;
        Boolean cancel = false;
        progress_callback callback;
        TaskScheduler uicontext;
        int num;

        public ProgressRandom(int mb_in, int num_in)
        {
            enclave = new EnclaveLinkManaged();
            mb = mb_in;
            num = num_in;
            uicontext = TaskScheduler.FromCurrentSynchronizationContext();
            callback = new progress_callback(UpdateProgress);

            InitializeComponent();

            labelTask.Text = String.Format("Generating {0} MB of random data", mb);
        }

This variable is passed as an argument to the EnclaveLinkManaged object when the RDRAND operation is requested:

        public Task<String> RunAsync()
        {
            this.Refresh();

            // Create a thread using Task.Run

            return Task.Run<String>(() =>
            {
                String data;

                data= enclave.genrand(mb, callback);

                return data;
            });
        }

The genrand() method inside of EnclaveLinkManaged saves this delegate to the property “callback”. It also creates a GCHandle that both points to itself and pins itself in memory, preventing the garbage collector from moving it in memory and thus making it accessible from unmanaged memory. This handle is passed as a pointer to the native object.

This is necessary because we cannot directly store a handle to a managed object as a member of an unmanaged class.

String ^EnclaveLinkManaged::genrand(int mb, progress_callback ^cb)
{
	UInt32 rv;
	int kb= 1024*mb;
	String ^mshex = gcnew String("");
	unsigned char *block;
	// Marshal a handle to the managed object to a system pointer that
	// the native layer can use.
	GCHandle handle= GCHandle::Alloc(this);
	IntPtr pointer= GCHandle::ToIntPtr(handle);

	callback = cb;
	block = new unsigned char[1024];
	if (block == NULL) return mshex;

	// Call into the native layer. This will make the ECALL, which executes
	// callbacks via the OCALL.

	rv= (UInt32) native->genrand(kb, pointer.ToPointer(), block);

In the native object, we now have a pointer to the managed object, which we save in the member variable managed.

Next, we use a feature of C++11 to create a std::function reference that is bound to a class method. Unlike standard C function pointers, this std::function reference points to the class method in our instantiated object, not to a static or global function.

DWORD EnclaveLinkNative::genrand (int mkb, void *obj, unsigned char rbuffer[1024])
{
	using namespace std::placeholders;
	auto callback= std::bind(&EnclaveLinkNative::genrand_progress, this, _1, _2);
	sgx_status_t status;
	int rv;
	sgx_enclave_id_t thiseid;

	if (!get_enclave(&thiseid)) return 0;

	// Store the pointer to our managed object as a (void *). We'll Marshall this later.

	managed = obj;

	// Retry if we lose the enclave due to a power transition
again:
	status= en_genrand(thiseid, &rv, mkb, callback, rbuffer);

Why do we need this layer of indirection? Because the next layer down, EnclaveBridge.dll, cannot have a linkage dependency on EnclaveLink.dll as this would create a circular reference (where A depends on B, and B depends on A). EnclaveBridge.dll needs an anonymous means of pointing to our instantiated class method.

Inside en_genrad() in EnclaveBridge.cpp, this std::function is converted to a void pointer. Enclaves only support a subset of data types, and they don’t support any of the C++11 extensions regardless. We need to convert the std::function pointer to something the enclave will accept. In this case, that means passing the pointer address in a generic data buffet. Why use void instead of an integer type? Because the size of a std::function pointer varies by architecture.

typedef std::function<int(int, int)> progress_callback_t;

ENCLAVENATIVE_API sgx_status_t en_genrand(sgx_enclave_id_t eid, int *rv, int kb, progress_callback_t callback, unsigned char *rbuffer)
{
	sgx_status_t status;
	size_t cbsize = sizeof(progress_callback_t);

	// Pass the callback pointer to the enclave as a 64-bit address value.
	status = e_genrand(eid, rv, kb, (void *)&callback, cbsize, rbuffer);

	return status;
}

Note that we not only must allocate this data buffer, but also tell the edgr8r tool how large the buffer is. That means we need to pass the size of the buffer in as an argument, even though it is never explicitly used.

Inside the enclave, the callback parameter literally just gets passed through and out the OCALL. The definition in the EDL file looks like this:

enclave {
	from "sgx_tstdc.edl" import *;

    trusted {
        /* define ECALLs here. */

		public int e_cpuid(int leaf, [out] uint32_t flags[4]);
		public int e_genrand(int kb, [in, size=sz] void *callback, size_t sz, [out, size=1024] unsigned char *block);
    };

    untrusted {
        /* define OCALLs here. */

		int o_genrand_progress ([in, size=sz] void *callback, size_t sz, int progress, int target);
    };
};

The callback starts unwinding in the OCALL, o_genrand_progress:

typedef std::function<int(int, int)> progress_callback_t;

int o_genrand_progress(void *cbref, size_t sz, int progress, int target)
{
	progress_callback_t *callback = (progress_callback_t *) cbref;

	// Recast as a pointer to our callback function.

	if (callback == NULL) return 1;

	// Propogate the cancellation condition back up the stack.
	return (*callback)(progress, target);
}

The callback parameter, cbref, is recast as a std::function binding and then executed with our two arguments: progress and target. This points back to the genrand_progress() method inside of the EnclaveLinkNative object, where the GCHandle is recast to a managed object reference and then executed.

int __cdecl EnclaveLinkNative::genrand_progress (int generated, int target)
{
	// Marshal a pointer to a managed object to native code and convert it to an object pointer we can use
	// from CLI code

	EnclaveLinkManaged ^mobj;
	IntPtr pointer(managed);
	GCHandle mhandle;

	mhandle= GCHandle::FromIntPtr(pointer);
	mobj= (EnclaveLinkManaged ^)mhandle.Target;

	// Call the progress update function in the Managed version of the object. A retval of 0 means
	// we should cancel our operation.

	return mobj->genrand_update(generated, target);
}

The next stop is the managed object. Here, the delegate that was saved in the callback class member is used to call up to the C# method.

int EnclaveLinkManaged::genrand_update(int generated, int target)
{
	return callback(generated, target);
}

This executes the UpdateProgress() method, which updates the UI. This delegate returns an int value of either 0 or 1, which represents the status of the cancellation button: 

        private int UpdateProgress(int received, int target)
        {
            Task.Factory.StartNew(() =>
            {
                progressBarRand.Value = 100 * received / target;
                this.Text = String.Format("Thread {0}: {1}% complete", num, progressBarRand.Value);
            }, CancellationToken.None, TaskCreationOptions.None, uicontext);

            return (cancel) ? 0 : 1;
        }

A return value of 0 means the user has asked to cancel the operation. This return code propagates back down the application layers into the enclave. The enclave code looks at the return value of the OCALL to determine whether or not to cancel:

        // Make our callback. Be polite and only do this every MB.
        // (Assuming 1 KB = 1024 bytes, 1MB = 1024 KB)
        if (!(i % 1024)) {
            status = o_genrand_progress(&rv, callback, sz, i + 1, kb);
            // rv == 0 means we got a cancellation request
            if (status != SGX_SUCCESS || rv == 0) return i;
         } 

Enclave Configuration

The default configuration for an enclave is to allow a single thread. As the sample application can run up to three threads in the enclave at one time—the CPUID function on the UI thread and the two RDRAND operations in background threads—the enclave configuration needed to be changed. This is done by setting the TCSNum parameter to 3 in Enclave.config.xml. If this parameter is left at its default of 1 only one thread can enter the enclave at a time, and simultaneous ECALLs will fail with the error code SGX_ERROR_OUT_OF_TCS.

<EnclaveConfiguration><ProdID>0</ProdID><ISVSVN>0</ISVSVN><StackMaxSize>0x40000</StackMaxSize><HeapMaxSize>0x100000</HeapMaxSize><TCSNum>3</TCSNum><TCSPolicy>1</TCSPolicy><DisableDebug>0</DisableDebug><MiscSelect>0</MiscSelect><MiscMask>0xFFFFFFFF</MiscMask></EnclaveConfiguration>

Summary

Mixing Intel SGX with managed code is not difficult, but it can involve a number of intermediate steps. The sample C# application presented in this article represents one of the more complicated cases: multiple DLLs, multiple threads originating from .NET, locking in native space, OCALLS, and UI updates based on enclave operations. It is intended to demonstrate the flexibility that application developers really have when working with Intel SGX, in spite of their restrictions.


Improve Video Quality, Build Extremely Efficient Encoders & Decoders with Intel® VPA & Intel® SBE

$
0
0

Video Codec Developers: This could be your magic encoder/decoder ring. We're excited to announce new Intel® Video Pro Analyzer 2017 (Intel® VPA), and also Intel® Stress Bitstreams and Encoder 2017, which you can use to enhance the brilliance of your video quality, and build extremely efficient, robust encoders and decoders. Get the scoop and more technical details on these advanced Intel video analysis tools below. 

Learn more: Intel® VPA  |  Intel® SBE

 

Enhance Video Quality & Streaming for AVC, HEVC, VP9 & MPEG-2 with Intel VPA

Improving your encoder's video quality and compliance becomes faster and easier. Intel® VPA, a comprehensive video analysis toolset to inspect, debug and optimize the encode/decode process for AVC, HEVC, VP9 and MPEG-2, brings efficiency and multiple UI enhancements in its 2017 edition. A few of the top new features include:

video codec developerOptimized Performance & Efficiency

  • HEVC file indexingmakes Intel VPA faster and easier to use with better performance and responsiveness when loading and executing debug optimizations and quicker switching capabilities between frames.
  • MPEG-2 error resilience improvements (previously delivered for HEVC and VP9 analysis) 
  • Significant improvements in decode processing time by 30% for HEVC and by 60% for AVC, along with AVC playback optimization. This includes optimizations for skipping some intermediate processing when the user clicks on frames to decode in the GUI. 
  • Video Quality Caliper provides more stream information, and has faster playback speed. 

Enhanced Playback & Navigation

New performance enhancements in the 2017 release include decoder performance optimization with good gains for linear playback and indexing (for HEVC) to facilitate very fast navigation within the stream. Playback for HEVC improved 1.4x, and for AVC improved 2.2x.1

  • Performance analysis for HEVC and AVC playback (blue bars) consists of the ratio of average times to load one Time Lapse Footage American Cities sequence, 2160p @ 100 frames.
  • Performance analysis for HEVC Random Navigation (orange bar) improved by 12x and consists of the ratio of latency differences to randomly access the previous frame from the current frame, measured on 2160p HEVC/AVC video.

 

UI Enhancements​

  •  Filtering of error messages and new settings to save fixes
  • Additional options for save/load, including display/decode order, fields/frame, yuv mode, and more.
  • Improved GUI picture cashing.

And don't forget. With this advance video analysis tool, you can innovate for UHD with BT 2020 support. See the full list of Intel VPA features, visit the product site for more details. Versions for Linux*, Windows* and Mac OSX* are available.

For current users, Download the Intel VPA 2017 Edition Now

If you are not already using Intel VPA -  Get a Free Trial2

More Resources - Get Started Optimizing Faster

 


 

Build Compliant HEVC & AVS2 Decoders with new Intel SBE 2017  

Intel SBE is a set of streams and tools for VP9, HEVC, and AVS2 for extensive validation of the decoders, transcoders, players, and streaming solutions. You can also create custom bitstreams for testing and optimize stream base for coverage and usage efficiency. The new 2017 release delivers:
  • Improved HEVC coverage including syntax ensuring that decoders are in full conformance with the standard.  Longterm reference generation support for video conferencing. 
     
  • Random Encoder for AVS2 Main and Main 10 (this can be shipped only to those who are members of the AVS2 committee. (AVS2 format is broadly used in People's Republic of China.) 
     
  • Compliance with the recently finalized AVS2 standard.

Learn more by visiting the Intel SBE site

Take a test drive of Intel SBE -  Download a Free Evaluation2


 

1Baseline configuration: Intel® VPA 2017 vs. 2016 running on Microsoft Windows* 8.1. Intel Customer Reference Platform with Intel® Core™ i7-5775C (3.4 GHz, 32 GB DDR3 DRAM). Gigabyte Z97-HD3 Desktop board, 32GB (4x8GB DDR3 PC3-12800 (800MHz) DIMM), 500GB Intel SSD, Turbo Boost Enabled, and HT Enabled. Source: Intel internal measurements as of August 2016.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/performance. Features and benefits may require an enabled system and third party hardware, software or services. Consult your system provider.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.

2Note that in evaluation package, the number of streams and/or other capabilities may be limited. Contact Intel Sales if you need a review without these limitations. 

 
 
 
 
 
 
 

Intel® Memory Protection Extensions on Windows® 10: A Tutorial

$
0
0

Introduction

Beginning with the Intel® 6th generation Core™ processor, Intel has introduced Intel® Memory Protection Extensions (Intel® MPX), a new extension to the instruction set architecture that aims to enhance software security by helping to protect against buffer overflow attacks. In this article, we discuss buffer overflow, and then give step-by step details on how application developers can prevent their apps from suffering from buffer overflow attacks on Windows® 10. Intel MPX works for both traditional desktop apps and Universal Windows Platform* apps.

Prerequisites

To run the samples discussed in this article, you’ll need the following hardware and software:

  • A computer (desktop, laptop, or any other form factor) with Intel® 6th generation Core™ processor and Microsoft Windows 10 OS (November 2015 update or greater; Windows 10 version 1607 is preferred)
  • Intel MPX enabled in UEFI (if the option is available)
  • Intel MPX driver properly installed
  • Microsoft Visual Studio* 2015 (update 1 or later IDE; Visual Studio 2015 update 3 is preferred)

Buffer Overflow

C/C++ code is by nature more susceptible to buffer overflows. For example, in the following code the string operation function “strcpy” in main() will put the program at risk for a buffer overflow attack.

#include "stdafx.h"
#include <iostream>
#include <time.h>
#include <stdlib.h>

using namespace std;

void GenRandomUname(char* uname_string, const int uname_len)
{
	srand(time(NULL));
	for (int i = 0; i < uname_len; i++)
	{
		uname_string[i] = (rand() % ('9' - '0' + 1)) + '0';
	}
	uname_string[uname_len] = '\0';
}

int main(int argnum, char** args)
{
	char user_name[16];
	GenRandomUname(user_name, 15);
	cout << "random gentd user name: "<< user_name << endl;

	char config[10] = { '\0' };
	strcpy(config, args[1]);

	cout << "config mem addr: "<< &config << endl;
	cout << "user_name mem addr: "<< &user_name << endl;

	if (0 == strcmp("ROOT", user_name))
	{
		cout << "Buffer Overflow Attacked!"<< endl;
		cout << "Uname changed to: "<< user_name << endl;
	}
	else
	{
		cout << "Uname OK: "<< user_name << endl;
	}
	return 0;
}

To be more accurate, if we compile and run the above sample as a C++ console application, passing CUSM_CFG as an argument, the program will run normally and the console will show the following output:

Figure 1 Buffer Overflow

But if we rerun the program passing CUSTOM_CONFIGUREROOT as an argument, the output will be “unexpected” and the console will show a message like this:

Figure 2 Buffer Overflow

This simple example shows how a buffer overflow attack works. The reason why there can be unexpected output is that the function call of strcpy does not check the bonds of the destination array. Although compilers usually give several extra bytes to arrays for memory alignment purpose, buffer overflow may still happen if the source array is long enough. In this case, a piece of the runtime memory layout of the program looks like this (the result of different compilers or compile options may vary):

Figure 3

Intel Memory Protection Extensions

With the help of Intel MPX, we can avoid the buffer overflow security issue simply by adding the compile option /d2MPX to the Visual Studio C++ compiler.

Figure 4

After recompiling with the Intel MPX option, the program is able to defend against buffer overflow attacks. If we try running the recompiled program with CUSTOM_CONFIGUREROOT argument, a runtime exception will arise and cause the program to exit.

Figure 5

Let’s dig into the generated assembly code to see what Intel MPX has done with the program. From the results, we can see that many of the instructions related to Intel MPX have been inserted into the original instructions to detect buffer overflows at runtime.

Figure 6

Now let’s look in more detail at the instructions related to Intel MXP:

bndmk: Creates LowerBound (LB) and UpperBound (UB) in the bounds register (%bnd0) in the code snapshot above.
bndmov: Fetches the bounds information (upper and lower) out of memory and puts it in a bounds register.
bndcl: Checks the lower bounds against an argument (%rax) in the code snapshot above.
bndcu: Checks the upper bounds against an argument (%rax) in the code snapshot above.

Troubleshooting

If MPX is not working properly,

  1. Double-check the versions of your CPU, OS, and Visual Studio 2015. Boot the PC into the UEFI settings to check if there is any Intel MPX switch; turn on the switch if needed.
  2. Confirm that the Intel MPX driver is properly installed and functioning properly in the Windows* Device Manager. Figure 7
  3. Check that the compiled executable contains instructions related to Intel MPX. Insert a break point, and then run the program. When the break point is hit, right-click with the mouse, and then click Go To Disassembly. A new window will display for viewing the assembly code.

Figure 8

Conclusion

Intel MPX is a new hardware solution that helps defend against buffer overflow attacks. Compared with software solutions such as AddressSanitizer (https://code.google.com/p/address-sanitizer/), from an application developer’s point of view, Intel MPX has many advantages, including the following:

  • Detects when a pointer points out of the object but still points to valid memory.
  • Intel MPX is more flexible, it can be used in some certain modules without affecting any other modules.
  • Compatibility with legacy code is much higher for code instrumented with Intel MPX.
  • One single version binary can still be released, because of the particular instruction encoding. The instructions related to Intel MPX will be executed as NOPs (No Operations) on unsupported hardware or operations systems.

On Intel® 6th generation Core™ Processor and Windows 10, benefiting from Intel MPX for applications is as simple as adding a compiler option, which can help enhance application security without hurting the application’s backward compatibility.

Related Articles

Intel® Memory Protection Extensions Enabling Guide:

https://software.intel.com/en-us/articles/intel-memory-protection-extensions-enabling-guide

References

[1] AddressSanitizer: https://code.google.com/p/address-sanitizer/

About the Author

Fanjiang Pei is an application engineer in the Client Computing Enabling Team, Developer Relations Division, Software and Solutions Group (SSG). He is responsible for enabling security technologies of Intel such as Intel MPX, Intel® Software Guard Extensions, and more.

Intel® Software Guard Extensions Tutorial Series: Part 3, Designing for Intel® SGX

$
0
0

In Part 3 of the Intel® Software Guard Extensions (Intel® SGX) tutorial series we’ll talk about how to design an application with Intel SGX in mind. We’ll take the concepts that we reviewed in Part 1, and apply them to the high-level design of our sample application, the Tutorial Password Manager, laid out in Part 2. We’ll look at the overall structure of the application and how it is impacted by Intel SGX and create a class model that will prepare us for the enclave design and integration.

You can find the list of all of the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

While we won’t be coding up enclaves or enclave interfaces just yet, there is source code provided with this installment. The non-Intel SGX version of the application core, without its user interface, is available for download. It comes with a small test program, a console application written in C#, and a sample password vault file.

Designing for Enclaves

This is the general approach we’ll follow for designing the Tutorial Password Manager for Intel SGX:

  1. Identify the application’s secrets.
  2. Identify the providers and consumers of those secrets.
  3. Determine the enclave boundary.
  4. Tailor the application components for the enclave.

Identify the Application’s Secrets

The first step in designing an application for Intel SGX is to identify the application’s secrets.

A secret is anything that is not meant to be known or seen by others. Only the user or the application for which it is intended should have access to a secret, and it should not be exposed to others users or applications regardless of their privilege level. Potential secrets can include financial information, medical records, personally identifiable information, identity data, licensed media content, passwords, and encryption keys.

In the Tutorial Password Manager, there are several items that are immediately identifiable as secrets, shown in Table 1.


Secret


The user’s account passwords

The user’s account logins

The user’s master password or passphrase

The master key for the password vault

The encryption key for the account database


Table 1:Preliminary list of application secrets.


These are the obvious choices, but we’re going to expand this list by including all of the user’s account information and not just their logins. The revised list is shown in Table 2.


Secret


The user’s account passwords

The user’s account logins information

The user’s master password or passphrase

The master key for the password vault

The encryption key for the account database


Table 2: Revised list of application secrets.

Even without revealing the passwords, the account information is valuable to attackers. Exposing this data in the password manager leaks valuable clues to those with malicious intent. With this data, they can choose to launch attacks against the services themselves, perhaps using social engineering or password reset attacks, to obtain access to the owner’s account because they know exactly what to target.

Identify the Providers and Consumers of the Application’s Secrets

Once the application’s secrets have been identified, the next step is to determine their origins and destinations.

In the current version of Intel SGX, the enclave code is not encrypted, which means that anyone with access to the application files can disassemble and inspect it. By definition, something cannot be a secret if it is open to inspection, and that means that secrets should never be statically compiled into enclave code. An application’s secrets must originate from outside its enclaves and be loaded into them at runtime. In Intel SGX terminology, this is referred to as provisioning secrets into the enclave.

When a secret originates from a component outside of the Trusted Compute Base (TCB), it is important to minimize its exposure to untrusted code. (One of the main reasons why remote attestation is such a valuable component of Intel SGX is that it allows a service provider to establish a trusted relationship with an Intel SGX application, and then derive an encryption key that can be used to provision encrypted secrets to the application that only the trusted enclave on that client system can decrypt.) Similar care must be taken when a secret is exported out of an enclave. As a general rule, an application’s secrets should not be sent to untrusted code without first being encrypted inside of the enclave.

Unfortunately for the Tutorial Password Manager application, we do need to send secrets into and out of the enclave, and those secrets will have to exist in clear text at some point. The end user will be entering his or her account information and password via a keyboard or touchscreen, and recalling it at a future time as needed. Their account passwords will need to be shown on the screen, and even copied to the Windows* clipboard on request. These are core requirements for a password manager application to be useful.

What that means for us is that we can’t completely eliminate the attack surface: we can only minimize it, and we’ll need some mitigation strategy for dealing with secrets when they exist outside the enclave in plain text.

Secret

Source

Destination

The user’s account passwords

User input*

Password vault file

User interface*

Clipboard*

Password vault file

The user’s account information

User input*

Password vault file

User interface*

Password vault file

The user’s master password or passphrase

User input

Key derivation function

The master key for the password vault

Key derivation function

Database key crypto

The encryption key for the password database

Random generation

Password vault file

Password vault crypto

Password vault fil

Table 3: Application secrets, their sources, and their destinations. Potential security risks are denoted with an asterisk (*).

Table 3 adds the sources and destinations for the Tutorial Password Manager’s secrets. Potential problems—areas where secrets may be exposed to untrusted code—are denoted with an asterisk (*).

Determine the Enclave Boundary

Once the secrets have been identified, it’s time to determine the boundary for the enclave. Start by looking at the data flow of secrets through the application’s core components. The enclave boundary should:

  • Encompass the minimum set of critical components that act on your application’s secrets.
  • Completely contain as many secrets as is feasible.
  • Minimize the interactions with, and dependencies on, untrusted code.

The data flows and chosen enclave boundary for the Tutorial Password Manager application are shown in Figure 1.

Figure 1

Figure 1: Data flow for secrets in the Tutorial Password Manager.

Here, the application secrets are depicted as circles, with blue circles representing secrets that will exist in plain text (unencrypted) at some point during the application’s execution and green circles representing secrets that are encrypted by the application. The enclave boundary has been drawn around the encryption and decryption routines, the key derivation function (KDF) and the random number generator. This does several things for us:

  1. The database/vault key, which is used to encrypt some of our application’s secrets (account information and passwords), is generated within the enclave and is never sent outside of it in clear text.
  2. The master key is derived from the user’s passphrase inside the enclave, and used to encrypt and decrypt the database/vault key. The master key is ephemeral and is never sent outside the enclave in any form.
  3. The database/vault key, account information, and account passwords are encrypted inside the enclave using encryption keys that are not visible to untrusted code (see #1 and #2).

Unfortunately, we have issues with unencrypted secrets crossing the enclave boundary that we simply can’t avoid. At some point during the Tutorial Password Manager’s execution, a user will have to enter a password on the keyboard or copy a password to the Windows clipboard. These are insecure channels that can’t be placed inside the enclave, and the operations are absolutely necessary for the functioning of the application. This is potentially a huge problem, which is compounded by the decision to build the application on top of a managed code base.

Protecting Secrets Outside the Enclave

There are no complete solutions for securing unencrypted secrets outside the enclave, only mitigation strategies that reduce the attack surface. The best we can do is minimize the amount of time that this information exists in a form that is easily compromised.

Here is some general advice for handling sensitive data in untrusted code:

  • Zero-fill your data buffers when you are done with them. Be sure to use functions such as SecureZeroMemory (Windows) and memzero_explicit (Linux) that are guaranteed to not be optimized out by the compiler.
  • Do not use the C++ standard template library (STL) containers to store sensitive data. The STL containers have their own memory management, which makes it difficult to ensure that the memory allocated to an object is securely wiped when the object is deleted. (By using custom allocators you can address this issue for some containers.)
  • When working with managed code such as .NET, or languages that feature automatic memory management, use storage types that are specifically designed for holding secure data. Other storage types are at the mercy of the garbage collector and just-in-time compilation, and may not be cleared or freed on demand (if at all).
  • If you must place data on the clipboard be sure to clear it after a short length of time. In particular, don’t allow it to remain there after the application has exited.

For the Tutorial Password Manager project, we have to work with both native and managed code. In native code, we’ll allocate wchar_t and char buffers, and use SecureZeroMemory to wipe them clean before freeing them. In the managed code space, we’ll employ .NET’s SecureString class.

When sending a SecureString to unmanaged code, we’ll use the helper functions from System::Runtime::InteropServices to marshal the data. 

using namespace System::Runtime::InteropServices;

LPWSTR PasswordManagerCore::M_SecureString_to_LPWSTR(SecureString ^ss)
{
	IntPtr wsp= IntPtr::Zero;

	if (!ss) return NULL;

	wsp = Marshal::SecureStringToGlobalAllocUnicode(ss);
	return (wchar_t *) wsp.ToPointer();
}

When marshaling data in the other direction, from native code to managed code, we have two methods. If the SecureString object already exists, we’ll use the Clear and AppendChar methods to set the new value from the wchar_t string.

password->Clear();
for (int i = 0; i < wpass_len; ++i) password->AppendChar(wpass[i]);

When creating a new SecureString object, we’ll use the constructor form that creates a SecureString from an existing wchar_t string.

try {
	name = gcnew SecureString(wname, (int) wcslen(wname));
	login = gcnew SecureString(wlogin, (int) wcslen(wlogin));
	url = gcnew SecureString(wurl, (int) wcslen(wurl));
}
catch (...) {
	rv = NL_STATUS_ALLOC;
}

Our password manager also supports transferring passwords to the Windows clipboard. The clipboard is an insecure storage space that can potentially be accessed by other users and for this reason Microsoft recommends that sensitive data never be placed on there. The point of a password manager, though, is to make it possible for users to create strong passwords that they do not have to remember. It also makes it possible to create lengthy passwords consisting of randomly generated characters which would be difficult to type by hand. The clipboard provides much needed convenience in exchange for some measure of risk.

To mitigate this risk, we need to take some extra precautions. The first is to ensure that the clipboard is emptied when the application exits. This is accomplished in the destructor in one of our native objects.

PasswordManagerCoreNative::~PasswordManagerCoreNative(void)
{
	if (!OpenClipboard(NULL)) return;
	EmptyClipboard();
	CloseClipboard();
}

We’ll also set up a clipboard timer. When a password is copied to the clipboard, set a timer for 15 seconds and execute a function to clear the clipboard when it fires. If a timer is already running, meaning a new password was placed on the clipboard before the old one was expired, that timer is cancelled and the new one takes its place.

void PasswordManagerCoreNative::start_clipboard_timer()
{
	// Use the default Timer Queue

	// Stop any existing timer
	if (timer != NULL) DeleteTimerQueueTimer(NULL, timer, NULL);

	// Start a new timer
	if (!CreateTimerQueueTimer(&timer, NULL, (WAITORTIMERCALLBACK)clear_clipboard_proc,
		NULL, CLIPBOARD_CLEAR_SECS * 1000, 0, 0)) return;
}

static void CALLBACK clear_clipboard_proc(PVOID param, BOOLEAN fired)
{
	if (!OpenClipboard(NULL)) return;
	EmptyClipboard();
	CloseClipboard();
}

Tailor the Application Components for the Enclave

With the secrets identified and the enclave boundary drawn, it’s time to structure the application while taking the enclave into account. There are significant restrictions on what can be done inside of an enclave, and these restrictions will mandate which components live inside the enclave, which live outside of it, and when porting an existing applications, which ones may need to be split in two.

The biggest restriction that impacts the Tutorial Password Manager is that enclaves cannot perform any I/O operations. The enclave can’t read from the keyboard or write to the display so all of our secrets—passwords and account information—must be marshaled into and out of the enclave. It also can’t read from or write to the vault file: the components that parse the vault file must be separated from components that perform the physical I/O. That means we are going to have to marshal more than just our secrets across the enclave boundary: we have to marshal the file contents as well.

 Class diagram for the Tutorial Password Manager.

Figure 2:Class diagram for the Tutorial Password Manager.

Figure 2 shows the basic class diagram for the application core (excluding the user interface), including which classes serve as the sources and destinations for our secrets. Note that the PasswordManagerCore class is considered the source and destination for secrets which must interact with the GUI in this diagram for simplicity’s sake. Table 4 briefly describes each class and its purpose.

Class

Type

Function

PasswordManagerCore

Managed

Interact with the C# graphical user interface (GUI) and marshal data to the native layer.

PasswordManagerCoreNative

Native, Untrusted

Interact with the managed PasswordManagerCore class. Also responsible for converting between Unicode and multibyte character data (this will be discussed in more detail in Part 4).

VaultFile

Managed

Reads and writes from the vault file.

Vault

Native, Enclave

Stores the password vault data in AccountRecord members. Deserializes the vault file on reads, and reserializes it for writing.

AccountRecord

Native, Enclave

Stores the account information and password for each account in the user’s password vault.

Crypto

Native, Enclave

Performs cryptographic functions.

DRNG

Native, Enclave

Interface to the random number generator.

Table 4:Class descriptions.

Note that we had to split the handling of the vault file into two pieces: one that does the physical I/O, and one that stores its contents once they are read and parsed. We also had to add serialization and deserialization methods to the Vault object as intermediate sources and destinations for our secrets. All of this is necessary because the VaultFile class can’t know anything about the structure of the vault file itself, since that would require access to cryptographic functions that are located inside the enclave.

We’ve also drawn a dotted line when connecting the PasswordManagerCoreNative class to the Vault class. As you might recall from Part 2, enclaves can only link to C functions. These two C++ classes cannot directly communicate with one another: they must use an intermediary which is denoted by the Bridge Functions box.

The Non-Intel® Software Guard Extensions Code Path

The diagram in Figure 2 is for the Intel SGX code path. The PasswordManagerCoreNative class cannot link directly to the Vault class because the latter is inside the enclave. In the non-Intel SGX code path, however, there is no such restriction: PasswordManagerCoreNative can directly contain a member of class Vault. This is the only shortcut we’ll take in the application design for the non-Intel SGX code path. To simplify the enclave integration, the non-enclave code path will still separate the vault processing into the Vault and VaultFile classes.

Another key difference between the two code paths is that the cryptographic functions in the Intel SGX path will come from the Intel SGX SDK. The non-Intel SGX code path can’t use these functions, so they will draw upon Microsoft’s Cryptography Next Generation* API (CNG). That means we have to maintain two, distinct copies of the Crypto class: one for use in enclaves and one for use in untrusted space. We’ll have to do the same with the DRNG class, too, since the Intel SGX code path will call sgx_read_rand instead of using the RDRAND intrinsic.

Sample Code

As mentioned in the introduction, there is sample codeprovided with this part for you to download. The attached archive includes the source code for the Tutorial Password Manager core DLL, prior to enclave integration. In other words, this is the non-Intel SGX version of the application core. There is no user interface provided, but we have included a rudimentary test application written in C# that runs through a series of test operations. It executes two test suites: one that creates a new vault file and performs various operations on it, and one that acts on a reference vault file that is included with the source distribution. As written, the test application expects the test vault to be located in your Documents folder, though you can change this in the TestSetup class if needed.

This source code was developed in Microsoft Visual Studio* Professional 2013 per the requirements stated in the introduction to the tutorial series. It does not require the Intel SGX SDK at this point, though you will need a system that supports Intel® Data Protection Technology with Secure Key.

Coming Up Next

In part 4 of the tutorial we’ll develop the enclave and the bridge functions. Stay tuned!

Find the list of all the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

Deploying applications with Intel® IPP DLLs

$
0
0

Introduction

The Intel® Integrated Performance Primitives (Intel® IPP) is a cross-architecture software library that provides a broad range of library functions for image processing, signal processing, data compression, cryptography, and computer vision, as well as math support routines for such processing capabilities. Intel® IPP is optimized for the wide range of Intel microprocessors.

One of the key advantages within Intel® IPP is performance. The performance advantage comes through per processor architecture optimized functions, compiled into one single library. Intel® IPP functions are “dispatched” at run-time. The “dispatcher” chooses which of these processor-specific optimized libraries to use when the application makes a call into the IPP library. This is done to maximize each function’s use of the underlying vector instructions and other architecture-specific features.

This paper covers application deployment with Intel® IPP dynamic-link libraries (DLLs). It is important to understand processor detection and library dispatching, so that software redistribution is problem free. Additionally you want to consider two key factors when it comes to DLLs:

  1. Selection of an appropriate DLL linking model.
  2. The location for the DLLs on the target system.

This document explains how the Intel® IPP dynamic libraries work and discusses these important considerations. For information on all Intel® IPP linking models, please refer to the document Intel® IPP Linkage Models – Quick Reference Guide. Further documentation on Intel® IPP can be found at Intel® Integrated Performance Primitives – Documentation.

Version Information
This document applies to Intel® IPP 2017.xx.xxx for Windows* running 32-bit and 64-bit applications but concepts can also be applied to other operating systems supported by Intel® IPP.

Library Location
Intel® IPP is also a key component of Intel® Parallel Studio XE and Intel® System Studio. The IPP libraries of Parallel Studio can be found in redist directory. For default installation on Windows*, the path to the libraries is set to ’C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_x.xx.xxx\<target_os>’ where ‘x.xx.xxx’ designates the version installed (on certain systems, instead of ‘Program Files (x86)’, the directory name is ‘Program Files’). For convenience <ipp directory> will be used instead throughout this paper.


Note: Please verify that your license permits redistribution before distributing the Intel® IPP DLLs. Any software source code included with this product is furnished under a software license and may only be used or copied in accordance with the terms of that license. Please see the Intel® Software Products End User License Agreement for license definitions and restrictions on the library.
 


Key Concepts

Library Dispatcher
Every Intel® IPP function has many binary implementations, each performance-optimized for a specific target CPU. These processor-specific functions are contained in separate DLLs. The name of each DLL has a prefix identification code that denotes its target processor. For example, a 32-bit Intel processor with SSE4.2 support requires the image processing library named ippip8.dll, where ‘p8’ is the CPU identification code for 32-bit SSE4.2.

IA-32 Intel® architecture

Intel® 64 architecture

Meaning

px

Mx

Generic code optimized for processors with Intel® Streaming SIMD Extensions (Intel® SSE)

w7

My

Optimized for processors with Intel SSE2

s8

n8

Optimized for processors with Supplemental Streaming SIMD Extensions 3 (SSSE3)

-

m7

Optimized for processors with Intel SSE3

p8

y8

Optimized for processors with Intel SSE4.2

g9

e9

Optimized for processors with Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI)

h9

l9

Optimized for processors with Intel® Advanced Vector Extensions 2 (Intel® AVX2)

-

k0

Optimized for processors with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

 

n0

Optimized for processors with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) for Intel® Many Integrated Core Architecture (Intel® MIC Architecture)

Table 1: CPU Identification Codes Associated with Processor-Specific Libraries

When the first Intel® IPP function call occurs in the application, the application searches the system path for an Intel® IPP dispatcher library. The dispatcher library identifies the system processor and invokes the function version that has the best performance on the target CPU. This process does not add overhead because the dispatcher connects to an entry point of an optimized function only once during application initialization. This allows your code to call optimized functions without worrying about the processor on which the code will execute.

Dynamic Linking
Dynamic-link libraries are loaded when an application runs. Simply link the application to the Intel® IPP libraries located in the <ipp directory>\ipp\lib\ia32 or <ipp directory>\ipp\lib\intel64 folder, which load the dispatcher libraries and link to the correct entry points. Ensure that the dispatcher DLLs and the processor-specific DLLs are on the system path. In the diagram below, the application links to the ipps.lib and ipps.dll automatically loads ippsv8.dll at runtime.


Figure 1: Processor-Specific Dispatching

Dynamic linking is useful if many Intel® IPP functions are called in the application. Most applications are good candidates for this model.

Building a Custom DLL
In addition to dynamic linking, the Intel® IPP provides a tool called Intel® IPP Custom Library Tool for developers to create their own DLL. This tool can be found under <ipp directory>\ipp\tools\custom_library_tool and links selected Intel® IPP functions into a new separate DLL and generates an import library to which the application can link. A custom DLL is useful if the application uses a limited set of functions. The custom DLL must be distributed with the application. Intel® IPP supports two dynamic linking options. Refer to Table 2 below to choose which dynamic linking model best suits the application.

Feature

Dynamic Linking

Custom DLL

Processor Updates

Automatic

Recompile and redistribute

Optimization

All processors

All processors

Build

Link to stub static libraries

Build and link to a separate import library which dispatches a separate DLL

Function Naming

Regular names

Regular names

Total Binary Size

Large

Small

Executable Size

Smallest

Smallest

Kernel Mode

No

No

Table 2: Dynamic Linking Models

For detailed information on how to build and link to a custom DLL, unzip the example package files under <ipp directory>\ipp\examples and look at the core examples under components\examples_core\ipp_custom_dll.

Threading and Multi-core Support
Intel continues deprecation of internal threading that was started in version Intel® IPP 7.1. Internal (inside a primitive) threading is significantly less effective than external (at the application level) threading. For threading Intel® IPP functions, external threading is recommended which gives significant performance gain on multi-processor and multi-core systems. A good starting point on how to develop code for external threading can be found here.


Linking the Application

The Intel® IPP can be compiled with Microsoft Visual Studio* and Intel® C++ Compiler. Instructions for configuring Microsoft Visual Studio to link to the Intel® IPP libraries can be found in the Getting Started with Intel® Integrated Performance Primitives document.


Deploying the Application

The Intel® IPP dispatcher and processor-specific DLLs, located in <ipp directory>\redist\ia32\ipp or <ipp directory>\redist\intel64\ipp, or a custom DLL must be distributed with the application software. The Intel® IPP core functions library, ippcore.dll must also be distributed.

When distributing a custom DLL, it is best to create a distinct naming scheme to avoid conflicts and for tracking purposes. This is also important because custom DLLs must be recompiled and redistributed to include new processor optimizations not available in previous Intel® IPP versions.

On Microsoft Windows*, the system PATH variable holds a list of folder locations that is searched for executable files. When the application is invoked, the Intel® IPP DLLs need to be located in a folder that is listed in the PATH variable. Choose a location for the Intel® IPP DLLs and custom DLLs on the target system so that the application can easily find them. Possible distribution locations include %SystemDrive%/WINDOWS\system32, the application folder or any other folder on the target system. Table 3 below compares these options.

 

System PATH

Permissions

%SystemDrive%\WINDOWS\system32

This folder is listed on the system PATH by default.

Administrator permissions may be required to copy files to this folder.

Application Folder or Subfolder

Windows will first check the application folder for the DLLs.

Special permissions may be required.

Other Folder

Add this directory to the system PATH.

Special permissions may be required.

Table 3: Intel® IPP DLL Location

In all cases, the application must be able to find the location of the Intel® IPP DLLs and custom DLLs in order to run properly.

The Intel® IPP provides a convenient method to performance optimize a 32-bit or Intel 64-bit application for the latest processors. Application and DLL distribution requires developers to do the following:

  1. Choose the appropriate DLL linking model
    • Dynamic linking– Application is linked to stub libraries. At runtime, dispatcher DLLs detect the target processor and dispatch processor-specific DLLs. Dispatcher and processor-specific DLLs to be distributed with the application.
    • Custom DLL– Application is linked to a custom import library. At runtime, the custom DLL is invoked. Custom DLL to be distributed with the application.
  1. Determine the best location for the Intel® IPP DLLs on the end–user system.
    • %SystemDrive%\WINDOWS\system32
    • Application folder or subfolder
    • Other folder

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Copyright © 2002-2016, Intel Corporation. All rights reserved.

Introducing the new Packed APIs for GEMM

$
0
0

1      Introducing Packed APIs for GEMM

Matrix-matrix multiplication (GEMM) is a fundamental operation in many scientific, engineering, and machine learning applications. There is a continuing demand to optimize this operation, and Intel® Math Kernel Library (Intel® MKL) offers parallel high-performing GEMM implementations. To provide optimal performance, the Intel MKL implementation of GEMM typically transforms the original input matrices into an internal data format best suited for the targeted platform. This data transformation (also called packing) can be costly, especially for input matrices with one or more small dimensions.

Intel MKL 2017 introduces [S,D]GEMM packed application program interfaces (APIs) that allow users to explicitly transform the matrices into an internal packed format and pass the packed matrix (or matrices) to multiple GEMM calls. With this approach, the packing costs can be amortized over multiple GEMM calls if the input matrices (A or B) are reused between these calls.

2      Example

Three GEMM calls shown below use the same A matrix, while B/C matrices differ for each call:

float *A, *B1, *B2, *B3, *C1, *C2, *C3, alpha, beta;

MKL_INT m, n, k, lda, ldb, ldc;

// initialize the pointers and matrix dimensions (skipped for brevity)

sgemm(“T”, “N”, &m, &n, &k, &alpha, A, &lda, B1, &ldb, &beta, C1, &ldc);

sgemm(“T”, “N”, &m, &n, &k, &alpha, A, &lda, B2, &ldb, &beta, C2, &ldc);

sgemm(“T”, “N”, &m, &n, &k, &alpha, A, &lda, B3, &ldb, &beta, C3, &ldc);

Here the A matrix is transformed into internal packed data format within each sgemm call. The relative cost of packing matrix A three times can be high if n is small (number of columns for B/C). This cost can be minimized by packing the A matrix once and using its packed equivalent for the three consecutive GEMM calls as shown below:

// allocate memory for packed data format

float *Ap;

Ap = sgemm_alloc(“A”, &m, &n, &k);

// transform A into packed format

sgemm_pack(“A”, “T”, &m, &n, &k, &alpha, A, &lda, Ap);

// SGEMM computations are performed using the packed A matrix: Ap

sgemm_compute(“P”, “N”, &m, &n, &k, Ap, &lda, B1, &ldb, &beta, C1, &ldc);

sgemm_compute(“P”, “N”, &m, &n, &k, Ap, &lda, B2, &ldb, &beta, C2, &ldc);

sgemm_compute(“P”, “N”, &m, &n, &k, Ap, &lda, B3, &ldb, &beta, C3, &ldc);

// release the memory for Ap

sgemm_free(Ap);

The code sample above uses four new functions introduced to support packed APIs for GEMM: sgemm_alloc, sgemm_pack, sgemm_compute, and sgemm_free. First, the memory required for packed format is allocated using sgemm_alloc, which accepts a character argument identifying the packed matrix (A in this example) and three integer arguments for the matrix dimensions. Then, sgemm_pack transforms the original A matrix into the packed format Ap and performs the alpha scaling. The original A matrix remains unchanged. The three sgemm calls are replaced with three sgemm_compute calls that work with packed matrices and assume that alpha=1.0. The first two character arguments to sgemm_compute indicate that the A matrix is in packed format (“P”), and the B matrix is in non-transposed column major format (“N”). Finally, the memory allocated for Ap is released by calling sgemm_free.

GEMM packed APIs eliminate the cost of packing the matrix A twice for the three matrix-matrix multiplication operations shown in this example. These packed APIs can be used to eliminate the data transformation costs for A and/or B input matrices if A and/or B are re-used between GEMM calls.

3      Performance

The chart below shows the performance gains with the packed APIs on Intel® Xeon Phi™ Processor 7250. It is assumed that the packing cost can be completely amortized by a large number of SGEMM calls that use the same A matrix. The performance of regular SGEMM call is also provided for comparison.

4      Implementation Notes

It is recommended to call gemm_pack and gemm_compute with the same number of threads to get the best performance. Note that if there are only a small number of GEMM calls that share the same A or B matrix, the packed APIs may provide little performance benefit.

The gemm_alloc routine allocates memory approximately as large as the original input matrix. This means that the memory requirement of the application may increase significantly for a large input matrix.

GEMM packed APIs are only implemented for SGEMM and DGEMM in Intel MKL 2017. They are functional for all Intel architectures, but they are only optimized for 64-bit Intel® AVX2 and above.

5      Summary

[S,D]GEMM packed APIs can be used to minimize the data packing costs for multiple GEMM calls that use the same input matrix. As shown in the performance chart, calling them can improve the performance significantly if there is sufficient matrix reuse across multiple GEMM calls. These packed APIs are available in Intel MKL 2017, and both FORTRAN 77 and CBLAS interfaces are supported. Please see the Intel MKL Developer Reference for additional documentation.

Viewing all 533 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>