Quantcast
Channel: Intel Developer Zone Articles
Viewing all 533 articles
Browse latest View live

A Comparison of Intel® RealSense™ Front Facing Camera SR300 and F200

$
0
0

Introduction

The SR300 is the second generation front-facing Intel® RealSense™ camera that supports Microsoft Windows* 10. Similar to the F200 camera model, the SR300 uses coded light depth technology to create a high quality 3D depth video stream at close range. The SR300 camera implements an infrared (IR) laser projector system, Fast VGA infrared (IR) camera, and a 2MP color camera with integrated ISP. The SR300 model uses Fast VGA depth mode instead of native VGA depth mode that the F200 model uses. This new depth mode reduces exposure time and allows dynamic motion up to 2m/s. This camera enables new platform usages by providing synchronized color, depth, and IR video stream data to the client system. The effective range of the depth solution from the camera is optimized from 0.2 to 1.2m for use indoors.


Figure 1: SR300 camera model.

The SR300 camera can use the Intel® RealSense™ SDK for Windows. The version that adds support for SR300 is SDK 2015 R5 or later. The SR300 will become available and built into form factors in 2016 including PCs, all-in-ones, notebooks and 2-in-1s. The SR300 model adds new features and has a number of improvements over the F200 model as follows:

  • Support for the new Hand Tracking Cursor Mode
  • Support for the new Person Tracking Mode
  • Increased Range and Lateral Speed
  • Improved Color Quality under Low-light Capture and Improved RGB Texture for 3D Scan
  • Improved Color and Depth Stream Synchronization
  • Decreased Power Consumption

Product Highlights

SR300

F200

Orientation

Front facing

Front facing

Technology

Coded Light; Fast VGA 60fps

Coded Light; native VGA 60fps

Color Camera

Up to 1080p 30 fps, 720p 60 fps

Up to 1080p 30 fps

SDK

SDK 2015 R5 or later

SDK R2 or later

DCM version

DCM 3.0.24.51819*

DCM 1.4.27.41994*

Operating System

Windows 10 64-bit RTM

Windows 10 64-bit RTM, Windows 8 64-bit

Range

Indoors; 20 – 120cm

Indoors; 20 – 120cm

* As of Feb 19th, 2016.

New Features only Supported by SR300

Cursor Mode

The standout feature for the SR300 camera model is Cursor Mode. This tracking mode returns a single point on the hand allowing accurate and responsive 3D cursor point tracking and basic gestures. Cursor mode also improves power and performance more than 50% compared to Full Hand mode but without latency or requiring calibration. It also increases range to 85cm and tracks hand motion speed up to 2m/s. Cursor Mode includes the Click gesture to simulate the mouse click using the index finger.


Figure 2: Click gesture.

Person Tracking

Another new feature provided for the SR300 model is Person Tracking. Person Tracking also supports the rear facing camera R200, but is not available for the F200. Person Tracking supports real-time 3D body motion tracking. It has three main tracking modes: the body movement, skeleton joints, and facial recognition.

  • Body movement: Locates the body, head and body contour.
  • Skeleton joints: Return the position of body’s joints in 2D and 3D data.
  • Facial recognition: Compares the current face against the database of registered users to determine the user’s identification.

Person Tracking

SR300

F200

Detection

50-250 cm

NA

Tracking

50-550 cm

NA

Skeleton

50-200 cm

NA

Increased Range and Lateral Speed

The SR300 camera model introduces a new Depth mode called Fast VGA. It captures frames at HVGA but interpolates the frames to VGA before transmitting to a client. This new depth mode reduces exposure time for scenes and allows hand motion speed up to 2m/s while native VGA F200 support accepts hand motion speed only up to 0.75m/s. The SR300 model also provides a significant improvement in range from the F200 model. Using hand tracking, the SR300 was able to achieve up to 85 cm while the F200 only achieved 60 cm. Hand segmentation range is increased up to 110 cm for the SR300 improved from 100 cm for the F200 model.

Hand Tracking Mode

SR300

F200

Cursor Mode - general

20-120 cm (2m/s)

NA

Cursor Mode - kids

20-80 cm (1-2m/s)

NA

Tracking

20-85 cm (1.5m/s)

20-60 cm (0.75m/s)

Gesture

20-85 cm (1.5m/s)

20-60 cm (0.75m/s)

Segmentation

20-120 cm (1m/s)

20-100 cm (1m/s)

The range for face recognition increases from 80 cm for the F200 up to 150 cm for the SR300 model.

Face Tracking Mode

SR300

F200

Detection

30-100 cm

25-100 cm

Landmark

30-100 cm

30-100 cm

Recognition

30-150 cm

30-80 cm

Expression

30-100 cm

30-100 cm

Pulse

30-60 cm

30-60 cm

Pose

30-100 cm

30-100 cm

The SR300 model improves RGB texture mapping and achieves a more detailed 3D scan. The range for 3D scan increases up to 70 cm while also allowing more details. Blob tracking speed increases up to 2m/s and its range increases up to 150 m/s in the SR300 model.

Others Tracking Mode

SR300

F200

3D scanning

25-70 cm

25-54 cm

Blob Tracking

20-150 cm (2m/s)

30-85 cm (1.5m/s)

Object Tracking

30-180 cm

30-180 cm

The depth range of the SR300 model was improved by 50%-60%. At the 80 cm range, both SR300 and F200 cameras detect the hand clearly. When the range gets longer than 120 cm, the SR300 can still detect the hand while F200 can’t detect the hand at all at that range.


Figure 3: SR300 vs F200 depth range.

Improved Color Quality Under Low-light Capture and Improved RGB Texture for 3D Scan

The new auto exposure feature is only available with the SR300 model. The exposure compensation feature allows the images taken in low-light or high-contrast to achieve better color quality. The color stream frame rate in the low-light condition might be lower when the color stream auto exposure is enabled.

Function

SR300

F200

Color EV Compensation Control

Yes

No

Improved Color and Depth Stream Synchronization

The F200 model only supports multiple depth and color applications running at the same frame rate. The SR300 supports multiple depth and color applications running at different frame rates, within an integer interval, while maintaining temporal synchronization. This allows software to switch between different frame rates without having to start or stop the video stream.

Camera Temporal Synchronization

SR300

F200

Sync different stream types of same frame rate

Yes

Yes

Sync different stream types of different frame rate

Yes

No

Decreased Power Consumption

The SR300 camera model enables additional power gear modes that can operate at lower frame rates. This allows the image system to reduce the power consumption of the camera, but still maintains awareness. With the power gears mode, SR300 can process the scene autonomously while the system is in standby mode.

Backward Compatibility with F200 Applications

The Intel RealSense Depth Camera Manager (DCM) 3.x enables the SR300 camera to function as an F200 camera to provide backwards compatibility for applications developed for the F200 camera model. The DCM emulates the capabilities of the F200 camera so that the existing SDK applications can work seamlessly on the SR300 model. SR300 features are supported in SDK R5 2015 or later.

When a streaming request comes from an SDK application compiled with SDK earlier than SDK R5 2015, the DCM will automatically activate the compatibility mode and send calls through the F200 pipe instead of the SR300 pipe. Most applications should work without any configuration on the new SR300 model.

Infrared Compatibility

The SR300 supports a 10-bit native infrared data format while the F200 supports an 8-bit native infrared data format. The DCM driver will provide compatibility by either removing or padding 2-bit of the data to fit the requested infrared data size.

Physical Connector

The motherboard and cable design for F200 and SR300 are identical. The F200 cable plug fits into an SR300 receptacle. Therefore, an F200 cable can be used for an SR300 camera model. Both models require fully powered USB 3.0.

SDK APIs

Most SDK APIs are shared between SR300, F200 and even R200 in some cases, and the SDK modules provide the proper interface depending on the camera found at runtime. Similarly, simple color and depth streaming that does not call specific resolutions or pixel formats will run without change required.

And by using the SenseManager to read raw streams, no code change is needed to pick stream resolutions, frame rate, and pixel format without hardcoding.

For the above automatic change depending on camera, it’s important for every app to check for camera model and configuration at runtime. See Installer Options in the SDK documentation.

DCM

As of this writing, the gold DCM version for SR300 is DCM 3.0.24.59748 and updates will be provided by Windows Update. Visit https://software.intel.com/en-us/intel-realsense-sdk/download to download the latest DCM. For more information on the DCM, go to Intel® RealSenseTM Cameras and DCM Overview.

Camera Type

SR300

F200

R200

DCM Installer Version

3.x

1.x

2.x

Hardware Requirements

To support the bandwidth needed by the Intel RealSense camera, a USB 3 port is required in the client system. For details on system requirements and supported operating systems for SR300 and F200, see https://software.intel.com/en-us/RealSense/Devkit/

Summary

This document summarizes the new features and enhancements available with the front-facing Intel RealSense 3D camera SR300 beyond those available with the F200. These new features are supported in SDK 2015 R5 and DCM 3.0.24.51819 or later. This new camera is available to order at http://click.intel.com/realsense.html.

Helpful References

Here is a collection of useful references for the Intel® RealSense™ DCM and SDK, including release notes and how to download and update the software.

About the Author

Nancy Le is a software engineer at Intel Corporation in the Software and Services Group working on Intel® Atom™ processor scale-enabling projects.


DIY Pan-Tilt Person Tracking with Intel® RealSense™ Camera

$
0
0

Download Code Sample

Introduction

In the spirit of the Maker Movement and “America’s Greatest Makers” TV show coming this spring, this article describes a project I constructed and programmed: a pan-tilt person-tracking camera rig using an Intel® RealSense™ camera (R200) and a few inexpensive electronic components. The goal of this project was to devise a mechanism that extends the viewing range of the camera for tracking a person in real time.

The camera rig (Figure 1) consists of two hobby-grade servo motors that are directly coupled using tie wraps and double-sided tape, and a low-cost control board.

DIY pan-tilt camera rig

Figure 1. DIY pan-tilt camera rig.

The servos are driven by a control board connected to the computer’s USB port. A Windows* C# app running on a PC or laptop controls the camera rig. The app uses the Face Tracking and Person Tracking APIs contained in the Intel® RealSense™ SDK for Windows*.

The software, which you can download using the link on this page, drives the two servo motors in real time to physically move the rig nearly 180° degrees in two axes to center the tracked person in the field of view of the R200 camera. You can see a video of the camera rig in action here: https://youtu.be/v2b8CA7oHPw

Why?

The motivation to build a device like this is twofold: first, it presents an interesting control systems problem wherein the camera that’s used to track a moving person is also moving at the same time. Second, a device like this can be employed in interesting use cases such as:

  • Enhanced surveillance – monitoring areas over a wider range than is possible with a fixed camera.
  • Elderly monitoring – tracking a person from a standing position to lying on the floor.
  • Robotic videography – controlling a pan-tilt system like this for recording presentations, seminars, and similar events using a mounted SLR or video camera.
  • Companion robotics – controlling a mobility platform and making your robot follow you around a room.

Scope (and Disclaimer)

This article is not intended to serve as a step-by-step “how-to” guide, nor is the accompanying source code guaranteed to work with your particular rig if you decide to build something similar. The purpose of this article is to chronical one approach for building an automated person-tracking camera rig.

From the description and pictures provided in this document, it should be fairly evident how to fasten two servo motors together in a pan-tilt arrangement using tie wraps and double-sided tape. Alternatively, you can use a kit like this to simplify the construction of a pan-tilt rig.

Note: This is not a typical (or recommended) usage of the R200 peripheral camera. If you decide to build your own rig, make certain you securely fasten the camera and limit the speed and range of the servo motors to prevent damaging it. If you are not completely confident in your maker skills, you may want to pass on building something like this.

Software Development Environment

The software developed for this project runs on Windows 10 and was developed with Microsoft Visual Studio* 2015. The code is compatible with the Intel® RealSense™ SDK version 2016 R1.

This software also requires installation of the Pololu USB Software Development Kit, which can be downloaded here. The Pololu SDK contains the drivers, Control Center app, and samples that are useful for controlling servo motors over a computer’s USB port. (Note: this third-party software is not part of the code sample that can be downloaded from this page.)

Computer System Requirements

The basic hardware requirements for running the person-tracking app are:

  • 4th generation (or later) Intel® Core™ processor
  • 150 MB free hard disk space
  • 4GB RAM
  • Intel® RealSense™ camera (R200)
  • Available USB3 port for the R200 camera
  • Additional USB port for the servo controller board

Code Sample

The software developed for this project was written in C#/WPF using Microsoft Visual Studio 2015. The user interface (Figure 2) provides the color camera stream from the R200 camera, along with real-time updates of the face and person tracking parameters.

Custom software user interface

Figure 2. Custom software user interface.

The software attempts to track the face and torso of a single person using both the Face Tracking and Person Tracking APIs. Face tracking alone is performed by default, as it currently provides more accurate and stable tracking. If the tracked person’s face goes out of view of the camera, the software will resort to tracking the whole person. (Note that the person tracking algorithm is under development and will be improved in future releases of the RSSDK.)

To keep the code sample as simple as possible, it attempts tracking only if a single instance of a face or person is detected. The displacement of a bounding rectangle’s center to the middle of the image plane is used to drive the servos. The movements of the servos will attempt to center the tracked person in the image plane.

Servo Control Algorithm

The first cut at controlling the servos in software was to derive linear equations that effectively scale the servo target positions to the coordinate system shared by the face rectangle and image, as shown in the following code snippet.

Servo.cs
public class Servo
{
   public const int Up = 1152;
   public const int Down = 2256;
   public const int Left = 752;
   public const int Right = 2256;
   .
   .
   .
}

MainWindow.xaml.cs
private const int ImageWidthMin = 0;
private const int ImageWidthMax = 640;
private const int ImageHeightMin = 0;
private const int ImageHeightMax = 480;
.
.
.
ushort panScaled = Convert.ToUInt16((Servo.Right - Servo.Left) * (faceX –
ImageWidthMin) / (ImageWidthMax - ImageWidthMin) + Servo.Left);

ushort tiltScaled = Convert.ToUInt16((Servo.Down - Servo.Up) * (faceY –
	ImageHeightMin) / (ImageHeightMax - ImageHeightMin) + Servo.Up);

MoveCamera(panScaled, tiltScaled);

Although this approach came close to accomplishing the goal of centering the tracked person in the image plane, it resulted in oscillations that occurred as the servo target position and face rectangle converged. These oscillations could be dampened by reducing the speed of the servos, but this made the camera movements too slow to effectively keep up with the person being tracked. A PID algorithm or similar solution could have been employed to tune-out the oscillations, or perhaps employing inverse kinematics to determine the camera position parameters, but I decided to use a simpler approach instead.

The chosen solution simply compares the center of the face (faceRectangle) or person (personBox) to the center of the image plane in a continuous thread and then increments or decrements the camera position in both x and y axes to find a location that roughly centers the person in the image plane. Deadband regions (Figure 3) are defined in both axes to help ensure the servos stop “hunting” for the center position when the camera is approximately centered on the person.

Incremental tracking method

Figure 3. Incremental tracking method.

Building the Code Sample

The code sample has two dependencies that are not redistributed in the downloadable zip file, but are contained in the Pololu USB Software Development Kit:

  • UsbWrapper.dll (located in pololu-usb-sdk\UsbWrapper_Windows\)
  • Usc.dll (located in pololu-usb-sdk\Maestro\Usc\precompiled_obj\)

These files should be copied to the ServoInterface project folder (C:\PersonTrackingCodeSample\ServoInterface\), and then added as references as shown in Figure 4.

Third-party dependencies referenced in Solution Explorer

Figure 4. Third-party dependencies referenced in Solution Explorer.

Note that this project uses an explicit path to libpxcclr.cs.dll (the managed RealSense DLL): C:\Program Files (x86)\Intel\RSSDK\bin\win32. This reference will need to be changed if your installation path is different. If you have problems building the code samples, try removing and then re-adding this library reference.

Control Electronics

This project incorporates a Pololu Micro Maestro* 12-channel USB servo controller (Figure 5) to control the two servo motors. This device includes a fairly comprehensive SDK for developing control applications targeting different platforms and programming languages. To see how a similar model of this board was used refer to robotic hand control experiment article.

Pololu Micro Maestro* servo controller

Figure 5. Pololu Micro Maestro* servo controller.

I used Parallax Standard Servo motors in this project; however, similar devices are available that should work equally well for this application. The servos are connected to channels 0 and 1 of the control board as shown in Figure 5.

Servo Controller Settings

I configured the servo controller board settings before starting construction of the camera rig. The Pololu Micro Maestro SDK includes a Control Center app (Figure 6) that allows you to configure firmware-level parameters and save them to flash memory on the control board.

Control Center channel settings

Figure 6. Control Center channel settings.

Typically, you should set the Min and Max settings in Control Center to match the control pulse width of the servos under control. According to the Parallax Standard Servo data sheet, these devices are controlled using “pulse-width modulation, 0.75–2.25 ms high pulse, 20 ms intervals.” The Control Center app specifies units in microseconds, so Min would be set to 750 and Max set to 2250.

However, the construction of this particular device resulted in some hard-stops (i.e., positions that result in physical binding of the servo horn that can potentially damage the component). The safe operating range of each servo was determined experimentally, and these values were entered for channels 0 and 1 to help prevent it from inadvertently being driven to a binding position.

Summary

This article gives an overview of one approach to building an automated camera rig capable of tracking a person’s movements around a wide area. Beyond presenting an interesting control systems programming challenge, practical applications for a device like this include enhanced surveillance, elderly monitoring, etc. Hopefully, this project will inspire other makers to create interesting things with the Intel RealSense cameras and SDK for Windows.

Watch the Video

To see the pan-tilt camera rig in action, check out the YouTube video here: https://youtu.be/v2b8CA7oHPw

Check Out the Code

Follow the Download link to get the sample code for this project.

About Intel® RealSense™ Technology

To learn more about the Intel RealSense SDK for Windows, go to https://software.intel.com/en-us/intel-realsense-sdk.

About the Author

Bryan Brown is a software applications engineer at Intel Corporation in the Software and Services Group. 

Peel the onion (optimization techniques)

$
0
0

This paper is a more formal response to a IDZ Forum posting. See: (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/590710).

The issue as expressed by original poster was that the code did not scale well using OpenMP on an 8 core E5-2650 V2 processor with 16 hardware threads. I took some time on the forum to aid the poster by giving him some pointers, but did not take sufficient time to fully optimize the code. This article will address additional optimizations that can be made beyond that laid out in the IDZ forum.

I have to say it is unclear as to the experience level of the original poster; I am going to assume he has recently graduated from an institution that may have taught parallel programming with an emphasis on scaling. In the outside world, the practicalities are:  systems have limited amount of processing resources (threads), and the emphasis should be on efficiency as well as scaling. The original sample code on that forum posting provides us with the foundation of a learning tool of how to address efficiencies, in the greater sense, and scaling in the lesser sense.

In order to present code for this paper, I took the liberty to re-work the sample code, while keeping with the overall design and spirit of the original code. This means, I kept the fundamental algorithm intact as the example code was taken from an application that may have had additional functionality requiring the given algorithm. The provided code sample used an array of LOGICALs (mask) for flow control. While the sample code could have been written without the logical array(s), the sample code provided may have been an abbreviated excerpt of a larger application, and these mask arrays may have been required for reasons not obvious in the sample code. Therefore the masks were kept.

Upon inspection of the code, and the poster’s first attempt at parallelization, it was determined that the place chosen to create the parallel region (parallel DO) had too short of run. The original code can be sketched like this:

bid = 1 ! { not stated in original posting, but would appeared to be in a DO bid=1,65 }
do k=1,km-1  ! km = 60
    do kk=1,2
        !$OMP PARALLEL PRIVATE(I) DEFAULT(SHARED)
        !$omp do 
        do j=1,ny_block     ! ny_block = 100
            do i=1,nx_block ! nx_block = 81
... {code}
            enddo
        enddo
        !$omp end do
        !$OMP END PARALLEL
    enddo
enddo

For the users first attempt at parallelization he placed the parallel do on the do j= loop. While this is the “hottest” loop levels, it is not the appropriate loop level for this problem and on this platform.

The number of threads involved was 16. With 16 threads, and the inner two loops performing a combined 8100 iterations, each thread would iterate about 506 iterations. However, the parallel region would be entered 120 times (60*2). The work performed in the inner most loop, while not insignificant, was also not significant. This resulted in the cost of the parallel region being a significant portion of the application. With 16 threads, and an outer loop count of 60 iterations (120 if loops fused), a better choice may be to raise the parallel region to the do k loop.

The code was modified to execute the do k loop many times and compute the average time to execute the entire do k loop. As optimization techniques are applied, we can then use the ratios of average times of original code to revised code as a measurement of improvement. While I did not have an 8 core E5-2650 v2 processor available for testing, I do have a 6 core E5-2620 v2 processor available.  The slightly reworked code presented the following results:

OriginalSerialCode
Average time 0.8267E-02
Version1_ParallelAtInnerTwoLoops
Average time 0.1746E-02,  x Serial  4.74

Perfect scaling on an 6 core E5-2620 v2 processor would have been somewhere between 6x and 12x (7x if you assume an additional 15% for HT). A scaling of 4.74x is significantly less than an expected 7x.

In the following sections of this paper will walk you through four additional techniques of optimization.

OriginalSerialCode
Average time 0.8395E-02
ParallelAtInnerTwoLoops
Average time 0.1699E-02,  x Serial  4.94
ParallelAtkmLoop
Average time 0.6905E-03,  x Serial 12.16,  x Prior  2.46
ParallelAtkmLoopDynamic
Average time 0.5509E-03,  x Serial 15.24,  x Prior  1.25
ParallelNestedRank1
Average time 0.3630E-03,  x Serial 23.13,  x Prior  1.52

Note, the ParallelAtInnerTwoLoops report in the second run illustrates a different multiplier factor than the first run. The principal cause for this is fortuitous code placement or lack thereof. The code did not change between runs. The only difference was the addition of the extra code and the insertion of the call statements to run those subroutines. It is important to bear in mind that code placement of tight loops can significantly affect the performance of those loops. Even adding or removing a single statement can significantly affect some code run times.

To facilitate ease of reading of the code changes, the body of the inner 3 loops was encapsulated into a subroutine. This makes the code easier to study as well as easier to diagnose with program profiler (VTune). Example from the ParallelAtkmLoop subroutine:

bid = 1
!$OMP PARALLEL DEFAULT(SHARED)
!$omp do 
do k=1,km-1 ! km = 60
    call ParallelAtkmLoop_sub(bid, k)
end do
!$omp end do
!$OMP END PARALLEL
endtime = omp_get_wtime()
...
subroutine ParallelAtkmLoop_sub(bid, k)
     ...
    do kk=1,2
        do j=1,ny_block     ! ny_block = 100
            do i=1,nx_block ! nx_block = 81
...
            enddo
        enddo
    enddo
end subroutine ParallelAtkmLoop_sub               

The first optimization I performed was to make two changes:

1) Move the parallelization up two loop levels to the do k loop level. Thus reducing the number of entries into the parallel region by a factor of 120. And,

2) The application used an array of LOGICAL’s as a mask for code selection.  I reworked the code used to generate the values to reduce unnecessary manipulation of the mask array.

These two changes resulted in an improvement of 2.46x over the initial parallelization attempt. While this improvement is great, is this as good as you can get?

In looking at the code of the inner most loop we find:

  ... {construct masks}
  if ( LMASK1(i,j) ) then
     ... {code}
  endif

  if ( LMASK2(i,j) ) then
     ... {code}
  endif

  if( LMASK3(i,j) ) then
     ... {code}
  endif

Meaning the filter masks results in the work load per iteration being unequal. Under this circumstance, it is often better to use dynamic scheduling. This next optimization is performed with ParallelAtkmLoopDynamic. This is the same code as ParallelAtkmLoop but with schedule(dynamic) added to the !$omp do.

This simple change added an additional 1.25x. Note, dynamic scheduling is not your only scheduling option. There are others that might be worth exploring, and note that the type of scheduling often includes a modifier clause (chunk size).

The next level of optimization, which provides an additional 1.52x performance boost in performance, is what one would consider aggressive optimization. The extra 52% does require significant programming effort (but not unmanageable). The opportunity for this optimization comes from an observation that can be made by looking at the Assembly code that you can view using VTune.

I would like to stress that you do not have to understand the assembly code when you look at it. In general you can assume:

more assembly code == slower performance

What you can do is to make an inference as to the complexity (volume) of assembly code has to potential missed optimization opportunities by the compiler. And, when missed opportunities are detected, how you can use a simple technique, to aid the complier with code optimization.

When looking at the body of main work we find:

subroutine ParallelAtkmLoopDynamic_sub(bid, k)
  use omp_lib
  use mod_globals
  implicit none
!-----------------------------------------------------------------------
!
!     dummy variables
!
!-----------------------------------------------------------------------
  integer :: bid,k

!-----------------------------------------------------------------------
!
!     local variables
!
!-----------------------------------------------------------------------
  real , dimension(nx_block,ny_block,2) :: &
        WORK1, WORK2, WORK3, WORK4   ! work arrays

  real , dimension(nx_block,ny_block) :: &
        WORK2_NEXT, WORK4_NEXT       ! WORK2 or WORK4 at next level

  logical , dimension(nx_block,ny_block) :: &
        LMASK1, LMASK2, LMASK3       ! flags
   
  integer  :: kk, j, i    ! loop indices
   
!-----------------------------------------------------------------------
!
!     code
!
!-----------------------------------------------------------------------
  do kk=1,2
    do j=1,ny_block
      do i=1,nx_block
        if(TLT%K_LEVEL(i,j,bid) == k) then
          if(TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid)) then
            LMASK1(i,j) = TLT%ZTW(i,j,bid) == 1
            LMASK2(i,j) = TLT%ZTW(i,j,bid) == 2
            if(LMASK2(i,j)) then
              LMASK3(i,j) = TLT%K_LEVEL(i,j,bid) + 1 < KMT(i,j,bid)
            else
              LMASK3(i,j) = .false.
            endif
          else
            LMASK1(i,j) = .false.
            LMASK2(i,j) = .false.
            LMASK3(i,j) = .false.
          endif
        else
          LMASK1(i,j) = .false.
          LMASK2(i,j) = .false.
          LMASK3(i,j) = .false.
        endif
        if ( LMASK1(i,j) ) then
          WORK1(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
            * SLX(i,j,kk,kbt,k,bid) * dz(k)
                           
          WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk)            &
            - KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) &
            * dz(k+1) )

          WORK2_NEXT(i,j) = c2 * ( &
            KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - &
            KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) )

          WORK3(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
            * SLY(i,j,kk,kbt,k,bid) * dz(k)

          WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk)            &
            - KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) &
            * dz(k+1) )

          WORK4_NEXT(i,j) = c2 * ( &
            KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) )

          if( abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) then
            WORK2(i,j,kk) = WORK2_NEXT(i,j)
          endif

          if ( abs( WORK4_NEXT(i,j) ) < abs( WORK4(i,j,kk ) ) ) then
            WORK4(i,j,kk) = WORK4_NEXT(i,j)
          endif
        endif

        if ( LMASK2(i,j) ) then
          WORK1(i,j,kk) =  KAPPA_THIC(i,j,ktp,k+1,bid)     &
            * SLX(i,j,kk,ktp,k+1,bid)

          WORK2(i,j,kk) =  c2 * ( WORK1(i,j,kk)                 &
            - ( KAPPA_THIC(i,j,kbt,k+1,bid)        &
            * SLX(i,j,kk,kbt,k+1,bid) ) )

          WORK1(i,j,kk) = WORK1(i,j,kk) * dz(k+1)

          WORK3(i,j,kk) =  KAPPA_THIC(i,j,ktp,k+1,bid)     &
            * SLY(i,j,kk,ktp,k+1,bid)

          WORK4(i,j,kk) =  c2 * ( WORK3(i,j,kk)                 &
            - ( KAPPA_THIC(i,j,kbt,k+1,bid)        &
            * SLY(i,j,kk,kbt,k+1,bid) ) )

          WORK3(i,j,kk) = WORK3(i,j,kk) * dz(k+1)
        endif
 
        if( LMASK3(i,j) ) then
          if (k.lt.km-1) then ! added to avoid out of bounds access
            WORK2_NEXT(i,j) = c2 * dzwr(k+1) * ( &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) * dz(k+1) - &
              KAPPA_THIC(i,j,ktp,k+2,bid) * SLX(i,j,kk,ktp,k+2,bid) * dz(k+2))

            WORK4_NEXT(i,j) = c2 * dzwr(k+1) * ( &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) * dz(k+1) - &
              KAPPA_THIC(i,j,ktp,k+2,bid) * SLY(i,j,kk,ktp,k+2,bid) * dz(k+2))
          end if
          if( abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) &
            WORK2(i,j,kk) = WORK2_NEXT(i,j)
          if( abs(WORK4_NEXT(i,j)) < abs(WORK4(i,j,kk)) ) &
            WORK4(i,j,kk) = WORK4_NEXT(i,j)
          endif  
        enddo
      enddo
  enddo
end subroutine Version2_ParallelAtkmLoop_sub

Making an Intel Amplifier run (VTune), and looking at line 540 as an example:

We have part of a statement that performs the product of two numbers. For this partial statement you would expect:

                Load value at some index of SLX
                Multiply by value at some index of dz

Clicking on the Assembly button in amplifier:

Then, sorting by source line number:

And locating source line 540, we find:

We find a total of 46 assembler instructions use to multiply two numbers.

Now comes the inference part.

The two numbers are cells of two arrays. The array SLX has six subscripts the other has one subscript. You can also observe that the last two assembly instructions are vmovss from memory and vmulss from memory. We were expecting fully optimized code to produce something similar to our expectations. The code above shows 44 out of 46 assembly instructions are associated with computing the array indexes to these two variables. Granted, we might expect a few instructions to obtain the indexes into the arrays, but not 44 instructions. Can we do something to reduce this complexity?

In looking at the source code (most recent above) you will note that the last four subscripts of SLX, and the one subscript of dz are loop invariant for the inner most two loops. In the case of SLX, the left most two indices, the inner most two loop control variables, represents a contiguous array section. The compiler optimization failed to recognize the unchanging (right most) array indices as candidates for loop invariant code that can be lifted out of a loop. Additionally, the compiler also failed to identify the left two most indexes as a candidate for collapse into a single index.

This is a good example of what future compiler optimization efforts could address under these circumstances. In this case, the next optimization, which performs a lifting of loop invariant subscripting, illustrates a 1.52x performance boost.

Now that we know that a goodly portion of the “do work” code involves contiguous array sections with several subscripts, can we somehow reduce the number of subscripts without rewriting the application?

The answer to this is yes, if we encapsulate smaller array slices represented by fewer array subscripts. How do we do this for this example code?

The choice made was for two nest levels:

  1. at the outer most bid level (the module data indicates the actual code uses 65 bid values)

  2. at the next to outer most level, the do k loop level. In addition to this, we consolidate the first two indexes into one.

The outermost level passes bid level array sections:

        bid = 1 ! in real application bid may iterate
        ! peel off the bid
        call ParallelNestedRank1_bid( &
            TLT%K_LEVEL(:,:,bid), &
            KMT(:,:,bid), &
            TLT%ZTW(:,:,bid), &
            KAPPA_THIC(:,:,:,:,bid),  &
            SLX(:,:,:,:,:,bid), &
            SLY(:,:,:,:,:,bid))
…
subroutine ParallelNestedRank1_bid(K_LEVEL_bid, KMT_bid, ZTW_bid, KAPPA_THIC_bid, SLX_bid, SLY_bid)
    use omp_lib
    use mod_globals
    implicit none
    integer, dimension(nx_block , ny_block) :: K_LEVEL_bid, KMT_bid, ZTW_bid
    real, dimension(nx_block,ny_block,2,km) :: KAPPA_THIC_bid
    real, dimension(nx_block,ny_block,2,2,km) :: SLX_bid, SLY_bid
…

Note, for non-pointer (allocatable or fixed dimensioned) arrays, the arrays are contiguous. This provides you with the opportunity to peel off the right most indexes to pass on a contiguous array section, and do so with merely computing the offset to the subsection of the larger array. Whereas peeling indexes other than rightmost would require creating a temporary array, and should be avoided. Though there may be some cases where it might be beneficial to do so.

And the second nested level peeled off an additional array index of the do k loop, as well as compressed the first two indexes into one:

    !$OMP PARALLEL DEFAULT(SHARED)
    !$omp do 
    do k=1,km-1
        call ParallelNestedRank1_bid_k( &
            K_LEVEL_bid, KMT_bid, ZTW_bid, &
            KAPPA_THIC_bid(:,:,:,k), &
            KAPPA_THIC_bid(:,:,:,k+1),  KAPPA_THIC_bid(:,:,:,k+2),&
            SLX_bid(:,:,:,:,k), SLY_bid(:,:,:,:,k), &
            SLX_bid(:,:,:,:,k+1), SLY_bid(:,:,:,:,k+1), &
            SLX_bid(:,:,:,:,k+2), SLY_bid(:,:,:,:,k+2), &
            dz(k),dz(k+1),dz(k+2),dzwr(k),dzwr(k+1))
    end do
    !$omp end do
    !$OMP END PARALLEL
end subroutine ParallelNestedRank1_bid   

subroutine ParallelNestedRank11_bid_k( &
    k, K_LEVEL_bid, KMT_bid, ZTW_bid, &
    KAPPA_THIC_bid_k, KAPPA_THIC_bid_kp1, KAPPA_THIC_bid_kp2, &
    SLX_bid_k, SLY_bid_k, &
    SLX_bid_kp1, SLY_bid_kp1, &
    SLX_bid_kp2, SLY_bid_kp2, &
    dz_k,dz_kp1,dz_kp2,dzwr_k,dzwr_kp1)
    use mod_globals
    implicit none
    !-----------------------------------------------------------------------
    !
    !     dummy variables
    !
    !-----------------------------------------------------------------------
    integer :: k
    integer, dimension(nx_block*ny_block) :: K_LEVEL_bid, KMT_bid, ZTW_bid
    real, dimension(nx_block*ny_block,2) :: KAPPA_THIC_bid_k, KAPPA_THIC_bid_kp1
    real, dimension(nx_block*ny_block,2) :: KAPPA_THIC_bid_kp2
    real, dimension(nx_block*ny_block,2,2) :: SLX_bid_k, SLY_bid_k
    real, dimension(nx_block*ny_block,2,2) :: SLX_bid_kp1, SLY_bid_kp1
    real, dimension(nx_block*ny_block,2,2) :: SLX_bid_kp2, SLY_bid_kp2
    real :: dz_k,dz_kp1,dz_kp2,dzwr_k,dzwr_kp1
... ! next note index (i,j) compression to (ij)
    do kk=1,2
        do ij=1,ny_block*nx_block
            if ( LMASK1(ij) ) then

Note that at the point of the call, a contiguous array section (reference) is passed. The dummy arguments of the called routine specify a same sized contiguous chunk of memory with a different number of indexes.  As long as you are careful in Fortran, you can do this.

The coding effort was mostly a copy and paste, then a find and replace operation. Other than this, there was no code flow changes. A meticulous junior programmer could have done this with proper instructions.

While future versions of compiler optimization may make this unnecessary, a little bit of “unnecessary” programming effort now can, at times, yield substantial performance gains (52% in this case).

The equivalent source code statement is now:

And the assembly code is now:

We are now down from 46 instructions to 6 instructions a 7.66x reduction. This illustrates that by reducing the number of array subscripts, that the complier optimization can reduce the instruction count.

Introducing a 2-Level nest with peel yielded a 1.52x performance boost. As to if a 52% boost in performance is worth the additional effort, this is a subjective measure for you to decide. I anticipate that future compiler optimizations will perform loop invariant array subscript lifting as performed manually above. But until then you can use the index peel and compress technique.

I hope that I have provided you with some useful tips.

Jim Dempsey
Quickthread Programming, LLC
A software consulting company.

Chat Heads with Intel® RealSense™ SDK Background Segmentation Boosts e-Sport Experience

$
0
0

Intel® RealSense™ Technology can be used to improve the e-sports experience for both players and spectators by allowing them to see each other on-screen during the game. Using background segmented video (BGS), players’ “floating heads” can be overlaid on top of a game using less screen real-estate than full widescreen video and graphics can be displayed behind them (like a meteorologist on television). Giving players the ability to see one another while they play in addition to speaking to one another will enhance their overall in-game communication experience. And spectators will get a chance to see their favorite e-sports competitors in the middle of all the action.

In this article, we will discuss how this technique is made possible by the Intel® RealSense™ SDK. This sample will help you understand the various pieces in the implementation (using the Intel RealSense SDK for background segmentation, networking, video compression and decompression), the social interaction, and the performance of this use case. The code in this sample is written in C++ and uses DirectX*.

Figure 1:Screenshot of the sample with two players with a League of Legends* video clip playing in the background.

Figure 2:Screenshot of the sample with two players and a Hearthstone* video clip playing in the background.

Installing, Building, and Running the Sample

Download the sample at: https://github.com/GameTechDev/ChatHeads

The sample uses the following third-party libraries: 
(i) RakNet for networking 
(ii) Theora Playback Library to play back ogg videos 
(iii) ImGui for the UI 
(iv) Windows Media Foundation* (WMF) for encoding and decoding the BGS video streams

(i) and (ii) are dynamically linked (required dll’s are present in the source repo), while (iii) is statically linked with source included.
(iv) is dynamically linked, and is part of the WMF runtime, which should be installed by default on a Windows* 8 or greater system. If it is not already present, please install the Windows SDK.

Install the Intel RealSense SDK (2015 R5 or higher) prior to building the sample. The header and library include paths in the Visual Studio project use the RSSDK_DIR environment variable, which is set during the RSSDK installation.

The solution file is at ChatheadsNativePOC\ChatheadsNativePOC and should build successfully with VS2013 and VS2015.

Install the Intel® RealSense™ Depth Camera Manager, which includes the camera driver, before running the sample. The sample has been tested on Windows® 8.1 and Windows® 10 using both the external and embedded Intel® RealSense™ cameras.

When you start the sample, the option panel shown in Figure 3 displays:

Figure 3:Option panel at startup.

  • Scene selection: Select between League of Legends* video, Hearthstone* video and a CPUT (3D) scene. Click the Load Scene button to render the selection. This does not start the Intel RealSense software; that happens in a later step.
  • Resolutions: The Intel RealSense SDK background segmentation module supports multiple resolutions. Setting a new resolution results in a shutdown of the current Intel RealSense SDK session and initializes a new one.
  • Is Server / IP Address: If you are running as the server, check the box labeled Is Server. 
    If you are running as a client, leave the box unchecked and enter the IP address you want to connect to.

    Hitting Start initializes the network and Intel RealSense SDK and plays the selected scene. The maximum number of connected machines (server plus client(s)) is hardcoded to 4 in the file NetworkLayer.h

    Note: While a server and client can be started on the same system, they cannot use different color stream resolutions. Attempting to do so will crash the Intel RealSense SDK runtime since two different resolutions can’t run simultaneously on the same camera.

    After the network and Intel RealSense SDK initialize successfully, the panels shown in Figure 4 display:

Figure 4:Chat Heads option panels.

The Option panel has multiple sections, each with their own control settings. The sections and their fields are:

  • BGS/Media controls
    • Show BGS Image – If enabled, the background segmented image (i.e., color stream without the background) is shown. If disabled, the color stream is simply used (even though BGS processing still happens). This affects the remote chat heads as well (that is, if both sides have the option disabled, you’ll see the remote players’ background in the video stream).

      Figure 5:BGS on (left) and off (right). The former blends into Hearthstone*, while the latter sticks out

    • Pause BGS - Pause the Intel RealSense SDK BGS module, suspending segmentation processing on the CPU
    • BGS frame skip interval - The frequency at which the BGS algorithm runs. Enter 0 to run every frame, 1 to run once in two frames, and so on. The limit exposed by the Intel RealSense SDK is 4.
    • Encoding threshold – This is relevant only for multiplayer scenarios. See the Implementation section for details.
    • Decoding threshold - This is relevant only for multiplayer scenarios. See the Implementation section for details.
  • Size/Pos controls
    • Size - Click/drag within the boxes to resize the sprite. Use it with different resolutions to compare quality.
    • Pos - Click/drag within the boxes to reposition the sprite.
  • Network control/information (This section is shown only when multiple players are connected)
    • Network send interval (ms) - how often video update data is sent.
    • Sent - Graph of data sent by a client or server.
    • Rcvd - Graph of data received by a client or server. Clients send their updates to the server, which then broadcasts it to the other clients. For reference, to stream 1080p Netflix* video, the recommended b/w required is 5 Mbps (625 KB/s).
  • Metrics
    • Process metrics
      • CPU Used - The BGS algorithm runs on several Intel® Threading Building Blocks threads and in the context of a game, can use more CPU resources than desired. Play with the Pause BGS and BGS frame skip interval options and change the Chat Head resolution to see how it affects the CPU usage.

Implementation

Internally, the Intel RealSense SDK does its processing on each new frame of data it receives from the Intel RealSense camera. The calls used to retrieve that data are blocking, making it costly to execute this processing on the main application thread. Therefore, in this sample, all of the Intel RealSense SDK processing happens on its own dedicated thread. This thread and the application thread never attempt to write to the same objects, making synchronization trivial.

There is also a dedicated networking thread that handles incoming messages and is controlled by the main application thread using signals. The networking thread receives video update packets and updates a shared buffer for the remote chat heads with the decoded data.

The application thread takes care of copying the updated image data to the DirectX* texture resource. When a remote player changes the camera resolution, the networking thread sets a bool for recreation, and the application thread takes care of resizing the buffer, recreating the DirectX* graphics resources (Texture2D and ShaderResourceView) and reinitializing the decoder.

Figure 6 shows the post-initialization interaction and data flow between these systems (threads).

Figure 6: Interaction flow between local and remote Chat Heads.

Color Conversion

The Intel RealSense SDK uses 32-bit BGRA (8 bits per channel) to store the segmented image, with the alpha channel set to 0 for background pixels. This maps directly to the DirectX texture format DXGI_FORMAT_B8G8R8A8_UNORM_SRGB for rendering the chat heads. In this sample, we convert the BGRA image to YUYV, wherein every pair of BGRA pixels is combined into one YUYV pixel. However, YUYV does not have an alpha channel, so to preserve the alpha from the original image, we set the Y, U, and V channels all to 0 in order to represent background segmented pixels.

The YUYV bit stream is then encoded using WMF’s H.264 encoder. This also ensures better compression, since more than half the image is generally comprised of background pixels.

When decoded, the YUYV values meant to represent background pixels can be non-zero due to the lossy nature of the compression. Our workaround is to use 8 bit encoding and decoding thresholds, exposed in the UI. On the encoding side, if the alpha of a given BGRA pixel is less than the encoding threshold, then the YUYV pixel will be set to 0. Then again, on the decoding side, if the decoded Y, U, and V channels are all less than the decoding threshold, then the resulting BGRA pixel will be assigned an alpha of 0.

When the decoding threshold is set to 0, you may notice green pixels (shown below) highlighting the background segmented image(s). This is because in YUYV, 0 corresponds to the color green and not black as in BGRA (with non-zero alpha). When the decoding threshold is set to 0, you may notice green pixels (shown below) highlighting the background segmented image(s). This is because in YUYV, 0 corresponds to the color green and not black as in BGRA (with non-zero alpha).

Figure 7: Green silhouette edges around the remote player when a 0 decoding threshold is used

Bandwidth

The amount of data sent over the network depends on the network send interval and local camera resolution. The maximum send rate is limited by the 30 fps camera frame rate, and is thus 33.33 ms. At this send rate, a 320x240 resolution video feed consumes 60-75 KBps with minimal motion (kilobytes per second) and 90-120 KBps with more motion. Note that the bandwidth figures depend on the number of pixels covered by the player. Increasing the resolution to 1280x720 doesn’t impact the bandwidth cost all that much; the net increase is around 10-20 KBps, since a sizable chunk of the image is the background (YUYV set to 0) which results in much better compression.
Reducing the send interval to 70ms results in a bandwidth consumption of ~20-30 KBps. 

Performance

The sample uses Intel® Instrumentation and Tracing Technology (Intel® ITT) markers and Intel® VTune Amplifier XE to help measure and analyze performance. To enable them, uncomment

//#define ENABLE_VTUNE_PROFILING // uncomment to enable marker code

In the file

ChatheadsNativePOC\itt\include\VTuneScopedTask.h

and rebuild.

With the instrumentation code enabled, an Intel® VTune concurrency analysis of the sample can help understand the application’s thread profile. The platform view tab shows a colored box (whose length is based on execution time) for every instrumented section, and can help locate bottlenecks. The following capture was taken on an Intel® Core™ i7-4770R processor (8 logical cores) with varying BGS work. The “CPU Usage” row on the bottom shows the cost of executing the BGS algorithm every frame, every alternate frame, once in three frames and when suspended. As expected, the TBB threads doing the BGS work have lower CPU utilization when frames are skipped.

Figure 8: VTune concurrency analysis platform view with varying BGS work

A closer look at the RealSense thread shows the RSSDK AcquireFrame() call taking ~29-35ms on average, which is a result of the configured frame capture rate of 30 fps.

Figure 9: Closer look at the RealSense thread. The thread does not spin, and is blocked while trying to acquire the frame data

The CPU usage info can be seen via the metrics panel of the sample as well, and is shown in the table below:

BGS frequencyChat Heads CPU Usage (approx.) 
Every frame23%
Every alternate frame19%
Once in three frames16%
Once in four frames13%
Suspended9%

Doing the BGS work every alternate frame, or once in three frames, results in a fairly good experience when the subject is a gamer because of minimal motion. The sample currently doesn’t update the image for the skipped frames – it would be interesting to use the updated color stream with the previous frame’s segmentation mask instead.

Conclusion

The Chat Heads usage enabled by Intel RealSense technology can make a game truly social and improve both the in-game and e-sport experience without sacrificing the look, feel and performance of the game. Current e-sport broadcasts generally show full video (i.e., with the background) overlays of the professional player (and/or) team in empty areas on the bottom UI. Using the Intel RealSense SDK's background segmentation, each players’ segmented video feed can be overlaid near the player’s character, without obstructing the game view. Combined with Intel RealSense SDK face tracking, it allows for powerful and fun social experiences in games. 

Acknowledgements

A huge thanks to Jeff Laflam for pair-programming the sample and reviewing this article. 
Thanks also to Brian Mackenzie for the WMF based encoder/decoder implementation, Doug McNabb for CPUT clarifications and Geoffrey Douglas for reviewing this article.

 

Good Performance: Three Developers’ Behaviors that Prevent It!

$
0
0

Download [PDF 1.2MB]

Introduction

Performance is regarded as one of the most valuable non-functional requirements of an application. If you are reading this, you are probably using an application like a browser or document reader, and understand how important performance is. In this article, I will talk about applications’ good performance and three developers’ behaviors that prevent it.

Behavior #1: Lack of understanding of the development technologies

It doesn’t matter whether you are someone who just graduated from school or have years of experience; when you have to develop something, you will probably look for something that was already developed. Hopefully in the same programming language.

This is not a bad thing. In fact, it often speeds up development. But, on the other hand, it also might prevent you from learning something. Because only rarely does this approach involve taking the time to inspect the code and understand not only the algorithm but also the inner workings of each line of code.

That is one example of us, as developers, falling into behavior number one. But there are other ways too. For example, when I was younger and just starting my journey in software development, my boss at the time was my role model, and whatever he did was the best someone could do. Whenever I had to do something, I looked at how he did it and replicated it as closely as possible. Many times, I did not understand why his approach worked, but who cares, right? It worked!

There is a kind of developer that I call a “4x4.” He or she is someone, who when asked to do something works as hard as possible to complete it. They usually look for building blocks, or pieces of things already done, puts them all together, and “voilà!” The thing is done! Rarely does this kind of developer spend any time understanding all the pieces he or she found and don’t consider or investigate scalability, maintainability, or performance.

There is one more situation that leads to developers not understanding how things actually work: never running into problems! When you use a technology for the first time and you run into problems, you dig into the details of the technology, and you end up understanding how it works.

At this point, let’s look at some examples that will help us understand the difference between understanding the technology and simply using it. Since I am, for the most part, a .NET* web developer, I will focus on that.

JavaScript* and the Document Object Module (DOM)

Let’s look at the code snippet below. Pretty plain. The code just updates the style of an element in the DOM. The problem (which is less of a problem with modern browsers but included to illustrate the point), is that it is traversing the DOM tree three times. If this code is repeated and the document is large and complex, there will be a performance hit in the application.

Fixing such a problem is easy. Look at the following code snipped. There is a direct reference hold in the variable myField prior to working on the object. This new code is less verbose, quicker to read and understand, and has better performance since there is only one access to the DOM tree.

Let’s look at another example. This example was taken from: http://code.tutsplus.com/tutorials/10-ways-to-instantly-increase-your-jquery-performance--net-5551

In the following figure, there are two equivalent code snippets. Each code creates a thousand list item li elements. The code on the right adds an id attribute to each li element, whereas the code on the left adds a class attribute to each li element.

As you can see, the second part of each code snippet simply accesses each of the thousand li elements that was created. In my benchmarking in Internet Explorer* 10 and Chrome* 48, the average time taken was 57 ms for the code on the left and 9 ms for the code on the right—significantly less. The difference is huge in this case when just accessing the elements in one way or the other.

This example has to be taken very carefully! There are so many additional things to understand that might make this example look wrong, like the order in which the selectors are evaluated, which is from right to left. If you are using jQuery*, read about the DOM context as well. For general CSS Selectors’ performance concepts, see the following article: https://smacss.com/book/selectors

Let’s provide a final example in JavaScript code. This example is more related to memory but will help you understand how things really work. High memory consumption in browsers will cause a performance problem as well.

The next image shows two different ways of creating an object with two properties and one method. On the left, the class’s constructor adds the two properties to the object and the additional method is added through the class’s prototype. On the right, the constructor adds the properties and the method at once.

After the objects are created, a thousand objects are created using both techniques. If you compare the memory used by the objects you’ll see differences of memory usage in the Shallow Size and Retained Size for both approaches in Chrome. The prototype approach uses about 20 percent less memory (20 Kbytes versus 24 Kbytes) in the Shallow Size and references up to 66 percent less in Retained Memory (20 Kbytes versus 60 Kbytes).

For a better understanding of how Shallow Size and Retained Size memory work, see:

https://developers.google.com/web/tools/chrome-devtools/profile/memory-problems/memory-101?hl=en

You can create objects by knowing how to use the technology. But understanding how the technology actually works gives you tools to improve the application in areas like memory management and performance.

LINQ

When I was preparing my conference presentation on this topic, I wanted to provide an example with server-side code. I decided to use LINQ*, since LINQ has become a first-hand tool in the .NET world for new development and is one of the most promising areas to look for performance improvements.

Consider this common scenario. In the following image there are two functionally equivalent sections of code. The purpose of the code is to list all departments and all courses for each department in a school. In the code titled Select N+1, we list all the departments and for each department list its courses. This means that if there are 100 departments, we will make 1+100 calls to the database.

There are many ways to solve this. One simple approach is shown in the code on the right side of the image. By using the Include method (in this case I am using a hardcoded string for ease in understanding) there will be one single database call in which all the departments and its courses will be brought at once. In this case, when the second foreach loop is executed, all the Courses collections for each department will already be in memory.

Improvements in performance on the order of hundreds of times faster are possible simply by avoiding the Select N+1 problem.

Let’s consider a less obvious example.

In the image below, there is only one difference between the two code snippets: the data type of the target list in the second line. You might ask, what difference does the target type make? When you understand how the technology works, you will realize that the target data type actually defines the exact moment when the query is executed against the database. That, in turn, defines when the filters of each query is applied.

In the case of the Code #1 sample where an IEnumerable is expected, the query is executed right before Take<Employee>(10) is executed. This means that if there are 1,000 employees, all of them will be retrieved from the database and then only 10 will be taken.

In the case of the Code #2 sample, the query is executed until Take<Employee>(10) is executed. That is, only 10 records are retrieved from the database.

The following article has an in-depth explanation of the differences in using multiple types of collections.

http://www.codeproject.com/Articles/832189/List-vs-IEnumerable-vs-IQueryable-vs-ICollection-v

SQL Server*

In SQL, there are many concepts to understand in order to get the best performance possible out of your database. SQL Server is complex because it requires an understanding of how the data is being used, and what tables are queried the most and by which fields.

Nevertheless, you can still apply some general concepts to improve performance, such as:

  • Clustered versus non-clustered indexes
  • Properly ordered JOINs
  • Understanding when to use #temp tables and variable tables
  • Use of views versus indexed views
  • Use of pre-compiled statements

For the sake of brevity, I won’t provide a specific use case, but these are the types of concepts that you can use, understand, and make the most of.

Mindset Change

So, what are the mindset changes we, as developers, must have in order to avoid behavior #1?

  • Stop thinking “I am a front-end or back-end developer!” You probably are an engineer and you may become an expert in one area, but don’t use that as a shield to avoid learning more about other areas.
  • Stop thinking “Let’s let the expert do it because he’s faster!” In the current world where agile is all over the place, we must be fungible resources, and we must learn about the areas we are weak in.
  • Stop telling yourself “I don’t get it!” Of course! If it was easy then we all would be experts! Spend your time reading, asking, and understanding. It’s not easy, but it pays off by itself.
  • Stop saying “I don’t have time!” OK, I get this one. It does happen. But once an Intel fellow told me “when you are passionate about something, your bandwidth is infinite.” And here I am, writing this article at 12:00 a.m. on a Saturday!

Behavior #2: Bias on specific technologies

I have developed in .NET since version 1.0. I knew every single little detail of how Web Forms worked as well as a lot of the .NET client-side libraries (I customized some of them). When I saw that Model View Controller (MVC) was coming out, I was reluctant to use it because “we didn’t need it.”

I won’t continue with the list of things that I didn’t like at the beginning but now use extensively. But this makes my point of people’s bias against using specific technologies, preventing themselves from getting better performance.

One of the discussions I often hear is either about LINQ-to-Entities in Entity Framework, or about SQL Stored Procedures when querying data. People are so used to one or the other that they try to continue using them.

Another aspect that makes people biased toward a particular technology is whether they are open source lovers or haters. This makes people not think about what is best for their current situation but rather what best aligns to their philosophy.

Sometimes external factors (for instance, deadlines) push us to make decisions. In order to choose the best technology for our applications, we require time to read, play, compare, and conclude. When we start developing a new product or version of an existing product, it’s not uncommon that we are already late. Two ways come to mind on how to solve this situation: stand up and ask for that time or work extra hours to educate ourselves.

Mindset Change

So what are the mindset changes we, as developers, must have in order to avoid behavior #2:

  • Stop saying “This has always worked,” “This is what we have always used,” and so on. We need to identify and use other options, especially if there is data that supports those options.
  • Stop fixing the solution! There are times when people want to use a specific technology that doesn’t provide the expected results. Then they spend hours and hours trying to tweak that technology. What they are doing in this case if “fixing the solution” instead of focusing on the problem and maybe finding a quicker, more elegant solution somewhere else.
  • “I don’t have time!” Of course, we don’t have time to learn or try new stuff. Again, I get this one.

Behavior #3: Not understanding the application’s infrastructure

After we have put a lot of effort into creating the best application, it is time to deploy it! We tested everything. Everything worked beautifully in our machines. All the 10 testers were so happy with it and its performance. So, what could go wrong after all?

Well, everything could go wrong!

Did you ask yourself any of the following questions?

  • Was the application expected to work in a load-balanced environment?
  • Is the application going to be hosted in the cloud with many instances of it?
  • How many other applications are running in my target production machine?
  • What else is running on that server? SQL Server? Reporting Services? Some SharePoint* extensions?
  • Where are my end users located? Are they all over the world?
  • How many final users will my application have in the next five years?

I understand that not all of these questions refer to the infrastructure but bear with me here. More often than not, the final conditions under which our application will run are not the same as our staging servers.

Let’s pick come of the possible situations that could affect the performance in our application. We will start with users around the world. Maybe our application is very fast and we hear no complaints from our customers in America, but our customers in Malaysia don’t have the same speedy experience.

There are many options to solve this situation. For one, we could use Content Delivery Networks (CDNs) where we can place static files, then loading pages would be faster from different locations. The following image shows what I am talking about.

Picking another potential situation, let’s consider applications running on servers having SQL Server and Web Server running together. In this case we have two CPU-intensive servers on the same machine. So, how can we solve this? Still assuming you are running a .NET application in an Internet Information Services (IIS) Server, we could take advantage of CPU affinity. CPU affinity ties one process to one or more specific cores in the machine.

For example, let’s say that we have SQL Server and Web Server (IIS) in a machine with four CPUs.

If we leave it to the operating system to determine what CPU uses the IIS or the SQL Server, there could be various setups. We can have two CPUs assigned to each server.

Or we could have all processors assigned to only one server!

In this case, we could fall into deadlock because the IIS might be attending too many requests that took all four processors, and probably some of them will require access to SQL, which, of course, won’t happen. This is admittedly an unlikely scenario, but it illustrates my point.

There is one additional situation: one process will not run on the same CPUs all the time. There will be too much context switching. This context switching will cause performance degradation in the server and then in the applications running on that server.

One way to minimize this is by using processor affinity for IIS and SQL. In this way, we can determine how many processors we need for the SQL Server and for the IIS. This is done by changing the Processor Affinity settings in the CPU category in the IIS and the “affinity mask” in the SQL server database. Both cases are shown in the following images.


I could continue with other options at the infrastructure level to improve applications’ performance, like the use of Web Gardens and Web Farms.

Mindset Change

What are the mindset changes we, as developers, must have in order to avoid behavior #3?

  • Stop thinking “That is not my job!” We, as engineers, must broaden our knowledge as much as possible in order to provide the best integral solution to our customers.
  • “I don’t have time!” Of course, we never have time. This is the commonality in the mindset change. Making time is what differentiate a professional that succeeds, exceeds, or outstands!

Don’t feel guilty!

But, do not feel guilty! It is not all on you! Really, we don’t have time! We have family, we have hobbies, and we have to rest!

The important thing here is to realize that sometimes there is more to performance than just writing good code. We all have shown and will show some or all of these behaviors during our lifetime.

Let me give you some tips to avoid these behaviors.

  1. Make time. When asked for estimates for your projects, make sure you estimate for researching, testing, concluding, and making decisions.
  2. Try to create along the way a personal test application. This application will avoid you having to try stuff in the application under development. This is a mistake we all make at some point.
  3. Look for people that already know and do pair-programming. Work with your infrastructure person when he or she is deploying the application. This is time well spent.
  4. Stack Overflow is evil!!! Actually, I do help there and a big percentage of my problems are already solved there. But, if you use it for “copy and paste” answers, you will end up with incomplete solutions.
  5. Stop being the front-end person. Stop being the back-end person too. Become a subject matter expert if you will, but make sure you can hold a smart discussion when talking about the areas where you are not an expert.
  6. Help out! This is probably the best way to learn. When you spend time helping people with their problems, you are in the long run saving yourself time by not encountering the same or similar situation

See Also

Funny and interesting blog from Scott Hanselman about being a googler or a developer.

http://www.hanselman.com/blog/AmIReallyADeveloperOrJustAGoodGoogler.aspx

More about objects and prototypes:

http://thecodeship.com/web-development/methods-within-constructor-vs-prototype-in-javascript/

About the Author

Alexander García is a computer science engineer from Intel Costa Rica. Alex has over 14 years of professional experience in software engineering. His professional interests go from software engineering practices, software security, and performance to data mining and related fields.  Alex is currently in the Master’s Degree in Computer Science.

 

Visual Studio IntelliSense stopped recognizing many of the AVX, AVX2 intrinsics

$
0
0

Problem description:

Install the Parallel Studio (2016 update1) on windows operating system, Visual studio [2013/2015] was used for integration with Intel compiler. After IPS XE installation, visual Studio IntelliSense stopped recognizing many of the AVX, AVX2 intrinsics. Its long list of intrinsics but for example: _mm256_load_ps, _mm256_add_ps ,_mm256_sub_ps ,_mm256_mu l_ps ,_mm256_store_ps ...etc. would not be recognized by IntelliSense.

Reason:

The reason of this behavior of visual studio IntelliSense is that there are no declarations for these intrinsic in compiler's header files.

Workaround:

#define __INTEL_COMPILER_USE_INTRINSIC_PROTOTYPES 1
before #include <mmintrin.h>

This will add all the intrinsic prototypes back into the header files.

Advantages of header-less recognition:

This is a Intel compiler 15.0 to 16.0 regression. Because the intrinsic header files for AVX512 were taking a long time to compile, it was decided that the compiler could just learn to recognize them without header files or prototypes. In 16.0, the compiler does not require prototypes for intrinsic functions. However, when the compiler is recognizing functions without prototypes, it uses the old C rules, which view enums and integers are the same.We don't see acceptable fix for this situation in VS IDE. 

By default, we don't want to include every AVX512 intrinsic, as there are so many of them it actually has a noticeable slowdown parsing the user code. This is why we changed to header-less recognition in 16.0.

Note in future header files:

we have added a comment in mmintrin.h for 17.0/18.0 to explain the compiler's use of these header files. The user will need to manually enable the function headers as per the instructions.

Here is excerpt from the latest mmintrin.h file.
/*
 * Many of these function declarations are not visible to the
 * compiler; they are for reference only. The compiler recognizes the
 * function names as "builtins", without requiring the
 * declarations. This improves compile-time.  If user code requires
 * the actual declarations, they can be made visible like
 * this:
 * #define __INTEL_COMPILER_USE_INTRINSIC_PROTOTYPES 1
 * #include <mmintrin.h>
 */
 

How to Migrate Intel® RealSense™ SDK R4 (v6.0) Hand Mode Functionality to Intel RealSense SDK 2016 R1 Cursor Mode

$
0
0

Abstract

With the dual arrivals of the Intel® RealSense™ camera (SR300) and the Intel® RealSense™ SDK 2016 R1, comes a new mode of gesture interaction called Cursor Mode, for use with the SR300 only. This tutorial sets out to highlight code changes that developers will have to make in order to exploit the new capability.

Introduction

Prior to the release of Intel RealSense SDK 2016 R1, applications wanting to effect cursor movement and detect click-actions were reliant on using the Hand Mode and detecting the “click action” through gesture recognition. That functionality existing in the Hand Mode has now been decoupled into a new feature called Cursor Mode.  As such, applications that relied on the previous functionality can now change their code to take advantage of the refinements and upgrades to cursor control with the new Cursor Mode.

Please note that Cursor Mode is available only to devices and peripherals that use the Intel RealSense camera (SR300). As a developer of Intel® RealSense™ applications looking to use the SR300, you must upgrade to Windows* 10, and we require that you use version 2016 R1 of the Intel RealSense SDK.

Tutorial

More than likely you already have an application that is written for the F200 camera with the Intel RealSense SDK R4 (v6.0) and would like to know how to move forward and use the new Cursor Mode functionality. This tutorial presents the following:

Part 1

Initialization of the processing pipeline must occur in a manner similar to the previous version of the Intel RealSense SDK.  Thus, you must instantiate the Sense Manager and check for no errors in the process.

PXCSenseManager *pSenseMgr = new PXCSenseManager::CreateInstance();
if( !pSenseMgr ) {< continue on to creating the modes >
}

Part 2

Previously for the F200 Hand Mode, to get anything resembling cursor actions you had to rely on the Hand Module and track the hand set to various configurations. Your code might have looked like this (note that the following code is for reference purposes and will not compile directly as written below):

PXCHandModule *pHandModule;
PXCHandData *pHandData;
int confidence;
. . . <additional library and variables setup> . . .
pxcStatus status;
if( !pSenseMgr ) {
	status = pSenseMgr->EnableHand()
	if(status == pxcStatus::PXC_STATUS_NO_ERROR) {
	// Get an instance of PXCHandModule
handModule = pSenseMgr->QueryHand();
// Get an instance of PXCHandConfiguration
PXCHandConfiguration handConfig = handModule
handConfig->EnableGesture("cursor_click");
handConfig->ApplyChanges();
	. . . <additional configuration options> . . .
}
}

Part 3

Beginning with the Intel RealSense SDK 2016 R1 a new Cursor Mode has been implemented, and cursor actions have been decoupled from the Hand Mode. This means that previous code paths that queried the Hand Mode in the Sense Manager must change. The new code will take the following form:

PXCHandCursorModule *pCursorModule;
PXCCursorData::BodySideType bodySide;
// please note that the Confidence values no longer exist
. . . <additional library and variables setup> . . .
pxcStatus status;
if( !pSenseMgr ) {
// Enable handcursor tracking
status = pSenseMgr::EnableHandCursor();
	if(status == pxcStatus.PXC_STATUS_NO_ERROR) {
	// Get an instance of PXCCursorModule
pCursorModule = pSenseMgr->QueryHandCursor();
// Get an instance of the cursor configuration
PXCCursorConfiguration *pCursorConfig = CursorModule::CreateActiveConfiguration();

// Make configuration changes and apply them
pCursorConfig.EnableEngagement(true);
pCursorConfig.EnableAllGestures();
pCursorConfig.ApplyChanges();
	. . . <additional configuration options> . . .

}
}

Part 4

Implementation examples of the main processing loops for synchronous and asynchronous functions can be found in the Intel RealSense™ SDK 2016 R1 Documentation in the Implementing the Main Processing Loop subsection of the Cursor Module [SR300] section.

A summary of the asynchronous—and preferred—approach is as follows:

class MyHandler: public PXCSenseManager::Handler {
public:
    virtual pxcStatus PXCAPI OnModuleProcessedFrame(pxcUID mid, PXCBase *module, PXCCapture::Sample *sample) {
       // check if the callback is from the hand cursor tracking module
       if (mid==PXCHandCursorModule::CUID) {
           PXCHandCursorModule *cursorModule=module->QueryInstance<PXCHandCursorModule>();
               PXCCursorData *cursorData = cursorModule->CreateOutput();
           // process cursor tracking data
       }

       // return NO_ERROR to continue, or any error to abort
       return PXC_STATUS_NO_ERROR;
    }
};
. . . <SenseManager declaration> . . .
// Initialize and stream data
MyHandler handler; // Instantiate the handler object

// Register the handler object
pSenseMgr->Init(&handler);

// Initiate SenseManager’s processing loop in blocking mode
// (function exits only when processing ends)
pSenseMgr ->StreamFrames(true);

// Release SenseManager resources
pSenseMgr ->Release()

Conclusion

Though the Intel RealSense SDK 2016 R1 has changed the implementation and access to the hand cursor, it is worth noting that the changes have a consistency that allow for an easy migration of your code. The sample code above demonstrate that ease by showing that your general program structure during initialization, setup, and per-frame execution can remain unchanged while still harnessing the improved capabilities of the new Cursor Mode.

It is worth repeating that the new Cursor Mode is only available to systems that are enabled with the SR300 camera, either integrated or as a peripheral, and using RealSense™ SDK 2016 R1. The ability to detect and branch your code to support dual F200 and SR300 cameras, with either one as peripherals, during development will be discussed in other tutorials.

New, Exciting Intel Media Software Capabilities to Showcase at NAB Show 2016

$
0
0

Intel Media Tools at NAB

Visit with Intel at NAB Show (National Broadcasters Association Show) in Las Vegas, Apr. 18-21. As a world leader in computing, Intel® architecture, media accelerators and software are at the heart of innovative, advanced media solutions for video service providers and broadcasters. See how you can stay competitive and deliver the next generation of brilliant media experiences with Intel: innovate media experiences; get fast performance and efficiency for your media solutions; and reduce infrastructure and development costs. 

Attend NAB with a free passcode LV6579 for entry to the Exhibit Hall. Visit Intel at SU621 (South Upper Hall).
 

Technology Showcase: Media Transcoding Software Preview

At NAB, we’re showing some exciting demos on new media software capabilities that bring tremendous performance and productivity, along with high-quality across AVC, HEVC, VP9, MPEG-2 and AVS 2.0 formats. Come check them out. Below is only a partial list—some are so special, that we can’t write about them yet!

  • Accelerate fast, dense, high-quality video transcoding, visualize CPU/GPU and memory usage with Intel® Media Server Studio. See also Intel® Visual Compute Accelerator in action.
  • Accelerate transitions to HEVC, 4K, or even 8K
  • Debug decode and encode, ensure encoder compliances with Intel® Video Pro Analyzer, a complete toolset for advanced video analysis
  • Develop robust decoders, accelerate media validation and debug
  • Innovative media demos with Ateme and Sharp
  • And more...

Learn how Intel® Media SDK, Intel® Media Server StudioIntel® Video Pro Analyzer, and Intel® Stress Bitstreams and Encoder can help video solution providers and broadcasters innovate, speed media processing, improve quality - not to mention save time and costs.

Feel free to also contact us for a private meeting on how to optimize your media solutions.

Other activities in the Intel booth include demonstrations of the latest architecture platforms, devices and technologies for awesome media and broadcasting, and cloud delivery, along with a suite of industry-leading customers showing how they are using Intel media acceleration technologies to innovate today.

 


Create a Virtual Joystick Using the Intel® RealSense™ SDK Hand Cursor Module

$
0
0

Abstract

This article describes a code walkthrough for creating a virtual joystick app (see Figure 1) that incorporates the new Hand Cursor Module in the Intel® RealSense™ SDK. This project is developed in C#/XAML and can be built using Microsoft Visual Studio* 2015.


Figure 1: RS Joystick app controlling Google Earth* flight simulator.

Introduction

Support for the new Intel® RealSense™ camera, model SR300, was introduced in R5 of the Intel RealSense SDK. The SR300 is the successor to the F200 model and provides a set of improvements along with a new feature known as the Hand Cursor Module.

As described in the SDK documentation , the Hand Cursor Module returns a single point on the hand that allows accurate and responsive tracking. Its purpose is to facilitate the hand-based UI control use case, along with supporting a limited set of gestures.

RS Joystick, the joystick emulator app described in this article, maps 3D hand data provided by the SDK to virtual joystick controls, resulting in a hands-free way to interact with software applications that work with joystick controllers.

The RS Joystick app leverages the following Hand Cursor Module features:

  • Body Side Type– The app notifies the user which hand is controlling the virtual joystick, based on a near-to-far access order.
  • Cursor-Click Gesture– The user can toggle the ON-OFF state of button 1 on the virtual joystick controller by making a finger-click gesture.
  • Adaptive Point Tracking– The app displays the normalized 3D point inside the imaginary “bounding box” defined by the Hand Cursor Module and uses this data to control the x-, y-, and z-axes of the virtual joystick.
  • Alert Data– The app uses Cursor Not Detected, Cursor Disengaged, and Cursor Out Of Border alerts to change the joystick border from green to red when the user’s hand is out of range of the SR300 camera.

(For more information on the Hand Cursor Module check out “What could you do with Intel RealSense Cursor Mode?”)

Prerequisites

You should have some knowledge of C# and understand basic operations in Visual Studio like building an executable. Previous experience with adding third-party libraries to a custom software project is helpful, but this walkthrough provides detailed steps, if this is new to you. Your system needs a front-facing SR300 camera, the latest versions of the SDK and Intel® RealSense™ Depth Camera Manager (DCM) installed, and must meet the hardware requirements listed here. Finally, your system must be running Microsoft Windows* 10 Threshold 2.

Third-Party Software

In addition to the Intel RealSense SDK, this project incorporates a third-party virtual joystick device driver called vJoy* along with some dynamic-link libraries (DLLs). These software components are not part of any distributed code associated with this custom project, so details on downloading and installing the device driver are provided below.

Install the Intel RealSense SDK

Download and install the required DCM and SDK at https://software.intel.com/en-us/intel-realsense-sdk/download. At the time of this writing the current versions of these components are:

  • Intel RealSense Depth Camera Manager (SR300) v3.1.25.1077
  • Intel RealSense SDK v8.0.24.6528

Install the vJoy Device Driver and SDK

Download and install the vJoy device driver: http://vjoystick.sourceforge.net/site/index.php/download-a-install/72-download. Reboot the computer when instructed to complete the installation.

Once installed, the vJoy device driver appears under Human Interface Devices in Device Manager (see Figure 2).


Figure 2: Device Manager.

Next, open the Windows 10 Start menu and select All apps. You will find several installed vJoy components, as shown in Figure 3.


Figure 3: Windows Start menu.

To open your default browser and go to the download page, click the vJoy SDK button.

Once downloaded, copy the .zip file to a temporary folder, unzip it, and then locate the C# DLLs in \SDK\c#\x86.

We will be adding these DLLs to our Visual Studio project once it is created, as described in the next step.

Create a New Visual Studio Project

  • Launch Visual Studio 2015.
  • From the menu bar, select File, New, Project….
  • In the New Project screen, expand Templates and select Visual C#, Windows.
  • Select WPF Application.
  • Specify the location for the new project and its name. For this project, our location is C:\ and the name of the application is RsJoystick.

Figure 4 show the New Project settings used for this project.


Figure 4: Visual Studio* New Project settings.

Click OK to create the project.

Copy Libraries into the Project

Two DLLs are required for creating Intel® RealSense™ apps in C#:

  • libpxcclr.cs.dll – the managed C# interface DLL
  • libpxccpp2c.dll – the unmanaged C++ P/Invoke DLL

Similarly, there are two DLLs required to allow the app to communicate with the vJoy device driver:

  • vJoyInterface.dll – the C-language API library
  • vJoyInterfaceWrap.dll – the C# wrapper around the C-language API library

To simplify the overall structure of our project, we’re going to copy all four of these files directly into the project folder:

  • Right-click the RsJoystick project and select Add, Existing Item…
  • Navigate to the location of the vJoy DLLs (that is, \SDK\c#\x86) and select both vJoyInterface.dll and vJoyInterfaceWrap.dll. Note: you may need to specify All Files (*.*) for the file type in order for the DLLs to become visible.
  • Click the Add button.

Similarly, copy the Intel RealSense SDK DLLs into the project:

  • Right-click the RsJoystick project and then select Add, Existing Item…
  • Navigate to the location where the x86 libraries reside, which is C:\Program Files (x86)\Intel\RSSDK\bin\win32 in a default SDK installation.
  • Select both libpxcclr.cs.dll and libpxccpp2c.dll.
  • Click the Add button.

All four files should now be visible in Solution Explorer under the RsJoystick project.

Create References to the Libraries

Now that the required library files have been physically copied to the Visual Studio project, you must create references to the managed (.NET) DLLs so they can be used by your app. Right-click References (which is located under the RsJoystick project) and select Add Reference… In the Reference Manager window, click the Browse button and navigate to the project folder (c:\RsJoystick\RsJoystick). Select both the libpxcclr.cs.dll and vJoyInterfaceWrap.dll files, and then click the Add button. Click the OK button in Reference Manager.

In order for the managed wrapper DLLs to work properly, you need to ensure the unmanaged DLLs get copied into the project’s output folder before the app runs. In Solution Explorer, click libpxccpp2c.dll to select it. The Properties screen shows the file properties for libpxccpp2c.dll. Locate the Copy to Output Directory field and use the drop-down list to select Copy Always. Repeat this step for vJoyInterface.dll. This ensures that the unmanaged DLLs get copied to the project output folder when you build the application.

At this point you may see a warning about a mismatch between the processor architecture of the project being built and the processor architecture of the referenced libraries. Clear this warning by doing the following:

  • Locate the link to Configuration Manager in the drop-down list in the menu bar (see Figure 5).
  • Select Configuration Manager.
  • In the Configuration Manager screen, expand the drop-down list in the Platform column, and then select New.
  • Select x86 as the new platform and then click OK.
  • Close the Configuration Manager screen.


Figure 5: Configuration Manager.

At this point the project should build and run without any errors or warnings. Also, if you examine the contents of the output folder (c:\RsJoystick\RsJoystick\bin\x86\Debug) you should find that all four of the DLLs got copied there as well.

The User Interface

The user interface (see Figure 6) displays the following information:

  • The user’s hand that is controlling the virtual joystick, based on a near-to-far access order (that is, the hand that is closest to the camera is the controlling hand).
  • The ON-OFF state of Button 1 on the virtual joystick controller, which is controlled by making a finger-click gesture.
  • An ellipse that tracks the relative position of the user’s hand in the x- and y-axes, and changes diameter based on the z-axis to indicate the hand’s distance from the camera.
  • The x-, y-, and z-axis Adaptive Point data from the SDK, which is presented as normalized values in the range of zero to one.
  • A colored border that changes from green to red when the user’s hand is out of range of the SR300 camera.
  • Slider controls that allow the sensitivity to be adjusted for each axis.


Figure 6: User Interface.

The complete XAML source listing is presented in Table 1. This can be copied and pasted directly over the MainWindow.xaml code that was automatically generated when the project was created.

Table 1:XAML Source Code Listing: MainWindow.xaml.

<Window x:Class="RsJoystick.MainWindow"
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
        xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
        xmlns:local="clr-namespace:RsJoystick"
        mc:Ignorable="d"
        Title="RSJoystick" Height="420" Width="420" Background="#FF222222" Closing="Window_Closing"><Window.Resources><Style x:Key="TextStyle" TargetType="TextBlock"><Setter Property="Foreground" Value="White"/><Setter Property="FontSize" Value="14"/><Setter Property="Text" Value="-"/><Setter Property="Margin" Value="4"/><Setter Property="HorizontalAlignment" Value="Center"/></Style></Window.Resources><StackPanel VerticalAlignment="Center" HorizontalAlignment="Center" Width="320"><TextBlock x:Name="uiBodySide" Style="{StaticResource TextStyle}"/><TextBlock x:Name="uiButtonState" Style="{StaticResource TextStyle}"/><Border x:Name="uiBorder" BorderThickness="2" Width="200" Height="200" BorderBrush="Red" Margin="4"><Canvas x:Name="uiCanvas" ClipToBounds="True"><Ellipse x:Name="uiCursor" Height="10" Width="10" Fill="Yellow"/><Ellipse Height="50" Width="50" Stroke="Gray" Canvas.Top="75" Canvas.Left="75"/><Rectangle Height="1" Width="196" Stroke="Gray" Canvas.Top="100"/><Rectangle Height="196" Width="1" Stroke="Gray" Canvas.Left="100"/></Canvas></Border><StackPanel Orientation="Horizontal" HorizontalAlignment="Center"><TextBlock x:Name="uiX" Style="{StaticResource TextStyle}" Width="80"/><Slider x:Name="uiSliderX" Width="150" ValueChanged="sldSensitivity_ValueChanged" Margin="4"/></StackPanel><StackPanel Orientation="Horizontal" HorizontalAlignment="Center"><TextBlock x:Name="uiY" Style="{StaticResource TextStyle}" Width="80"/><Slider x:Name="uiSliderY" Width="150" ValueChanged="sldSensitivity_ValueChanged" Margin="4"/></StackPanel><StackPanel Orientation="Horizontal" HorizontalAlignment="Center"><TextBlock x:Name="uiZ" Style="{StaticResource TextStyle}" Width="80"/><Slider x:Name="uiSliderZ" Width="150" ValueChanged="sldSensitivity_ValueChanged" Margin="4"/></StackPanel></StackPanel></Window>

Program Source Code

The complete C# source listing for the RSJoystick app is presented in Table 2. This can be copied and pasted directly over the MainWindow.xaml.cs code that was automatically generated when the project was created.

Table 2.C# Source Code Listing: MainWindow.xaml.cs

//--------------------------------------------------------------------------------------
// Copyright 2016 Intel Corporation
// All Rights Reserved
//
// Permission is granted to use, copy, distribute and prepare derivative works of this
// software for any purpose and without fee, provided, that the above copyright notice
// and this statement appear in all copies.  Intel makes no representations about the
// suitability of this software for any purpose.  THIS SOFTWARE IS PROVIDED "AS IS."
// INTEL SPECIFICALLY DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, AND ALL LIABILITY,
// INCLUDING CONSEQUENTIAL AND OTHER INDIRECT DAMAGES, FOR THE USE OF THIS SOFTWARE,
// INCLUDING LIABILITY FOR INFRINGEMENT OF ANY PROPRIETARY RIGHTS, AND INCLUDING THE
// WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  Intel does not
// assume any responsibility for any errors which may appear in this software nor any
// responsibility to update it.
//--------------------------------------------------------------------------------------
using System;
using System.Windows;
using System.Windows.Controls;
using System.Windows.Media;
using vJoyInterfaceWrap;
using System.Threading;
using System.Windows.Shapes;

namespace RsJoystick
{
    /// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        private PXCMSenseManager sm;
        private PXCMHandCursorModule cursorModule;
        private PXCMCursorConfiguration cursorConfig;
        private vJoy joystick;
        private Thread update;
        private double joySensitivityX;
        private double joySensitivityY;
        private double joySensitivityZ;
        private const uint joyID = 1;
        private const uint MaxSensitivity = 16384;

        public MainWindow()
        {
            InitializeComponent();

            // Configure the sensitivity controls
            uiSliderX.Maximum = MaxSensitivity;
            uiSliderY.Maximum = MaxSensitivity;
            uiSliderZ.Maximum = MaxSensitivity;
            joySensitivityX = uiSliderX.Value = MaxSensitivity / 2;
            joySensitivityY = uiSliderY.Value = MaxSensitivity / 2;
            joySensitivityZ = uiSliderZ.Value = MaxSensitivity / 2;

            // Create an instance of the joystick
            joystick = new vJoy();
            joystick.AcquireVJD(joyID);

            // Configure the cursor mode module
            ConfigureRealSense();

            // Start the Update thread
            update = new Thread(new ThreadStart(Update));
            update.Start();
        }

        public void ConfigureRealSense()
        {
            // Create an instance of the SenseManager
            sm = PXCMSenseManager.CreateInstance();

            // Enable cursor tracking
            sm.EnableHandCursor();

            // Get an instance of the hand cursor module
            cursorModule = sm.QueryHandCursor();

            // Get an instance of the cursor configuration
            cursorConfig = cursorModule.CreateActiveConfiguration();

            // Make configuration changes and apply them
            cursorConfig.EnableEngagement(true);
            cursorConfig.EnableAllGestures();
            cursorConfig.EnableAllAlerts();
            cursorConfig.ApplyChanges();

            // Initialize the SenseManager pipeline
            sm.Init();
        }

        private void Update()
        {
            bool handInRange = false;
            bool joyButton = false;

            // Start AcquireFrame-ReleaseFrame loop
            while (sm.AcquireFrame(true).IsSuccessful())
            {
                PXCMCursorData cursorData = cursorModule.CreateOutput();
                PXCMPoint3DF32 adaptivePoints = new PXCMPoint3DF32();
                PXCMCursorData.BodySideType bodySide;

                // Retrieve the current cursor data
                cursorData.Update();

                // Check if alert data has fired
                for (int i = 0; i < cursorData.QueryFiredAlertsNumber(); i++)
                {
                    PXCMCursorData.AlertData alertData;
                    cursorData.QueryFiredAlertData(i, out alertData);

                    if ((alertData.label == PXCMCursorData.AlertType.CURSOR_NOT_DETECTED) ||
                        (alertData.label == PXCMCursorData.AlertType.CURSOR_DISENGAGED) ||
                        (alertData.label == PXCMCursorData.AlertType.CURSOR_OUT_OF_BORDERS))
                    {
                        handInRange = false;
                    }
                    else
                    {
                        handInRange = true;
                    }
                }

                // Check if click gesture has fired
                PXCMCursorData.GestureData gestureData;

                if (cursorData.IsGestureFired(PXCMCursorData.GestureType.CURSOR_CLICK, out gestureData))
                {
                    joyButton = !joyButton;
                }

                // Track hand cursor if it's within range
                int detectedHands = cursorData.QueryNumberOfCursors();

                if (detectedHands > 0)
                {
                    // Retrieve the cursor data by order-based index
                    PXCMCursorData.ICursor iCursor;
                    cursorData.QueryCursorData(PXCMCursorData.AccessOrderType.ACCESS_ORDER_NEAR_TO_FAR,
                                               0,
                                               out iCursor);

                    adaptivePoints = iCursor.QueryAdaptivePoint();

                    // Retrieve controlling body side (i.e., left or right hand)
                    bodySide = iCursor.QueryBodySide();

                    // Control the virtual joystick
                    ControlJoystick(adaptivePoints, joyButton);
                }
                else
                {
                    bodySide = PXCMCursorData.BodySideType.BODY_SIDE_UNKNOWN;
                }

                // Update the user interface
                Render(adaptivePoints, bodySide, handInRange, joyButton);

                // Resume next frame processing
                cursorData.Dispose();
                sm.ReleaseFrame();
            }
        }

        private void ControlJoystick(PXCMPoint3DF32 points, bool buttonState)
        {
            double joyMin;
            double joyMax;

            // Scale x-axis data
            joyMin = MaxSensitivity - joySensitivityX;
            joyMax = MaxSensitivity + joySensitivityX;
            int xScaled = Convert.ToInt32((joyMax - joyMin) * points.x + joyMin);

            // Scale y-axis data
            joyMin = MaxSensitivity - joySensitivityY;
            joyMax = MaxSensitivity + joySensitivityY;
            int yScaled = Convert.ToInt32((joyMax - joyMin) * points.y + joyMin);

            // Scale z-axis data
            joyMin = MaxSensitivity - joySensitivityZ;
            joyMax = MaxSensitivity + joySensitivityZ;
            int zScaled = Convert.ToInt32((joyMax - joyMin) * points.z + joyMin);

            // Update joystick settings
            joystick.SetAxis(xScaled, joyID, HID_USAGES.HID_USAGE_X);
            joystick.SetAxis(yScaled, joyID, HID_USAGES.HID_USAGE_Y);
            joystick.SetAxis(zScaled, joyID, HID_USAGES.HID_USAGE_Z);
            joystick.SetBtn(buttonState, joyID, 1);
        }

        private void Render(PXCMPoint3DF32 points,
                            PXCMCursorData.BodySideType bodySide,
                            bool handInRange,
                            bool buttonState)
        {
            Dispatcher.Invoke(delegate
            {
                // Change drawing border to indicate if the hand is within range
                uiBorder.BorderBrush = (handInRange) ? Brushes.Green : Brushes.Red;

                // Scale cursor data for drawing
                double xScaled = uiCanvas.ActualWidth * points.x;
                double yScaled = uiCanvas.ActualHeight * points.y;
                uiCursor.Height = uiCursor.Width = points.z * 100;

                // Move the screen cursor
                Canvas.SetRight(uiCursor, (xScaled - uiCursor.Width / 2));
                Canvas.SetTop(uiCursor, (yScaled - uiCursor.Height / 2));

                // Update displayed data values
                uiX.Text = string.Format("X Axis: {0:0.###}", points.x);
                uiY.Text = string.Format("Y Axis: {0:0.###}", points.y);
                uiZ.Text = string.Format("Z Axis: {0:0.###}", points.z);
                uiBodySide.Text = string.Format("Controlling Hand: {0}", bodySide);
                uiButtonState.Text = string.Format("Button State (use 'Click' gesture to toggle): {0}",
                                                    buttonState);
            });
        }

        private void Window_Closing(object sender, System.ComponentModel.CancelEventArgs e)
        {
            update.Abort();
            cursorConfig.Dispose();
            cursorModule.Dispose();
            sm.Dispose();
            joystick.ResetVJD(joyID);
            joystick.RelinquishVJD(joyID);
        }

        private void sldSensitivity_ValueChanged(object sender,
                                                 RoutedPropertyChangedEventArgs<double> e)
        {
            var sliderControl = sender as Slider;

            switch (sliderControl.Name)
            {
                case "uiSliderX":
                    joySensitivityX = sliderControl.Value;
                    break;
                case "uiSliderY":
                    joySensitivityY = sliderControl.Value;
                    break;
                case "uiSliderZ":
                    joySensitivityZ = sliderControl.Value;
                    break;
            }
        }
    }
}

Code Details

To keep this code sample as simple as possible, all methods are contained in a single class. As shown in the source code presented in Table 2, the MainWindow class is composed of the following methods:

  • MainWindow() – Several private objects and member variables are declared at the beginning of the MainWindow class. These objects are instantiated and variables initialized in the MainWindow constructor.
  • ConfigureRealSense()– This method handles the details of creating the SenseManager object and hand cursor module, and configuring the cursor module.
  • Update()– As described in the Intel RealSense SDK Reference Manual, the SenseManager interface can be used either by procedural calls or by event callbacks. In the RSJoystick app we are using procedural calls as the chosen interfacing technique. The acquire/release frame loop runs in the Update() thread, independent of the main UI thread. This thread runs continuously and is where hand cursor data, gestures, and alert data is acquired.
  • ControlJoystick()– This method is called from the Update() thread when the user’s hand is detected. Adaptive Point data is passed to this method, along with the state of the virtual joystick button (toggled by the CURSOR_CLICK gesture). The Adaptive Point data is scaled using values from the sensitivity slider controls. The slider controls and scaling calculations allow the user to select the full-scale range of values that are sent to the vJoy SetAxis() method, which expects values in the range of 0 to 32768. With a sensitivity slider set to its maximum setting, the corresponding cursor data point will be converted to a value in the range of 0 to 32768. Lower sensitivity settings will narrow this range for the same hand trajectory. For example: 8192 to 24576.
  • Render()– This method is called from the Update() thread and uses the Dispatcher.Invoke() method to perform operations that will be executed on the UI thread. This includes updating the position of the ellipse on the canvas control and data values shown in the TextBlock controls.
  •  sldSensitivity_ValueChanged() – This event handler is raised whenever any of the slider controls are adjusted.

Using the Application

You can test the app by running vJoy Monitor from the Windows 10 Start menu (see Figure 3). As shown in Figure 7, you can monitor the effects of moving your hand in three axes and performing the click gesture to toggle button 1.


Figure 7: Testing the app with vJoy Monitor.

For a more fun and practical usage, you can run the flight simulator featured in Google Earth* (see Figure 1). According to their website, “Google Earth lets you fly anywhere on Earth to view satellite imagery, maps, terrain, 3D buildings, from galaxies in outer space to the canyons of the ocean.” (https://www.google.com/earth).

After downloading and installing Google Earth, refer to the instructions located here to run the flight simulator. Start by reducing the x- and y-axis sensitivity controls in RSJoystick to minimize the effects of hand motions on the airplane, and set the z-axis slider to its maximum position. After some experimentation you should be able to control the airplane using subtle hand motions.

Summary

This article provided a simple walkthrough describing how to create an Intel RealSense SDK-enabled joystick emulator app from scratch, and how to use the Hand Cursor Module supported by the SR300 camera.

About Intel RealSense Technology

To learn more about the Intel RealSense SDK for Windows, go to https://software.intel.com/en-us/intel-realsense-sdk.

About the Author

Bryan Brown is a software applications engineer at Intel Corporation in the Software and Services Group. 

Introducing the Intel® RealSense™ Camera SR300

$
0
0

Introduction

The Intel® RealSense™ camera SR300 is the latest front-facing camera in our product lineup. Intel has added a number of new features and significant improvements to the SR300 over the first-generation Intel® RealSense™ camera F200. The SR300 improves the depth range of the camera to 1.5 meters and provides dynamic motion capture with higher-quality depth data, decreased power consumption, and increased middleware quality and robustness. With 1080p full HD video image quality at up to 30 frames per second (FPS), or 720p HD video image quality at up to 60 FPS, the SR300 model provides improved Skype* support. The SR300 supports legacy Intel® RealSense™ camera F200 applications and RGB usage. The Intel® RealSense™ SDK has added a new 3D Cursor mode, improved background segmentation, and 3D object scanning for the SR300 camera. The article A Comparison of Intel® RealSense™ Front-Facing Camera SR300 and F200 shows the differences between the SR300 and F200 models,  and motivations to move to SR300.


Figure 1:SR300 camera model.

The dimensions of the SR300 camera are approximately 110 mm x 12.6 mm x 3.8–4.1 mm, and its weight is 9.4 grams. Its size and weight allow it to be clipped onto a mobile platform lid or desktop monitor and provide stable video output. The SR300 will be built into multiple form factors in 2016, including PCs, all-in-ones, notebooks, and 2-in-1s. The SR300 camera can use the Intel RealSense SDK for Windows* or librealsense software. The SDK version that added support for SR300 is 2016 R1 or later.

New Feature and Improvements

New Features
  • Cursor mode
  • Person tracking
Improvements
  • Increased range and lateral speed
  • Improved color quality under low-light capture and improved RGB texture for 3D scan
  • Improved color and depth stream synchronization
  • Decreased power consumption

Visit A Comparison of Intel® RealSense™ Front-Facing Camera SR300 and F200 to learn more about fast VGA and more new features and improvements.

Additional Intel® RealSense™ SDK Features Planned for the SR300 Camera

Future releases of the Intel® RealSense™ SDK will include great new updates and features: Auto-range, High Dynamic Range (HDR) mode, and Confidence Map.

Planned for Second Half of 2016

Auto-Range

Auto-range improves the image quality, especially at close range. It controls laser gain in proximity and exposure in the long range.

High Dynamic Range (HDR) Mode

High Dynamic Range (HDR) is a technique used to add more dynamic range to the image. The dynamic range is the ratio of light to dark in the image. With HDR mode enabled, the images will be reproduced with more detail. HDR mode is useful in low-light or backlit conditions and allows an application with tolerant frame rate variation.

With HDR mode enabled, images reveal more details of regular and highlighted hair:


Figure 2: Reveal more hair details.


Figure 3: Improve highlighted hair.

HDR mode will resolve confusing scenarios such as a black foreground over a black background, providing a significant improvement in background segmentation (BGS). HDR will be available only for BGS and initially may not be used at the same time as any other middleware. More information will be available in a future Intel RealSense SDK at release.


Figure 4: Black hair over a black background.

Confidence Map

The Confidence map feature will provide a confidence value associated with the depth map in the range of 0–15. The low range of 0–4 will provide more depth accuracy while the full range will be helpful for blob segmentation, edge detection, and edge gap-fills.

SR300 Targeted Usages

  • Full-hand skeletal tracking and gesture recognition
  • Cursor mode
  • Head tracking
  • 3D segmentation and background removal
  • Depth-enhanced augmented reality
  • Voice command and control
  • 3D scanning: face and small object
  • Facial recognition

Camera Applications

Dynamic BSG

The User Segmentation module masks out the background when a user is in front of the camera so you can place the user’s face in a new background. This module is being integrated into video conferencing applications. With the HDR mode enabled, the SR300 model provides high-quality masking and significant improved color quality in low-light conditions.

3D Scanning

The SR300 model provides significantly improved color quality in low-light conditions, resulting in improved RGB texture that can be applied to the mesh to create a more appealing visualization than the F200 model. Either front-facing camera can scan the user’s face or a small object. However, with the SR300, the range for the capture increases to 70 cm at 50 FPS while achieving more details compared to the F200 model. You can use the Intel RealSense SDK to create a 3D scan and then use Sketchfab* to share it on Facebook*. For more information on Sketchfab, visit Implementing Sketchfab Login in your app and Sketchfab Integration. The 3D scan module is being integrated into AAA games in order to capture and use end-user face scans on in-game characters.

Hand Gesture Recognition – Cursor Mode Only Available in SR300

There are three main tracking modes supported by the hand module: cursor mode, extremities and full-hand mode. Cursor mode is a new feature that is only available with the SR300 camera. This mode returns a single point on the hand, allowing accurate and responsive tracking and gestures. Cursor mode is used when faster, lighter weight, more accurate hand tracking combined with a few highly robust gestures are sufficient. Cursor mode includes hand tracking movement and a click gesture. It provides twice the range and speed with no latency and low power consumption compared with full-hand mode.


Figure 1: Cursor Mode.

Dual-Array Microphones

The Intel RealSense camera SR300 has a microphone array consisting of two microphones that provide audio input to the client system. Using the two microphones improves the voice module robustness in noisy environments.

Intel® RealSense™ Camera SR300 Details

 SR300 Camera
Range**0.2 meters to 1.2 meters, indoors and indirect sunlight;
Depth/IR640x480 resolution at 60 FPS
Color camera**Up to 1080p at 30 FPS, 720p at 60 FPS
Depth Camera**Up to 640x480 60 FPS (Fast VGA, VGA), HVGA 110 FPS
IR Camera**Up to 640x480 200 FPS
Motherboard interfacesUSB 3.0, 5V, GND
Developer Kit Dimensions**110 mm x 12.6 mm x 3.8–4.1 mm
Weight**9.4 grams
Required OSMicrosoft Windows* 10 64-bit RTM
LanguageC++, C#, Visual Basic*, Java*, JavaScript*

DCM Driver

The Intel® RealSense™ Depth Camera Manager (DCM) 3.x is required for the SR300 camera. As of this writing, the gold DCM version for the SR300 is DCM 3.0.24.59748, and updates will be provided in the Windows 10 Update. Visit the Intel RealSense SDK download page to download the latest DCM. For more information on the DCM, go to Intel RealSense Cameras and DCM Overview.

Firmware Updates

The Intel RealSense camera supports firmware updates provided by the DCM driver. If a firmware update is required, the DCM driver will prompt the user and the user must accept before proceeding.

Hardware Requirements

To support the bandwidth needed by the Intel RealSense camera, a powered USB 3.0 port is required on the client system. The SR300 camera requires Windows 10 and a 6th generation Intel® Core™ processor or later. For details on system requirements and supported operating systems for SR300 and F200, visit the page Buy a Dev Kit.

Summary

This document summarized the new features of the front-facing Intel RealSense camera SR300 as provided by current and future versions of the Intel RealSense SDK. Go here to download the latest Intel RealSense SDK. You can order the new camera at http://click.intel.com/intel-realsense-developer-kit.html

Helpful References

Here is a collection of useful references for the Intel RealSense DCM and SDK, including release notes and instructions for how to download and update the software.

About the Author

Nancy Le is a software engineer at Intel Corporation in the Software and Services Group working on Intel® Atom™ processor scale-enabling projects.

**Distances are approximate.

Middleware in Game Development

$
0
0

Download PDF [PDF: 251KB]

Middleware in Game Development

Middleware can have a number of different meanings in software development. But in game development, middleware can be thought of in two ways: one as the software between the kernel and the UX, and the other more important one as software that adds services, features, and functionality to improve the game as well as make game development easier. Whether you are looking for an entire game engine to develop your idea into a game, or an efficient easy-to-use video codec to deploy full motion video, this list will guide you to the best middleware to use while developing your game for Intel® architecture.

Game Engines

A game engine typically encapsulates the rendering, physics, sound, input, networking, and artificial intelligence. If you are not building your own engine, then you will need to use a commercial version. The game engines below have been heavily optimized for Intel® hardware, ensuring that your game runs great no matter which Intel® platform you choose to develop for.

EngineDescriptionIntel Resources

Unreal* Engine 4

Unreal Engine 4 powers some of the most visually stunning games in existence while being easy to learn. Blueprints visual scripting lets you jump in with no programming experience, or you can go the traditional route and use C++. Unreal supports cross-platform game development on your Intel® processor-based PC and devices.

https://software.intel.com/en-us/articles/Unreal-Engine-4-with-x86-Support

Unity* 5

Unity 5 is extremely easy to learn and supports both Unity Script and C# programming support. Unity supports cross-platform game development on your Intel processor-based PC and devices.

https://software.intel.com/en-us/articles/unity

Cocos2d-x

Cocos2d-X is an open source game engine that supports cross-platform 2d game development on your Intel processor-based PC and devices. Cocos2d-x supports C++, JavaScript*, and LUA and allows developers to use the same code across all platforms.

https://software.intel.com/en-us/articles/creating-multi-platform-games-with-cocos2d-x-version-30-or-later

Marmalade

Marmalade is designed as a write once, execute anywhere engine. Developers can access low-level platform features for memory management and file access, while using C++ or Objective-C* for game scripting. Marmalade supports cross-platform game development on your Intel processor-based PC and devices.

https://software.intel.com/en-us/android/articles/marmalade-c-and-shiva3d-a-game-engine-guide-for-android-x86

libGDX

libGDX is an open-source, cross-platform game development framework for Windows*, Linux*, OS X*, iOS*, Android*, and Blackberry* platforms and WebGL-enabled browsers. It supports multiple Java* Virtual Machine languages.

https://software.intel.com/en-us/android/articles/preparing-libgdx-to-natively-support-intel-x86-cpus-running-android

Optimization Tools

Intel provides a number of tools for analyzing and optimizing your game. Does a particular section of your game cause long frame draw times? Do you want to optimize your code for multicore performance? Intel’s optimization tools can help you unleash the full performance of Intel hardware.

Intel Optimization ToolsDescriptionIntel Resources

Graphics Performance Analyzers (GPA)

GPA is a set of powerful, agile tools enabling game developers to get full performance out of their gaming platform, including (but not limited to) Intel® Core™ processors and Intel® HD graphics, as well as Intel processor-based tablets running Android.

https://software.intel.com/en-us/gpa/faq

Intel® VTune™ Amplifier

Intel Vtune Amplifier gives insight into threading performance and scalability, bandwidth, caching, and much more. Analysis is faster and easier because VTune Amplifier understands common threading models and presents information at a higher, easily understood level.

https://software.intel.com/en-us/get-started-with-vtune

Intel® Compiler Tools

Intel Compiler tools generate code that unlocks the full horsepower of Intel processors.

https://software.intel.com/en-us/compiler_15.0_ug_c

Intel® Thread Building Blocks (Intel® TBB)

Intel TBB lets you easily write parallel C++ programs. These parallel programs take full advantage of multicore performance, are portable and composable, and have future-proof scalability.

https://software.intel.com/en-us/android/articles/android-tutorial-writing-a-multithreaded-application-using-intel-threading-building-blocks

Other tools to consider

Using these additional tools can further specialize your game. Generate realistic looking vegetation with efficient levels of detail (LODs), compose your Mozart-like audio masterpiece, or improve your global illumination with lifelike shadows and lighting. If you’re looking to push the limits of what is possible in game technology, consider the tools below.

AudioDescription

Wwise*

Multithreaded high-quality audio that integrates easily into multiple game engines and is easily deployed to multiple platforms.

FMOD*

FMOD is a suite of tools for both game development and sound deployment. FMOD studio is an audio creation tool for authoring sounds for your game, while FMOD Ex is a playback engine for sound, with cross-platform compatibility and support for a variety of engines including Unity, Unreal, Cocos2d, and Havok*.

Lighting

Description

Beast*

Autodesk’s Beast provides high-quality global illumination, simulating physically correct real-time lighting.

GUI

Description

Scaleform*

Autodesk’s Scaleform creates menu systems that are both lightweight and feature-rich. Scaleform supports multithreaded rendering, is easy to implement, and supports DirectX* 12.

Misc.

Description

Bink* 2

Bink is a video codec with a self-contained library that does not require software installation. Bink supports multicore CPUs, such as 6th generation Intel processors, for smooth video playback of your game.

SpeedTree*

SpeedTree generates realistic trees with LODs for your game. SpeedTree supports per-instance and per-vertex hue generation to reduce the number of assets for your game, as well as shader optimizations for Intel HD graphics.

Umbra

Umbra is multicore-optimized occlusion-culling middleware, compatible with integration support for Unity and Unreal engines.

Simplygon*

Simplygon automatically generates new LODs by intelligently reducing the number of polygons in models that different LODs require.

Feedback

We value your input! Feel free to comment if you have middleware you’d like to see added to this list. And share screenshots of what you’re working on with middleware in the comments section below. 

API without Secrets: Introduction to Vulkan* Preface

$
0
0

Download PDF (456K)

Link to Github Sample Code

About the Author

I have been a software developer for over 9 years. My main area of interest is graphics programming, and most of my professional career has been involved in 3D graphics. I have a lot of experience in OpenGL* and shading languages (mainly GLSL and Cg), and for about 3 years I also worked with Unity* software. I have also had opportunities to work on some VR projects that involved working with head-mounted displays like Oculus Rift* or even CAVE-like systems.

Recently, with our team here at Intel, I was involved in preparing validation tools for our graphics driver’s support for the emerging API called Vulkan. This graphics programming interface and the approach it represents is new to me. The idea came to me that while I’m learning about it I can, at the same time, prepare a tutorial for writing applications using Vulkan. I can share my thoughts and experiences as someone who knows OpenGL and would like to “migrate” to its successor.

About Vulkan

Vulkan is seen as an OpenGL’s successor. It is a multiplatform API that allows developers to prepare high-performance graphics applications likes games, CAD tools, benchmarks, and so forth. It can be used on different operating systems like Windows*, Linux*, or Android*. The Khronos consortium created and maintains Vulkan. Vulkan also shares some other similarities with OpenGL, including graphics pipeline stages, GLSL shaders (sort of) or nomenclature.

But there are many differences that confirm the need for the new API. OpenGL was changing for over 20 years. Many things have changed in the computer industry since the early 90s, especially in graphics cards architecture. OpenGL is a good library, but not everything can be done by only adding new functionalities that match the abilities of new graphics cards. Sometimes a huge redesign has to be made. And that’s why Vulkan was created.

Vulkan was based on Mantle*—the first in a series of new low-level graphics APIs. Mantle was developed by AMD and designed only for the architecture of Radeon cards. And despite it being the first publicly available API, games and benchmarks that used Mantle saw some impressive performance gains. Then other low-level APIs started appearing, such as Microsoft’s DirectX* 12, Apple’s Metal* and now Vulkan.

What is the difference between traditional graphics APIs and new low-level APIs? High-level APIs like OpenGL are quite easy to use. The developer declares what they want to do and how they want to do it, and the driver handles the rest. The driver checks whether the developer uses API calls in the proper way, whether the correct parameters are passed, and whether the state is adequately prepared. If problems occur, feedback is provided. For ease of use, many tasks have to be done “behind the scenes” by the driver.

In low-level APIs the developer is the one who must take care of most things. They are required to adhere to strict programming and usage rules and also must write much more code. But this approach is reasonable. The developer knows what they want to do and what they want to achieve. The driver does not, so with traditional APIs the driver has to make additional effort for the program to work properly. With APIs like Vulkan this additional effort can be avoided. That’s why DirectX 12, Metal, or Vulkan are called thin-drivers/thin-APIs. Mostly they only communicate user requests to the hardware, providing only a thin abstraction layer of the hardware itself. The driver does as little as possible for the sake of much higher performance.

Low-level APIs require additional work on the application side. But this work can’t be avoided. Someone or something has to do it. So it is much more reasonable for the developer to do it, as they know how to divide work into separate threads, when the image would be a render target (color attachment) or used as a texture/sampler, and so on. The developer knows what pipeline state or what vertex attributes changes more often. All that leads to far more effective use of the graphics card hardware. And the best part is that it works. An impressive performance boost can be observed.

But the word “can” is important. It requires additional effort but also a proper approach. There are scenarios in which no difference in performance between OpenGL and Vulkan will be observed. If someone doesn’t need multithreading or if the application isn’t CPU bound (rendered scenes aren’t too complex), OpenGL is enough and using Vulkan will not give any performance boost (but it may lower power consumption, which is important on mobile devices). But if we want to squeeze every last bit from our graphics hardware, Vulkan is the way to go.

Sooner or later all major graphics engines will support some, if not all, of the new low-level APIs. So if we want to use Vulkan or other APIs, we won’t have to write everything from scratch. But it is always good to know what is going on “under the hood”, and that’s the reason I have prepared this tutorial.

A Note about the Source Code

I’m a Windows developer. When given a choice I write applications for Windows. That’s because I don’t have experience with other operating systems. But Vulkan is a multiplatform API and I want to show that it can be used on different operating systems. That’s why I’ve prepared a sample project that can be compiled and executed both on Windows and Linux.

Source code for this tutorial can be found here:

https://github.com/GameTechDev/IntroductionToVulkan

I have tried to write code samples that are as simple as possible and to not clutter the code with unnecessary “#ifdefs”. Sometimes this can’t be avoided (like in window creation and management) so I decided to divide the code into small parts:

  • Tutorial files are the most important here. They are the ones where all the exciting Vulkan-related code is placed. Each lesson is placed in one header/source pair.
  • OperatingSystem header and source files contain OS-dependent parts of code like window creation, message processing, and rendering loops. These files contain code for both Linux and Windows, but I tried to unify them as much as possible.
  • main.cpp file is a starting point for each lesson. As it uses my custom Window class it doesn’t contain any OS-specific code.
  • VulkanCommon header/source files contain the base class for all tutorials starting from tutorial 3. This class basically replicates tutorials 1 and 2—creation of a Vulkan instance and all other resources necessary for the rendered image to appear on the screen. I’ve extracted this preparation code so the code of all the other chapters could focus on only the presented topics.
  • Tools contain some additional utility functions and classes like a function that reads the contents of a binary file or a wrapper class for automatic object destruction.

The code for each chapter is placed in a separate folder. Sometimes it may contain an additional Data directory in which resources like shaders or textures for a given chapter are placed. This Data folder should be copied to the same directory in which executables will be held. By default executables are compiled into a build folder.

Right. Compilation and build folder. As the sample project should be easily maintained both on Windows and Linux I’ve decided to use CMakeLists.txt file and a CMake tool. On Windows there is a build.bat file that creates a Visual Studio* solution—Microsoft Visual Studio 2013 is required to compile the code on Windows (by default). On Linux I’ve provided a build.sh script that compiles the code using make but CMakeLists.txt can also be easily opened with tools like Qt. CMake is of course also required.

Solution and project files are generated and executables are compiled into the build folder. This folder is also the default working directory, so the Data folders should be copied into it for the lessons to work properly. During execution, in case of any problems, additional information is “printed” in cmd/terminal. So if there is something wrong, run the lesson from the command line/terminal or look into the console/terminal window to see if any messages are displayed.

I hope these notes will help you understand and follow my Vulkan tutorial. Now let’s focus on learning Vulkan itself!

 

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800- 548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

Tutorial: Using Intel® RealSense™ Technology in the Unreal Engine* 3 - Part 2

$
0
0

Download PDF 854 KB

Part 1

Setting Up Visual Studio 2010 for the Example Game

The steps below set your map file as the default map for the Example game by modifying the .ini file.

  1. Go to <UE3 Source>\ExmapleGame\Config.

  2. Open DefaultEngine.ini and change as shown below.

    [URL]

    MapExt=umap

    Map=test.umap

    LocalMap=BLrealsense_Map.umap

    TransitionMap=BLrealsense_Map.umap

    EXEName=ExampleGame.exe

    DebugEXEName=DEBUG-ExampleGame.exe

    GameName=Example Game

    GameNameShort=EG

  3. Open ExampleEngine.ini and change as listed.

    [URL]

    Protocol=unreal

    Name=Player

    Map=test.umap

    LocalMap=BLrealsense_Map.umap

    LocalOptions=

    TransitionMap=BLrealsense_Map.umap

    MapExt=umap

    EXEName=ExampleGame.exe

    DebugEXEName=DEBUG-ExampleGame.exe

    SaveExt=usa

    Port=7777

    PeerPort=7778

    GameName=Example Game

    GameNameShort=EG

  4. Open the UE3 Visual Studio project or solution file in <UE3 source>\Development\Src – UE3.sln, or open UE3.sln in Visual Studio.

    Figure 1: Microsoft Visual Studio* 2010.

  5. Build and run as in the previous steps. You will see the Unreal initial window and your game.

Using the Coordinate System in Unreal Engine

Before linking with the Intel® RealSense™ SDK, it is important to understand the coordinate system in Unreal.

Position is tracked by the X-Y-Z axis (Refer to “Origin” and “RotOrigin” class in UE3 source code) and rotation is by Euler (P-Y-R) and Quaternion (Refer to https://en.wikipedia.org/wiki/Quaternion for more detail). 


Figure 2: Coordinate system

Quaternion has one scalar factor and three vector factors.

To convert from Euler angle to Quaternion:

X-Y-Z angles:

Autoexpand Setup for a Debugger in Visual Studio 2010 (Optional)

The debugging symbols for bone structure array, position, and rotation array were originally encrypted and unrecognizable in Visual Studio. To see debugging symbols, follow the steps below.

  1. Find your Autoexp.dat
     

    For Visual Studio and Windows 7 64-bit, it is located at C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\Packages\Debugger

  2. Find the debugging script and open it.
     

    UE3 source\ Development/External/Visual Studio Debugging/AUTOEXP.DAT_addons.txt

  3. Copy each [AutoExpand] and [Visualizer] section into your Autoexp.dat.

Intel® RealSense™ SDK Enabling on Unreal Engine 3

This section describes Intel RealSense SDK-related changes in Unreal Engine 3 after installing the Intel RealSense SDK and Depth Camera Manager. Face landmark and head-pose tracking APIs in Intel RealSense SDK are used to manipulate facial expression and head movement of the example character. Head-pose tracking is intuitive since the roll, yaw, and pitch values can be used in Unreal Engine 3 as is, but face landmark tracking is more complicated.


Figure 3: Roll-Yaw-Pitch.

There are 76 traceable points for the face provided by the Intel RealSense SDK. Each expression, like blink or mouth open, has a value range with relevant points. For example, when the eye is closed, the distance between point 12 and point 16 will be around 0, and when the eye is open, the distance will be greater than 0 and varies for each individual.

Based on this, the current implementation is based on the relative calculation of the minimum/maximum value between the character and the user. For example, for blinking, calculate and apply how much distance the game character’s eye should have for eyes open and closed according to the user.


Figure 4: Face landmarks and numbers of the Intel® RealSense™ SDK.

<UE3> is the home folder where UE3 is installed. Below four files are to be modified.

  • <UE3>\Development\Src\UnrealBuildTool\Configuration\UE3BuildConfiguration.cs
  • <UE3>\Development\Src\UnrealBuildTool\Configuration\UE3BuildWin32.cs
  • <UE3>\Development\Src\Engine\Inc\UnSkeletalMesh.h
  • <UE3>\Development\Src\Engine\Src\UnSkeletalComponent.cpp

UE3BuildConfiguration.cs (Optional)

public static bool bRealSense = true;

RealSense relevant codes are enclosed with “#if USE_REALSENSE” phrase. This sentence is used for defining “#if USE_REALSENSE” phrase at UE3BuildWin32.cs file.  If you modify this to “false”, RealSense relevant code won’t be referenced to be compiled. This is an optional.

UE3BuildWin32.cs

if (UE3BuildConfiguration.bRealSense)
{
SetupRealSenseEnvironment();
}
void SetupRealSenseEnvironment()
{
      GlobalCPPEnvironment.Definitions.Add("USE_REALSENSE=1");
      String platform = (Platform == UnrealTargetPlatform.Win64 ? "x64" : "Win32");

      GlobalCPPEnvironment.SystemIncludePaths.Add("$(RSSDK_DIR)/include");
      FinalLinkEnvironment.LibraryPaths.Add("$(RSSDK_DIR)/lib/" + platform);

      if (Configuration == UnrealTargetConfiguration.Debug) {
           FinalLinkEnvironment.AdditionalLibraries.Add("libpxc_d.lib");
      } else {
           FinalLinkEnvironment.AdditionalLibraries.Add("libpxc.lib");
      }
}

The definition of “USE_REALSENSE” that will be used to enable or disable Intel RealSense SDK relevance at source codes (Optional).

Since Unreal Engine 3 is a makefile-based solution, the Intel RealSense SDK header file and library path should be added at the project’s include and library path.

UnSkeletalMesh.h

#if USE_REALSENSE
	PXCFaceData* faceOutput;
	PXCFaceConfiguration *faceConfig;
	PXCSenseManager *senseManager;

	void InitRealSense();
	void ReleaseRealSense();
#endif

This is the declaration part of the Intel RealSense SDK classes and functions. The bone structure manipulating part is at UpdateSkelPos() of UnSkeletalComponent.cpp.

UnSkeletalComponent.cpp

#if USE_REALSENSE
	#include "pxcfacedata.h"
	#include "pxcfacemodule.h"
	#include "pxcfaceconfiguration.h"
	#include "pxcsensemanager.h"

	FLOAT rsEyeMin = 6;
	FLOAT rsEyeMax = 25;

	FLOAT rsMouthMin = 5;
	FLOAT rsMouthMax = 50;

	FLOAT rsMouthWMin = 40;
	FLOAT rsMouthWMax = 70;

	FLOAT chMouthMin = -105;
	FLOAT chMouthMax = -75;
……
#endif

Include Intel RealSense SDK header files. Defines minimum/maximum values of the user and game characters, starting with “rs” as the user’s value, and “ch” as the game character’s value (this should be changed according to the user and game character’s appearance). For example, for blinking, this defines how much distance a game character’s eye should have for eyes open and closed according to the user.

void USkeletalMeshComponent::Attach()
{
……
#if USE_REALSENSE
	senseManager = NULL;
	InitRealSense();
#endif

The Attach() function calls the InitRealSense() function to initialize the Intel RealSense SDK’s relevant classes and configure the camera. 

#if USE_REALSENSE
void USkeletalMeshComponent::InitRealSense() {
	if (senseManager != NULL) return;

	faceOutput = NULL;

	senseManager = PXCSenseManager::CreateInstance();
	if (senseManager == NULL)
	{
 // error found
	}

	PXCSession *session = senseManager->QuerySession();
	PXCCaptureManager* captureManager = senseManager-> QueryCaptureManager();

The InitRealSense() function configures which camera will be used,and creates face-relevant class instances.

void USkeletalMeshComponent::UpdateSkelPose( FLOAT DeltaTime, UBOOL bTickFaceFX )
{
……
#if USE_REALSENSE
if (senseManager->AcquireFrame(false) >= PXC_STATUS_NO_ERROR) {
	faceOutput->Update();
	int totalNumFaces = faceOutput->QueryNumberOfDetectedFaces();
	if (totalNumFaces > 0) {

The UpdateSkelPose() function is used for head pose and face landmark tracking.

// Head
FVector v(yaw, roll, pitch);

LocalAtoms(6).SetRotation(FQuat::MakeFromEuler(v));
LocalAtoms(6).NormalizeRotation();

Head-pose tracking is intuitive because roll, yaw, and pitch values from the Intel RealSense SDK can be used as is.


Figure 5: Face landmarks and numbers that are used for eyes and mouth expression.

To express blinking, landmark points 12, 16 and 20, 24 are used, and 47, 51, 33, 39 are used for mouth expression (detail implementation depends on developers’ preference).

// Mouth
FLOAT mouthOpen = points[51].image.y - points[47].image.y;
mouth = chMouthMax - (mouthOpen - rsMouthMin) * mouthRatio;

mouthOpen = points[47].image.x - points[33].image.x;
rMouthWOpen = chMouthWMin + (mouthOpen - rsMouthWMin) * mouthWRatio;

mouthOpen = points[39].image.x - points[47].image.x;
lMouthWOpen = chMouthWMin + (mouthOpen - rsMouthWMin) * mouthWRatio;

cMouth = chMouthCMax - (mouthOpen - rsMouthWMin) * mouthCRatio;
// Left Eye
FLOAT eyeOpen = points[24].image.y - points[20].image.y;
lEyeInner = chEyeInnerMin + (eyeOpen - rsEyeMin) * innerEyeRatio;
lEyeOuter = chEyeOuterMin + (eyeOpen - rsEyeMin) * outerEyeRatio;
lEyeUpper = chEyeUpperMin + (eyeOpen - rsEyeMin) * upperEyeRatio;
// Right Eye
eyeOpen = points[16].image.y - points[12].image.y;
rEyeInner = chEyeInnerMin + (eyeOpen - rsEyeMin) * innerEyeRatio;
rEyeOuter = chEyeOuterMin + (eyeOpen - rsEyeMin) * outerEyeRatio;
rEyeUpper = chEyeUpperMin + (eyeOpen - rsEyeMin) * upperEyeRatio;
rEyeLower = chEyeLowerMin + (eyeOpen - rsEyeMin) * lowerEyeRatio;

BN_Lips_Corner_R, BN_Lips_Corner_L, BN_Jaw_Dum       is used for mouth expression, and BN_Blink_UpAdd, BN_Blink_Lower, BN_Blink_Inner, BN_Blink_Outer is used to express eye blinking. (Refer to the “Facial Bone Structure in Example Characters” section for each bone number.)

// Mouth
FVector m(90, 0, mouth);
LocalAtoms(59).SetRotation(FQuat::MakeFromEuler(m));

LocalAtoms(57).SetTranslation(FVector(mouthWXZ[2], rMouthWOpen, mouthWXZ[3])); // Right side
LocalAtoms(58).SetTranslation(FVector(mouthWXZ[4], lMouthWOpen * -1, mouthWXZ[5])); // Left side

// Left Eye
LocalAtoms(40).SetTranslation(FVector(eyeXY[0], eyeXY[1], lEyeUpper)); // Upper
LocalAtoms(41).SetTranslation(FVector(eyeXY[2], eyeXY[3], lEyeLower)); // Lower
LocalAtoms(42).SetTranslation(FVector(eyeXY[4], eyeXY[5], lEyeInner)); // Inner
LocalAtoms(43).SetTranslation(FVector(eyeXY[6], eyeXY[7], lEyeOuter)); // Outer

// Right Eye
LocalAtoms(47).SetTranslation(FVector(eyeXY[8], eyeXY[9], rEyeLower)); // Lower
LocalAtoms(48).SetTranslation(FVector(eyeXY[10], eyeXY[11], rEyeOuter)); // Outer
LocalAtoms(49).SetTranslation(FVector(eyeXY[12], eyeXY[13], rEyeInner)); // Inner
LocalAtoms(50).SetTranslation(FVector(eyeXY[14], eyeXY[15], rEyeUpper)); // Upper
void USkeletalMeshComponent::ReleaseRealSense() {
	if (faceOutput)
		faceOutput->Release();

	faceConfig->Release();
	senseManager->Close();
	senseManager->Release();
}

Close and release all of the Intel RealSense SDK relevant class instances.

Facial Bone Structure in Example Characters

In the example, the face is designed with 58 bones. In the image, each box represents a bone. A complete list of bones follows.


Figure 6: Names of bones.

Conclusion

To make an avatar that moves and copies users’ facial movements and expressions to enrich the gaming experience in UE3 and using the Intel RealSense SDK, implementation of the UE3 source code is the only option, and developers must know which source file to change. We hope this document helps you when making avatar in UE3 with the Intel RealSense SDK.

About the Authors

Chunghyun Kim is an application engineer in the Intel Software and Services Group. He focuses on game and graphic optimization on Intel® architecture.

Peter Hong is an application engineer at the Intel Software and Services Group. He focuses on enabling the Intel RealSense SDK for face, hand tracking, 3D scanning, and more.

For More Information

Epic Unreal Engine
https://www.unrealengine.com

Intel RealSense SDK
http://software.intel.com/realsense

Part 1

Tutorial: Using Intel® RealSense™ Technology in the Unreal Engine* 3 - Part 1

$
0
0

Download PDF 1.38 MB

Part 2

Introduction

Epic Games (https://epicgames.com/) Unreal Game Engine 3 (UE3) is a popular PC game engine. Intel® RealSense™ Technology is used for face and hand movement tracking to enrich the gaming experience. In the UE3 environment, using an Unreal script in the Unreal Development Kit (UDK) is the only recommendation and custom functions should be added into the UDK as a plug-in, but the Intel® RealSense™ SDK doesn’t provide an UDK plugin.

This article describes how to add Intel RealSense SDK features into a game character in massively multiplayer online role-playing games (MMORPGs) on UE3 by using C++, not an Unreal script. The common term for determining and modifying a character’s facial structure is “face-rigging.” There are several ways to handle face-rigging in the game, but we are focusing on using the Intel RealSense SDK as the manipulation method of the characters’ facial bone structure for performance and workload issues.

Key points covered in the article include the following:

  • A description of the face-rigging method
  • How to use Autodesk 3ds Max *as part of the Unreal Engine
  • A description of the coordinate system in Unreal: X-Y-Z, Euler (Pitch-Yaw-Roll) and Quaternion
  • How to enable the Intel RealSense SDK on the Unreal Engine
  • Mapping the algorithm of the Intel RealSense SDK to the game engine

Face-Rigging within the Game

There are several ways to modify the underlying bone structure of characters, which we call face-rigging, in the game.

  • Animation with script. Pre-defined animation is the normal method in games, but it is hard to implement for real-time face-rigging. If you want to make a simple emotion expressed on a face, this method would be the best and easiest way. You can control animation using an Unreal script or Matinee.
  • Commercial face-rigging tool – FaceFX*. FaceFX is Unreal’s commercial face-rigging tool from OC3 Entertainment (https://www.facefx.com). It is prelicensed on Unreal Engine 3. FaceFX incorporates full body and face changes for characters.
  • Morph targeting. The previous Intel RealSense SDK with Unity face-rigging sample code (named Angie) used Morph. It is a simple way to implement the Intel RealSense SDK within the game, but it has to make a morph target for each character. In the case of this MMORPG, there are from three to six tribes, and the player can modify a character’s face and body, so there are several thousands of combinations. It requires several thousands of Morph face resources and make less performance compared to bone manipulation.
  • Bone manipulation. If the Intel RealSense SDK can determine bone manipulation, it should be a good method on a real game. Even with several thousands of character combinations, there are comparatively few face structures (tribes x males/females). Also, this method will not impact on rendering and has minimum impact on gaming performance.

For example, the MMORPG game Bless* (http://bless-source.com/, http://bless.pmang.com/) has 10 tribes but there are only eight face bones types for Elf (male/female), Lupus (male/female), Human (male/female), and Mascu (male/female). A full list of bone names is available in the Face Bone Structure section at the end of the document.


Figure 1: Game characters in MMORPG.

Environment

  • Tested machine: Intel® Core™ i7-5775C processor, 16 GB DDR, 256 GB solid-state drive (SSD). Recommended machine: higher CPU like the 6th generation Intel® Core™ i7 processor (code-named Skylake) and a high-performance external graphic card and SSDs with large space storage (more than 50 GB). SSDs rather than hard-disk drives are recommended for I/O bandwidth. Intel® RealSense™ camera (F200) or 2D web camera.
  • Microsoft Windows* 7 64 bit
  • Autodesk 3ds Max 2014
  • Epic Games Unreal Engine 3 source code (required license)
  • Microsoft Visual Studio* 2010
  • Intel RealSense SDK 6.0.21.6598 and Intel® RealSense™ Depth Camera Manager 1.4.27.41944

Setup procedure

  1. Clean install Windows 7 64 bit on the machine and update Windows and drivers for each device
  2. Copy the UE3 source code to the local drive.
  3. Install Microsoft Visual Studio 2010 and update it. A debugging script must be included if you need debugging on Visual Studio 2010. Refer to the backup – Autoexpand setup for debugger.
  4. (Optional) Install Autodesk 3ds Max 2014 if you need to export the FBX file from the MAX file.

Export MAX File to FBX File for Importing UE3

Most common 3D modeling tools like Autodesk 3ds Max or Maya* can export their 3D modeling to the Unreal Engine or Unity* through the FBX file format.

These steps are based on Autodesk 3ds Max 2014.

  1. Open the MAX file that contains the bone structure. You can see the bone positions and outlook as well.


    Figure 2:Open the MAX file.

  2. Export the file to FBX. Set the “by-object” mode to export it correctly if encountered the warning screen of an unsupported “by-layer” mode.


    Figure 3:Export to FBX.

  3. Select all objects, then right-click and select “Object Properties”.


    Figure 4:Export option.

  4. Click “By Object" button to change to “by-layer” mode


    Figure 5: Export option – the by-layer mode.

  5. Select Menu, and then select Export. Enter the export name in the FBX export option. You can select animation and bake animation to test the animation in UE3.


    Figure 6:Export with animation.

Import the FBX File into the UE3 Editor
If you are using standalone type UE3 with standard option (DirectX* 9, 32-bit), you can run the Unreal editor.

  1. Run the UE3 Editor with the following commands:

    Run CMD and go to your UE3 folder (in my case, C:\UE3)
    Go to \Binaries\Win32\
    Run examplegame.exe editor -NoGADWarning


    Figure 7: Unreal Editor startup.

  2. In Content Browser, click Import, and then select the FBX file. Click OK to import. Once imported, you can see Animset, SkeletonMesh, and others.


    Figure 8:Unreal Editor - Content Browser.

  3. To check your imported FBX, right-click Animset and then select Edit Using AnimSet Viewer.


    Figure 9:Unreal editor - AnimSet Viewer.

  4. You can adjust the scale and position of the face using the mouse buttons (left: rotation, middle: position, right: zoom). You can see the bone names on the left side and skeletons on the right side. If you play the animation, the time frame and delta of position and rotation are also visible.


    Figure 10:AnimSet Viewer.


    Figure 11:AnimSet Viewer - Adjust scale.

  5. Select the bone you want (these images use BN_Ear_R 53) and the X-Y-Z coordinate system. To move, drag each X-, Y-, or Z-axis arrow.


    Figure 12:AnimSet Viewer - Check Bone

  6. To test rotation with Euler (pitch-yaw-roll), press the space bar. Changing the coordinate system displays the Euler coordinate system on the right ear. You can adjust the rotation as you drag each P-Y-R circle.


    Figure 13:Change the coordinate system.

Map and Level Creation in UE3 Editor

You can skip this section if you plan to use another existing map file or level. The steps in this section will make simple cubes and light, camera, and actor ( face bone).

  1. Run the UE3 Editor using the following commands:

    Run CMD and go to your UE3 folder (in my case: C:\UE3)
    Go to \Binaries\Win32\
    Run examplegame.exe editor -NoGADWarning

  2. Use one of the template levels, or make yourself a super basic level. Right-click the BSP Cube button. In the pop-up, enter 1024 for X, Y, Z, and enable “Hollow?” Then on the left toolbar, click Add.


    Figure 14: Unreal Editor - make layer.

  3. Fly into the cube using the WASD/arrow keys and the mouse, or alternatively drag around while holding the left/right/both mouse buttons to move the camera.


    Figure 15: Unreal Editor – start location.

  4. To add the game start location, right-click the floor, and then select Add Actor, then select Add Playerstart.

  5. To add light, right-click the wall, and then select Add Actor, then select Add Light(Point).


    Figure 16: Unreal Editor - Add light.

  6. To add an actor: face bone, press the Tab key to move to the contents browser and drag the skeleton mesh into the UE editor.


    Figure 17: Unreal Editor - Add Actor.

  7. To adjust scaling, enter a scaling number on the bottom right.

  8. To adjust position, select the Translation mode icon on the upper-left side; move your character with X-Y-Z.

  9. To adjust rotation, select the Rotation mode icon on the upper-left side; move your character with P-Y-R.


    Figure 18: Unreal Editor - Adjust rotation.

  10. Save the level with a name. In this instance, I used “test.umap” in <UE source>\ExampleGame\Content\Maps


    Figure 19: Unreal Editor – save.

  11. Finally, build it all. From the menu, select Build, and then select Build All.

  12. To check your map, click Play or press Alt+F8.


    Figure 20: Unreal Editor – Build.

  13. Save and exit the UE Editor.

About the Authors

Chunghyun Kim is an application engineer in the Intel Software and Services Group. He focuses on game and graphic optimization on Intel® architecture.

Peter Hong is an application engineer at the Intel Software and Services Group. He focuses on enabling the Intel RealSense SDK for face, hand tracking, 3D scanning, and more.

Part 2

Intel® Parallel Studio XE 2017 Beta

$
0
0

Contents

How to enroll in the Beta program

Complete the pre-beta survey at registration link

  • Information collected from the pre-beta survey will be used to evaluate beta testing coverage. Here is a link to theIntel Privacy Policy.
  • Keep the beta product serial number provided for future reference
  • After registration, you will be taken to the Intel Registration Center to download the product
  • After registration, you will be able to download all available beta products at any time by returning to the Intel Registration Center

Note: At the end of the beta program you should uninstall the beta product software.

What's New in the 2017 Beta

A detailed description of the new features in the 2017 Beta products is available in the Intel® Parallel Studio XE 2017 Beta What's Newdocument. You can also view the Release Notes for the suite or individual components. Download and try sample programs for the Beta products for OS X* platform. Linux* and Windows* versions of the samples are going to be available shortly as well.

Frequently Asked Questions

A complete list of FAQs regarding the 2017 Beta can be found in the Intel® Parallel Studio XE 2017 Beta Program: Frequently Asked Questionsdocument.

Beta duration and schedule

The beta program officially ends June 28th, 2016. The beta license provided will expire October 7th, 2016. Starting June 6th, 2016, you will be asked and be able to complete a survey regarding your experience with the beta software.

During the Beta feedback period, we will provide one periodic update. Here is a rough schedule of those milestones:

  • May 25th: Intel® Parallel Studio XE 2017 Beta Update 1
  • June 6th: Beta completion surveys are made available early
  • June 28th: Beta closes and post-beta surveys are sent

Note that once you register for the Beta, you will be notified automatically when updates are made available.

Support

Technical support will be provided via Intel® Premier Customer Support. The Intel® Registration Centerwill be used to provide updates to the component products during this beta period.

Beta Webinars

Want to know more about the 2017 Beta features in the Intel® Parallel Studio XE? Register for the webinars and view the webinar archives here at the Intel® Software Tools Technical Webinar Series page.

Beta Release Notes

Check out theRelease Notes for the Intel® Parallel Studio XE 2017 Beta and the various constituent components.

Known Issues

This section contains information on known issues (plus associated fixes) of the 2017 Beta versions of the Intel® Parallel Studio XE tools. Check back often for updates.

Compiler Fixes page

Visit the Compiler Fixes page for a list of defect fixes and feature requests that have been incorporated into Intel® C++ and Fortran Compilers 17.0 Beta component in Intel® Parallel Studio XE 2017 Beta. Defects and feature requests described within represent specific issues with specific test cases.

Next Steps

Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2016, Intel Corporation. All rights reserved.


How Intel® Media SDK Screen Capture Plug-in Plays with Video-Streaming Cloud Gaming

$
0
0

Introduction

Nowadays, cloud gaming is becoming more and more popular. The benefits are obvious: Instead of downloading and installing large files, you can just connect to the server from a diversity of devices.

How Video-Streaming Cloud Games Play

Let’s take a look how it works in case of video-streaming cloud games. First, a user runs a game on their device. Then an application sends a signal to the servers, it chooses the best available server in cloud, and connects the user to it. Actually, the game is stored and played on the server. To allow a user to see video streams, the server application captures the image, encodes it and sends it via the Internet to the client app, which decodes the received pictures. Next, the user makes some actions like pressing buttons on a keyboard or game controllers. The final step is sending this information back to the server. This endless cycle continues while the user plays the game.

 

As we can see, screen capturing is one of essential parts of cloud gaming. In this case, screen capture speed is extremely important, because we want to play in real time without any freezes, interruptions and latency delays. Along with cloud gaming, the screen capture feature is also useful in remote desktop and other content capturing scenarios etc. 

Optimize Cloud Game-Play with Intel® Media SDK Screen Capture Plug-in

Intel has a solution - especially for cloud gaming developers to help manage quick these challenges – with an optimized Screen Capture plug-in as part of the Intel® Media SDK and Intel® Media Server Studio. This plug-in is a development library that enables the media acceleration capabilities of Intel® platforms for Windows* desktop front-frame buffer capturing. The Intel Media SDK Screen Capture package includes a hardware-accelerated plug-in library that exposes graphics acceleration capabilities, and is implemented as an Intel Media SDK Decode plug-in. The plug-in can only be loaded into/used with the Intel Media SDK Hardware Library and combines with the Intel-powered encoder, decoder and screen capturing to make your game server work faster.

The Screen Capture procedure can use the decode plug-in and other SDK components. Capturing display are available in NV12 or RGB4 formats. Moreover, the following Screen Capturing features are supported:

  • Display Selection - This an opportunity to choose a display to capture for systems with enabled virtual display. The display selection feature is available on systems with virtual displays only (without physical display connected) and RGB4 output fourcc format.
  • Dirty Rectangles Detection - This function creates the possibility to detect only changed regions on the captured display image.

Set-up Steps for the Intel Media SDK Screen Capture Plug-in

  1. Download and install Intel® Media SDK or Intel® Media Server Studio.
  2. The screen capture plug-in is available at following path: <Installed_Directory>/Intel_Media_SDK_XX/bin/x64(win32)/ 22d62c07e672408fbb4cc20ed7a053e4.
  3. Download and install latest code samples package

Launch Screen Capture Plug-in

Following is an example command line to run screen capture plugin with sample_decode:

sample_decode.exe capture -w [Width] -h [Height] -d3d11 -p 22d62c07e672408fbb4cc20ed7a053e4 -o [output] -n [number of frames] - hw

(If plug-in is installed at a different directory, provide the complete path to the plug-in or copy the plug-in to the same folder directory before running sample_decode.)

On hybrid graphics (Intel Graphics + Discrete Graphics), screen capturing also supports a software implementation (replace parameter -hw with -sw in above command line). Also, please note that a software implementation is not optimized and when using that approach, expect a performance drop vs what you can achieve via hardware implementation. 

For further understanding and usability ease of the Screen capture feature, refer to: tutorial_screen_capture package attached below.

Limitations and Hardware Requirements

Refer to the Screen Capture Manual and Intel® Media SDK Release Notes.

Questions, Comments or Feedback? Connect with other developers and Intel media technical experts at the Intel Developer Zone Media Support Forum.

An overview of the 6th generation Intel® Core™ processor (code-named Skylake)

$
0
0

Introduction

The 6th generation Intel® Core™ processor (code-named Skylake) was launched in 2015. Based on improvements in the core, system-on-a-chip, and platform levels and new capabilities over the previous-generation 14nm processor (code-named Broadwell), Skylake is the processor-of-choice for productivity, creativity, and gaming applications across various form factors. This article provides an overview of the key capabilities and improvements in Skylake, along with exciting new usages like wake on voice and biometric login using Windows* 10.

Skylake architecture

The 6th generation Intel Core microarchitecture is built on 14nm technology that takes into consideration reduced processor and platform size for use in multiple form factors, Intel® architecture and graphics performance improvements, power reduction, and enhanced security features. Figure 1 illustrates these new capabilities and improvements. Actual configuration in OEM devices may vary.

Figure 1: Skylake architecture and improvement summary [1].

Core processor vectors

Performance

Performance improvement is a direct result of providing more instructions to the execution unit—more instructions executed per clock. This was accomplished through four categories of improvements [Ibid]

  • Improved front-end. Smarter branch prediction with higher capacity creates wider instruction decoding, and faster and more efficient prefetch.
  • Enhanced instruction parallelism. With more instructions per clock, the parallelism of instruction execution is improved through deeper out-of-order buffers.
  • Improved execution units (EUs). The EUs are enhanced compared to the previous generations through :
    • Shortening latencies
    • Increased number of EUs
    • Improved power efficiency of turning off units not in use
    • Increased security algorithms execution speed.
  • Improved memory subsystem. With improvements to the front-end, instruction parallelism, and EUs, the memory subsystem is also improved to scale to the bandwidth and performance requirements of the above. This has been accomplished through :
    • Higher load/store bandwidth
    • Prefetcher improvements
    • Deeper storage
    • Fill and write-back buffers
    • Improved page miss handling
    • Improvements to L2 cache miss bandwidth
    • New instructions for cache management

Figure 2: Skylake core uArchitecture at a glance.

Figure 3 shows the resulting increase in parallelism in Skylake compared to previous generations of processors (Sandy Bridge is the second generation and Haswell the 4th generation of Intel® Core™ processors).

Figure 3: Increased parallelism over past generations of processors.

The improvements shown in Figure 3 and more resulted in up to a 60-percent increase in performance compared to a five-year-old PC, with up to 6 times faster video transcoding and up to 11 times the graphics performance.

Figure 4: Performance of 6th generation Intel® Core™ processor compared to a five-year-old PC.

  1. Source: Intel Corporation. Based on estimated SYSmark* 2014 scores comparing Intel® Core™ i5-6500 and Intel® Core™ i5-650 processors.
  2. Source: Intel Corporation. Based on estimated Handbrake w/ QSV scores comparing Intel® Core™ i5-6500 and Intel® Core™ i5-650 processors.
  3. Source: Intel Corporation. Based on estimated 3DMark* “Cloud Gate” scores comparing Intel® Core™ i5-6500 and Intel® Core™ i5-650 processors.

For detailed benchmarks in performance for desktop and laptop can be found in the following:

Desktop performance benchmark: http://www.intel.com/content/www/us/en/benchmarks/desktop/6th-gen-core-i5-6500.html

Laptop performance benchmark: http://www.intel.com/content/www/us/en/benchmarks/laptop/6th-gen-core-i5-6200u.html

Energy efficiency

Resource configuration based on dynamic consumption:

Legacy systems use the Intel® SpeedStep® technology for balancing performance with energy efficiency through a demand-based algorithm controlled by the OS. While this works well for steady workloads, it is not optimal for bursty workloads. In Skylake, Intel® Speed Shift Technology shifts control from the OS to the hardware and allows the processor to go to a maximum clock speed in ~1 ms, providing for finer-grained power management [3].

Figure 5: Comparison of Intel® Speed Shift Technology with Intel® SpeedStep® technology.

On Intel® Core™ i5 – 6200U processor, the chart below gives the response time of Intel Speed Shift Technology compared to Intel SpeedStep technology:

  • Up to 45-percent improved responsiveness
  • Photo enhancement up to 45 percent
  • Sales graphs up to 31 percent
  • Local notes up to 22 percent
  • Overall responsiveness up to 20 percent

[Measured by WebXPRT* 2015, a benchmark from Principled Technologies* that measures the performance of web applications using overall and subtests for photo enhancements, local notes, and sales graphs. Find out more at www.principledtechnologies.com.]

Additional power optimization is also achieved by configuring resources based on dynamic consumption, be it through downscaling of resources that are underutilized or power gating of Intel® Advanced Vector Extensions 2 when not in use, as well as through idle power reduction.

Media and graphics

Intel® HD Graphics capabilities have come a long way in terms of 3D graphics, media and display capabilities, performance, power envelopes and configurability/scalability since processor graphics (the core processor and graphics on the same die) was first introduced in the 2nd generation Intel® Core™ processors. Figure 6 compares some of these improvements that provide a >100X improvement in graphics performance [2].

[Peak Shader FLOPS @ 1 GHz]

Figure 6: Generational features in processor graphics.

Figure 7: Generational improvement in graphics and media.

Gen9 uArchitecture

The Generation 9 (Gen9) graphics architecture is similar to the Gen8 microarchitecture in the Intel® 5th generation Core™ processor code named Broadwell but has been enhanced for performance and scalability. Figure 8 shows a block diagram of the Gen9 uArchitecture [8], which has three main components.

  • Display. On the far left side.
  • Unslice. The L-shaped piece in the center; handles the command streamer, global thread dispatcher, and the Graphics Technology Interface (GTI).
  • Slice. Comprises the EUs.

Compared to Gen8, the Gen9 uArchitecture enables maximum performance per watt, throughput improvements, and separate power/clock domain to the unslice component. This capability makes it more intelligent in terms of power management for uses like media playback. The slice component is configurable. For example, while GT3 can support up to 2 slices (each slice with 24 EUs), GT4 (Halo) can scale up to 3 slice units (GTx stands for the number of EUs based on use: GT1 supports 12, GT2 supports 24, GT3 supports 48, and GT4 supports 72). This architecture is configurable enough to allow for scaling down the number of EUs for low-power scenarios, thus allowing for usages that range from less than 4W to more than 65. API support in Gen9 is available for DirectX* 12, OpenCL™ 2.x, OpenGL* 5.x, and Vulkan*.

Figure 8: Gen9 processor graphics architecture.

You can read more about these components in detail at (IDF link, https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf)

Some of the capabilities and improvements for media include the following [2]:

  • < 1 W consumption and 1 W videoconferencing
  • Camera RAW acceleration with new VQE functions to enable 4K60 RAW video on mobile platforms
  • New Intel® Quick Sync Video Fixed-Function (FF) Mode
  • Rich codec support with fixed function and GPU accelerated decode

Figure 9 gives a snapshot of Gen9 codecs.

Note: Media codec and processing support may not be available on all operating systems and applications.

Figure 9: Codec support in Skylake.

Some of the capabilities and improvements on the display include the following:

  • Panel Blend, Scale, Rotate, Compress
  • High PPI support (4K+)
  • Wireless support up to 4K30
  • Self Refresh (PSR2)
  • CUI X.X – New capabilities, improved performance

For the gaming enthusiasts, the Intel® Core™ I7-6700K processor comes with all these rich features and improvements (see Figure 10). It also includes Intel® Turbo Boost Technology 2.0, Intel® Hyper-Threading Technology, and overclocking. The performance gains are 80 percent better compared to a 5-year-old PC. Additional information can be obtained here: http://www.intel.com/content/www/us/en/processors/core/core-i7ee-processor.html

  1. Source: Intel Corporation. Based on estimated SPECint*_rate_base2006 (8 copy rate) scores comparing Intel® Core™ i7-6700K and Intel® Core™ i7-875K processors.
  2. Source: Intel Corporation. Based on estimated SPECint*_rate_base2006 (8 copy rate) scores comparing Intel® Core™ i7-6700K and Intel® Core™ i7-3770K processors.
  3. Features are present with select chipsets and processor combinations. Warning: Altering clock frequency and/or voltage may (i) reduce system stability and useful life of the system and processor; (ii) cause the processor and other system components to fail; (iii) cause reductions in system performance; (iv) cause additional heat or other damage; and (v) affect system data integrity. Intel has not tested, and does not warranty, the operation of the processor beyond its specification.

Figure 10: Features in the Intel® Core™ i7-6700K processor.

Scalability

Skylake microarchitecture provides for a configurable core—a single design with two derivatives, one for the client space and another for servers—without compromising the power and performance requirements of each segment. Figure 11 shows the various SKUs and their power efficiencies for use in form factors that range from a compute stick on the low end to Intel® Xeon® processor-based workstations on the high end.

Figure 11: Intel® Core™ processor availability across various form factors.

Enhanced security features

Intel® Software Guard Extensions (Intel® SGX): Intel SGX is a set of new instructions provided in Skylake that allows application developers to protect sensitive data from unauthorized modification and access from rogue software running at higher privilege levels. This allows applications to preserve the confidentiality and integrity of sensitive information [1],[3]. Skylake provides instructions and flows to create secure enclaves and enables usage of trusted memory regions. More information about Intel SGX can be obtained here: https://software.intel.com/en-us/blogs/2013/09/26/protecting-application-secrets-with-intel-sgx

Intel® Memory Protection Extensions (Intel® MPX): Intel MPX is a new set of instructions to enable runtime buffer overflow checks. These instructions allow both stack and heap buffer boundary testing before memory access to ensure that the calling process only accesses memory that is allocated to it. Intel MPX support is enabled in Windows* 10 with support for Intel MPX intrinsics in Microsoft Visual Studio* 2015. Most C/C++ applications will be able to use Intel MPX by recompiling their applications without source code changes and interoperating with legacy libraries. Running an Intel MPX-enabled library on legacy systems without Intel MPX support (5th generation Intel® Core™ processors and earlier) do not provide any benefit or impact. It is also possible to enable/disable Intel MPX support dynamically [1], [3].

We’ve covered the architectural improvements and advancement in Skylake. In the next section, we’ll look at some of the Windows 10 feature that are optimized to take advantage of Intel® Core™ processor architecture.

New experiences with Windows 10

The capabilities in the 6th generation Intel Core processor are accentuated by the capabilities within Windows 10, creating an optimal experience. Below are some of the key hardware capabilities from Intel and Windows 10 capabilities that make the Intel® platforms running on Windows 10 more energy efficient, secure, responsive, and scalable [3].

Ϯ Intel and Microsoft active collaboration under way for future Windows support.

Figure 12: Skylake and Windows* 10 capabilities.

Cortana

The Microsoft Cortana* Voice Assistant available with the Windows* 10 RTM allows for a hands-free experience using the Hey Cortana keyword spotter. While the wake-on-voice capability uses the CPU processing audio pipeline for great Correct Accept and low False Accept performance, the capability can also be offloaded to a hardware audio DSP which has built in support on Windows 10 [3].

Windows Hello*

Using biometric hardware and Microsoft Passport*, Windows Hello supports various types of logins using the face, fingerprint, and iris for a password-free, out-of-the box-login experience. The user-facing Intel® RealSense™ camera (F200/SR300) supports biometric authentication using facial login [3].

Figure 13: Windows* Hello with Intel® RealSense™ Technology.

The photos in Figure 13 show how the facial landmarks provided by the F200 camera are used for the enrollment and login scenarios. The 78 landmark points on the face are used to create a facial template the first time a user tries to log in with face recognition. The next time the user tries to log in, the landmarks from the camera are verified against the template to obtain a match. Together with the Microsoft Passport security features and the camera features, the login capability provides for an accuracy determined by 1/100,000 False Acceptance Rate with a False Rejection Rate of 2 to 4 percent.

References

  1. Intel’s next generation microarchitecture code-named Skylake by Julius Mandelblat: http://intelstudios.edgesuite.net/idf/2015/sf/ti/150818_spcs001/index.html
  2. Next-generation Intel® processor graphics architecture, code-named Skylake, by David Blythe: http://intelstudios.edgesuite.net/idf/2015/sf/ti/150818_spcs003/index.html
  3. Intel® architecture code-named Skylake and Windows* 10 better together, by Shiv Koushik: http://intelstudios.edgesuite.net/idf/2015/sf/ti/150819_spcs009/index.html
  4. Skylake for gamers: http://www.intel.com/content/www/us/en/processors/core/core-i7ee-processor.html
  5. Intel’s best processor ever: http://www.intel.com/content/www/us/en/processors/core/core-processor-family.html
  6. Skylake Desktop Performance Benchmark: http://www.intel.com/content/www/us/en/benchmarks/desktop/6th-gen-core-i5-6500.html
  7. Skylake Laptop Performance Benchmark: http://www.intel.com/content/www/us/en/benchmarks/laptop/6th-gen-core-i5-6200u.html
  8. The compute architecture of Intel® processor graphics Gen9: https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf

Intel® IPP Functions Optimized for Intel® Advanced Vector Extensions 2 (Intel® AVX2)

$
0
0

Here is a list of Intel® Integrated Performance Primitives (Intel® IPP) functions that are optimized for Intel® Advanced Vector Extensions 2 (AVX2) on Haswell and Intel® microarchitecture code name Skylake. These functions include Convert, CrossCorr, Max/Min, PolarToCart, Sort, and some other arithmetic functions. The functions listed here are all hand-tuned for Intel® architecture. Intel IPP functions that are not listed here also get optimization benefit from Intel® Compiler. 

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

ippiConvert_16s16u_C1Rs
ippiConvert_16s32f_C1R
ippiConvert_16s32s_C1R
ippiConvert_16s8u_C1R
ippiConvert_16u32f_C1R
ippiConvert_16u32s_C1R
ippiConvert_16u8u_C1R
ippiConvert_16s8s_C1RSfs
ippiConvert_16u16s_C1RSfs
ippiConvert_16u8s_C1RSfs
ippiConvert_32f16s_C1RSfs
ippiConvert_32f16u_C1RSfs
ippiConvert_32f32s_C1RSfs
ippiConvert_32f8s_C1RSfs
ippiConvert_32f8u_C1RSfs
ippiCopy_16u_C1MR
ippiCopy_16u_C3MR
ippiCopy_32s_C1MR
ippiCopy_32s_C3MR
ippiCopy_32s_C4MR
ippiCopy_8u_C1MR
ippiCopy_8u_C1R
ippiCopy_8u_C3MR
ippiCopy_8u_C3P3R
ippiCopy_8u_C4P4R
ippiCopyConstBorder_16s_C3R
ippiCopyConstBorder_16s_C4R
ippiCopyConstBorder_16u_C1R
ippiCopyConstBorder_16u_C3R
ippiCopyConstBorder_16u_C4R
ippiCopyConstBorder_32f_C3R
ippiCopyConstBorder_32f_C4R
ippiCopyConstBorder_32s_C3R
ippiCopyConstBorder_32s_C4R
ippiCopyConstBorder_8u_C3R
ippiCopyConstBorder_8u_C4R
ippiCopyReplicateBorder_16s_C1IR
ippiCopyReplicateBorder_16s_C1R
ippiCopyReplicateBorder_16s_C3IR
ippiCopyReplicateBorder_16s_C3R
ippiCopyReplicateBorder_16s_C4IR
ippiCopyReplicateBorder_16s_C4R
ippiCopyReplicateBorder_16u_C1IR
ippiCopyReplicateBorder_16u_C1R
ippiCopyReplicateBorder_16u_C3IR
ippiCopyReplicateBorder_16u_C3R
ippiCopyReplicateBorder_16u_C4IR
ippiCopyReplicateBorder_16u_C4R
ippiCopyReplicateBorder_32f_C1IR
ippiCopyReplicateBorder_32f_C1R
ippiCopyReplicateBorder_32f_C3IR
ippiCopyReplicateBorder_32f_C3R
ippiCopyReplicateBorder_32f_C4IR
ippiCopyReplicateBorder_32f_C4R
ippiCopyReplicateBorder_32s_C1IR
ippiCopyReplicateBorder_32s_C1R
ippiCopyReplicateBorder_32s_C3IR
ippiCopyReplicateBorder_32s_C3R
ippiCopyReplicateBorder_32s_C4IR
ippiCopyReplicateBorder_32s_C4R
ippiCopyReplicateBorder_8u_C1IR
ippiCopyReplicateBorder_8u_C1R
ippiCopyReplicateBorder_8u_C3IR
ippiCopyReplicateBorder_8u_C3R
ippiCopyReplicateBorder_8u_C4IR
ippiCopyReplicateBorder_8u_C4R
ippiCopyMirrorBorder_16s_C1IR
ippiCopyMirrorBorder_16s_C1R
ippiCopyMirrorBorder_16s_C3IR
ippiCopyMirrorBorder_16s_C3R
ippiCopyMirrorBorder_16s_C4IR
ippiCopyMirrorBorder_16s_C4R
ippiCopyMirrorBorder_16u_C1IR
ippiCopyMirrorBorder_16u_C1R
ippiCopyMirrorBorder_16u_C3IR
ippiCopyMirrorBorder_16u_C3R
ippiCopyMirrorBorder_16u_C4IR
ippiCopyMirrorBorder_16u_C4R
ippiCopyMirrorBorder_32f_C1IR
ippiCopyMirrorBorder_32f_C1R
ippiCopyMirrorBorder_32f_C3IR
ippiCopyMirrorBorder_32f_C3R
ippiCopyMirrorBorder_32f_C4IR
ippiCopyMirrorBorder_32f_C4R
ippiCopyMirrorBorder_32s_C1IR
ippiCopyMirrorBorder_32s_C1R
ippiCopyMirrorBorder_32s_C3IR
ippiCopyMirrorBorder_32s_C3R
ippiCopyMirrorBorder_32s_C4IR
ippiCopyMirrorBorder_32s_C4R
ippiCopyMirrorBorder_8u_C1IR
ippiCopyMirrorBorder_8u_C1R
ippiCopyMirrorBorder_8u_C3IR
ippiCopyMirrorBorder_8u_C3R
ippiCopyMirrorBorder_8u_C4IR
ippiCopyMirrorBorder_8u_C4R
ippiCrossCorrNorm_32f_C1R
ippiCrossCorrNorm_16u32f_C1R
ippiCrossCorrNorm_8u32f_C1R
ippiCrossCorrNorm_8u_C1RSfs
ippiDilateBorder_32f_C1R
ippiDilateBorder_32f_C3R
ippiDilateBorder_32f_C4R
ippiDilateBorder_8u_C1R
ippiDilateBorder_8u_C3R
ippiDilateBorder_8u_C4R
ippiDistanceTransform_3x3_8u_C1R
ippiDistanceTransform_3x3_8u32f_C1R
ippiErodeBorder_32f_C1R
ippiErodeBorder_32f_C3R
ippiErodeBorder_32f_C4R
ippiErodeBorder_8u_C1R
ippiErodeBorder_8u_C3R
ippiErodeBorder_8u_C4R
ippiFilterBoxBorder_16s_C1R
ippiFilterBoxBorder_16s_C3R
ippiFilterBoxBorder_16s_C4R
ippiFilterBoxBorder_16u_C1R
ippiFilterBoxBorder_16u_C3R
ippiFilterBoxBorder_16u_C4R
ippiFilterBoxBorder_32f_C1R
ippiFilterBoxBorder_32f_C3R
ippiFilterBoxBorder_32f_C4R
ippiFilterBoxBorder_8u_C1R
ippiFilterBoxBorder_8u_C3R
ippiFilterBoxBorder_8u_C4R
ippiFilterLaplacianBorder_32f_C1R
ippiFilterLaplacianBorder_8u16s_C1R
ippiFilterMaxBorder_32f_C1R
ippiFilterMaxBorder_32f_C3R
ippiFilterMaxBorder_32f_C4R
ippiFilterMaxBorder_8u_C1R
ippiFilterMaxBorder_8u_C3R
ippiFilterMaxBorder_8u_C4R
ippiFilterMedianBorder_16s_C1R
ippiFilterMedianBorder_16u_C1R
ippiFilterMedianBorder_32f_C1R
ippiFilterMedianBorder_8u_C1R
ippiFilterMinBorder_32f_C1R
ippiFilterMinBorder_32f_C3R
ippiFilterMinBorder_32f_C4R
ippiFilterMinBorder_8u_C1R
ippiFilterMinBorder_8u_C3R
ippiFilterMinBorder_8u_C4R
ippiFilterScharrHorizMaskBorder_16s_C1R
ippiFilterScharrHorizMaskBorder_32f_C1R
ippiFilterScharrHorizMaskBorder_8u16s_C1R
ippiFilterScharrVertMaskBorder_16s_C1R
ippiFilterScharrVertMaskBorder_8u16s_C1R
ippiGetCentralMoment_64f
ippiGetNormalizedCentralMoment_64f
ippiGetSpatialMoment_64f
ippiHarrisCorner_32f_C1R
ippiHarrisCorner_8u32f_C1R
ippiHistogramEven_8u_C1R
ippiHoughLine_Region_8u32f_C1R
ippiLUTPalette_8u_C3R
ippiLUTPalette_8u_C4R
ippiMax_16s_C1R
ippiMax_16u_C1R
ippiMax_32f_C1R
ippiMax_8u_C1R
ippiMin_16s_C1R
ippiMin_16u_C1R
ippiMin_32f_C1R
ippiMin_8u_C1R
ippiMinEigenVal_32f_C1R
ippiMinEigenVal_8u32f_C1R
ippiMirror_16s_C1IR
ippiMirror_16s_C1R
ippiMirror_16s_C3IR
ippiMirror_16s_C3R
ippiMirror_16s_C4IR
ippiMirror_16s_C4R
ippiMirror_16u_C1IR
ippiMirror_16u_C1R
ippiMirror_16u_C3IR
ippiMirror_16u_C3R
ippiMirror_16u_C4IR
ippiMirror_16u_C4R
ippiMirror_32f_C1IR
ippiMirror_32f_C1R
ippiMirror_32f_C3IR
ippiMirror_32f_C3R
ippiMirror_32f_C4IR
ippiMirror_32f_C4R
ippiMirror_32s_C1IR
ippiMirror_32s_C1R
ippiMirror_32s_C3IR
ippiMirror_32s_C3R
ippiMirror_32s_C4IR
ippiMirror_32s_C4R
ippiMirror_8u_C1IR
ippiMirror_8u_C1R
ippiMirror_8u_C3IR
ippiMirror_8u_C3R
ippiMirror_8u_C4IR
ippiMirror_8u_C4R
ippiMoments64f_16u_C1R
ippiMoments64f_32f_C1R
ippiMoments64f_8u_C1R
ippiMul_16s_C1RSfs
ippiMul_16u_C1RSfs
ippiMul_32f_C1R
ippiMul_8u_C1RSfs
ippiMulC_16s_C1IRSfs
ippiMulC_32f_C1R
ippiSet_16s_C1MR
ippiSet_16s_C3MR
ippiSet_16s_C4MR
ippiSet_16u_C1MR
ippiSet_16u_C3MR
ippiSet_16u_C4MR
ippiSet_32f_C1MR
ippiSet_32f_C3MR
ippiSet_32f_C4MR
ippiSet_32s_C1MR
ippiSet_32s_C3MR
ippiSet_32s_C4MR
ippiSet_8u_C1MR
ippiSet_8u_C3MR
ippiSet_8u_C4MR
ippiSqr_32f_C1R
ippiSqrDistanceNorm_32f_C1R
ippiSqrDistanceNorm_8u32f_C1R
ippiSwapChannels_16u_C4R
ippiSwapChannels_32f_C4R
ippiSwapChannels_8u_C4R
ippiThreshold_GT_16s_C1R
ippiThreshold_GT_32f_C1R
ippiThreshold_GT_8u_C1R
ippiThreshold_GTVal_16s_C1R
ippiThreshold_GTVal_32f_C1R
ippiThreshold_GTVal_8u_C1R
ippiThreshold_LTVal_16s_C1R
ippiThreshold_LTVal_32f_C1R
ippiThreshold_LTVal_8u_C1R
ippiTranspose_16s_C1IR
ippiTranspose_16s_C1R
ippiTranspose_16s_C3IR
ippiTranspose_16s_C3R
ippiTranspose_16s_C4IR
ippiTranspose_16s_C4R
ippiTranspose_16u_C1IR
ippiTranspose_16u_C1R
ippiTranspose_16u_C3IR
ippiTranspose_16u_C3R
ippiTranspose_16u_C4IR
ippiTranspose_16u_C4R
ippiTranspose_32f_C1IR
ippiTranspose_32f_C1R
ippiTranspose_32f_C3IR
ippiTranspose_32f_C3R
ippiTranspose_32f_C4IR
ippiTranspose_32f_C4R
ippiTranspose_32s_C1IR
ippiTranspose_32s_C1R
ippiTranspose_32s_C3IR
ippiTranspose_32s_C3R
ippiTranspose_32s_C4IR
ippiTranspose_32s_C4R
ippiTranspose_8u_C1IR
ippiTranspose_8u_C1R
ippiTranspose_8u_C3IR
ippiTranspose_8u_C3R
ippiTranspose_8u_C4IR
ippiTranspose_8u_C4R
ippsDotProd_32f64f
ippsDotProd_64f
ippsFlip_16u_I
ippsFlip_32f_I
ippsFlip_64f_I
ippsFlip_8u_I
ippsMagnitude_32f
ippsMagnitude_64f
ippsMaxEvery_16u
ippsMaxEvery_32f
ippsMaxEvery_64f
ippsMaxEvery_8u
ippsMinEvery_16u
ippsMinEvery_32f
ippsMinEvery_64f
ippsMinEvery_8u
ippsPolarToCart_32f
ippsPolarToCart_64f
ippsSortAscend_16s_I
ippsSortAscend_16u_I
ippsSortAscend_32f_I
ippsSortAscend_32s_I
ippsSortAscend_64f_I
ippsSortAscend_8u_I
ippsSortDescend_16s_I
ippsSortDescend_16u_I
ippsSortDescend_32f_I
ippsSortDescend_32s_I
ippsSortDescend_64f_I
ippsSortDescend_8u_I
ippsSortIndexAscend_16s_I
ippsSortIndexAscend_16u_I
ippsSortIndexAscend_32f_I
ippsSortIndexAscend_32s_I
ippsSortIndexAscend_64f_I
ippsSortIndexAscend_8u_I
ippsSortIndexDescend_16s_I
ippsSortIndexDescend_16u_I
ippsSortIndexDescend_32f_I
ippsSortIndexDescend_32s_I
ippsSortIndexDescend_64f_I
ippsSortIndexDescend_8u_I
 
ippiAdd_8u_C1RSfs
ippiAdd_16u_C1RSfs
ippiAdd_16s_C1RSfs
ippiAdd_32f_C1R
ippiSub_8u_C1RSfs
ippiSub_16u_C1RSfs
ippiSub_16s_C1RSfs
ippiSub_32f_C1R
ippiMaxEvery_8u_C1R
ippiMaxEvery_16u_C1R
ippiMaxEvery_32f_C1R
ippiMinEvery_8u_C1R
ippiMinEvery_16u_C1R
ippiMinEvery_32f_C1R
ippiAnd_8u_C1R
ippiOr_8u_C1R
ippiXor_8u_C1R
ippiNot_8u_C1R
ippiCompare_8u_C1R
ippiCompare_16u_C1R
ippiCompare_16s_C1R
ippiCompare_32f_C1R
ippiSum_8u_C1R 
ippiSum_8u_C3R 
ippiSum_8u_C4R 
ippiSum_16u_C1R
ippiSum_16u_C3R
ippiSum_16u_C4R
ippiSum_16s_C1R
ippiSum_16s_C3R
ippiSum_16s_C4R
ippiSum_32f_C1R
ippiSum_32f_C3R
ippiSum_32f_C4R
ippiMean_8u_C1R 
ippiMean_8u_C3R 
ippiMean_8u_C4R 
ippiMean_16u_C1R
ippiMean_16u_C3R
ippiMean_16u_C4R
ippiMean_16s_C1R
ippiMean_16s_C3R
ippiMean_16s_C4R
ippiMean_32f_C1R
ippiMean_32f_C3R
ippiMean_32f_C4R
ippiNorm_Inf_8u_C1R
ippiNorm_Inf_8u_C3R 
ippiNorm_Inf_8u_C4R 
ippiNorm_Inf_16u_C1R
ippiNorm_Inf_16u_C3R
ippiNorm_Inf_16u_C4R
ippiNorm_Inf_16s_C1R
ippiNorm_Inf_16s_C3R
ippiNorm_Inf_16s_C4R
ippiNorm_Inf_32f_C1R
ippiNorm_Inf_32f_C3R
ippiNorm_Inf_32f_C4R
ippiNorm_L1_8u_C1R
ippiNorm_L1_8u_C3R 
ippiNorm_L1_8u_C4R 
ippiNorm_L1_16u_C1R
ippiNorm_L1_16u_C3R
ippiNorm_L1_16u_C4R
ippiNorm_L1_16s_C1R
ippiNorm_L1_16s_C3R
ippiNorm_L1_16s_C4R
ippiNorm_L1_32f_C1R
ippiNorm_L1_32f_C3R
ippiNorm_L1_32f_C4R
ippiNorm_L2_8u_C1R
ippiNorm_L2_8u_C3R 
ippiNorm_L2_8u_C4R 
ippiNorm_L2_16u_C1R
ippiNorm_L2_16u_C3R
ippiNorm_L2_16u_C4R
ippiNorm_L2_16s_C1R
ippiNorm_L2_16s_C3R
ippiNorm_L2_16s_C4R
ippiNorm_L2_32f_C1R
ippiNorm_L2_32f_C3R
ippiNorm_L2_32f_C4R
ippiNormRel_Inf_8u_C1R
ippiNormRel_Inf_16u_C1R
ippiNormRel_Inf_16s_C1R
ippiNormRel_Inf_32f_C1R
ippiNormRel_L1_8u_C1R
ippiNormRel_L1_16u_C1R
ippiNormRel_L1_16s_C1R
ippiNormRel_L1_32f_C1R
ippiNormRel_L2_8u_C1R
ippiNormRel_L2_16u_C1R
ippiNormRel_L2_16s_C1R
ippiNormRel_L2_32f_C1R
ippiNormDiff_Inf_8u_C1R
ippiNormDiff_Inf_8u_C3R 
ippiNormDiff_Inf_8u_C4R 
ippiNormDiff_Inf_16u_C1R
ippiNormDiff_Inf_16u_C3R
ippiNormDiff_Inf_16u_C4R
ippiNormDiff_Inf_16s_C1R
ippiNormDiff_Inf_16s_C3R
ippiNormDiff_Inf_16s_C4R
ippiNormDiff_Inf_32f_C1R
ippiNormDiff_Inf_32f_C3R
ippiNormDiff_Inf_32f_C4R
ippiNormDiff_L1_8u_C1R
ippiNormDiff_L1_8u_C3R 
ippiNormDiff_L1_8u_C4R 
ippiNormDiff_L1_16u_C1R
ippiNormDiff_L1_16u_C3R
ippiNormDiff_L1_16u_C4R
ippiNormDiff_L1_16s_C1R
ippiNormDiff_L1_16s_C3R
ippiNormDiff_L1_16s_C4R
ippiNormDiff_L1_32f_C1R
ippiNormDiff_L1_32f_C3R
ippiNormDiff_L1_32f_C4R
ippiNormDiff_L2_8u_C1R
ippiNormDiff_L2_8u_C3R 
ippiNormDiff_L2_8u_C4R 
ippiNormDiff_L2_16u_C1R
ippiNormDiff_L2_16u_C3R
ippiNormDiff_L2_16u_C4R
ippiNormDiff_L2_16s_C1R
ippiNormDiff_L2_16s_C3R
ippiNormDiff_L2_16s_C4R
ippiNormDiff_L2_32f_C1R
ippiNormDiff_L2_32f_C3R
ippiNormDiff_L2_32f_C4R
ippiSwapChannels_8u_C3C4R
ippiSwapChannels_16u_C3C4R
ippiSwapChannels_32f_C3C4R
ippiSwapChannels_8u_C4C3R
ippiSwapChannels_16u_C4C3R
ippiSwapChannels_32f_C4C3R
ippiSwapChannels_8u_C3R
ippiSwapChannels_16u_C3R
ippiSwapChannels_32f_C3R
ippiSwapChannels_8u_AC4R
ippiSwapChannels_16u_AC4R
ippiSwapChannels_32f_AC4R
ippiCopy_8u_AC4C3R
ippiCopy_16u_AC4C3R
ippiCopy_32f_AC4C3R
ippiCopy_8u_P3C3R
ippiCopy_16u_P3C3R
ippiCopy_32f_P3C3R
ippiMulC_32f_C1IR
ippiSet_8u_C1R
ippiSet_16u_C1R
ippiSet_32f_C1R
ippiSet_8u_C3R
ippiSet_16u_C3R
ippiSet_32f_C3R
ippiSet_8u_C4R
ippiSet_16u_C4R
ippiWarpAffineBack_8u_C1R 
ippiWarpAffineBack_8u_C3R 
ippiWarpAffineBack_8u_C4R 
ippiWarpAffineBack_16u_C1R
ippiWarpAffineBack_16u_C3R
ippiWarpAffineBack_16u_C4R
ippiWarpAffineBack_32f_C1R
ippiWarpAffineBack_32f_C3R
ippiWarpAffineBack_32f_C4R
ippiWarpPerspectiveBack_8u_C1R 
ippiWarpPerspectiveBack_8u_C3R 
ippiWarpPerspectiveBack_8u_C4R 
ippiWarpPerspectiveBack_16u_C1R
ippiWarpPerspectiveBack_16u_C3R
ippiWarpPerspectiveBack_16u_C4R
ippiWarpPerspectiveBack_32f_C1R
ippiWarpPerspectiveBack_32f_C3R
ippiWarpPerspectiveBack_32f_C4R
ippiCopySubpixIntersect_8u_C1R
ippiCopySubpixIntersect_8u32f_C1R
ippiCopySubpixIntersect_32f_C1R
ippiSqrIntegral_8u32f64f_C1R
ippiIntegral_8u32f_C1R
ippiSqrIntegral_8u32s64f_C1R
ippiIntegral_8u32s_C1R
ippiHaarClassifierFree_32f
ippiHaarClassifierInitAlloc_32f
ippiHaarClassifierFree_32f
ippiRectStdDev_32f_C1R
ippiApplyHaarClassifier_32f_C1R
ippiAbsDiff_8u_C1R
ippiAbsDiff_16u_C1R
ippiAbsDiff_32f_C1R
ippiMean_8u_C1MR 
ippiMean_16u_C1MR
ippiMean_32f_C1MR
ippiMean_8u_C3CMR 
ippiMean_16u_C3CMR
ippiMean_32f_C3CMR
ippiMean_StdDev_8u_C1MR 
ippiMean_StdDev_16u_C1MR
ippiMean_StdDev_32f_C1MR
ippiMean_StdDev_8u_C3CMR 
ippiMean_StdDev_16u_C3CMR
ippiMean_StdDev_32f_C3CMR
ippiMean_StdDev_8u_C1R 
ippiMean_StdDev_16u_C1R
ippiMean_StdDev_32f_C1R
ippiMean_StdDev_8u_C3CR 
ippiMean_StdDev_16u_C3CR
ippiMean_StdDev_32f_C3CR
ippiMinMaxIndx_8u_C1MR 
ippiMinMaxIndx_16u_C1MR
ippiMinMaxIndx_32f_C1MR
ippiMinMaxIndx_8u_C1R 
ippiMinMaxIndx_16u_C1R
ippiMinMaxIndx_32f_C1R
ippiNorm_Inf_8u_C1MR
ippiNorm_Inf_8s_C1MR 
ippiNorm_Inf_16u_C1MR
ippiNorm_Inf_32f_C1MR
ippiNorm_L1_8u_C1MR
ippiNorm_L1_8s_C1MR 
ippiNorm_L1_16u_C1MR
ippiNorm_L1_32f_C1MR
ippiNorm_L2_8u_C1MR
ippiNorm_L2_8s_C1MR 
ippiNorm_L2_16u_C1MR
ippiNorm_L2_32f_C1MR
ippiNorm_Inf_8u_C3CMR
ippiNorm_Inf_8s_C3CMR 
ippiNorm_Inf_16u_C3CMR
ippiNorm_Inf_32f_C3CMR
ippiNorm_L1_8u_C3CMR
ippiNorm_L1_8s_C3CMR 
ippiNorm_L1_16u_C3CMR
ippiNorm_L1_32f_C3CMR
ippiNorm_L2_8u_C3CMR
ippiNorm_L2_8s_C3CMR 
ippiNorm_L2_16u_C3CMR
ippiNorm_L2_32f_C3CMR
ippiNormRel_Inf_8u_C1MR
ippiNormRel_Inf_8s_C1MR 
ippiNormRel_Inf_16u_C1MR
ippiNormRel_Inf_32f_C1MR
ippiNormRel_L1_8u_C1MR
ippiNormRel_L1_8s_C1MR 
ippiNormRel_L1_16u_C1MR
ippiNormRel_L1_32f_C1MR
ippiNormRel_L2_8u_C1MR
ippiNormRel_L2_8s_C1MR 
ippiNormRel_L2_16u_C1MR
ippiNormRel_L2_32f_C1MR
ippiNormDiff_Inf_8u_C1MR
ippiNormDiff_Inf_8s_C1MR 
ippiNormDiff_Inf_16u_C1MR
ippiNormDiff_Inf_32f_C1MR
ippiNormDiff_L1_8u_C1MR
ippiNormDiff_L1_8s_C1MR 
ippiNormDiff_L1_16u_C1MR
ippiNormDiff_L1_32f_C1MR
ippiNormDiff_L2_8u_C1MR
ippiNormDiff_L2_8s_C1MR 
ippiNormDiff_L2_16u_C1MR
ippiNormDiff_L2_32f_C1MR
ippiNormDiff_Inf_8u_C3CMR
ippiNormDiff_Inf_8s_C3CMR 
ippiNormDiff_Inf_16u_C3CMR
ippiNormDiff_Inf_32f_C3CMR
ippiNormDiff_L1_8u_C3CMR
ippiNormDiff_L1_8s_C3CMR 
ippiNormDiff_L1_16u_C3CMR
ippiNormDiff_L1_32f_C3CMR
ippiNormDiff_L2_8u_C3CMR 
ippiNormDiff_L2_8s_C3CMR 
ippiNormDiff_L2_16u_C3CMR
ippiNormDiff_L2_32f_C3CMR
ippiFilterRowBorderPipelineGetBufferSize_32f_C1R
ippiFilterRowBorderPipelineGetBufferSize_32f_C3R
ippiFilterRowBorderPipeline_32f_C1R
ippiFilterRowBorderPipeline_32f_C3R
ippiDistanceTransform_5x5_8u32f_C1R
ippiTrueDistanceTransform_8u32f_C1R
ippiTrueDistanceTransformGetBufferSize_8u32f_C1R
ippiFilterScharrVertGetBufferSize_32f_C1R
 ippiFilterScharrVertMaskBorderGetBufferSize
ippiFilterScharrVertBorder_32f_C1R
 ippiFilterScharrVertMaskBorder_32f_C1R
ippiFilterScharrHorizGetBufferSize_32f_C1R
 ippiFilterScharrHorizMaskBorderGetBufferSize
ippiFilterScharrHorizBorder_32f_C1R
ippiFilterSobelNegVertGetBufferSize_8u16s_C1R
ippiFilterSobelNegVertBorder_8u16s_C1R
ippiFilterSobelHorizBorder_8u16s_C1R
ippiFilterSobelVertSecondGetBufferSize_8u16s_C1R
ippiFilterSobelVertSecondBorder_8u16s_C1R
ippiFilterSobelHorizSecondGetBufferSize_8u16s_C1R
ippiFilterSobelHorizSecondBorder_8u16s_C1R
ippiFilterSobelNegVertGetBufferSize_32f_C1R
ippiFilterSobelNegVertBorder_32f_C1R
ippiFilterSobelHorizGetBufferSize_32f_C1R
ippiFilterSobelHorizBorder_32f_C1R
ippiFilterSobelVertSecondGetBufferSize_32f_C1R
ippiFilterSobelVertSecondBorder_32f_C1R
ippiFilterSobelHorizSecondGetBufferSize_32f_C1R
ippiFilterSobelHorizSecondBorder_32f_C1R
ippiColorToGray_8u_C3C1R
ippiColorToGray_16u_C3C1R
ippiColorToGray_32f_C3C1R
ippiColorToGray_8u_AC4C1R
ippiColorToGray_16u_AC4C1R
ippiColorToGray_32f_AC4C1R
ippiRGBToGray_8u_C3C1R
ippiRGBToGray_16u_C3C1R
ippiRGBToGray_32f_C3C1R
ippiRGBToGray_8u_AC4C1R
ippiRGBToGray_16u_AC4C1R
ippiRGBToGray_32f_AC4C1R
ippiRGBToXYZ_8u_C3R
ippiRGBToXYZ_16u_C3R
ippiRGBToXYZ_32f_C3R
ippiXYZToRGB_8u_C3R
ippiXYZToRGB_16u_C3R
ippiXYZToRGB_32f_C3R
ippiRGBToHSV_8u_C3R
ippiRGBToHSV_16u_C3R
ippiHSVToRGB_8u_C3R
ippiHSVToRGB_16u_C3R
ippiRGBToHLS_8u_C3R
ippiRGBToHLS_16u_C3R
ippiRGBToHLS_32f_C3R
ippiHLSToRGB_8u_C3R
ippiHLSToRGB_16u_C3R
ippiHLSToRGB_32f_C3R
 ippiDotProd_8u64f_C1R
 ippiDotProd_16u64f_C1R
 ippiDotProd_16s64f_C1R
 ippiDotProd_32u64f_C1R
 ippiDotProd_32s64f_C1R
 ippiDotProd_32f64f_C1R
 ippiDotProd_8u64f_C3R
 ippiDotProd_16u64f_C3R
 ippiDotProd_16s64f_C3R
 ippiDotProd_32u64f_C3R
 ippiDotProd_32s64f_C3R
 ippiDotProd_32f64f_C3R
 ippiDotProd_8u64f_C4R
 ippiDotProd_16u64f_C4R
 ippiDotProd_16s64f_C4R
 ippiDotProd_32u64f_C4R
 ippiDotProd_32s64f_C4R
 ippiDotProd_32f64f_C4R

API without Secrets: Introduction to Vulkan* Part 1: The Beginning

$
0
0

Download [PDF 736 KB]

Link to Github Sample Code


Go to: API without Secrets: Introduction to Vulkan* Part 0: Preface


Table of Contents

Tutorial 1: Vulkan* – The Beginning

We start with a simple application that unfortunately doesn’t display anything. I won’t present the full source code (with windowing, rendering loop, and so on) here in the text as the tutorial would be too long. The entire sample project with full source code is available in a provided example that can be found at https://github.com/GameTechDev/IntroductionToVulkan. Here I show only the parts of the code that are relevant to Vulkan itself. There are several ways to use the Vulkan API in our application:

  1. We can dynamically load the driver’s library that provides Vulkan API implementation and acquire function pointers by ourselves from it.
  2. We can use the Vulkan SDK and link with the provided Vulkan Runtime (Vulkan Loader) static library.
  3. We can use the Vulkan SDK, dynamically load Vulkan Loader library at runtime, and load function pointers from it.

The first approach is not recommended. Hardware vendors can modify their drivers in any way, and it may affect compatibility with a given application. It may even break the application and requiredevelopers writing a Vulkan-enabled application to rewrite some parts of the code. That’s why it’s better to use some level of abstraction.

The recommended solution is to use the Vulkan Loader from the Vulkan SDK. It provides more configuration abilities and more flexibility without the need to modify Vulkan application source code. One example of the flexibility is Layers. The Vulkan API requires developers to create applications that strictly follow API usage rules. In case of any errors, the driver provides us with little feedback, only some severe and important errors are reported (for example, out of memory). This approach is used so the API itself can be as small (thin) and as fast as possible. But if we want to obtain more information about what we are doing wrong we have to enable debug/validation layers. There are different layers for different purposes such as memory usage, proper parameter passing, object life-time checking, and so on. These layers all slow down the application’s performance but provide us with much more information.

We also need to choose whether we want to statically link with a Vulkan Loader or whether we will load it dynamically and acquire function pointers by ourselves at runtime. This choice is just a matter of personal preference. This paper focuses on the third way of using Vulkan: dynamically loading function pointers from the Vulkan Runtime library. This approach is similar to what we had to do when we wanted to use OpenGL* on a Windows* system in which only some basic functions were provided by the default implementation. The remaining functions had to be loaded dynamically using wglGetProcAddress() or standard windows GetProcAddress() functions. This is what wrangler libraries such as GLEW or GL3W were created for.

Loading Vulkan Runtime Library and Acquiring Pointer to an Exported Function

In this tutorial we go through the process of acquiring Vulkan functions pointers by ourselves. We load them from the Vulkan Runtime library (Vulkan Loader) which should be installed along with the graphics driver that supports Vulkan. The dynamic library for Vulkan (Vulkan Loader) is named vulkan-1.dll on Windows* and libvulkan.so on Linux*.

From now on, I refer to the first tutorial’s source code, focusing on the Tutorial01.cpp file. So in the initialization code of our application we have to load the Vulkan library with something like this:

#if defined(VK_USE_PLATFORM_WIN32_KHR)
VulkanLibrary = LoadLibrary( "vulkan-1.dll" );
#elif defined(VK_USE_PLATFORM_XCB_KHR) || defined(VK_USE_PLATFORM_XLIB_KHR)
VulkanLibrary = dlopen( "libvulkan.so", RTLD_NOW );
#endif

if( VulkanLibrary == nullptr ) {
  printf( "Could not load Vulkan library!\n" );
  return false;
}
return true;

1.Tutorial01.cpp, function LoadVulkanLibrary()

VulkanLibrary is a variable of type HMODULE in Windows or just void* in Linux. If the value returned by the library loading function is not 0 we can load all exported functions. The Vulkan library, as well as Vulkan implementations (every driver from every vendor), are required to expose only one function that can be loaded with the standard techniques our OS possesses (like the previously mentioned GetProcAddress() in Windows or dlsym() in Linux). Other functions from the Vulkan API may also be available for acquiring using this method but it is not guaranteed (and even not recommended). The only function that must be exported is vkGetInstanceProcAddr().

This function is used to load all other Vulkan functions. To ease our work of obtaining addresses of all Vulkan API functions it is very convenient to place their names inside a macro. This way we won’t have to duplicate function names in multiple places (like definition, declaration, or loading) and can keep them in only one header file. This single file will be used later for different purposes with an #include directive. We can declare our exported function like this:

#if !defined(VK_EXPORTED_FUNCTION)
#define VK_EXPORTED_FUNCTION( fun )
#endif

VK_EXPORTED_FUNCTION( vkGetInstanceProcAddr )

#undef VK_EXPORTED_FUNCTION

2.ListOfFunctions.inl

Now we define the variables that will represent functions from the Vulkan API. This can be done with something like this:

#include "vulkan.h"

#define VK_EXPORTED_FUNCTION( fun ) PFN_##fun fun;
#define VK_GLOBAL_LEVEL_FUNCTION( fun ) PFN_##fun fun;
#define VK_INSTANCE_LEVEL_FUNCTION( fun ) PFN_##fun fun;
#define VK_DEVICE_LEVEL_FUNCTION( fun ) PFN_##fun fun;

#include "ListOfFunctions.inl"

3.VulkanFunctions.cpp

Here we first include the vulkan.h file, which is officially provided for developers that want to use Vulkan API in their applications. This file is similar to the gl.h file in the OpenGL library. It defines all enumerations, structures, types, and function types that are necessary for Vulkan application development. Next we define the macros for functions from each “level” (I will describe these levels soon). The function definition requires providing function type and a function name. Fortunately, function types in Vulkan can be easily derived from function names. For example, the definition of vkGetInstanceProcAddr() function’s type looks like this:

typedef PFN_vkVoidFunction (VKAPI_PTR *PFN_vkGetInstanceProcAddr)(VkInstance instance, const char* pName);

4.Vulkan.h

The definition of a variable that represents this function would then look like this:

PFN_vkGetInstanceProcAddr vkGetInstanceProcAddr;

This is what the macros from VulkanFunctions.cpp file expand to. They take the function name (hidden in a “fun” parameter) and add “PFN_” at the beginning. Then the macro places a space after the type, and adds a function name and a semicolon after that. Functions are “pasted” into the file in the line with the #include “ListOfFunctions.inl” directive.

But we must remember that when we want to define Vulkan functions’ prototypes by ourselves we must define the VK_NO_PROTOTYPES preprocessor directive. By default the vulkan.h header file contains definitions of all functions. This is useful when we are statically linking with Vulkan Runtime. So when we add our own definitions, there will be a compilation error claiming that the given variables (for function pointers) are defined more than once (since we would break the One Definition rule). We can disable definitions from vulkan.h file using the mentioned preprocessor macro.

Similarly we need to declare variables defined in the VulkanFunctions.cpp file so they would be seen in all other parts of our code. This is done in the same way, but the word “extern” is placed before each function. Compare to the VulkanFunctions.h file.

Now we have variables in which we can store addresses of functions acquired from the Vulkan library. To load the only one exported function, we can use the following code:

#if defined(VK_USE_PLATFORM_WIN32_KHR)
#define LoadProcAddress GetProcAddress
#elif defined(VK_USE_PLATFORM_XCB_KHR) || defined(VK_USE_PLATFORM_XLIB_KHR)
#define LoadProcAddress dlsym
#endif

#define VK_EXPORTED_FUNCTION( fun )                                 \
if( !(fun = (PFN_##fun)LoadProcAddress( VulkanLibrary, #fun )) ) {  \
  printf( "Could not load exported function: " #fun "!\n" );        \
  return false;                                                     \
}

#include "ListOfFunctions.inl"

return true;

5.Tutorial01.cpp, function LoadExportedEntryPoints()

This macro takes the function name from the “fun” parameter, converts it into a string (with #) and obtains its address from VulkanLibrary. The address is acquired using the GetProcAddress() (on Windows) or dlsym() (on Linux) function and is stored in the variable represented by fun. If this operation fails and the function is not exposed from the library, we report this problem by printing the proper information and returning false. The macro operates on lines included from ListOfFunctions.inl. This way we don’t have to write the names of functions multiple times.

Now that we have our main function-loading procedure, we can load the rest of the Vulkan API procedures. These can be divided into three types:

  • Global-level functions. Allow us to create a Vulkan instance.
  • Instance-level functions. Check what Vulkan-capable hardware is available and what Vulkan features are exposed.
  • Device-level functions. Responsible for performing jobs typically done in a 3D application (like drawing).

We will start with acquiring instance creation functions from the global level.

Acquiring Pointers to Global-Level Functions

Before we can create a Vulkan instance we must acquire the addresses of functions that will allow us to do it. Here is a list of these functions:

  • vkCreateInstance
  • vkEnumerateInstanceExtensionProperties
  • vkEnumerateInstanceLayerProperties

The most important function is vkCreateInstance(), which allows us to create a “Vulkan instance.” From application point of view Vulkan instance can be thought of as an equivalent of OpenGL’s rendering context. It stores per-application state (there is no global state in Vulkan) like enabled instance-level layers and extensions. The other two functions allow us to check what instance layers are available and what instance extensions are available. Validation layers are divided into instance and device levels depending on what functionality they debug. Extensions in Vulkan are similar to OpenGL’s extensions: they expose additional functionality that is not required by core specifications, and not all hardware vendors may implement them. Extensions, like layers, are also divided into instance and device levels, and extensions from different levels must be enabled separately. In OpenGL, all extensions are (usually) available in created contexts; using Vulkan we have to enable them before the functionality exposed by them can be used.

We call the function vkGetInstanceProcAddr() to acquire addresses of instance-level procedures. It takes two parameters: an instance, and a function name. We don’t have an instance yet so we provide “null” for the first parameter. That’s why these functions may sometimes be called null-instance or no-instance level functions. The second parameter required by the vkGetInstanceProcAddr() function is a name of a procedure address of which we want to acquire. We can only load global-level functions without an instance. It is not possible to load any other function without an instance handle provided in the first parameter.

The code that loads global-level functions may look like this:

#define VK_GLOBAL_LEVEL_FUNCTION( fun )                             \
if( !(fun = (PFN_##fun)vkGetInstanceProcAddr( nullptr, #fun )) ) {  \
  printf( "Could not load global level function: " #fun "!\n" );    \
  return false;                                                     \
}

#include "ListOfFunctions.inl"

return true;

6.Tutorial01.cpp, function LoadGlobalLevelEntryPoints()

The only difference between this code and the code used for loading the exported function (vkGetInstanceProcAddr() exposed by the library) is that we don’t use function provided by the OS, like GetProcAddress(), but we call vkGetInstanceProcAddr() where the first parameter is set to null.

If you follow this tutorial and write the code yourself, make sure you add global-level functions wrapped in a properly named macro to ListOfFunctions.inl header file:

#if !defined(VK_GLOBAL_LEVEL_FUNCTION)
#define VK_GLOBAL_LEVEL_FUNCTION( fun )
#endif

VK_GLOBAL_LEVEL_FUNCTION( vkCreateInstance )
VK_GLOBAL_LEVEL_FUNCTION( vkEnumerateInstanceExtensionProperties )
VK_GLOBAL_LEVEL_FUNCTION( vkEnumerateInstanceLayerProperties )

#undef VK_GLOBAL_LEVEL_FUNCTION

7.ListOfFunctions.inl

Creating a Vulkan Instance

Now that we have loaded global-level functions, we can create a Vulkan instance. This is done by calling the vkCreateInstance() function, which takes three parameters.

  • The first parameter has information about our application, the requested Vulkan version, and the instance level layers and extensions we want to enable. This all is done with structures (structures are very common in Vulkan).
  • The second parameter provides a pointer to a structure with list of different functions related to memory allocation. They can be used for debugging purposes but this feature is optional and we can rely on built-in memory allocation methods.
  • The third parameter is an address of a variable in which we want to store Vulkan instance handle. In the Vulkan API it is common that results of operations are stored in variables we provide addresses of. Return values are used only for some pass/fail notifications. Here is the full source code for instance creation:
VkApplicationInfo application_info = {
  VK_STRUCTURE_TYPE_APPLICATION_INFO,             // VkStructureType            sType
  nullptr,                                        // const void                *pNext"API without Secrets: Introduction to Vulkan",  // const char                *pApplicationName
  VK_MAKE_VERSION( 1, 0, 0 ),                     // uint32_t                   applicationVersion"Vulkan Tutorial by Intel",                     // const char                *pEngineName
  VK_MAKE_VERSION( 1, 0, 0 ),                     // uint32_t                   engineVersion
  VK_API_VERSION                                  // uint32_t                   apiVersion
};

VkInstanceCreateInfo instance_create_info = {
  VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,         // VkStructureType            sType
  nullptr,                                        // const void*                pNext
  0,                                              // VkInstanceCreateFlags      flags
  &application_info,                              // const VkApplicationInfo   *pApplicationInfo
  0,                                              // uint32_t                   enabledLayerCount
  nullptr,                                        // const char * const        *ppEnabledLayerNames
  0,                                              // uint32_t                   enabledExtensionCount
  nullptr                                         // const char * const        *ppEnabledExtensionNames
};

if( vkCreateInstance( &instance_create_info, nullptr, &Vulkan.Instance ) != VK_SUCCESS ) {
  printf( "Could not create Vulkan instance!\n" );
  return false;
}
return true;

8.Tutorial01.cpp, function CreateInstance()

Most of the Vulkan structures begin with a field describing the type of the structure. Parameters are provided to functions by pointers to avoid copying big memory chunks. Sometimes, inside structures, pointers to other structures, are also provided. For the driver to know how many bytes it should read and how members are aligned, the type of the structure is always provided. So what exactly do all these parameters mean?

  • sType – Type of the structure. In this case it informs the driver that we are providing information for instance creation by providing a value of VK_STRUCTURE_TYPE_APPLICATION_INFO.
  • pNext – Additional information for instance creation may be provided in future versions of Vulkan API and this parameter will be used for that purpose. For now, it is reserved for future use.
  • flags – Another parameter reserved for future use; for now it must be set to 0.
  • pApplicationInfo – An address of another structure with information about our application (like name, version, required Vulkan API version, and so on).
  • enabledLayerCount – Defines the number of instance-level validation layers we want to enable.
  • ppEnabledLayerNames – This is an array of enabledLayerCount elements with the names of the layers we would like to enable.
  • enabledExtensionCount – The number of instance-level extensions we want to enable.
  • ppEnabledExtensionNames – As with layers, this parameter should point to an array of at least enabledExtensionCount elements containing names of instance-level extensions we want to use.

Most of the parameters can be nulls or zeros. The most important one (apart from the structure type information) is a parameter pointing to a variable of type VkApplicationInfo. So before specifying instance creation information, we also have to specify an additional variable describing our application. This variable contains the name of our application, the name of the engine we are using, or the Vulkan API version we require (which is similar to the OpenGL version; if the driver doesn’t support this version, the instance will not be created). This information may be very useful for the driver. Remember that some graphics card vendors provide drivers that can be specialized for a specific title, such as a specific game. If a graphics card vendor knows what graphics the engine game uses, it can optimize the driver’s behavior so the game performs faster. This application information structure can be used for this purpose. The parameters from the VkApplicationInfo structure include:

  • sType – Type of structure. Here VK_STRUCTURE_TYPE_APPLICATION_INFO, information about the application.
  • pNext – Reserved for future use.
  • pApplicationName – Name of our application.
  • applicationVersion – Version of our application; it is quite convenient to use Vulkan macro for version creation. It packs major, minor, and patch numbers into one 32-bit value.
  • pEngineName – Name of the engine our application uses.
  • engineVersion – Version of the engine we are using in our application.
  • apiVersion – Version of the Vulkan API we want to use. It is best to provide the version defined in the Vulkan header we are including, which is why we use VK_API_VERSION found in the vulkan.h header file.

So now that we have defined these two structures we can call the vkCreateInstance() function and check whether an instance was created. If successful, instance handle will be stored in a variable we provided the address of and VK_SUCCESS (which is zero!) is returned.

Acquiring Pointers to Instance-Level Functions

We have created a Vulkan instance. Next we can acquire pointers to functions that allow us to create a logical device, which can be seen as a user view on a physical device. There may be many different devices installed on a computer that support Vulkan. Each of these devices may have different features and capabilities and different performance, or may support different functionalities. When we want to use Vulkan, we must specify which device to perform the operations on. We may use many devices for different purposes (such as one for rendering 3D graphics, one for physics calculations, and one for media decoding). We must check what devices and how many of them are available, what their capabilities are, and what operations they support. This is all done with instance-level functions. We get the addresses of these functions using the vkGetInstanceProcAddr() function used earlier. But this time we will provide handle to a created Vulkan instance.

Loading every Vulkan procedure using the vkGetInstanceProcAddr() function and Vulkan instance handle comes with some trade-offs. When we use Vulkan for data processing, we must create a logical device and acquire device-level functions. But on the computer that runs our application, there may be many devices that support Vulkan. Determining which device to use depends on the mentioned logical device. But vkGetInstanceProcAddr() doesn’t recognize a logical device, as there is no parameter for it. When we acquire device-level procedures using this function we in fact acquire addresses of a simple “jump” functions. These functions take the handle of a logical device and jump to a proper implementation (function implemented for a specific device). The overhead of this jump can be avoided. The recommended behavior is to load procedures for each device separately using another function. But we still have to use the vkGetInstanceProcAddr() function to load functions that allow us to create such a logical device.

Some of the instance level functions include:

  • vkEnumeratePhysicalDevices
  • vkGetPhysicalDeviceProperties
  • vkGetPhysicalDeviceFeatures
  • vkGetPhysicalDeviceQueueFamilyProperties
  • vkCreateDevice
  • vkGetDeviceProcAddr
  • vkDestroyInstance

These are the functions that are required and are used in this tutorial to create a logical device. But there are other instance-level functions, that is, from extensions. The list in a header file from the example solution’s source code will expand. The source code used to load all these functions is:

#define VK_INSTANCE_LEVEL_FUNCTION( fun )                                   \
if( !(fun = (PFN_##fun)vkGetInstanceProcAddr( Vulkan.Instance, #fun )) ) {  \
  printf( "Could not load instance level function: " #fun "\n" );           \
  return false;                                                             \
}

#include "ListOfFunctions.inl"

return true;

9.Tutorial01.cpp, function LoadInstanceLevelEntryPoints()

The code for loading instance-level functions is almost identical to the code loading global-level functions. We just change the first parameter of vkGetInstanceProcAddr() function from null to create Vulkan instance handle. Of course we also operate on instance-level functions so now we redefine the VK_INSTANCE_LEVEL_FUNCTION() macro instead of a VK_GLOBAL_LEVEL_FUNCTION() macro. We also need to define functions from the instance level. As before, this is best done with a list of macro-wrapped names collected in a shared header, for example:

#if !defined(VK_INSTANCE_LEVEL_FUNCTION)
#define VK_INSTANCE_LEVEL_FUNCTION( fun )
#endif

VK_INSTANCE_LEVEL_FUNCTION( vkDestroyInstance )
VK_INSTANCE_LEVEL_FUNCTION( vkEnumeratePhysicalDevices )
VK_INSTANCE_LEVEL_FUNCTION( vkGetPhysicalDeviceProperties )
VK_INSTANCE_LEVEL_FUNCTION( vkGetPhysicalDeviceFeatures )
VK_INSTANCE_LEVEL_FUNCTION( vkGetPhysicalDeviceQueueFamilyProperties )
VK_INSTANCE_LEVEL_FUNCTION( vkCreateDevice )
VK_INSTANCE_LEVEL_FUNCTION( vkGetDeviceProcAddr )
VK_INSTANCE_LEVEL_FUNCTION( vkEnumerateDeviceExtensionProperties )

#undef VK_INSTANCE_LEVEL_FUNCTION

10.ListOfFunctions.inl

Instance-level functions operate on physical devices. In Vulkan we can see “physical devices” and “logical devices” (simply called devices). As the name suggests, a physical device refers to any physical graphics card (or any other hardware component) that is installed on a computer running a Vulkan-enabled application that is capable of executing Vulkan commands. As mentioned earlier, such a device may expose and implement different (optional) Vulkan features, may have different capabilities (like total memory or ability to work on buffer objects of different sizes), or may provide different extensions. Such hardware may be a dedicated (discrete) graphics card or an additional chip built (integrated) into a main processor. It may even be the CPU itself. Instance-level functions allow us to check all these parameters. After we check them, we must decide (based on our findings and our needs) which physical device we want to use. Maybe we even want to use more than one device, which is also possible, but this scenario is too advanced for now. So if we want to harness the power of any physical device we must create a logical device that represents our choice in the application (along with enabled layers, extensions, features, and so on). After creating a device (and acquiring queues) we are prepared to use Vulkan, the same way as we are prepared to use OpenGL after creating rendering context.

Creating a Logical Device

Before we can create a logical device, we must first check to see how many physical devices are available in the system we execute our application on. Next we can get handles to all available physical devices:

uint32_t num_devices = 0;
if( (vkEnumeratePhysicalDevices( Vulkan.Instance, &num_devices, nullptr ) != VK_SUCCESS) ||
    (num_devices == 0) ) {
  printf( "Error occurred during physical devices enumeration!\n" );
  return false;
}

std::vector<VkPhysicalDevice> physical_devices( num_devices );
if( vkEnumeratePhysicalDevices( Vulkan.Instance, &num_devices, &physical_devices[0] ) != VK_SUCCESS ) {
  printf( "Error occurred during physical devices enumeration!\n" );
  return false;
}

11.Tutorial01.cpp, function CreateDevice()

To check how many devices are available, we call the vkEnumeratePhysicalDevices() function. We call it twice, first with the last parameter set to null. This way the driver knows that we are asking only for the number of available physical devices. This number will be stored in the variable we provided the address of in the second parameter.

Now that we know how many physical devices are available we can prepare storage for their handles. I use a vector so I don’t need to worry about memory allocation and deallocation. When we call vkEnumeratePhysicalDevices() again, this time with all the parameters not equal to null, we will acquire handles of the physical devices in the array we provided addresses of in the last parameter. This array may not be the same size as the number returned after the first call, but it must hold the same number of elements as defined in the second parameter.

Example: we can have four physical devices available, but we are interested only in the first one. So after the first call we set a value of four in num_devices. This way we know that there is any Vulkan-compatible device and that we can proceed. We overwrite this value with one as we only want to use one (any) such device, no matter which. And we will get only one physical device handle after the second call.

The number of devices we provided will be replaced by the actual number of enumerated physical devices (which of course will not be greater than the value we provided). Example: we don’t want to call this function twice. Our application supports up to 10 devices and we provide this value along with a pointer to a static, 10-element array. The driver always returns the number of actually enumerated devices. If there is none, zero is stored in the variable address we provided. If there is any such device, we will also know that. We will not be able to tell if there are more than 10 devices.

Now that we have handles of all the Vulkan compatible physical devices we can check the properties of each device. In the sample code, this is done inside a loop:

VkPhysicalDevice selected_physical_device = VK_NULL_HANDLE;
uint32_t selected_queue_family_index = UINT32_MAX;
for( uint32_t i = 0; i < num_devices; ++i ) {
  if( CheckPhysicalDeviceProperties( physical_devices[i], selected_queue_family_index ) ) {
    selected_physical_device = physical_devices[i];
  }
}

12.Tutorial01.cpp, function CreateDevice()

Device Properties

I created the CheckPhysicalDeviceProperties() function. It takes the handle of a physical device and checks whether the capabilities of a given device are enough for our application to work properly. If so, it returns true and stores the queue family index in the variable provided in the second parameter. Queues and queue families are discussed in a later section.

Here is the first half of a CheckPhysicalDeviceProperties() function:

VkPhysicalDeviceProperties device_properties;
VkPhysicalDeviceFeatures   device_features;

vkGetPhysicalDeviceProperties( physical_device, &device_properties );
vkGetPhysicalDeviceFeatures( physical_device, &device_features );

uint32_t major_version = VK_VERSION_MAJOR( device_properties.apiVersion );
uint32_t minor_version = VK_VERSION_MINOR( device_properties.apiVersion );
uint32_t patch_version = VK_VERSION_PATCH( device_properties.apiVersion );

if( (major_version < 1) &&
    (device_properties.limits.maxImageDimension2D < 4096) ) {
  printf( "Physical device %p doesn't support required parameters!\n", physical_device );
  return false;
}

13.Tutorial01.cpp, function CheckPhysicalDeviceProperties()

At the beginning of this function, the physical device is queried for its properties and features. Properties contain fields such as supported Vulkan API version, device name and type (integrated or dedicated/discrete GPU), Vendor ID, and limits. Limits describe how big textures can be created, how many samples in anti-aliasing are supported, or how many buffers in a given shader stage can be used.

Device Features

Features are additional hardware capabilities that are similar to extensions. They may not necessarily be supported by the driver and by default are not enabled. Features contain items such as geometry and tessellation shaders multiple viewports, logical operations, or additional texture compression formats. If a given physical device supports any feature we can enable it during logical device creation. Features are not enabled by default in Vulkan. But the Vulkan spec points out that some features may have performance impact (like robustness).

After querying for hardware info and capabilities, I have provided a small example of how these queries can be used. I “reversed” the VK_MAKE_VERSION macro and retrieved major, minor, and patch versions from the apiVersion field of device properties. I check whether it is above some version I want to use, and also check whether I can create 2D textures of a given size. In this example I’m not using features at all, but if we want to use any feature (that is,  geometry shaders) we must check whether it is supported and we must (explicitly) enable it later, during logical device creation. And this is the reason why we need to create a logical device and not use physical device directly. A logical device represents a physical device and all the features and extensions we enabled for it.

The next part of checking physical device’s capabilities—queues—requires additional explanation.

Queues, Queue Families, and Command Buffers

When we want to process any data (that is, draw a 3D scene from vertex data and vertex attributes) we call Vulkan functions that are passed to the driver. These functions are not passed directly, as sending each request separately down through a communication bus is inefficient. It is better to aggregate them and pass in groups. In OpenGL this was done automatically by the driver and was hidden from the user. OpenGL API calls were queued in a buffer and if this buffer was full (or if we requested to flush it) whole buffer was passed to hardware for processing. In Vulkan this mechanism is directly visible to the user and, more importantly, the user must specifically create and manage buffers for commands. These are called (conveniently) command buffers.

Command buffers (as whole objects) are passed to the hardware for execution through queues. However, these buffers may contain different types of operations, such as graphics commands (used for generating and displaying images like in typical 3D games) or compute commands (used for processing data). Specific types of commands may be processed by dedicated hardware, and that’s why queues are also divided into different types. In Vulkan these queue types are called families. Each queue family may support different types of operations. That’s why we also have to check if a given physical device supports the type of operations we want to perform. We can also perform one type of operation on one device and another type of operation on another device, but we have to check if we can. This check is done in the second half of CheckPhysicalDeviceProperties() function:

uint32_t queue_families_count = 0;
vkGetPhysicalDeviceQueueFamilyProperties( physical_device, &queue_families_count, nullptr );
if( queue_families_count == 0 ) {
  printf( "Physical device %p doesn't have any queue families!\n", physical_device );
  return false;
}

std::vector<VkQueueFamilyProperties> queue_family_properties( queue_families_count );
vkGetPhysicalDeviceQueueFamilyProperties( physical_device, &queue_families_count, &queue_family_properties[0] );
for( uint32_t i = 0; i < queue_families_count; ++i ) {
  if( (queue_family_properties[i].queueCount > 0) &&
      (queue_family_properties[i].queueFlags & VK_QUEUE_GRAPHICS_BIT) ) {
    queue_family_index = i;
    return true;
  }
}

printf( "Could not find queue family with required properties on physical device %p!\n", physical_device );
return false;

14.Tutorial01.cpp, function CheckPhysicalDeviceProperties()

We must first check how many different queue families are available in a given physical device. This is done in a similar way to enumerating physical devices. First we call vkGetPhysicalDeviceQueueFamilyProperties() with the last parameter set to null. This way, in a “queue_count” a variable number of different queue families is stored. Next we can prepare a place for this number of queue families’ properties (if we want to—the mechanism is similar to enumerating physical devices). Next we call the function again and the properties for each queue family are stored in a provided array.

The properties of each queue family contain queue flags, the number of available queues in this family, time stamp support, and image transfer granularity. Right now, the most important part is the number of queues in the family and flags. Flags (which is a bitfield) define which types of operations are supported by a given queue family (more than one may be supported). It can be graphics, compute, transfer (memory operations like copying), and sparse binding (for sparse resources like mega-textures) operations. Other types may appear in the future.

In our example we check for graphics operations support, and if we find it we can use the given physical device. Remember that we also have to remember the selected family index. After we chose the physical device we can create logical device that will represent it in the rest of our application, as shown in the example:

if( selected_physical_device == VK_NULL_HANDLE ) {
  printf( "Could not select physical device based on the chosen properties!\n" );
  return false;
}

std::vector<float> queue_priorities = { 1.0f };

VkDeviceQueueCreateInfo queue_create_info = {
  VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO,     // VkStructureType              sType
  nullptr,                                        // const void                  *pNext
  0,                                              // VkDeviceQueueCreateFlags     flags
  selected_queue_family_index,                    // uint32_t                     queueFamilyIndex
  static_cast<uint32_t>(queue_priorities.size()), // uint32_t                     queueCount&queue_priorities[0]                            // const float                 *pQueuePriorities
};

VkDeviceCreateInfo device_create_info = {
  VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO,           // VkStructureType                    sType
  nullptr,                                        // const void                        *pNext
  0,                                              // VkDeviceCreateFlags                flags
  1,                                              // uint32_t                           queueCreateInfoCount
  &queue_create_info,                             // const VkDeviceQueueCreateInfo     *pQueueCreateInfos
  0,                                              // uint32_t                           enabledLayerCount
  nullptr,                                        // const char * const                *ppEnabledLayerNames
  0,                                              // uint32_t                           enabledExtensionCount
  nullptr,                                        // const char * const                *ppEnabledExtensionNames
  nullptr                                         // const VkPhysicalDeviceFeatures    *pEnabledFeatures
};

if( vkCreateDevice( selected_physical_device, &device_create_info, nullptr, &Vulkan.Device ) != VK_SUCCESS ) {
  printf( "Could not create Vulkan device!\n" );
  return false;
}

Vulkan.QueueFamilyIndex = selected_queue_family_index;
return true;

15.Tutorial01.cpp, function CreateDevice()

First we make sure that after we exited the device features loop, we have found the device that supports our needs. Next we can create a logical device, which is done by calling vkCreateDevice(). It takes the handle to a physical device and an address of a structure that contains the information necessary for device creation. This structure is of type VkDeviceCreateInfo and contains the following fields:

  • sType – Standard type of a provided structure, VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO here that means we are providing parameters for a device creation.
  • pNext – Parameter pointing to an extension specific structure; here we set it to null.
  • flags – Another parameter reserved for future use which must be zero.
  • queueCreateInfoCount – Number of different queue families from which we create queues along with the device.
  • pQueueCreateInfos – Pointer to an array of queueCreateInfoCount elements specifying queues we want to create.
  • enabledLayerCount – Number of device-level validation layers to enable.
  • ppEnabledLayerNames – Pointer to an array with enabledLayerCount names of device-level layers to enable.
  • enabledExtensionCount – Number of extensions to enable for the device.
  • ppEnabledExtensionNames – Pointer to an array with enabledExtensionCount elements; each element must contain the name of an extension that should be enabled.
  • pEnabledFeatures – Pointer to a structure indicating additional features to enable for this device (see the “Device ” section).

Features (as I have described earlier) are additional hardware capabilities that are disabled by default. If we want to enable all available features, we can’t simply fill this structure with ones. If some feature is not supported, the device creation will fail. Instead, we should pass a structure that was filled when we called vkGetPhysicalDeviceFeatures(). This is the easiest way to enable all supported features. If we are interested only in some specific features, we query the driver for available features and clear all unwanted fields. If we don’t want any of the additional features we can clear this structure (fill it with zeros) or pass a null pointer for this parameter (like in this example).

Queues are created automatically along with the device. To specify what types of queues we want to enable, we provide an array of additional VkDeviceQueueCreateInfo structures. This array must contain queueCreateInfoCount elements. Each element in this array must refer to a different queue family; we refer to a specific queue family only once.

The VkDeviceQueueCreateInfo structure contains the following fields:

  • sType –Type of structure, here VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO  indicating it’s queue creation information.
  • pNext – Pointer reserved for extensions.
  • flags – Value reserved for future use.
  • queueFamilyIndex – Index of a queue family from which queues should be created.
  • queueCount – Number of queues we want to enable in this specific queue family (number of queues we want to use from this family) and a number of elements in the pQueuePriorities array.
  • pQueuePriorities – Array with floating point values describing priorities of operations performed in each queue from this family.

As I mentioned previously, each element in the array with VkDeviceQueueCreateInfo elements must describe a different queue family. Its index is a number that must be smaller than the value provided by the vkGetPhysicalDeviceQueueFamilyProperties() function (must be smaller than number of available queue families). In our example we are only interested in one queue from one queue family. And that’s why we must remember the queue family index. It is used right here. If we want to prepare a more complicated scenario, we should also remember the number of queues in each family as each family may support a different number of queues. And we can’t create more queues than are available in a given family!

It is also worth noting that different queue families may have similar (or even identical properties) meaning they may support similar types of operations, that is, there may be more than one queue families that support graphics operations. And each family may contain different number of queues.

We must also assign a floating point value (from 0.0 to 1.0, both inclusive) to each queue. The higher the value we provide for a given queue (relative to values assigned to other queues) the more time the given queue may have for processing commands (relatively to other queues). But this relation is not guaranteed. Priorities also don’t influence execution order. It is just a hint.

Priorities are relative only on a single device. If operations are performed on multiple devices, priorities may impact processing time in each of these devices but not between them. A queue with a given value may be more important only than queues with lower priorities on the same device. Queues from different devices are treated independently. Once we fill these structures and call vkCreateDevice(), upon success a created logical device is stored in a variable we provided an address of (in our example it is called VulkanDevice). If this function fails, it returns a value other than VK_SUCCESS.

Acquiring Pointers to Device-Level Functions

We have created a logical device. We can now use it to load functions from the device level. As I have mentioned earlier in real-life scenarios, there will be situations where more than one hardware vendor on a single computer will provide us with Vulkan implementation. With OpenGL it is happening now. Many computers have dedicated/discrete graphics card used mainly for gaming, but they also have Intel’s graphics card built into the processor (which of course can also be used for games). So in the future there will be more than one device supporting Vulkan. And with Vulkan we can divide processing into whatever hardware we want. Remember when there were extension cards dedicated for physics processing? Or going farther into the past, a normal “2D” card with additional graphics “accelerator” (do you remember Voodoo cards)? Vulkan is ready for any such scenario.

So what should we do with device-level functions if there can be so many devices? We can load universal procedures. This is done with the vkGetInstanceProcAddr() function. It returns the addresses of dispatch functions that perform jumps to proper implementations based on a provided logical device handle. But we can avoid this overhead by loading functions for each logical device separately. With this method, we must remember that we can call the given function only with the device we loaded this function from. So if we are using more devices in our application we must load functions from each of these devices. It’s not that difficult. And despite this leading to storing more functions (and grouping them based on a device they were loaded from), we can avoid one level of abstraction and save some processor time. We can load functions similarly to how we have loaded exported, global-, and instance-level functions:

#define VK_DEVICE_LEVEL_FUNCTION( fun )                                 \
if( !(fun = (PFN_##fun)vkGetDeviceProcAddr( Vulkan.Device, #fun )) ) {  \
  printf( "Could not load device level function: " #fun "!\n" );        \
  return false;                                                         \
}

#include "ListOfFunctions.inl"

return true;

16.Tutorial01.cpp, function LoadDeviceLevelEntryPoints()

This time we used the vkGetDeviceProcAddr() function along with a logical device handle. Functions from device level are placed in a shared header. This time they are wrapped in a VK_DEVICE_LEVEL_FUNCTION() macro like this:

#if !defined(VK_DEVICE_LEVEL_FUNCTION)
#define VK_DEVICE_LEVEL_FUNCTION( fun )
#endif

VK_DEVICE_LEVEL_FUNCTION( vkGetDeviceQueue )
VK_DEVICE_LEVEL_FUNCTION( vkDestroyDevice )
VK_DEVICE_LEVEL_FUNCTION( vkDeviceWaitIdle )

#undef VK_DEVICE_LEVEL_FUNCTION

17.ListOfFunctions.inl

All functions that are not from the exported, global or instance levels are from the device level. Another distinction can be made based on a first parameter: for device-level functions, the first parameter in the list may only be of type VkDevice, VkQueue, or VkCommandBuffer. In the rest of the tutorial if a new function appears it must be added to ListOfFunctions.inl and further added in the VK_DEVICE_LEVEL_FUNCTION portion (with a few noted exceptions like extensions).

Retrieving Queues

Now that we have created a device, we need a queue that we can submit some commands to for processing. Queues are automatically created with a logical device, but in order to use them we must specifically ask for a queue handle. This is done with vkGetDeviceQueue() like this:

vkGetDeviceQueue( Vulkan.Device, Vulkan.QueueFamilyIndex, 0, &Vulkan.Queue );

18.Tutorial01.cpp, function GetDeviceQueue()

To retrieve the queue handle we must provide the logical device we want to get the queue from. The queue family index is also needed and it must by one of the indices we’ve provided during logical device creation (we cannot create additional queues or use queues from families we didn’t request). One last parameter is a queue index from within a given family; it must be smaller than the total number of queues we requested from a given family. For example if the device supports five queues in family number 3 and we want two queues from that family, the index of a queue must be smaller than two. For each queue we want to retrieve we have to call this function and make a separate query. If the function call succeeds, it will store a handle to a requested queue in a variable we have provided the address of in the final parameter. From now on, all the work we want to perform (using command buffers) can be submitted for processing to the acquired queue.

Tutorial01 Execution

As I have mentioned, the example provided with this tutorial doesn’t display anything. But we have learned enough information for one lesson. So how do we know if everything went fine? If the normal application window appears and nothing is printed in the console/terminal, this means the Vulkan setup was successful. Starting with the next tutorial, the results of our operations will be displayed on the screen.

Cleaning Up

There is one more thing we need to remember: cleaning up and freeing resources. Cleanup must be done in a specific order that is (in general) a reversal of the order of creation.

After the application is closed, the OS should release memory and all other resources associated with it. This should include Vulkan; the driver usually cleans up unreferenced resources. Unfortunately, this cleaning may not be performed in a proper order, which might lead to application crash during the closing process. It is always good practice to do the cleaning ourselves. Here is the sample code required to release resources we have created during this first tutorial:

if( Vulkan.Device != VK_NULL_HANDLE ) {
  vkDeviceWaitIdle( Vulkan.Device );
  vkDestroyDevice( Vulkan.Device, nullptr );
}

if( Vulkan.Instance != VK_NULL_HANDLE ) {
  vkDestroyInstance( Vulkan.Instance, nullptr );
}

if( VulkanLibrary ) {
#if defined(VK_USE_PLATFORM_WIN32_KHR)
  FreeLibrary( VulkanLibrary );
#elif defined(VK_USE_PLATFORM_XCB_KHR) || defined(VK_USE_PLATFORM_XLIB_KHR)
  dlclose( VulkanLibrary );
#endif
}

19.Tutorial01.cpp, destructor

We should always check to see whether any given resource was created. Without a logical device there are no device-level function pointers so we are unable to call even proper resource cleaning functions. Similarly, without an instance we are unable to acquire pointer to a vkDestroyInstance() function. In general we should not release resources that weren’t created.

We must ensure that before deleting any object, it is not being used by a device. That’s why there is a wait function, which will block until all processing on all queues of a given device is finished. Next, we destroy the logical device using the vkDestroyDevice() function. All queues associated with it are destroyed automatically, then the instance is destroyed. After that we can free (unload or release) a Vulkan library from which all these functions were acquired.

Conclusion

This tutorial explained how to prepare to use Vulkan in our application. First we “connect” with the Vulkan Runtime library and load global level functions from it. Then we create a Vulkan instance and load instance-level functions. After that we can check what physical devices are available and what are their features, properties, and capabilities. Next we create a logical device and describe what and how many queues must be created along with the device. After that we can retrieve device-level functions using the newly created logical device handle. One additional thing to do is to retrieve queues through which we can submit work for execution.


Go to: API without Secrets: Introduction to Vulkan* Part 2: Swap chain


Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800- 548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

API without Secrets: Introduction to Vulkan* Part 2: Swap Chain

$
0
0

Download [PDF 1 MB]

Link to Github Sample Code


Go to: API without Secrets: Introduction to Vulkan* Part 1: The Beginning


Table of Contents

Tutorial 2: Swap Chain – Integrating Vulkan with the OS

Welcome to the second Vulkan tutorial. In the first tutorial, I discussed basic Vulkan setup: function loading, instance creation, choosing a physical device and queues, and logical device creation. I’m sure you now want to draw something! Unfortunately we must wait until the next part. Why? Because if we draw something we’ll want to see it. Similar to OpenGL*, we must integrate the Vulkan pipeline with the application and API that the OS provides. However, with Vulkan, this task unfortunately isn’t simple and obvious. And as with all other thin APIs, it is done this way on purpose—for  the sake of high performance and flexibility.

So how do you integrate Vulkan with the application’s window? What are the differences compared to OpenGL? In OpenGL (on Microsoft Windows*) we acquire Device Context that is associated with the application’s window. Using it we then have to define “how” to present images on the screen, “what” the format is of the application’s window we will be drawing on, and what capabilities it should support. This is done through the pixel format. Most of the time we create a 32-bit color surface with a 24-bit depth buffer and a support for double buffering (this way we can draw something to a “hidden” back buffer, and after we’re finished we can present it on the screen—swap front and back buffers). Only after these preparations can we create a Rendering Context and activate it. In OpenGL, all the rendering is directed to the default, back buffer.

In Vulkan there is no default frame buffer. We can create an application that displays nothing at all. This is a valid approach. But if we want to display something we can create a set of buffers to which we can render. These buffers along with their properties, similar to Direct3D*, are called a swap chain. A swap chain can contain many images. To display any of them we don’t “swap” them—as the name suggests—but we present them, which means that we give them back to a presentation engine. So in OpenGL we first have to define the surface format and associate it with a window (at least on Windows) and after that we create Rendering Context. In Vulkan, we first create an instance, a device, and then we create a swap chain. But, what’s interesting is that there will be situations where we will have to destroy this swap chain and recreate it. In the middle of a working application. From scratch!

Asking for a Swap Chain Extension

In Vulkan, a swap chain is an extension. Why? Isn’t it obvious we want to display an image on the screen in our application’s window?

Well, it’s not so obvious. Vulkan can be used for many different purposes, including performing mathematical operations, boosting physics calculations, and processing a video stream. The results of these actions may not necessarily be displayed on a typical monitor, which is why the core API is OS-agnostic, similar to OpenGL.

If you want to create a game and display rendered images on a monitor, you can (and should) use a swap chain. But here is the second reason why a swap chain is an extension. Every OS displays images in a different way. The  surface on which you can render may be implemented differently, can have a different format, and can be differently represented in the OS—there is no one universal way to do it. So in Vulkan a swap chain must also depend on the OS your application is written for.

These are the reasons a swap chain in Vulkan is treated as an extension: it provides render targets (buffers or images like FBOs in OpenGL) that integrates with OS specific code. It’s something that core Vulkan (which is platform independent) can’t do. So if swap chain creation and usage is an extension, we have to ask for the extension during both instance and device creation. The ability to create and use a swap chain requires us to enable extensions at two levels (at least on most operating systems, with Windows and Linux* among them). This means that we have to go back to the first tutorial and change it to request the proper swap-chain-related extensions. If a given instance and device doesn’t support these extensions, the instance and/or device creation will fail. There are of course other ways through which we can display an image, like acquiring the pointer to a buffer’s (texture’s) memory (mapping it) and copying data from it to the OS-acquired window’s surface pointer. This process is out of scope of this tutorial (though not really that hard). But fortunately it seems that swap chain extensions will be similar to OpenGL’s core extensions: they will be something that’s not in the core spec and that’s not required to be implemented but they also are something that every hardware vendor will implement anyway. I think all hardware vendors would like to show that they support Vulkan and that it gives impressive performance boost in games which are displayed on screen. And, what backs this theory, swap chain extensions are integrated into the main, “core” vulkan.h header.

In the case of swap-chain support, there are actually three extensions involved: two from an instance level and one from a device level. These extensions logically separate different functionalities. The first is the VK_KHR_surface extension defined at the instance level. It describes a “surface” object, which is a logical representation of an application’s window. This extension allows us to check different parameters (that is,  capabilities, supported formats, size) of a surface and to query whether the given physical device supports a swap chain (more precisely, whether the given queue family supports presenting an image to a given surface). This is useful information because we don’t want to choose a physical device and try to create a logical device from it only to find out that it doesn’t support swap chains. This extension also defines methods to destroy any such surface.

The second instance-level extension is OS-dependent: in the Windows OS family it is called VK_KHR_win32_surface and in Linux it is called VK_KHR_xlib_surface or VK_KHR_xcb_surface. This extension allows us to create a surface that represents the application’s window in a given OS (and uses OS-specific parameters).

Checking Whether an Instance Extension Is Supported

Before we can enable the two instance-level extensions, we need to check whether they are available or supported. We are talking about instance extensions and we haven’t created any instance yet. To determine whether our Vulkan instance supports these extensions, we use a global-level function called vkEnumerateInstanceExtensionProperties(). It enumerates all available instance general extensions, if its first parameter is null, or instance layer extensions (it seems that layers can also have extensions), if we set the first parameter to the name of any given layer. We aren’t interested in layers so we leave the first parameter set to null. Again we call this function twice. For the first call, we want to acquire the total number of supported extensions so we leave the third argument nulled. Next we prepare storage for all these extensions and we call this function once again with the third parameter pointing to the allocated storage.

uint32_t extensions_count = 0;
if( (vkEnumerateInstanceExtensionProperties( nullptr, &extensions_count, nullptr ) != VK_SUCCESS) ||
    (extensions_count == 0) ) {
  printf( "Error occurred during instance extensions enumeration!\n" );
  return false;
}

std::vector<VkExtensionProperties> available_extensions( extensions_count );
if( vkEnumerateInstanceExtensionProperties( nullptr, &extensions_count, &available_extensions[0] ) != VK_SUCCESS ) {
  printf( "Error occurred during instance extensions enumeration!\n" );
  return false;
}

std::vector<const char*> extensions = {
  VK_KHR_SURFACE_EXTENSION_NAME,
#if defined(VK_USE_PLATFORM_WIN32_KHR)
  VK_KHR_WIN32_SURFACE_EXTENSION_NAME
#elif defined(VK_USE_PLATFORM_XCB_KHR)
  VK_KHR_XCB_SURFACE_EXTENSION_NAME
#elif defined(VK_USE_PLATFORM_XLIB_KHR)
  VK_KHR_XLIB_SURFACE_EXTENSION_NAME
#endif
};

for( size_t i = 0; i < extensions.size(); ++i ) {
  if( !CheckExtensionAvailability( extensions[i], available_extensions ) ) {
    printf( "Could not find instance extension named \"%s\"!\n", extensions[i] );
    return false;
  }
}

1. Tutorial02.cpp, function CreateInstance()

We can prepare a place for a smaller amount of extensions, but then vkEnumerateInstanceExtensionProperties() will return VK_INCOMPLETE to let us know we didn’t acquire all the extensions.

Our array is now filled with all available (supported) instance-level extensions. Each element of our allocated space contains the name of the extension and its version. The second parameter probably won’t be used too often, but it may be useful to check whether the hardware supports the given version of the extension. For example, we might be  interested in some specific extension, and we downloaded an SDK for it that contains a set of header files. Each header file has its own version corresponding to the value returned by this query. If the hardware our application is executed on supports an older version of the extension (not the one we downloaded the SDK for) it may not support all the functions we are using from this specific extension. So sometimes it may be useful to also verify the version, but for a swap chain it doesn’t matter—at least for now.

We can now search through all of the returned extensions and see whether the list contains the extensions we are looking for. Here I’m using two convenient definitions named VK_KHR_SURFACE_EXTENSION_NAME and VK_KHR_????_SURFACE_EXTENSION_NAME. They are defined inside a Vulkan header file and contain the names of the extensions so we don’t have to copy or remember them. We just can use the definitions in our code, and if we make a mistake the compiler will tell us. I hope all extensions will come with a similar definition.

With the second definition comes a small trap. These two mentioned defines are placed in a vulkan.h header file. But isn’t the second define specific for a given OS and isn’t vulkan.h header OS independent? Both questions are true and quite valid. The vulkan.h file is OS-independent and it contains the definitions of OS-specific extensions. But these are enclosed inside #ifdef … #endif preprocessor directives. If we want to “enable” them we need to add a proper preprocessor directive somewhere in our project. For a Windows system, we need to add a VK_USE_PLATFORM_WIN32_KHR string. On Linux, we need to add VK_USE_PLATFORM_XCB_KHR or VK_USE_PLATFORM_XLIB_KHR depending on whether we want to use the X11 or XCB libraries. In the provided example project, these definitions are added by default through the CMakeLists.txt file.

But back to our source code. What does the CheckExtensionAvailability() function do? It loops over all available extensions and compares their names with the name of the provided extension. If a match is found, it just returns true.

for( size_t i = 0; i < available_extensions.size(); ++i ) {
  if( strcmp( available_extensions[i].extensionName, extension_name ) == 0 ) {
    return true;
  }
}
return false;

2.Tutorial02.cpp, function CheckExtensionAvailability()

Enabling an Instance-Level Extension

Let’s say we have verified that both extensions are supported. Instance-level extensions are requested (enabled) during instance creation—we create an instance with a list of extensions that should be enabled. Here’s the code responsible for doing it:

VkApplicationInfo application_info = {
  VK_STRUCTURE_TYPE_APPLICATION_INFO,             // VkStructureType            sType
  nullptr,                                        // const void                *pNext"API without Secrets: Introduction to Vulkan",  // const char                *pApplicationName
  VK_MAKE_VERSION( 1, 0, 0 ),                     // uint32_t                   applicationVersion"Vulkan Tutorial by Intel",                     // const char                *pEngineName
  VK_MAKE_VERSION( 1, 0, 0 ),                     // uint32_t                   engineVersion
  VK_API_VERSION                                  // uint32_t                   apiVersion
};

VkInstanceCreateInfo instance_create_info = {
  VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO,         // VkStructureType            sType
  nullptr,                                        // const void                *pNext
  0,                                              // VkInstanceCreateFlags      flags
  &application_info,                              // const VkApplicationInfo   *pApplicationInfo
  0,                                              // uint32_t                   enabledLayerCount
  nullptr,                                        // const char * const        *ppEnabledLayerNames
  static_cast<uint32_t>(extensions.size()),       // uint32_t                   enabledExtensionCount&extensions[0]                                  // const char * const        *ppEnabledExtensionNames
};

if( vkCreateInstance( &instance_create_info, nullptr, &Vulkan.Instance ) != VK_SUCCESS ) {
  printf( "Could not create Vulkan instance!\n" );
  return false;
}
return true;

3.Tutorial02.cpp, function CreateInstance()

This code is similar to the CreateInstance() function in the Tutorial01.cpp file. To request instance-level extensions we have to prepare an array with the names of all extensions we want to enable. Here I have used a standard vector with “const char*” elements and mentioned extension names in forms of defines.

In Tutorial 1 we declared zero extensions and placed a nullptr for the address of an array in a VkInstanceCreateInfo structure. This time we must provide an address of the first element of an array filled with the names of the requested extensions. And we must also specify how many elements the array contains (that’s why I chose a vector: if I add or remove extensions in future tutorials, the vector’s size will also change accordingly). Next we call the vkCreateInstance() function. If it doesn’t return VK_SUCCESS it means that (in the case of this tutorial) extensions are not supported. If it does return successfully, we can load instance-level functions as previously, but this time also with some additional, extension-specific functions.

With these extensions come additional functions. And, as it is an instance-level extension, we must add them to our set of instance-level functions (so they will also be loaded at a proper moment and with a proper function). In this case we must add the following functions into a ListOfFunctions.inl wrapped into a VK_INSTANCE_LEVEL_FUNCTION() macro like this:

// From extensions
#if defined(USE_SWAPCHAIN_EXTENSIONS)
VK_INSTANCE_LEVEL_FUNCTION( vkDestroySurfaceKHR )
VK_INSTANCE_LEVEL_FUNCTION( vkGetPhysicalDeviceSurfaceSupportKHR )
VK_INSTANCE_LEVEL_FUNCTION( vkGetPhysicalDeviceSurfaceCapabilitiesKHR )
VK_INSTANCE_LEVEL_FUNCTION( vkGetPhysicalDeviceSurfaceFormatsKHR )
VK_INSTANCE_LEVEL_FUNCTION( vkGetPhysicalDeviceSurfacePresentModesKHR )
#if defined(VK_USE_PLATFORM_WIN32_KHR)
VK_INSTANCE_LEVEL_FUNCTION( vkCreateWin32SurfaceKHR )
#elif defined(VK_USE_PLATFORM_XCB_KHR)
VK_INSTANCE_LEVEL_FUNCTION( vkCreateXcbSurfaceKHR )
#elif defined(VK_USE_PLATFORM_XLIB_KHR)
VK_INSTANCE_LEVEL_FUNCTION( vkCreateXlibSurfaceKHR )
#endif
#endif

4.ListOfFunctions.inl

One more thing: I’ve wrapped all these swap-chain-related functions inside another #ifdef … #endif pair, which requires a USE_SWAPCHAIN_EXTENSIONS preprocessor directive to be defined. I’ve done this so Tutorial 1 would properly work. Without it, our first application (as it uses the same header files) would try to load all these functions. But we don’t enable swap chain extensions in the first tutorial, so this operation would fail and the application would close without fully initializing Vulkan. If a given extension isn’t enabled, functions from it may not be available.

Creating a Presentation Surface

We have created a Vulkan instance with two extensions enabled. We have loaded instance-level functions from a core Vulkan spec and from enabled extensions (this is done automatically thanks to our macros). To create a surface, we write code similar to the following:

#if defined(VK_USE_PLATFORM_WIN32_KHR)
VkWin32SurfaceCreateInfoKHR surface_create_info = {
  VK_STRUCTURE_TYPE_WIN32_SURFACE_CREATE_INFO_KHR,  // VkStructureType                  sType
  nullptr,                                          // const void                      *pNext
  0,                                                // VkWin32SurfaceCreateFlagsKHR     flags
  Window.Instance,                                  // HINSTANCE                        hinstance
  Window.Handle                                     // HWND                             hwnd
};

if( vkCreateWin32SurfaceKHR( Vulkan.Instance, &surface_create_info, nullptr, &Vulkan.PresentationSurface ) == VK_SUCCESS ) {
  return true;
}

#elif defined(VK_USE_PLATFORM_XCB_KHR)
VkXcbSurfaceCreateInfoKHR surface_create_info = {
  VK_STRUCTURE_TYPE_XCB_SURFACE_CREATE_INFO_KHR,    // VkStructureType                  sType
  nullptr,                                          // const void                      *pNext
  0,                                                // VkXcbSurfaceCreateFlagsKHR       flags
  Window.Connection,                                // xcb_connection_t*                connection
  Window.Handle                                     // xcb_window_t                     window
};

if( vkCreateXcbSurfaceKHR( Vulkan.Instance, &surface_create_info, nullptr, &Vulkan.PresentationSurface ) == VK_SUCCESS ) {
  return true;
}

#elif defined(VK_USE_PLATFORM_XLIB_KHR)
VkXlibSurfaceCreateInfoKHR surface_create_info = {
  VK_STRUCTURE_TYPE_XLIB_SURFACE_CREATE_INFO_KHR,   // VkStructureType                sType
  nullptr,                                          // const void                    *pNext
  0,                                                // VkXlibSurfaceCreateFlagsKHR    flags
  Window.DisplayPtr,                                // Display                       *dpy
  Window.Handle                                     // Window                         window
};

if( vkCreateXlibSurfaceKHR( Vulkan.Instance,&surface_create_info, nullptr, &Vulkan.PresentationSurface ) == VK_SUCCESS ) {
  return true;
}

#endif

printf( "Could not create presentation surface!\n" );
return false;

5.Tutorial02.cpp, function CreatePresentationSurface()

To create a presentation surface, we call the vkCreate????SurfaceKHR() function, which accepts Vulkan Instance (with enabled surface extensions), a pointer to a OS-specific structure, a pointer to optional memory allocation handling functions, and a pointer to a variable in which a handle to a created surface will be stored.

This OS-specific structure is called Vk????SurfaceCreateInfoKHR and it contains the following fields:

  • sType – Standard type of structure that here should be equal to VK_STRUCTURE_TYPE_????_SURFACE_CREATE_INFO_KHR (where ???? can be WIN32, XCB, XLIB, or other)
  • pNext – Standard pointer to some other structure
  • flags – Parameter reserved for future use
  • hinstance/connection/dpy – First OS-specific parameter
  • hwnd/window – Handle to our application’s window (also OS specific)

Checking Whether a Device Extension is Supported

We have created an instance and a surface. The next step is to create a logical device. But we want to create a device that supports a swap chain. So we also need to check whether a given physical device supports a swap chain extension, a device-level extension. This extension is called VK_KHR_swapchain, and it defines the actual support, implementation, and usage of a swap chain.

To check what extensions given physical device supports we must create code similar to the code prepared for instance-level extensions. This time we just use the vkEnumerateDeviceExtensionProperties() function. It behaves identically to the function querying instance extensions. The only difference is that it takes an additional physical device handle in the first argument. The code for this may look similar to the example below. It is a part of the CheckPhysicalDeviceProperties() function in our example source code.

uint32_t extensions_count = 0;
if( (vkEnumerateDeviceExtensionProperties( physical_device, nullptr, &extensions_count, nullptr ) != VK_SUCCESS) ||
    (extensions_count == 0) ) {
  printf( "Error occurred during physical device %p extensions enumeration!\n", physical_device );
  return false;
}

std::vector<VkExtensionProperties> available_extensions( extensions_count );
if( vkEnumerateDeviceExtensionProperties( physical_device, nullptr, &extensions_count, &available_extensions[0] ) != VK_SUCCESS ) {
  printf( "Error occurred during physical device %p extensions enumeration!\n", physical_device );
  return false;
}

std::vector<const char*> device_extensions = {
  VK_KHR_SWAPCHAIN_EXTENSION_NAME
};

for( size_t i = 0; i < device_extensions.size(); ++i ) {
  if( !CheckExtensionAvailability( device_extensions[i], available_extensions ) ) {
    printf( "Physical device %p doesn't support extension named \"%s\"!\n", physical_device, device_extensions[i] );
    return false;
  }
}

6.Tutorial02.cpp, function CheckPhysicalDeviceProperties()

We first ask for the number of all extensions available on a given physical device. Next we get their names and look for the device-level swap-chain extension. If there is none there is no point in further checking the device’s properties, features, and queue families’ properties as a given device doesn’t support swap chain at all.

Checking Whether Presentation to a Given Surface Is Supported

Let’s go back to the CreateDevice() function. After creating an instance, in the first tutorial we looped through all available physical devices and queried their properties. Based on these properties we selected which device we want to use and which queue families we want to request. This query is done in a loop over all available physical devices. Now that we want to use swap chain I have to modify my CheckPhysicalDeviceProperties() function that is called inside a mentioned loop from CreateDevice() function like this:

uint32_t selected_graphics_queue_family_index = UINT32_MAX;
uint32_t selected_present_queue_family_index = UINT32_MAX;

for( uint32_t i = 0; i < num_devices; ++i ) {
  if( CheckPhysicalDeviceProperties( physical_devices[i], selected_graphics_queue_family_index, selected_present_queue_family_index ) ) {
    Vulkan.PhysicalDevice = physical_devices[i];
  }
}

7.Tutorial02.cpp, function CreateDevice()

The only change is that I’ve added another variable that will contain an index of a queue family that supports a swap chain (more precisely image presentation). Unfortunately, just checking whether swap extension is supported is not enough because presentation support is a queue family property. A physical device may support swap chains, but that doesn’t mean that all its queue families also support it. And do we really need another queue or queue family for displaying images? Can’t we just use graphics queue that we’d selected in the first tutorial? Most of the time one queue family will probably be enough for our needs. This means that the selected queue family will support both graphics operations and a presentation. But, unfortunately, it is also possible that there will be devices that won’t support graphics and presenting within a single queue family. In Vulkan we have to be flexible and prepared for any situation.

vkGetPhysicalDeviceSurfaceSupportKHR() function is used to check whether a given queue family from a given physical device supports a swap chain or, to be more precise, whether it supports presenting images to a given surface. That’s why we needed to create a surface earlier.

So assume we have already checked whether a given physical device exposes a swap-chain extension and that we have already queried for a number of different queue families supported by a given physical device. We have also requested the properties of all queue families. Now we can check whether a given queue family supports presentation to our surface (window).

uint32_t graphics_queue_family_index = UINT32_MAX;
uint32_t present_queue_family_index = UINT32_MAX;

for( uint32_t i = 0; i < queue_families_count; ++i ) {
  vkGetPhysicalDeviceSurfaceSupportKHR( physical_device, i, Vulkan.PresentationSurface, &queue_present_support[i] );

  if( (queue_family_properties[i].queueCount > 0) &&
      (queue_family_properties[i].queueFlags & VK_QUEUE_GRAPHICS_BIT) ) {
    // Select first queue that supports graphics
    if( graphics_queue_family_index == UINT32_MAX ) {
      graphics_queue_family_index = i;
    }

    // If there is queue that supports both graphics and present - prefer it
    if( queue_present_support[i] ) {
      selected_graphics_queue_family_index = i;
      selected_present_queue_family_index = i;
      return true;
    }
  }
}

// We don't have queue that supports both graphics and present so we have to use separate queues
for( uint32_t i = 0; i < queue_families_count; ++i ) {
  if( queue_present_support[i] ) {
    present_queue_family_index = i;
    break;
  }
}

// If this device doesn't support queues with graphics and present capabilities don't use it
if( (graphics_queue_family_index == UINT32_MAX) ||
    (present_queue_family_index == UINT32_MAX) ) {
  printf( "Could not find queue families with required properties on physical device %p!\n", physical_device );
  return false;
}

selected_graphics_queue_family_index = graphics_queue_family_index;
selected_present_queue_family_index = present_queue_family_index;
return true;

8.Tutorial02.cpp, function CheckPhysicalDeviceProperties()

Here we are iterating over all available queue families. In each loop iteration, we are calling a function responsible for checking whether a given queue family supports presentation. vkGetPhysicalDeviceSurfaceSupportKHR() function requires us to provide a physical device handle, the queue family index we want to check, and the surface handle we want to render into (present an image). If support is available, VK_TRUE will be stored at a given address; otherwise VK_FALSE is stored.

Now we have the properties of all available queue families. We know which queue family supports graphics operations and which supports presentation. In our tutorial example I prefer families that support both. If I find one I store the family index and exit immediately from CheckPhysicalDeviceProperties() function. If there is no such queue family I use the first queue family that supports graphics and a first family that supports presenting. Only then can I leave the function with a “success” return code.

A more advanced scenario may search through all available devices and try to find one with a queue family that supports both graphics and presentation operations. But I can also imagine situations when there will be no single device that supports both. Then we are forced to use one device for graphics calculations (maybe like the old “graphics accelerator”) and another device for presenting results on the screen (connected with the “accelerator” and a monitor). Unfortunately in such case we must use “general” Vulkan functions from the Vulkan Runtime or we need to store device‑level functions for each used device (each device may have a different implementation of Vulkan functions). But, hopefully, such situations will be uncommon.

Creating a Device with a Swap Chain Extension Enabled

Now we can return to the CreateDevice() function. We have found the physical device that supports both graphics and presenting but not necessarily in a single queue family. We now need to create a logical device.

std::vector<VkDeviceQueueCreateInfo> queue_create_infos;
std::vector<float> queue_priorities = { 1.0f };

queue_create_infos.push_back( {
    VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO,     // VkStructureType              sType
    nullptr,                                        // const void                  *pNext
    0,                                              // VkDeviceQueueCreateFlags     flags
    selected_graphics_queue_family_index,           // uint32_t                     queueFamilyIndex
    static_cast<uint32_t>(queue_priorities.size()), // uint32_t                     queueCount&queue_priorities[0]                            // const float                 *pQueuePriorities
} );

if( selected_graphics_queue_family_index != selected_present_queue_family_index ) {
  queue_create_infos.push_back( {
    VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO,     // VkStructureType              sType
    nullptr,                                        // const void                  *pNext
    0,                                              // VkDeviceQueueCreateFlags     flags
    selected_present_queue_family_index,            // uint32_t                     queueFamilyIndex
    static_cast<uint32_t>(queue_priorities.size()), // uint32_t                     queueCount&queue_priorities[0]                            // const float                 *pQueuePriorities
  } );
}

std::vector<const char*> extensions = {
  VK_KHR_SWAPCHAIN_EXTENSION_NAME
};

VkDeviceCreateInfo device_create_info = {
  VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO,             // VkStructureType                    sType
  nullptr,                                          // const void                        *pNext
  0,                                                // VkDeviceCreateFlags                flags
  1,                                                // uint32_t                           queueCreateInfoCount
  &queue_create_infos[0],                           // const VkDeviceQueueCreateInfo     *pQueueCreateInfos
  0,                                                // uint32_t                           enabledLayerCount
  nullptr,                                          // const char * const                *ppEnabledLayerNames
  static_cast<uint32_t>(extensions.size()),         // uint32_t                           enabledExtensionCount&extensions[0],                                   // const char * const                *ppEnabledExtensionNames
  nullptr                                           // const VkPhysicalDeviceFeatures    *pEnabledFeatures
};

if( vkCreateDevice( Vulkan.PhysicalDevice, &device_create_info, nullptr, &Vulkan.Device ) != VK_SUCCESS ) {
  printf( "Could not create Vulkan device!\n" );
  return false;
}

Vulkan.GraphicsQueueFamilyIndex = selected_graphics_queue_family_index;
Vulkan.PresentQueueFamilyIndex = selected_present_queue_family_index;
return true;

9.Tutorial02.cpp, function CreateDevice()

As before, we need to fill a variable of VkDeviceCreateInfo type. To do this, we need to declare the queue families and how many queues each we want to enable. We do this through a pointer to a separate array with VkDeviceQueueCreateInfo elements. Here I declare a vector and I add one element, which defines one queue from the queue family that supports graphics operations. We use a vector because if graphics and presenting aren’t supported by a single family, we will need to define two separate families. If a single family supports both we just define one member and declare that only one family is needed. If the indices of graphics and presentation families are different we need to declare additional members for our vector with VkDeviceQueueCreateInfo elements. In this case the VkDeviceCreateInfo structure must provide info about two different families. That’s why a vector once again comes in handy (with its size() member function).

But we are not finished with device creation yet. We have to ask for the third extension related to a swap chain—a device-level “VK_KHR_swapchain” extension. As mentioned earlier, this extensions defines the actual support, implementation, and usage of a swap chain.

To ask for this extension, similarly at an instance level, we define an array (or a vector) which contains all the names of device-level extensions we want to enable. We provide an address of a first element of this array and the number of extensions we want to use. This extension also contains a definition of its name in a form of a #define VK_KHR_SWAPCHAIN_EXTENSION_NAME. We can use it inside our array (vector), and we don’t have to worry about any typos.

This third extension introduces additional functions used to actually create, destroy, or in general manage swap chains. Before we can use them, we of course need to load pointers to these functions. They are from the device level so we will place them in a ListOfFunctions.inl file using VK_DEVICE_LEVEL_FUNCTION() macro:

// From extensions
#if defined(USE_SWAPCHAIN_EXTENSIONS)
VK_DEVICE_LEVEL_FUNCTION( vkCreateSwapchainKHR )
VK_DEVICE_LEVEL_FUNCTION( vkDestroySwapchainKHR )
VK_DEVICE_LEVEL_FUNCTION( vkGetSwapchainImagesKHR )
VK_DEVICE_LEVEL_FUNCTION( vkAcquireNextImageKHR )
VK_DEVICE_LEVEL_FUNCTION( vkQueuePresentKHR )
#endif

10.ListOfFunctions.inl

You can once again see that I’m checking whether a USE_SWAPCHAIN_EXTENSIONS preprocessor directive is defined. I define it only in projects that enable swap-chain extensions.

Now that we have created a logical devices we need to receive handles of a graphics queue and (if separate) presentation queue. I’m using two separate queue variables for convenience, but they both may contain the same handle.

After loading the device-level functions we can read requested queue handles. Here’s the code for it:

vkGetDeviceQueue( Vulkan.Device, Vulkan.GraphicsQueueFamilyIndex, 0, &Vulkan.GraphicsQueue );
vkGetDeviceQueue( Vulkan.Device, Vulkan.PresentQueueFamilyIndex, 0, &Vulkan.PresentQueue );
return true;

11.Tutorial02.cpp, function GetDeviceQueue()

Creating a Semaphore

One last step before we can move to swap chain creation and usage is to create a semaphore. Semaphores are objects used for queue synchronization. They may be signaled or unsignaled. One queue may signal a semaphore (change its state from unsignaled to signaled) when some operations are finished, and another queue may wait on the semaphore until it becomes signaled. After that, the queue resumes performing operations submitted through command buffers.

VkSemaphoreCreateInfo semaphore_create_info = {
  VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO,      // VkStructureType          sType
  nullptr,                                      // const void*              pNext
  0                                             // VkSemaphoreCreateFlags   flags
};

if( (vkCreateSemaphore( Vulkan.Device, &semaphore_create_info, nullptr, &Vulkan.ImageAvailableSemaphore ) != VK_SUCCESS) ||
    (vkCreateSemaphore( Vulkan.Device, &semaphore_create_info, nullptr, &Vulkan.RenderingFinishedSemaphore ) != VK_SUCCESS) ) {
  printf( "Could not create semaphores!\n" );
  return false;
}
return true;

12.Tutorial02.cpp, function CreateSemaphores()

To create a semaphore we call the vkCreateSemaphore() function. It requires us to provide create information with three fields:

  • sType – Standard structure type that must be set to VK_STRUCTURE_TYPE_SEMAPHORE_CREATE_INFO in this example.
  • pNext – Standard parameter reserved for future use.
  • flags – Another parameter that is reserved for future use and must equal zero.

Semaphores are used during drawing (or during presentation if we want to be more precise). I will describe the details later.

Creating a Swap Chain

We have enabled support for a swap chain, but before we can render anything on screen we must first create a swap chain from which we can acquire images on which we can render (or to which we can copy anything if we have rendered something into another image).

To create a swap chain, we call the vkCreateSwapchainKHR() function. It requires us to provide an address of a variable of type VkSwapchainCreateInfoKHR, which informs the driver about the properties of a swap chain that is being created. To fill this structure with the proper values, we must determine what is possible on a given hardware and platform. To do this we query the platform’s or window’s properties about the availability of and compatibility with several different features, that is, supported image formats or present modes (how images are presented on screen). So before we can create a swap chain we must check what is possible with a given platform and how we can create a swap chain.

Acquiring Surface Capabilities

First we must query for surface capabilities. To do this, we call the  vkGetPhysicalDeviceSurfaceCapabilitiesKHR() function like this:

VkSurfaceCapabilitiesKHR surface_capabilities;
if( vkGetPhysicalDeviceSurfaceCapabilitiesKHR( Vulkan.PhysicalDevice, Vulkan.PresentationSurface, &surface_capabilities ) != VK_SUCCESS ) {
  printf( "Could not check presentation surface capabilities!\n" );
  return false;
}

13.Tutorial02.cpp, function CreateSwapChain()

Acquired capabilities contain important information about ranges (limits) that are supported by the swap chain, that is, minimal and maximal number of images, minimal and maximal dimensions of images, or supported transforms (some platforms may require transformations applied to images before these images may be presented).

Acquiring Supported Surface Formats

Next, we need to query for supported surface formats. Not all platforms are compatible with typical image formats like non-linear 32-bit RGBA. Some platforms don’t have any preferences, but other may only support a small range of formats. We can only select one of the available formats for a swap chain or its creation will fail.

To query for surface formats, we must call the vkGetPhysicalDeviceSurfaceFormatsKHR() function. We can do it, as usual, twice: the first time to acquire the number of supported formats and a second time to acquire supported formats in an array prepared for this purpose. It can be done like this:

uint32_t formats_count;
if( (vkGetPhysicalDeviceSurfaceFormatsKHR( Vulkan.PhysicalDevice, Vulkan.PresentationSurface, &formats_count, nullptr ) != VK_SUCCESS) ||
    (formats_count == 0) ) {
  printf( "Error occurred during presentation surface formats enumeration!\n" );
  return false;
}

std::vector<VkSurfaceFormatKHR> surface_formats( formats_count );
if( vkGetPhysicalDeviceSurfaceFormatsKHR( Vulkan.PhysicalDevice, Vulkan.PresentationSurface, &formats_count, &surface_formats[0] ) != VK_SUCCESS ) {
  printf( "Error occurred during presentation surface formats enumeration!\n" );
  return false;
}

14.Tutorial02.cpp, function CreateSwapChain()

Acquiring Supported Present Modes

We should also ask for the available present modes, which tell us how images are presented (displayed) on the screen. The present mode defines whether an application will wait for v-sync or whether it will display an image immediately when it is available (which will probably lead to image tearing). I describe different present modes later.

To query for present modes that are supported on a given platform, we call the vkGetPhysicalDeviceSurfacePresentModesKHR() function. We can create code similar to this one:

uint32_t present_modes_count;
if( (vkGetPhysicalDeviceSurfacePresentModesKHR( Vulkan.PhysicalDevice, Vulkan.PresentationSurface, &present_modes_count, nullptr ) != VK_SUCCESS) ||
    (present_modes_count == 0) ) {
  printf( "Error occurred during presentation surface present modes enumeration!\n" );
  return false;
}

std::vector<VkPresentModeKHR> present_modes( present_modes_count );
if( vkGetPhysicalDeviceSurfacePresentModesKHR( Vulkan.PhysicalDevice, Vulkan.PresentationSurface, &present_modes_count, &present_modes[0] ) != VK_SUCCESS ) {
  printf( "Error occurred during presentation surface present modes enumeration!\n" );
  return false;
}

15.Tutorial02.cpp, function CreateSwapChain()

We now have acquired all the data that will help us prepare the proper values for a swap chain creation.

Selecting the Number of Swap Chain Images

A swap chain consists of multiple images. Several images (typically more than one) are required for the presentation engine to work properly, that is, one image is presented on the screen, another image waits in a queue for the next v-sync, and a third image is available for the application to render into.

An application may request more images. If it wants to use multiple images at once it may do so, for example, when encoding a video stream where every fourth image is a key frame and the application needs it to prepare the remaining three frames. Such usage will determine the number of images that will be automatically created in a swap chain: how many images the application requires at once for processing and how many images the presentation engine requires to function properly.

But we must ensure that the requested number of swap chain images is not smaller than the minimal required number of images and not greater than the maximal supported number of images (if there is such a limitation). And too many images will require much more memory. On the other hand, too small a number of images may cause stalls in the application (more about this later).

The number of images that are required for a swap chain to work properly and for an application to be able to render to is defined in the surface capabilities. Here is some code that checks whether the number of images is between the allowable min and max values:

// Set of images defined in a swap chain may not always be available for application to render to:
// One may be displayed and one may wait in a queue to be presented
// If application wants to use more images at the same time it must ask for more images
uint32_t image_count = surface_capabilities.minImageCount + 1;
if( (surface_capabilities.maxImageCount > 0) &&
    (image_count > surface_capabilities.maxImageCount) ) {
  image_count = surface_capabilities.maxImageCount;
}
return image_count;

16.Tutorial02.cpp, function GetSwapChainNumImages()

The minImageCount value in the surface capabilities structure gives the required minimum number of images for the swap chain to work properly. Here I’m selecting one more image than is required, and I also check whether I’m asking for too much. One more image may be useful for triple buffering-like presentation mode (if it is available). In more advanced scenarios we would also be required to store the number of images we want to use at the same time (at once). Let’s say we want to encode a mentioned video stream and we need a key frame (every forth image frame) and the other three images. But a swap chain doesn’t allow the application to operate on four images at once—only on three. We need to know that because we can only prepare two frames from a key frame, then we need to release them (give them back to a presentation engine) and only then can we acquire the last, third, non-key frame. This will become clearer later.

Selecting a Format for Swap Chain Images

Choosing a format for the images depends on the type of processing/rendering we want to do, that is, if we want to blend the application window with the desktop contents, an alpha value may be required. We must also know what color space is available and if we operate on linear or sRGB colorspace.

Each platform may support a different number of format-colorspace pairs. If we want to use specific ones we must make sure that they are available.

// If the list contains only one entry with undefined format
// it mean that there are no preferred surface formats and any can be choosen
if( (surface_formats.size() == 1) &&
    (surface_formats[0].format == VK_FORMAT_UNDEFINED) ) {
  return{ VK_FORMAT_R8G8B8A8_UNORM, VK_COLORSPACE_SRGB_NONLINEAR_KHR };
}

// Check if list contains most widely used R8 G8 B8 A8 format
// with nonlinear color space
for( VkSurfaceFormatKHR &surface_format : surface_formats ) {
  if( surface_format.format == VK_FORMAT_R8G8B8A8_UNORM ) {
    return surface_format;
  }
}

// Return the first format from the list
return surface_formats[0];

17.Tutorial02.cpp, function GetSwapChainFormat()

Earlier we requested a supported format which was placed in an array (a vector in our case). If this array contains only one value with an undefined format, that platform doesn’t have any preferences. We can use any image format we want.

In other cases, we can use only one of the available formats. Here I’m looking for any (linear or not) 32-bit RGBA format. If it is available I can choose it. If there is no such format I will use any format from the list (hoping that the first is also the best and contains the format with the most precision).

Selecting the Size of the Swap Chain Images

Typically the size of swap chain images will be identical to the window size. We can choose other sizes, but we must fit into image size constraints. The size of an image that would fit into the current application window’s size is available in the surface capabilities structure, in “currentExtent” member.

One thing worth noting is that a special value of “-1” indicates that the application’s window size will be determined by the swap chain size, so we can choose whatever dimension we want. But we must still make sure that the selected size is not smaller and not greater than the defined minimum and maximum constraints.

Selecting the swap chain size may (and probably usually will) look like this:

// Special value of surface extent is width == height == -1
// If this is so we define the size by ourselves but it must fit within defined confines
if( surface_capabilities.currentExtent.width == -1 ) {
  VkExtent2D swap_chain_extent = { 640, 480 };
  if( swap_chain_extent.width < surface_capabilities.minImageExtent.width ) {
    swap_chain_extent.width = surface_capabilities.minImageExtent.width;
  }
  if( swap_chain_extent.height < surface_capabilities.minImageExtent.height ) {
    swap_chain_extent.height = surface_capabilities.minImageExtent.height;
  }
  if( swap_chain_extent.width > surface_capabilities.maxImageExtent.width ) {
    swap_chain_extent.width = surface_capabilities.maxImageExtent.width;
  }
  if( swap_chain_extent.height > surface_capabilities.maxImageExtent.height ) {
    swap_chain_extent.height = surface_capabilities.maxImageExtent.height;
  }
  return swap_chain_extent;
}

// Most of the cases we define size of the swap_chain images equal to current window's size
return surface_capabilities.currentExtent;

18.Tutorial02.cpp, function GetSwapChainExtent()

Selecting Swap Chain Usage Flags

Usage flags define how a given image may be used in Vulkan. If we want an image to be sampled (used inside shaders) it must be created with “sampled” usage. If the image should be used as a depth render target, it must be created with “depth and stencil” usage. An image without proper usage “enabled” cannot be used for a given purpose or the results of such operations will be undefined.

For a swap chain we want to render (in most cases) into the image (use it as a render target), so we must specify “color attachment” usage with VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT enum. In Vulkan this usage is always available for swap chains, so we can always set it without any additional checking. But for any other usage we must ensure it is supported – we can do this through a “supportedUsageFlags” member of surface capabilities structure.

// Color attachment flag must always be supported
// We can define other usage flags but we always need to check if they are supported
if( surface_capabilities.supportedUsageFlags & VK_IMAGE_USAGE_TRANSFER_DST_BIT ) {
  return VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT | VK_IMAGE_USAGE_TRANSFER_DST_BIT;
}
return 0;

19.Tutorial02.cpp, function GetSwapChainUsageFlags()

In this example we define additional “transfer destination” usage which is required for image clear operation.

Selecting Pre-Transformations

On some platforms we may want our image to be transformed. This is usually the case on tablets when they are oriented in a way other than their default orientation. During swap chain creation we must specify what transformations should be applied to images prior to presenting. We can, of course, use only the supported transforms, which can be found in a “supportedTransforms” member of acquired surface capabilities.

If the selected pre-transform is other than the current transformation (also found in surface capabilities) the presentation engine will apply the selected transformation. On some platforms this may cause performance degradation (probably not noticeable but worth mentioning). In the sample code, I don’t want any transformations but, of course, I must check whether it is supported. If not, I’m just using the same transformation that is currently used.

// Sometimes images must be transformed before they are presented (i.e. due to device's orienation
// being other than default orientation)
// If the specified transform is other than current transform, presentation engine will transform image
// during presentation operation; this operation may hit performance on some platforms
// Here we don't want any transformations to occur so if the identity transform is supported use it
// otherwise just use the same transform as current transform
if( surface_capabilities.supportedTransforms & VK_SURFACE_TRANSFORM_IDENTITY_BIT_KHR ) {
  return VK_SURFACE_TRANSFORM_IDENTITY_BIT_KHR;
} else {
  return surface_capabilities.currentTransform;
}

20.Tutorial02.cpp, function GetSwapChainTransform()

Selecting Presentation Mode

Present modes determine the way images will be processed internally by the presentation engine and displayed on the screen. In the past, there was just a single buffer that was displayed all the time. If we were drawing anything on it the draw operations (whole process of image creation) were visible.

Double buffering was introduced to prevent the visibility of drawing operations: one image was displayed and the second was used to render into. During presentation, the contents of the second image were copied into the first image (earlier) or (later) the images were swapped (remember SwapBuffers() function used in OpenGL applications?) which means that their pointers were exchanged.

Tearing was another issue with displaying images, so the ability to wait for the vertical blank signal was introduced if we wanted to avoid it. But waiting introduced another problem: input lag. So double buffering was changed into triple buffering in which we were drawing into two back buffers interchangeably and during v-sync the most recent one was used for presentation.

This is exactly what presentation modes are for: how to deal with all these issues, how to present images on the screen and whether we want to use v-sync.

Currently there are four presentation modes:

  • IMMEDIATE. Present requests are applied immediately and tearing may be observed (depending on the frames per second). Internally the presentation engine doesn’t use any queue for holding swap chain images.

  • FIFO. This mode is the most similar to OpenGL’s buffer swapping with a swap interval set to 1. The image is displayed (replaces currently displayed image) only on vertical blanking periods, so no tearing should be visible. Internally, the presentation engine uses FIFO queue with “numSwapchainImages – 1” elements. Present requests are appended to the end of this queue. During blanking periods, the image from the beginning of the queue replaces the currently displayed image, which may become available to application. If all images are in the queue, the application has to wait until v-sync releases the currently displayed image. Only after that does it becomes available to the application and program may render image into it. This mode must always be available in all Vulkan implementations supporting swap chain extension.

  • FIFO RELAXED. This mode is similar to FIFO, but when the image is displayed longer than one blanking period it may be released immediately without waiting for another v-sync signal (so if we are rendering frames with lower frequency than screen’s refresh rate, tearing may be visible)
     
  • MAILBOX. In my opinion, this mode is the most similar to the mentioned triple buffering. The image is displayed only on vertical blanking periods and no tearing should be visible. But internally, the presentation engine uses the queue with only a single element. One image is displayed and one waits in the queue. If application wants to present another image it is not appended to the end of the queue but replaces the one that waits. So in the queue there is always the most recently generated image. This behavior is available if there are more than two images. For two images MAILBOX mode behaves similarly to FIFO (as we have to wait for the displayed image to be released, we don’t have “spare” image which can be exchanged with the one that waits in the queue).

Deciding on which presentation mode to use depends on the type of operations we want to do. If we want to decode and display movies we want all frames to be displayed in a proper order. So the FIFO mode is in my opinion the best choice. But if we are creating a game, we usually want to display the most recently generated frame. In this case I suggest using MAILBOX because there is no tearing and input lag is minimized. The most recently generated image is displayed and the application doesn’t need to wait for v-sync. But to achieve this behavior, at least three images must be created and this mode may not always be supported.

FIFO mode is always available and requires at least two images but causes application to wait for v-sync (no matter how many swap chain images were requested). Immediate mode is the fastest. As I understand it, it also requires two images but it doesn’t make application wait for monitor refresh rate. On the downside it may cause image tearing. The choice is yours but, as always, we must make sure that the chosen presentation mode is supported.

Earlier we queried for available present modes, so now we must look for the one that best suits our needs. Here is the code in which I’m looking for MAILBOX mode:

// FIFO present mode is always available
// MAILBOX is the lowest latency V-Sync enabled mode (something like triple-buffering) so use it if available
for( VkPresentModeKHR &present_mode : present_modes ) {
  if( present_mode == VK_PRESENT_MODE_MAILBOX_KHR ) {
    return present_mode;
  }
}
return VK_PRESENT_MODE_FIFO_KHR;

21.Tutorial02.cpp, function GetSwapChainPresentMode()

Creating a Swap Chain

Now we have all the data necessary to create a swap chain. We have defined all the required values, and we are sure they fit into the given platform’s constraints.

uint32_t                      desired_number_of_images = GetSwapChainNumImages( surface_capabilities );
VkSurfaceFormatKHR            desired_format = GetSwapChainFormat( surface_formats );
VkExtent2D                    desired_extent = GetSwapChainExtent( surface_capabilities );
VkImageUsageFlags             desired_usage = GetSwapChainUsageFlags( surface_capabilities );
VkSurfaceTransformFlagBitsKHR desired_transform = GetSwapChainTransform( surface_capabilities );
VkPresentModeKHR              desired_present_mode = GetSwapChainPresentMode( present_modes );
VkSwapchainKHR                old_swap_chain = Vulkan.SwapChain;

if( static_cast<int>(desired_usage) == 0 ) {
  printf( "TRANSFER_DST image usage is not supported by the swap chain!" );
  return false;
}

VkSwapchainCreateInfoKHR swap_chain_create_info = {
  VK_STRUCTURE_TYPE_SWAPCHAIN_CREATE_INFO_KHR,  // VkStructureType                sType
  nullptr,                                      // const void                    *pNext
  0,                                            // VkSwapchainCreateFlagsKHR      flags
  Vulkan.PresentationSurface,                   // VkSurfaceKHR                   surface
  desired_number_of_images,                     // uint32_t                       minImageCount
  desired_format.format,                        // VkFormat                       imageFormat
  desired_format.colorSpace,                    // VkColorSpaceKHR                imageColorSpace
  desired_extent,                               // VkExtent2D                     imageExtent
  1,                                            // uint32_t                       imageArrayLayers
  desired_usage,                                // VkImageUsageFlags              imageUsage
  VK_SHARING_MODE_EXCLUSIVE,                    // VkSharingMode                  imageSharingMode
  0,                                            // uint32_t                       queueFamilyIndexCount
  nullptr,                                      // const uint32_t                *pQueueFamilyIndices
  desired_transform,                            // VkSurfaceTransformFlagBitsKHR  preTransform
  VK_COMPOSITE_ALPHA_OPAQUE_BIT_KHR,            // VkCompositeAlphaFlagBitsKHR    compositeAlpha
  desired_present_mode,                         // VkPresentModeKHR               presentMode
  VK_TRUE,                                      // VkBool32                       clipped
  old_swap_chain                                // VkSwapchainKHR                 oldSwapchain
};

if( vkCreateSwapchainKHR( Vulkan.Device, &swap_chain_create_info, nullptr, &Vulkan.SwapChain ) != VK_SUCCESS ) {
  printf( "Could not create swap chain!\n" );
  return false;
}
if( old_swap_chain != VK_NULL_HANDLE ) {
  vkDestroySwapchainKHR( Vulkan.Device, old_swap_chain, nullptr );
}

return true;

22.Tutorial02.cpp, function CreateSwapChain()

In this code example, at the beginning we gathered all the necessary data described earlier. Next we create a variable of type VkSwapchainCreateInfoKHR. It consists of the following members:

  • sType – Normal structure type, which here must be a VK_STRUCTURE_TYPE_SWAPCHAIN_CREATE_INFO_KHR.
  • pNext – Pointer reserved for future use (for some extensions to this extension).
  • flags – Value reserved for future use; currently must be set to zero.
  • surface – A handle of a created surface that represents windowing system (our application’s window).
  • minImageCount – Minimal number of images the application requests for a swap chain (must fit into available constraints).
  • imageFormat – Application-selected format for swap chain images; must be one of the supported surface formats.
  • imageColorSpace – Colorspace for swap chain images; only enumerated values of format-colorspace pairs may be used for imageFormat and imageColorSpace (we can’t use format from one pair and colorspace from another pair).
  • imageExtent – Size (dimensions) of swap chain images defined in pixels; must fit into available constraints.
  • imageArrayLayers – Defines the number of layers in a swap chain images (that is, views); typically this value will be one but if we want to create multiview or stereo (stereoscopic 3D) images, we can set it to some higher value.
  • imageUsage – Defines how application wants to use images; it may contain only values of supported usages; color attachment usage is always supported.
  • imageSharingMode – Describes image-sharing mode when multiple queues are referencing images (I will describe this in more detail later).
  • queueFamilyIndexCount – The number of different queue families from which swap chain images will be referenced; this parameter matters only when VK_SHARING_MODE_CONCURRENT sharing mode is used.
  • pQueueFamilyIndices – An array containing all the indices of queue families that will be referencing swap chain images; must contain at least queueFamilyIndexCount elements and as in queueFamilyIndexCount this parameter matters only when VK_SHARING_MODE_CONCURRENT sharing mode is used.
  • preTransform – Transformations applied to the swap chain image before it can be presented; must be one of the supported values.
  • compositeAlpha – This parameter is used to indicate how the surface (image) should be composited (blended?) with other surfaces on some windowing systems; this value must also be one of the possible values (bits) returned in surface capabilities, but it looks like opaque composition (no blending, alpha ignored) will be always supported (as most of the games will want to use this mode).
  • presentMode – Presentation mode that will be used by a swap chain; only supported mode may be selected.
  • clipped – Connected with ownership of pixels; in general it should be set to VK_TRUE if application doesn’t want to read from swap chain images (like ReadPixels()) as it will allow some platforms to use more optimal presentation methods; VK_FALSE value is used in some specific scenarios (if I learn more about these scenario I will write about them).
  • oldSwapchain – If we are recreating a swap chain, this parameter defines an old swap chain that will be replaced by a newly created one.

So what’s the matter with this sharing mode? Images in Vulkan can be referenced by queues. This means that we can create commands that use these images. These commands are stored in command buffers, and these command buffers are submitted to queues. Queues belong to different queue families. And Vulkan requires us to state how many different queue families and which of them are referencing these images through commands submitted with command buffers.

If we want to reference images from many different queue families at a time we can do so. In this case we must provide “concurrent” sharing mode. But this (probably) requires us to manage image data coherency by ourselves, that is, we must synchronize different queues in such a way that data in the images is proper and no hazards occur—some queues are reading data from images, but other queues haven’t finished writing to them yet.

We may not specify these queue families and just tell Vulkan that only one queue family (queues from one family) will be referencing image at a time. This doesn’t mean other queues can’t reference these images. It just means they can’t do it all at once, at the same time. So if we want to reference images from one family and then from another we must specifically tell Vulkan: “My image was used inside this queue family, but from now on another family, this one, will be referencing it.” Such a transition is done using image memory barrier. When only one queue family uses a given image at a time, use the “exclusive” sharing mode.

If any of these requirements are not fulfilled, undefined behavior will probably occur and we may not rely on the image contents.

In this example we are using only one queue so we don’t have to specify “concurrent” sharing mode and leave related parameters (queueFamilyCount and pQueueFamilyIndices) blank (or nulled, or zeroed).

So now we can call the vkCreateSwapchainKHR() function to create a swap chain and check whether this operation succeeded. After that (if we are recreating the swap chain, meaning this isn’t the first time we are creating one) we should destroy the previous swap chain. I’ll discuss this later.

Image Presentation

We now have a working swap chain that contains several images. To use these images as render targets, we can get handles to all images created with a swap chain, but we are not allowed to use them just like that. Swap chain images belong to and are owned by the swap chain. This means that the application cannot use these images until it asks for them. This also means that images are created and destroyed by the platform along with a swap chain (not by the application).

So when the application wants to render into a swap chain image or use it in any other way, it must first get access to it by asking a swap chain for it. If the swap chain makes us wait, we have to wait. And after the application finishes using the image it should “return” it by presenting it. If we forget about returning images to a swap chain, we will soon run out of images and nothing will display on the screen.

The application may also request access to more images at once but they must be available. Acquiring access may require waiting. In corner cases, when there are too few images in a swap chain and the application wants to access too many of them, or if we forget about returning images to a swap chain, the application may even wait an infinite amount of time.

Given that there are (usually) at least two images, it may sound strange that we have to wait, but it is quite reasonable. Not all images are available for the application because they are used by the presentation engine. Usually one image is displayed. Additional images may also be required for the presentation engine to work properly. So we can’t use them because it could block the presentation engine in some way. We don’t know its internal mechanisms and algorithms or the requirements of the OS the application is executed on. So the availability of images may depend on many factors: internal implementation, OS, number of created images, number of images the application wants to use at a single time and on the selected presentation mode, which is the most important factor from the perspective of this tutorial.

In immediate mode, one image is always presented. Other images (at least one) are available for application. When the application posts a presentation request (“returns” an image), the image that was displayed is replaced with the new one. So if two images are created, only one image may be available for application at a single time. When the application asks for another image, it must “return” the previous one. If it wants two images at a time, it must create a swap chain with more images or it will wait forever. When we request more images, in immediate mode, the application can ask for (own) “imageCount – 1” images at a time.

In FIFO mode one image is displayed, and the rest are placed in a FIFO queue. The length of this queue is always equal to “imageCount – 1.” At first, all images may be available to the application (because the queue is empty and no image is presented). When the application presents an image (“returns” it to a swap chain), it is appended to the end of the queue. So as soon as the queue fills, the application has to wait for another image until the displayed image is released during the vertical blanking period. Images are always displayed in the same order they were presented in by the application. When the v-sync signal appears, the first image from the queue replaces the image that was displayed. The previously displayed image (the released one) may become available to the application as it becomes unused (isn’t presented and is not waiting in the queue). If all images are in the queue, the application will wait for the next blanking period to access another image. If rendering takes longer than the refresh rates, the application will not have to wait at all. This behavior doesn’t change when there are more images. The internal swap chain queue has always “imageCount – 1” elements.

The last mode available for the time being is MAILBOX. As previously mentioned, this mode is most similar to the “traditional” triple buffering. One image is always displayed. The second image waits in a single-element queue (it always has place for only one element). The rest of the images may be available for the application. When the application presents an image, the image replaces the one waiting in the queue. The image in the queue gets displayed only during blanking periods, but the application doesn’t need to wait for the next image (when there are more than two images). MAILBOX mode with only two images behaves identically to FIFO mode—the application must wait for the v-sync signal to acquire the next image. But with at least three images it immediately may acquire the image that was replaced by the “presented” image (the one waiting in the queue). That’s why I requested one more image than the minimal number. If MAILBOX mode is available I want to use it in a manner similar to triple buffering (maybe the first thing to do is to check what mode is available and after that choose the number of swap chain images based on the selected presentation mode).

I hope these examples help you understand why the application must ask for an image if it wants to use any. In Vulkan we can only do what is allowed and required—not less and usually not too much more.

uint32_t image_index;
VkResult result = vkAcquireNextImageKHR( Vulkan.Device, Vulkan.SwapChain, UINT64_MAX, Vulkan.ImageAvailableSemaphore, VK_NULL_HANDLE, &image_index );
switch( result ) {
  case VK_SUCCESS:
  case VK_SUBOPTIMAL_KHR:
    break;
  case VK_ERROR_OUT_OF_DATE_KHR:
    return OnWindowSizeChanged();
  default:
    printf( "Problem occurred during swap chain image acquisition!\n" );
    return false;
}

23.Tutorial02.cpp, function Draw()

To access an image, we must call the vkAcquireNextImageKHR() function. During the call we must specify (apart from the device handle like in almost all other functions) a swap chain from which we want to use an image, a timeout, a semaphore, and a fence object. A function, in case of a success, will store the image index in the variable we provided the address of. Why an index and not the (handle to) image itself? Such a behavior may be convenient (that is, during the “preprocessing” phase when we want to prepare as much data needed for rendering as possible to not waste time during typical frame rendering) but I will describe it later. Just remember that we can check what images were created in a swap chain if we want (we just can’t use them until we are allowed). An array of images will be provided upon such query. And the vkAcquireNextImageKHR() function stores an index into this very array.

We have to specify a timeout because sometimes images may not be immediately available. Trying to use an image before we are allowed to will cause an undefined behavior. Specifying a timeout gives the presentation engine time to react. If it needs to wait for the next vertical blanking period it can do so and we give it a time. So this function will block until the given time has passed. We can provide maximal available value so the function may even block indefinitely. If we provide 0 for the timeout, the function will return immediately. If any image was available at the time the call occurred it will be provided immediately. If there was no available image, an error will be returned stating that the image was not yet ready.

Once we have our image we can use it however we want. Images are processed or referenced by commands stored in command buffers. We can prepare command buffers earlier (to save as much processing time for rendering as we can) and use or submit them here. Or we can prepare the commands now and submit them when we’re done. In Vulkan, creating command buffers and submitting them to queues is the only way to cause operations to be performed by the device.

When command buffers are submitted to queues, all their commands start being processed. But a queue cannot use an image until it is allowed to, and the semaphore we created earlier is for internal queue synchronization—before the queue starts processing commands that reference a given image, it should wait on this semaphore (until it gets signaled). But this wait doesn’t block an application. There are two synchronization mechanisms for accessing swap chain images: (1) a timeout, which may block an application but doesn’t stop queue processing, and (2) a semaphore, which doesn’t block the application but blocks selected queues.

We now know (theoretically) how to render anything (through command buffers). So let’s now imagine that inside a command buffer we are submitting some rendering operations take place. But before the processing will start, we should tell the queue (on which this rendering will occur) to wait. This all is done within one submit operation.

VkPipelineStageFlags wait_dst_stage_mask = VK_PIPELINE_STAGE_TRANSFER_BIT;
VkSubmitInfo submit_info = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,                // VkStructureType              sType
  nullptr,                                      // const void                  *pNext
  1,                                            // uint32_t                     waitSemaphoreCount&Vulkan.ImageAvailableSemaphore,              // const VkSemaphore           *pWaitSemaphores&wait_dst_stage_mask,                         // const VkPipelineStageFlags  *pWaitDstStageMask;
  1,                                            // uint32_t                     commandBufferCount&Vulkan.PresentQueueCmdBuffers[image_index],  // const VkCommandBuffer       *pCommandBuffers
  1,                                            // uint32_t                     signalSemaphoreCount&Vulkan.RenderingFinishedSemaphore            // const VkSemaphore           *pSignalSemaphores
};

if( vkQueueSubmit( Vulkan.PresentQueue, 1, &submit_info, VK_NULL_HANDLE ) != VK_SUCCESS ) {
  return false;
}

24.Tutorial02.cpp, function Draw()

First we prepare a structure with information about the types of operations we want to submit to the queue. This is done through VkSubmitInfo structure. It contains the following fields:

  • sType – Standard structure type; here it must be set to VK_STRUCTURE_TYPE_SUBMIT_INFO.
  • pNext – Standard pointer reserved for future use.
  • waitSemaphoreCount – Number of semaphores we want the queue to wait on before it starts processing commands from command buffers.
  • pWaitSemaphores – Pointer to an array with semaphore handles on which queue should wait; this array must contain at least waitSemaphoreCount elements.
  • pWaitDstStageMask – Pointer to an array with the same amount of elements as pWaitSemaphores array; it describes the pipeline stages at which each (corresponding) semaphore wait will occur; in our example, the queue may perform some operations before it starts using the image from the swap chain so there is no reason to block all of the operations; the queue may start processing some drawing commands and until pipeline gets to the stage in which the image is used, it will wait.            
  • commandBufferCount – Number of command buffers we are submitting for execution.
  • pCommandBuffers – Pointer to an array with command buffers handles which must contain at least commandBufferCount elements.
  • signalSemaphoreCount – Number of semaphores we want the queue to signal after processing all the submitted command buffers.
  • pSignalSemaphores – Pointer to an array of at least signalSemaphoreCount elements with semaphore handles; these semaphores will be signaled after the queue has finished processing commands submitted within this submit information.

In this example we are telling the queue to wait only on one semaphore, which will be signaled by the presentation engine when the queue can safely start processing commands referencing the swap chain image.

We also submit just one simple command buffer. It was prepared earlier (I will describe how to do it later). It only clears the acquired image. But this is enough for us to see the selected color in our application’s window and to see that the swap chain is working properly.

In the code above, the command buffers are arranged in an array (a vector, to be more precise). To make it easier to submit the proper command buffer—the one that references the currently acquired image—I prepared a separate command buffer for each swap chain image. The index of an image that the vkAcquireNextImageKHR() function provides can be used right here. Using image handles (in similar scenarios) would require creating maps that would translate the handle into a specific command buffer or index. On the other hand, normal numbers can be used to just select a specific array element. This is why this function gives us indices and not image handles.

After we have submitted a command buffer, all the processing starts in the background, on “hardware.” Next, we want to present a rendered image. Presenting means that we want our image to be displayed and that we are “giving it back” to the swap chain. The code to do this might look like this:

VkPresentInfoKHR present_info = {
  VK_STRUCTURE_TYPE_PRESENT_INFO_KHR,           // VkStructureType              sType
  nullptr,                                      // const void                  *pNext
  1,                                            // uint32_t                     waitSemaphoreCount&Vulkan.RenderingFinishedSemaphore,           // const VkSemaphore           *pWaitSemaphores
  1,                                            // uint32_t                     swapchainCount&Vulkan.SwapChain,                            // const VkSwapchainKHR        *pSwapchains&image_index,                                 // const uint32_t              *pImageIndices
  nullptr                                       // VkResult                    *pResults
};
result = vkQueuePresentKHR( Vulkan.PresentQueue, &present_info );

switch( result ) {
  case VK_SUCCESS:
    break;
  case VK_ERROR_OUT_OF_DATE_KHR:
  case VK_SUBOPTIMAL_KHR:
    return OnWindowSizeChanged();
  default:
    printf( "Problem occurred during image presentation!\n" );
    return false;
}

return true;

25.Tutorial02.cpp, function Draw()

An image (or images) is presented by calling the vkQueuePresentKHR() function. It may be perceived as submitting a command buffer with only operation: presentation.

To present an image we must specify what images should be presented from how many and from which swap chains. We can present many images from many swap chains at once (that is, to multiple windows) but only one image from a single swap chain can be presented at once. We provide this information through the VkPresentInfoKHR structure, which contains the following fields:

  • sType – Standard structure type, it must be a VK_STRUCTURE_TYPE_PRESENT_INFO_KHR here.
  • pNext – Parameter reserved for future use.
  • waitSemaphoreCount – The number of semaphores we want the queue to wait on before it presents images.
  • pWaitSemaphores – Pointer to an array with semaphore handles on which the queue should wait; this array must contain at least waitSemaphoreCount elements.
  • swapchainCount – The number of swapchains to which we would like to present images.
  • pSwapchains – An array with swapchainCount elements that contains handles of all the swap chains that we  want to present images to; any single swap chain may only appear once in this array.
  • imageIndices – An array with swapchainCount elements that contains indices of images that we want to present; each element of this array corresponds to a swap chain in a pSwapchains array; the image index is the index into the array of each swap chain’s images (see the next section).
  • pResults – A pointer to an array of at least swapchainCount element; this parameter is optional and can be set to null, but if we provide such an array, the result of the presenting operation will be stored in each of its elements, for each swap chain respectively; a single value returned by the whole function is the same as the worst result value from all swap chains.

Now that we have prepared this structure, we can use it to present an image. In this example I’m just presenting a single image from a single swap chain.

Each operation that is performed (or submitted) by calling vkQueue…() functions (this includes presenting) is appended to the end of the queue for processing. Operations are processed in the order in which they were submitted. For a presentation, we are presenting an image after submitting other command buffers. So the present queue will start presenting an image after the processing of all the command buffers is done. This ensures that the image will be presented after we are done using it (rendering into it) and an image with correct contents will be displayed on the screen. But in this example we submit drawing (clearing) operations and a present operation to the same queue: the PresentQueue. We are doing only simple operations that are allowed to be done on a present queue.

If we want to perform drawing operations on a queue that is different than the present operation, we need to synchronize the queues. This is done, again, with semaphores, which is the reason why we created two semaphores (the second one may not be necessary in this example, as we render and present using the same queue, but I wanted to show how it should be done in the correct way).

The first semaphore is for presentation engine to tell the queue that it can safely use (reference/render into) an image. The second semaphore is for us. It is signaled when the operations on the image (rendering into it) are done. The submit info structure has a field called pSignalSemaphores. It is an array of semaphore handles that will be signaled after processing of all of the submitted command buffers is finished. So we need to tell the second queue to wait on this second semaphore. We store the handle of our second semaphore in the pWaitSemaphores field of a VkPresentInfoKHR structure. And the queue to which we are submitting the present operation will wait, thanks to this second semaphore, until we are done rendering into a given image.

And that’s it. We have displayed our first image using Vulkan!

Checking What Images Were Created in a Swap Chain

Previously I mentioned swap chain’s image indices. Here in this code sample, I show you more specifically what I was talking about.

uint32_t image_count = 0;
if( (vkGetSwapchainImagesKHR( Vulkan.Device, Vulkan.SwapChain, &image_count, nullptr ) != VK_SUCCESS) ||
    (image_count == 0) ) {
  printf( "Could not get the number of swap chain images!\n" );
  return false;
}

std::vector<VkImage> swap_chain_images( image_count );
if( vkGetSwapchainImagesKHR( Vulkan.Device, Vulkan.SwapChain, &image_count, &swap_chain_images[0] ) != VK_SUCCESS ) {
  printf( "Could not get swap chain images!\n" );
  return false;
}

2​6. -

This code sample is a fragment of an imaginary function that checks how many and what images were created inside a swap chain. It is done by a traditional “double-call,” this time using a vkGetSwapchainImagesKHR() function. First we call it with the last parameter set to null. This way the number of all images created in a swap chain is stored in an “image_count” variable and we know how much storage we need to prepare for the handles of all images. The second time we call this function, we achieve the handles in the array we have provided the address of through the last parameter.

Now we know all the images that the swap chain is using. For the vkAcquireNextImageKHR() function and VkPresentInfoKHR structure, the indices I referred to are the indices into this array, an array “returned” by the vkGetSwapchainImagesKHR() function. It is called an array of a swap chain’s presentable images. And if any function, in the case of a swap chain, wants us to provide an index or returns an index, it is the index of an image in this very array.

Recreating a Swap Chain

Previously, I mentioned that sometimes we must recreate a swap chain, and I also said that the old swap chain must be destroyed. The vkAcquireNextImageKHR() and vkQueuePresentKHR() functions return a result that sometimes causes the OnWindowSizeChanged() function to be called. This function recreates the swap chain.

Sometimes a swap chain gets old. This means that the properties of the surface, platform, or application window properties changed in such a way that the current swap chain cannot be used any more. The most obvious (and unfortunately not so good) example is when the window’s size changed. We cannot create a swap chain image nor can we change its size. The only possibility is to destroy and recreate a swap chain. There are also situations in which we can still use a swap chain, but it may no longer be optimal for surface it was created for.

These situations are notified by the return codes of the vkAcquireNextImageKHR() and vkQueuePresentKHR() functions.

When the VK_SUBOPTIMAL_KHR value is returned, we can still use the current swap chain for presentation. It will still work but not optimally (that is, color precision will be worse). It is advised to recreate swap chain when there is an opportunity. A good example is when we have performed performance-heavy rendering and after acquiring the image we are informed that our image is suboptimal. We don’t want to waste all this processing and make the user wait much longer for another frame. We just present the image and recreate the swap chain as soon as there is an opportunity.

When VK_ERROR_OUT_OF_DATE_KHR is returned we cannot use current swap chain and we must recreate it immediately. We cannot present using the current swap chain; this operation will fail. We have to recreate a swap chain as soon as possible.

I have mentioned that changing the window size is the most obvious, but not so good, example of surface properties’ changes after which we should recreate a swap chain. In this situation we should recreate a swap chain, but we may not be notified about it with the mentioned return codes. We should monitor the window size changes by ourselves using OS-specific code. And that’s why the name of this function in our source is OnWindowSizeChanged. This function is called every time a window’s size had changed. But as this function only recreates a swap chain (and command buffers) the same function can be called here.

Recreation is done the same way as creation. There is a structure member in which we provide a swap chain that the new one should replace. But we must implicitly destroy the old swap chain after we create the new one.

Quick Dive into Command Buffers

You now know a lot about swap chains, but there is still one important thing you need to know. To explain it, I will briefly show you how to prepare drawing commands. That one last important thing about swap chains is connected with drawing and preparing command buffers. I will present only information about how to clear images, but it is enough to check whether our swap chain is working as it should.

In the first tutorial, I described queues and queue families. If we want to execute commands on a device we submit them to queues through command buffers. To put it in other words: commands are encapsulated inside command buffers. Submitting such buffers to queues causes devices to start processing commands that were recorded in them. Do you remember OpenGL’s drawing lists? We could prepare lists of commands that cause the geometry to be drawn in a form of a list of, well, drawing commands. The situation in Vulkan is similar, but far more flexible and advanced.

Creating Command Buffer Memory Pool

To store commands, a command buffer needs some storage. To prepare space for commands we create a pool from which the buffer can allocate its memory. We don’t specify the amount of space—it is allocated dynamically when the buffer is built (recorded).

Remember that command buffers can be submitted only to proper queue families and only the types of operations compatible with a given family can be submitted to a given queue. Also, the command buffer itself is not connected with any queue or queue family, but the memory pool from which buffer allocates its memory is. So each command buffer that takes memory from a given pool can only be submitted to a queue from a proper queue family—a family from (inside?) which the memory pool was created. If there are more queues created from a given family, we can submit a command buffer to any one of them; the family index is the most important thing here.

VkCommandPoolCreateInfo cmd_pool_create_info = {
  VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO,     // VkStructureType              sType
  nullptr,                                        // const void*                  pNext
  0,                                              // VkCommandPoolCreateFlags     flags
  Vulkan.PresentQueueFamilyIndex                  // uint32_t                     queueFamilyIndex
};

if( vkCreateCommandPool( Vulkan.Device, &cmd_pool_create_info, nullptr, &Vulkan.PresentQueueCmdPool ) != VK_SUCCESS ) {
  printf( "Could not create a command pool!\n" );
  return false;
}

27.Tutorial02.cpp, function CreateCommandBuffers()

To create a pool for command buffer(s) we call a vkCreateCommandPool() function. It requires us to provide (an address of) a variable of structure type VkCommandPoolCreateInfo. It contains the following members:

  • sType – A usual type of structure that must be equal to VK_STRUCTURE_TYPE_CMD_POOL_CREATE_INFO in this occasion.
  • pNext – Pointer reserved for future use.
  • flags – Value reserved for future use.
  • queueFamilyIndex – Index of a queue family for which this pool is created.

For our test application, we use only one queue from a presentation family, so we should use its index. Now we can call the vkCreateCommandPool() function and check whether it succeeded. If yes, the handle to the command pool will be stored in a variable we have provided the address of.

Allocating Command Buffers

Next, we need to allocate the command buffer itself. Command buffers are not created in a typical way; they are allocated from pools. Other objects that take their memory from pool objects are also allocated (the pools themselves are created). That’s why there is a separation in the names of the functions vkCreate…() and vkAllocate…().

As described earlier, I allocate more than one command buffer—one for each swap chain image that will be referenced by the drawing commands. So each time we acquire an image from a swap chain we can submit/use the proper command buffer.

uint32_t image_count = 0;
if( (vkGetSwapchainImagesKHR( Vulkan.Device, Vulkan.SwapChain, &image_count, nullptr ) != VK_SUCCESS) ||
    (image_count == 0) ) {
  printf( "Could not get the number of swap chain images!\n" );
  return false;
}

Vulkan.PresentQueueCmdBuffers.resize( image_count );

VkCommandBufferAllocateInfo cmd_buffer_allocate_info = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO, // VkStructureType              sType
  nullptr,                                        // const void*                  pNext
  Vulkan.PresentQueueCmdPool,                     // VkCommandPool                commandPool
  VK_COMMAND_BUFFER_LEVEL_PRIMARY,                // VkCommandBufferLevel         level
  image_count                                     // uint32_t                     bufferCount
};
if( vkAllocateCommandBuffers( Vulkan.Device, &cmd_buffer_allocate_info, &Vulkan.PresentQueueCmdBuffers[0] ) != VK_SUCCESS ) {
  printf( "Could not allocate command buffers!\n" );
  return false;
}

if( !RecordCommandBuffers() ) {
  printf( "Could not record command buffers!\n" );
  return false;
}
return true;

28.Tutorial02.cpp, function CreateCommandBuffers()

First we need to know how many swap chain images were created (a swap chain may create more images than we have specified). This was explained in an earlier section. We call the vkGetSwapchainImagesKHR() function with the last parameter set to null. Right now we don’t need the handles of images, only their total number. After that we prepare an array (vector) for a proper number of command buffers and we can create a proper number of command buffers. To do this we call the vkAllocateCommandBuffers() function. It requires us to prepare a structured variable of type VkCommandBufferAllocateInfo, which contains the following fields:

  • sType – Type of a structure, this time equal to VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO.
  • pNext – Normal parameter reserved for future use.
  • commandPool – Command pool from which the buffer will be allocating its memory during commands recording.
  • level – Type (level) of command buffer. There are two levels: primary and secondary. Secondary command buffers may only be referenced (used) from primary command buffers. Because we don’t have any other buffers, we need to create primary buffers here.
  • bufferCount – The number of command buffers we want to create at once.

After calling the vkAllocateCommandBuffers() function, we need to check whether the buffer creations succeeded. If yes, we are done allocating command buffers and we are ready to record some (simple) commands.

Recording Command Buffers

Command recording is the most important operation we will be doing in Vulkan. The recording itself also requires us to provide a lot of information. The more information, the more complicated the drawing commands are.

Here is a set of variables required (in this tutorial) to record command buffers:

uint32_t image_count = static_cast<uint32_t>(Vulkan.PresentQueueCmdBuffers.size());

std::vector<VkImage> swap_chain_images( image_count );
if( vkGetSwapchainImagesKHR( Vulkan.Device, Vulkan.SwapChain, &image_count, &swap_chain_images[0] ) != VK_SUCCESS ) {
  printf( "Could not get swap chain images!\n" );
  return false;
}

VkCommandBufferBeginInfo cmd_buffer_begin_info = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,  // VkStructureType                        sType
  nullptr,                                      // const void                            *pNext
  VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT, // VkCommandBufferUsageFlags              flags
  nullptr                                       // const VkCommandBufferInheritanceInfo  *pInheritanceInfo
};

VkClearColorValue clear_color = {
  { 1.0f, 0.8f, 0.4f, 0.0f }
};

VkImageSubresourceRange image_subresource_range = {
  VK_IMAGE_ASPECT_COLOR_BIT,                    // VkImageAspectFlags                     aspectMask
  0,                                            // uint32_t                               baseMipLevel
  1,                                            // uint32_t                               levelCount
  0,                                            // uint32_t                               baseArrayLayer
  1                                             // uint32_t                               layerCount
};

29.Tutorial02.cpp, function RecordCommandBuffers()

First we get the handles of all the swap chain images, which will be used in drawing commands (we will just clear them to one single color but nevertheless we will use them). We already know the number of images, so we don’t have to ask for it again. The handles of images are stored in a vector after calling the vkGetSwapchainImagesKHR() function.

Next, we need to prepare a variable of structured type VkCommandBufferBeginInfo. It contains the information necessary in more typical rendering scenarios (like render passes). We won’t be doing such operations here and that’s why we can set almost all parameters to zeros or nulls. But, for clarity, the structure contains the following fields:

  • sType – Structure type, this time it must be set to VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO.
  • pNext – Pointer reserved for future use, leave it to null.
  • flags – Parameter defining preferred usage of a command buffer.
  • pInheritanceInfo – Parameter pointing to another structure that is used in more typical rendering scenarios.

Command buffers gather commands. To store commands in command buffers, we record them. The above structure provides some necessary information for the driver to prepare for and optimize the recording process.

In Vulkan, command buffers are divided into primary and secondary. Primary command buffers are typical command buffers similar to drawing lists. They are independent, individual “beings” and they (and only they) may be submitted to queues. Secondary command buffers can also store commands (we also record them), but they may only be referenced from within primary command buffers (we can call secondary command buffers from within primary command buffers like calling OpenGL’s drawing lists from another drawing lists). We can’t submit secondary command buffers directly to queues.

All of this information will be described in more detail in a forthcoming tutorial.

In this simple example we want to clear our images with one single value. So next we set up a color that will be used for clearing. You can pick any value you like. I used a light orange color.

The last variable in the code above specifies the parts of the image that our operations will be performed on. Our image consists of only one mipmap level and one array level (no stereoscopic buffers, and so on). We set values in the VkImageSubresourceRange structure accordingly. This structure contains the following fields:

  • aspectMask – Depends on the image format as we are using images as color render targets (they have “color” format) so we specify “color aspect” here.
  • baseMipLevel – First mipmap level that will be accessed (modified).
  • levelCount – Number of mipmap levels on which operations will be performed (including the base level).
  • baseArrayLayer – First array layer that will be accessed (modified).
  • arraySize –  Number of layers the operations will be performed on (including the base layer).

We are almost ready to record some buffers.

Image Layouts and Layout Transitions

The last variable required in the above code example (of type VkImageSubresourceRange) specifies the parts of the image that operations will be performed on. In this lesson we only clear an image. But we also need to perform resource transitions. Remember the code when we selected a use for a swap chain image before the swap chain itself was created? Images may be used for different purposes. They may be used as render targets, as textures that can be sampled from inside the shaders, or as a data source for copy/blit operations (data transfers). We must specify different usage flags during image creation for the different types of operations we want to perform with or on images. We can specify more usage flags if we want (if they are supported; “color attachment” usage is always available for swap chains). But image usage specification is not the only thing we need to do. Depending on the type of operation, images may be differently allocated or may have a different layout in memory. Each type of image operation may be connected with a different “image layout.” We can use a general layout that is supported by all operations, but it may not provide the best performance. For specific usages we should always use dedicated layouts.

If we create an image with different usages in mind and we want to perform different operations on it, we must change the image’s current layout before we can perform each type of operation. To do this, we must transition from the current layout to another layout that is compatible with the operations we are about to execute.

Each image we create is created (generally) with an undefined layout, and we must transition from it to another layout if want to use the image. But swap-chain-created images have VK_IMAGE_LAYOUT_PRESENT_SOURCE_KHR layouts. This layout, as the name suggests, is designed for the image to be used (presented) by the presentation engine (that is, displayed on the screen). So if we want to perform some operations on swap chain images, we need to change their layouts to ones compatible with the desired operations. And after we have finished with processing the images (that is, rendering into them) we need to transition their layouts back to the VK_IMAGE_LAYOUT_PRESENT_SOURCE_KHR. Otherwise, the presentation engine will not be able to use these images and undefined behavior may occur.

To transition from one layout to another one, image memory barriers are used. With them we can specify the old layout (current) we are transitioning from and the new layout we are transitioning to. The old layout must always be equal to the current or undefined layout. When we specify the old layout as undefined, image contents may be discarded during transition. This allows the driver to perform some optimizations. If we want to preserve image contents we must specify a layout that is equal to the current layout.

The last variable of type VkImageSubresourceRange in the code example above is also used for image transitions. It defines what “parts” of the image are changing their layout and is required when preparing an image memory barrier.

Recording Command Buffers

The last step is to record a command buffer for each swap chain image. We want to clear the image to some arbitrary color. But first we need to change the image layout and change it back after we are done. Here is the code that does that:

for( uint32_t i = 0; i < image_count; ++i ) {
  VkImageMemoryBarrier barrier_from_present_to_clear = {
    VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER,     // VkStructureType                        sType
    nullptr,                                    // const void                            *pNext
    VK_ACCESS_MEMORY_READ_BIT,                  // VkAccessFlags                          srcAccessMask
    VK_ACCESS_TRANSFER_WRITE_BIT,               // VkAccessFlags                          dstAccessMask
    VK_IMAGE_LAYOUT_UNDEFINED,                  // VkImageLayout                          oldLayout
    VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL,       // VkImageLayout                          newLayout
    Vulkan.PresentQueueFamilyIndex,             // uint32_t                               srcQueueFamilyIndex
    Vulkan.PresentQueueFamilyIndex,             // uint32_t                               dstQueueFamilyIndex
    swap_chain_images[i],                       // VkImage                                image
    image_subresource_range                     // VkImageSubresourceRange                subresourceRange
  };

  VkImageMemoryBarrier barrier_from_clear_to_present = {
    VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER,     // VkStructureType                        sType
    nullptr,                                    // const void                            *pNext
    VK_ACCESS_TRANSFER_WRITE_BIT,               // VkAccessFlags                          srcAccessMask
    VK_ACCESS_MEMORY_READ_BIT,                  // VkAccessFlags                          dstAccessMask
    VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL,       // VkImageLayout                          oldLayout
    VK_IMAGE_LAYOUT_PRESENT_SRC_KHR,            // VkImageLayout                          newLayout
    Vulkan.PresentQueueFamilyIndex,             // uint32_t                               srcQueueFamilyIndex
    Vulkan.PresentQueueFamilyIndex,             // uint32_t                               dstQueueFamilyIndex
    swap_chain_images[i],                       // VkImage                                image
    image_subresource_range                     // VkImageSubresourceRange                subresourceRange
  };

  vkBeginCommandBuffer( Vulkan.PresentQueueCmdBuffers[i], &cmd_buffer_begin_info );
  vkCmdPipelineBarrier( Vulkan.PresentQueueCmdBuffers[i], VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_TRANSFER_BIT, 0, 0, nullptr, 0, nullptr, 1, &barrier_from_present_to_clear );

  vkCmdClearColorImage( Vulkan.PresentQueueCmdBuffers[i], swap_chain_images[i], VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, &clear_color, 1, &image_subresource_range );

  vkCmdPipelineBarrier( Vulkan.PresentQueueCmdBuffers[i], VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, 0, 0, nullptr, 0, nullptr, 1, &barrier_from_clear_to_present );
  if( vkEndCommandBuffer( Vulkan.PresentQueueCmdBuffers[i] ) != VK_SUCCESS ) {
    printf( "Could not record command buffers!\n" );
    return false;
  }
}

return true;

30.Tutorial02.cpp, function RecordCommandBuffers()

This code is placed inside a loop. We are recording a command buffer for each swap chain image. That’s why we needed a number of images. Image handles are also needed here. We need to specify them for image memory barriers and during image clearing. But recall that I said we can’t use swap chain images until we are allowed to, until we acquire the image from the swap chain. That’s true, but we aren’t using them here. We are only preparing commands. The usage itself is performed when we submit operations (a command buffer) to the queue for execution. Here we are just telling Vulkan that in the future, take this picture and do this with it, then that, and after that something more. This way we can prepare as much work as we can before we start the main rendering loop and we avoid switches, ifs, jumps, and other branches during the real rendering. This scenario won’t be so simple in real life, but I hope the example is clear.

In the above code above, we are first preparing two image memory barriers. Memory barriers are used to change three different things in the case of images. From the tutorial point of view, only the layouts are interesting right now but we need to properly set all fields. To set up a memory barrier we need to prepare a variable of type VkImageMemoryBarrier, which contains the following fields:

  • sType – Structure type which here must be set to VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER.
  • pNext – Leave it null, pointer not used right now.
  • srcAccessMask – Types of memory operations done on the image before the barrier.
  • dstAccessMask – Types of memory operations that will take place after the barrier.
  • oldLayout – Layout from which we are transitioning; it should always be equal to the current layout (which in this example, for a first barrier, would be VK_IMAGE_LAYOUT_PRESENT_SOURCE_KHR).Or we can use an undefined layout, which will let the driver perform some optimizations but the contents of the image may be discarded. Since we don’t need the contents, we can use an undefined layout here.
  • newLayout – A layout that is compatible with operations we will be performing after the barrier; we want to do image clears; to do that we need to specify VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL layout. We should always use a specific, dedicated layout.
  • srcQueueFamilyIndex – A queue family index that was referencing the image previously.
  • dstQueueFamilyIndex – A family index from which queues will be referencing images after the barrier (this refers to the swap chain sharing mode I was describing earlier).
  • image – handle to image itself.
  • subresourceRange – A structure describing parts of an image we want to perform transitions on; this is that last variable from the previous code example.

Some notes are necessary regarding access masks and family indices. In this example before the first barrier and after the second barrier only the presentation engine has access to the image. The presentation engine only reads from the image (it doesn’t modify it) so we set srcAccessMask in the first barrier and dstAccessMask in the second barrier to VK_ACCESS_MEMORY_READ_BIT. This indicates that the memory associated with the image is read-only (image contents are not modified before the first barrier and after the second barrier). In our command buffer we will only clear an image. This operation belongs to the so-called “transfer” operations. That is why I’ve set the value of VK_ACCESS_TRANSFER_WRITE_BIT in the first barrier in dstAccessMask field and in the srcAccessMask field of the second barrier.

I won’t go into more detail about queue family indices, but if a queue used for graphics operations and presentation are the same, srcQueueFamilyIndex and dstQueueFamilyIndex will be equal, and the hardware won’t make any modifications regarding image access from the queues. But remember that we have specified that only one queue at a time will access/use the image. So if these queues are different, we inform the hardware here about the “ownership” change, that different queue will now access the image. And this is all the information you need right now to properly set up barriers.

We need to create two barriers: one that changes the layout from the “present source” (or undefined) to ”transfer dst”. This barrier is used at the beginning of a command buffer, when the previously presentation engine used an image and now we want to use it and modify it. The second barrier is used to change the layout back into the “present source” when we are done using the images and we can give them back to a swap chain. This barrier is set at the end of a command buffer.

Now we are ready to start recording our commands by calling the vkBeginCommandBuffer() function. We provide a handle to a command buffer and an address of a variable of type VkCommandBufferBeginInfo and we are ready to go. Next we set up a barrier to change the image layout. We call the vkCmdPipelineBarrier() function, which takes quite a few parameters but in this example the only relevant ones are the first—the command buffer handle—and the last two: number of elements (barriers) of an array and a pointer to first element of an array containing the addresses of variables of type VkImageMemoryBarrier. Elements of this array describe images, their parts, and the types of transitions that should occur. After the barrier we can safely perform any operations on the swap chain image that are compatible with the layout we have transitioned images to. The general layout is compatible with all operations but with a (probably) reduced performance.

In the example we are only clearing images so we call the vkCmdClearColorImage() function. It takes a handle to a command buffer, handle to an image, current layout of an image, pointer to a variable with clear color value, number of subresources (number of elements in the array from the last parameter), and an array of pointers to variables of type VkImageSubresourceRange. Elements in the last array specify what parts of the image we want to clear (we don’t have to clear all mipmaps or array levels of an image if we don’t want to).

And at the end of our recording session we set up another barrier that transitions the image layout back to a “present source” layout. It is the only layout that is compatible with the present operations performed by the presentation engine.

Now we can call the vkEndCommandBuffer() function to inform that we have ended recording a command buffer. If something went wrong during recording we will be informed about it through the value returned by this function. If there were errors, we cannot use the command buffer, and we’ll need to record it once again. If everything is fine we can use the command buffer later to tell our device to perform operations stored in it just by submitting the buffer to a queue.

Tutorial 2 Execution

In this example, if everything went fine, we should see a window with a light-orange color displayed inside it. The contents of a window should look similar to this:

Cleaning Up

Now you know how to create a swap chain, display images in a window and perform simple operations that are executed on a device. We have created command buffers, recorded them, and presented on the screen. Before we close the application, we need to clean up the resources we were using. In this tutorial I have divided cleaning into two functions. The first function clears (destroys) only those resources that should be recreated when the swap chain is recreated (that is, after the size of an application’s window has changed).

if( Vulkan.Device != VK_NULL_HANDLE ) {
  vkDeviceWaitIdle( Vulkan.Device );

  if( (Vulkan.PresentQueueCmdBuffers.size() > 0) && (Vulkan.PresentQueueCmdBuffers[0] != VK_NULL_HANDLE) ) {
    vkFreeCommandBuffers( Vulkan.Device, Vulkan.PresentQueueCmdPool, static_cast<uint32_t>(Vulkan.PresentQueueCmdBuffers.size()), &Vulkan.PresentQueueCmdBuffers[0] );
    Vulkan.PresentQueueCmdBuffers.clear();
  }

  if( Vulkan.PresentQueueCmdPool != VK_NULL_HANDLE ) {
    vkDestroyCommandPool( Vulkan.Device, Vulkan.PresentQueueCmdPool, nullptr );
    Vulkan.PresentQueueCmdPool = VK_NULL_HANDLE;
  }
}

31.Tutorial02.cpp, Clear()

First we must be sure that no operations are executed on the device’s queues (we can’t destroy a resource that is used by the currently processed commands). We can check it by calling vkDeviceWaitIdle() function. It will block until all operations are finished.

Next we free all the allocated command buffers. In fact this operation is not necessary here. Destroying a command pool implicitly frees all command buffers allocated from a given pool. But I want to show you how to explicitly free command buffers. Next we destroy the command pool itself.

Here is the code that is responsible for destroying all of the resources created in this lesson:

Clear();

if( Vulkan.Device != VK_NULL_HANDLE ) {
  vkDeviceWaitIdle( Vulkan.Device );

  if( Vulkan.ImageAvailableSemaphore != VK_NULL_HANDLE ) {
    vkDestroySemaphore( Vulkan.Device, Vulkan.ImageAvailableSemaphore, nullptr );
  }
  if( Vulkan.RenderingFinishedSemaphore != VK_NULL_HANDLE ) {
    vkDestroySemaphore( Vulkan.Device, Vulkan.RenderingFinishedSemaphore, nullptr );
  }
  if( Vulkan.SwapChain != VK_NULL_HANDLE ) {
    vkDestroySwapchainKHR( Vulkan.Device, Vulkan.SwapChain, nullptr );
  }
  vkDestroyDevice( Vulkan.Device, nullptr );
}

if( Vulkan.PresentationSurface != VK_NULL_HANDLE ) {
  vkDestroySurfaceKHR( Vulkan.Instance, Vulkan.PresentationSurface, nullptr );
}

if( Vulkan.Instance != VK_NULL_HANDLE ) {
  vkDestroyInstance( Vulkan.Instance, nullptr );
}

if( VulkanLibrary ) {
#if defined(VK_USE_PLATFORM_WIN32_KHR)
  FreeLibrary( VulkanLibrary );
#elif defined(VK_USE_PLATFORM_XCB_KHR) || defined(VK_USE_PLATFORM_XLIB_KHR)
  dlclose( VulkanLibrary );
#endif
}

32.Tutorial02.cpp, destructor

First we destroy the semaphores (remember they cannot be destroyed when they are in use, that is, when a queue is waiting on a given semaphore). After that we destroy a swap chain. Images that were created along with it are automatically destroyed, and we don’t need to do it by ourselves (we are even not allowed to). Next the device is destroyed. We also need to destroy the surface that represents our application’s window. At the end, the Vulkan instance destruction takes place and the graphics driver’s dynamic library is unloaded. Before we perform each step we also check whether a given resource was properly created. We can’t destroy resources that weren’t properly created.

Conclusion

In this tutorial you learned how to display on a screen anything that was created with Vulkan API. To brief review the steps: First we enabled the proper instance level extensions. Next we created an application window’s Vulkan representation called a surface. Then we chose a device with a queue family that supported presentation and created a logical device (don’t forget about enabling device-level extensions!)

After that we created a swap chain. To do that we first acquired a set of parameters describing our surface and then chose values for proper swap chain creation. Those values had to fit into a surface’s supported constraints.

To draw something on the screen we learned how to create and record command buffers, which also included image’s layout transitions for which image memory barriers (pipeline barriers) were used. We cleared images so we could see the selected color being displayed on screen.

And we also learned how to present a given image on the screen, which included acquiring an image, submitting a command buffer, and the presentation process itself.


Go to: API without Secrets: Introduction to Vulkan* Part 3: First Triangle


Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800- 548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

Viewing all 533 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>