Release notes for 1.13.0

Release notes for DC/OS 1.13.0, including Open Source attribution, and version policy.

DC/OS 1.13.0 was released on May 8, 2019.

Registered DC/OS Enterprise customers can access the DC/OS Enterprise configuration file from the support website. For new customers, contact your sales representative or sales@mesosphere.io before attempting to download and install DC/OS Enterprise.

Release summary

DC/OS is a distributed operating system that enables you to manage resources, application deployment, data services, networking, and security in an on-premise, cloud, or hybrid cluster environment.

This release provides new features and enhancements to improve the user experience, fix reported issues, integrate changes from previous releases, and maintain compatibility and support for other packages–such as Marathon and Metronome–used in DC/OS.

If you have DC/OS deployed in a production environment, see Known issues and limitations to see if any potential operational changes for specific scenarios apply to your environment.

New features and capabilities

DC/OS 1.13 includes new features and capabilities to enhance the installation and deployment experience, simplify cluster administration, increase operational productivity and efficiency, and provide additional monitoring, alerting, logging, and reporting for better visibility into cluster activity.

Highlights of what’s new

Some highlights for this release include:

  • Unified service accounts and authentication architecture
  • Monitoring and metrics for cluster operations
  • Extended support for workloads that take advantage of accelerated processing provided by graphic processing units (GPU)
  • Improvements to the Universal installer and the upgrade process
  • New features and options for command-line programs
  • New dashboard options for monitoring cluster performance
  • Tighter integration between the Mesosphere Kubernetes Engine (MKE) and Edge-LB load balancing

Features and capabilities that are introduced in DC/OS 1.13 are grouped by functional area or component and include links to view additional documentation, if applicable.

Unified service accounts and authentication architecture

The core of the DC/OS Enterprise identity and access management service (IAM) has been open-sourced and added to DC/OS, replacing DC/OS OpenAuth (dcos-oauth). This architectural change includes adding CockroachDB as the cluster high-availability database for identity and access management.

With this change, DC/OS also now supports unified service accounts. Service accounts allow individual programs and applications to interact with a DC/OS cluster using their own identity. A successful service account login results in authentication proof – the DC/OS authentication token. A valid DC/OS authentication token is required to access DC/OS services and components through the master node Admin Router.

This change also aligns the authentication architectures between DC/OS Enterprise and DC/OS Open Source. The HTTP API for service account management ans service authentication is now the same for both DC/OS Enterprise and DC/OS Open Source. For both DC/OS Enterprise and DC/OS Open Source clusters, the DC/OS authentication token is a JSON Web Token (JWT) of type RS256. This JWT authentication token can be validated by any component in the system after consulting the IAM services JSON Web Key Set (JWKS) endpoint.

Monitoring and metrics for cluster operations

This release extends DC/OS cluster monitoring capabilities and the metrics you can collect and report for DC/OS components. The enhancements to monitoring and metrics provide you with better visibility into cluster operations, activity, and performance through DC/OS itself and as input to Prometheus, Grafana, and other services.

Monitoring service

  • The DC/OS monitoring service (dcos-monitoring) can be configured to use DC/OS storage service (DSS) volumes to store time-series data.

    With this release, you can store the information collected by the DC/OS monitoring service (dcos-monitoring) in the profile-based storage provided by the DC/OS Storage Service. By using the DC/OS Storage Service to store the monitoring data used in Prometheus queries and Grafana dashboards, you can improve the performance and reliability of the Prometheus and Grafana monitoring components.

    When you install the DC/OS monitoring service, you can select the volume size and a volume profile for the file system where you want to store the Prometheus time-series database (tsdb). By specifying a volume managed by the DC/OS Storage Service, you can take advantage of the durability, performance, and flexibility DSS provides for your collected data.

    For more information about working with the DC/OS monitoring service, see DC/OS Monitoring Service. For more information about using the DC/OS storage service, see DC/OS Storage Service.

  • The DC/OS monitoring service enables you to import curated alerting rules.

    With this release, deploying the DC/OS monitoring service enables you to import Mesosphere-provided Prometheus Alert Rules from a github repository. These predefined alert rules enable you to create meaningful alerts concerning the condition of the DC/OS cluster, including successful or failed operations and node activity.

    Prometheus alert rules are automatically included as part of the DC/OS monitoring service. Each DC/OS component or framework available for monitoring should have a single rule file that contains all its alert rules. These alert rules are passed to Prometheus using the rule_files configuration parameter and are configured to specify one of the following severity levels:

    • Warning alerts identify issues that require notification, but do not require immediate action. For example, an alert identified as a warning might send email notification to an administrator but not require an immediate response.
    • Critical alerts identify issues that require immediate attention. For example, a critical alert might trigger a pager notification to signal that immediate action is required.
  • Automatically create a curated collection of Prometheus-driven Grafana dashboards for DC/OS.

    If you deploy DC/OS monitoring, you can leverage Mesosphere-provided Grafana-based dashboards. By installing and configuring the dcos-monitoring service, you can automatically create dashboards that enable you to quickly visualize the metrics that the dcos-monitoring package is collecting from the DC/OS cluster and DC/OS-hosted applications. For more information about using Grafana dashboards, see the dashboard repository.

Metrics

DC/OS metrics are collected and managed through the Telegraf service. Telegraf provides an agent-based service that runs on each master and agent node in a DC/OS cluster. By default, Telegraf gathers metrics from all of the processes running on the same node, processes them, then sends the collected information to a central metrics database.

With this release, you can use Telegraf to collect and forward information for the following additional DC/OS cluster components:

  • CockroachDB
  • ZooKeeper
  • Exhibitor
  • Marathon
  • Metronome

You can also collect information about the operation and performance of the Telegraf process itself. This information is stored along with other metrics and available for reporting using the DC/OS monitoring service or third-party monitoring services. For information about the Telegraf plugin and the metrics that Telegraf collects about its own performance, see the documentation for the Internal input plugin.

  • New volume and network metrics that are collected by the Mesos input plugin are enabled by default.

    The metrics collection service, dcos-telegraf can now collect additional metrics for Mesos volumes and network information. For a complete list of the Mesos metrics you can collect and report, see the latest list of metrics.

    In DC/OS 1.13, dcos-telegraf automatically collects Mesos metrics by default. Previously, you were required to manually enable the metrics plugin by updating the agent configuration or by setting the enable_mesos_input_plugin parameter in the config.yaml file to true. With this release, manually enabling this feature is no longer required. Instead, the default value for the parameter is now set to true. You can set the enable_mesos_input_plugin parameter in the config.yaml file to false if you want to disable the automatic collection of Mesos metrics.

  • Collect and report metrics that track the health and performance of the DC/OS Telegraf plugin.

    DC/OS metrics are collected and managed through the Telegraf service. Telegraf provides an agent-based service that runs on each master and agent node in a DC/OS cluster. By default, Telegraf gathers metrics from all of the processes running on the same node, processes them, then sends the collected information to a central metrics database.

    With this release, the dcos-telegraf program collects and forwards information about the operation and performance of the Telegraf process itself. This information is stored along with other metrics and is available for reporting using the DC/OS monitoring service or third-party monitoring services. For information about the Telegraf plugin and the metrics that Telegraf collects about its own performance, see the documentation for the Internal input plugin.

  • Expose task-related metrics using the Prometheus format.

    You can expose metrics from tasks that run on Mesos in Prometheus format. When a port configuration belonging to a task is labelled appropriately, the metrics endpoint on that port is polled regularly over the lifetime of the task and metrics collected are added to the Telegraf pipeline.

    For a detailed description of how to configure a task so that its metrics are collected in Prometheus format, see the Prometheus input plugin.

  • Add internal metrics for UDP activity to the Telegraf statsd input plugin.

    You can collect and report metrics for the number of incoming messages that have been dropped because of a full queue. This information is provided by the Telegraf statsd input plugin with the internal_statsd_dropped_messages metric.

  • Add process-level metrics for DC/OS agents and masters.

    You can collect and report process-level metrics for agent and master node processes. This information is provided by the Telegraf procstat input plugin. This plugin returns information about CPU and memory usage using the procstat_cpu_usage and procstat_memory_rss metrics.

  • Add metrics for Admin Router instances running on DC/OS master nodes.

    You can collect and report metrics for DC/OS Admin Router using NGINX Virtual Hosts metrics. This information is provided by Telegraf and NGINX input plugins and is enabled by default. You can view the NGINX instance metrics using the /nginx/status endpoint on each DC/OS master node.

  • Add the fault domain region and zone information to metrics.

For more information about collecting metrics and configuring metrics plugins, see the following topics:

Command-line interface

  • Identify the public-facing IP address for public agent nodes through the DC/OS CLI.

    With this release, you can retrieve the public-facing IP addresses for the nodes in a cluster by running the dcos node list command. For more information about using the new command for retrieving public IP addresses, see the dcos node command reference.

    You can look up the public agent IP address using the DC/OS web-based console, command-line interface, or API calls for DC/OS cluster nodes if DC/OS is deployed on a public cloud provider such as AWS, Google Cloud, or Azure. If DC/OS is installed on an internal network (on-premise) or a private cloud, nodes do not typically have separate public and private IP addresses. For nodes on an internal network or private cloud, the public IP address is most often the same as the IP address defined for the server in the DNS namespace.

  • Automatically install the DC/OS Enterprise command-line interface (CLI).

    If you have deployed a DC/OS Enterprise cluster, ypu can now automatically install the DC/OS Enterprise CLI when you install the base CLI package. Previously, the DC/OS Enterprise CLI could only be installed manually after the successful installation of the base DC/OS CLI.

    For more information about installing the command-line interface (CLI) and CLI plugins, see Installing the CLI and Installing the DC/OS Enterprise CLI.

  • Basic auto-completion using the TAB key.

    You can now use the TAB key to provide automatic completion when typing DC/OS commands. Auto-completion enables you to execute commands in a shell terminal more quickly by attempting to predict the rest of a command or subcommand you are typing. If the suggested text matches the command you intended, you can press the TAB key to accept the suggestion and execute the command.

    For more information about using auto-completion when you are working with the command-line interface (CLI) and CLI plugins, see Enabling autocompletion for the CLI.

  • Dynamic auto-completion of cluster names for dcos cluster attach and dcos cluster remove commands.

    You can now use the TAB key to provide automatic completion for potential cluster names when you run the dcos cluster attach or dcos cluster remove commands.

    For more information about using auto-completion when you are working with the command-line interface (CLI) and CLI plugins, see Enabling autocompletion for the CLI.

  • CLI support for macOS using Homebrew.

    Homebrew is a software package management program you can use to install and configure packages for computers running macOS or Linux operating systems. With this release, you can install the DC/OS command-line interface (CLI) packages using the Mac OSX homebrew utility. Previously, you were required to download all DC/OS CLI plug-ins directly from the DC/OS cluster. By adding support for the Homebrew package manager, operators and developers can keep their CLI packages up-to-date using the brew command. For example, you can install the core CLI package by running the following command:

    brew install dcos-cli
    

    For more information about installing and using Homebrew, see the Homebrew website or the GitHub repository.

Data services

  • Add a unique version number to Edge-LB pool packages.

    You can run a command to return the version number for the Edge-LB pool package you have installed. Using the version number returned by the edgelb version command, you can verify whether the Edge-LB pool and the Edge-LB API server versions match. The Edge-LB API server and the Edge-LB pool version numbers should always match. For example, if you have the Edge-LB pool package version v1.3.0 installed, the API server version should be v1.3.0 as well.

  • Enable Edge-LB pool instances to be scaled up or down.

    You can scale down the Edge-LB pool instances from a higher count to lower if you do not require all pool instances that are configured. To scale down, simply update the count variable in the Edge-LB pool configuration file to reflect the number of Edge-LB pool instances you need.

UI

  • Support for the independent upgrade of the DC/OS UI.

    You can now install and update the DC/OS UI without having to upgrade the DC/OS cluster. This feature enables new updates for DC/OS to be published to the DC/OS catalog and also be available as .dcos files for on-premise customers. The ability to install and update the DC/OS UI without upgrading the DC/OS cluster enables you to easily get the latest fixes and capabilities available in the DC/OS UI without affecting cluster operations. You also now have the ability to roll back an update, enabling you to use the DC/OS UI version that was originally shipped with your version of DC/OS if you need to.

  • Accurate status information for services.

    DC/OS 1.13 UI now includes a new tab in the Details section of every SDK-based data service. This new tab provides a clear indication of the status and progress of SDK-based services during the service life cycle, including installation and upgrade activity. From the Details tab, you can see information about the specific operational plans that are currently running or have just completed. You can also view the execution of each task so that you can easily track the progress of the plans you have deployed.

    For more information about viewing up-to-date status information for services and operational plans, see the Services documentation.

  • Identify the public-facing IP address for public agent nodes in the DC/OS UI.

    With this release, you can view the public-facing IP addresses for agent nodes in the DC/OS UI. Previously, retrieving the public IP address for a node required writing a custom query. For more information about viewing public IP addresses in the DC/OS UI, see Finding the public IP address.

    You can look up the public agent IP address using the DC/OS web-based console, command-line interface, or API calls for DC/OS cluster nodes if DC/OS is deployed on a public cloud provider such as AWS, Google Cloud, or Azure. If DC/OS is installed on an internal network (on-premise) or a private cloud, nodes do not typically have separate public and private IP addresses. For nodes on an internal network or private cloud, the public IP address is most often the same as the IP address defined for the server in the DNS namespace.

  • Add support for internationalization and localization (I18N and L10N - Chinese).

    Mesosphere DC/OS 1.13 UI has now been translated into Mandarin Chinese. Mandarin-speaking customers and users can now easily switch the language displayed in the UI and be able to interact with DC/OS operations and functions in English or Chinese. The DC/OS documentation has also been translated to Chinese to support those customers. Support for additional languages can be provided if there is sufficient customer demand.

    For information about changing the language displayed, see the UI documentation.

Installation

  • Multi-region support using the Universal Installer.

    Multi-region deployments enable higher availability for DC/OS clusters and support for multiple regions is crucial for customers who want to maintain uptime without being susceptible for regional outages. For more information, see the documentation for multi-region deployment.

  • Dynamic masters on the Universal Installer.

    Dynamic masters enable you to create, destroy, and recover master nodes. With this feature, you can use the Universal Installer to downscale or upscale your DC/OS clusters from not just the agent nodes (which is currently supported), but also from the master nodes–if you deem it necessary to do so. For more information, see the documentation for replaceable masters.

  • Enable Universal Installer and on-premise DC/OS life cycle management with Ansible.

    The DC/OS Ansible (dcos-ansible) component is a Mesosphere-provided version of the Ansible open-source provisioning, configuration management, and deployment tool that enables you to use supported Ansible roles for installing and upgrading DC/OS Open Source and DC/OS Enterprise clusters on the infrastructure you choose. For more information, see the documentation for Ansible.

Job management and scheduling

  • Enhance DC/OS job handling capabilities by adding support for the following:

    • Graphic processing units (GPU) when creating new jobs in the DC/OS UI or with the new DC/OS configuration option metronome_gpu_scheduling_behavior.
    • Jobs running in universal container runtime (UCR) containers.
    • File-based secrets.
    • Hybrid cloud deployments.
    • The IS constraint operator and the @region and @zone attributes.
  • Provide an option to enable or disable offer suppression when agents are idle.

  • Collect metrics for the “root” Metronome process on DC/OS for better observability.

  • Add HTTP and uptime metrics for job management.

  • Set the default value for the --gpu_scheduling_behavior configuration option to restricted to prevent jobs from being started on GPU-enabled agents if the job definition did not explicitly request GPU support.

Marathon

  • Enable secure computing (seccomp) and a default seccomp profile for UCR containers to prevent security exploits.

  • Replace Marathon-based health and readiness checks with generic DC/OS (Mesos-based) checks.

  • Collect metrics for the “root” Marathon framework on DC/OS for better observability.

  • Automatically replace instances when a DC/OS agent is decommissioned.

  • Set the default value for the --gpu_scheduling_behavior configuration option to restricted to prevent tasks from being started on GPU-enabled agents if the app or pod definition did not explicitly request GPU support.

  • Implement global throttling of Marathon-initiated health checks for better scalability.

  • Suppress offers by default when agents are idle for better scalability.

  • Close connections on slow event consumers to prevent excessive buffering and reduce the load on Marathon.

  • Aligning Ephemeral with Stateful Task Handling

    Marathon 1.8 introduces handling ephemeral instances similar to how it handled stateful instances since version 1.0. Until now, Marathon expunged ephemeral instances once all of their tasks ended up in a terminal state, and eventually launched replacements as a result of those instances being expunged. Instances will now only be expunged from the state once their goal is set to Decommissioned and all their tasks are in a terminal state. If their goal is still Running, they will be considered for scheduling and used to launch replacement tasks. This change not only merges two previously different code paths; this also simplifies debugging since users will be able to follow the task incarnations for a given instance throughout Marathons logs.

    This means that instance IDs are now stable for as long as an instance shall be kept running. New instances will be created only when replacing unreachable instances, and when replacing instances with new versions. Similar to the way we handle task IDs for stateful services, tasks of stateless services will now also provide an incarnation count, appended to the task Id. The first task created for an instance will be the .1, and subsequent replacements will increment that incarnation counter, e.g.

    service-name.instance-c0caec0a-863a-11e9-915b-c610fee06dff._app.42
    

    The above example denotes the 42nd generation of instance c0caec0a-863a-11e9-915b-c610fee06dff.

    When killing an instance using the wipe=true flag, its goal will be set to Decommission and it will eventually be expunged when all tasks are terminal. Note that as long as its tasks are e.g. unreachable, it will not be expunged until they are reported terminal (in case they stay unreachable: GONE, GONE_BY_OPERATOR, or UNKNOWN). When killing instances without the wipe=true flag, Marathon will only issue kill requests to Mesos, but keep the current goal and will, therefore, launch replacements that are still associated with the existing instance.

Mesos platform and containerization

  • Update the Universal Container Runtime (UCR) to support Docker registry manifest specification v2_schema2 images.

    DC/OS Universal Container Runtime (UCR) now fully supports Docker images that are formatted using the Docker v2_schema2 specification. The DC/OS Universal Container Runtime (UCR) also continues to support Docker images that use the v2_schema1 format.

    For more information, see Universal Container Runtime.

  • Add a communication heartbeat to improve resiliency.

    DC/OS clusters now include executor and agent communication channel heartbeats to ensure platform resiliency even if IPFilter is enabled with conntrack, which usually times out a connection every five days.

  • Support zero-downtime for tasks through layer-4 load balancing.

    DC/OS cluster health checks now provide task-readiness information. This information enables zero-downtime for load balancing when services are scaled out. With this feature, load balanced traffic is not redirected to containers until the container health check returns a ‘ready’ status.

  • Add support for CUDA 10 image processing for applications that use graphics processing unit (GPU) resources and are based on the NVIDIA Container Runtime.

    CUDA provides a parallel computing platform that enables you to use GPU resources for general purpose processing. The CUDA platform provides direct access to the GPU virtual instruction set using common programming languages such as C and C++. The NVIDIA Container Runtime is a container runtime that supports CUDA image processing and is compatible with the Open Containers Initiative (OCI) specification.

    With this release, DC/OS adds support for CUDA, NVIDIA Container Runtime containers, and applications that use GPU resources to enable you to build and deploy containers for GPU-accelerated workloads.

Networking

  • Add a new networking API endpoint to retrieve the public-facing IP address for public agent nodes.

    This release introduces a new API endpoint for accessing public-facing IP addresses for the nodes in a cluster. For more information about retrieving and viewing public IP addresses, see Finding the public IP address.

    You can look up the public agent IP address using the DC/OS web-based console, command-line interface, or API calls for DC/OS cluster nodes if DC/OS is deployed on a public cloud provider such as AWS, Google Cloud, or Azure. If DC/OS is installed on an internal network (on-premise) or a private cloud, nodes do not typically have separate public and private IP addresses. For nodes on an internal network or private cloud, the public IP address is most often the same as the IP address defined for the server in the DNS namespace.

Security

  • Extend the DC/OS authentication architecture to apply to both DC/OS Open Source (OSS) and DC/OS Enterprise clusters.

    You can now create unified service accounts that can be used across DC/OS OSS and DC/OS Enterprise clusters. By extending support for service accounts that can be used for all DC/OS clusters, you have the option to install, configure, and manage additional packages, including packages that require a service account when you are running DC/OS Enterprise DC/OS in strict mode.

    For more information about authentication and managing accounts, see Security and User account management.

  • Support secure computing mode (seccomp) profiles.

    Secure computing mode (seccomp) is a feature provided by the Linux kernel. You can use secure computing mode to restrict the actions allowed within an app or pod container. You can enable secure computing mode using a default profile for Universal Container Runtime (UCR) containers if the operating system you are using supports it.

    With DC/OS, you can use a seccomp profile to deny access to specific system calls by default. The profile defines a default action and the rules for overriding that default action for specific system calls.

    Using a secure computing mode profile is an important option if you need to secure access to containers and operations using the principle of least privilege.

    For more information about secure computing mode and the default secure computing profile, see Secure computing profiles.

Storage

  • Update Beta Rex-Ray to support NVMe EBS volumes.

    REX-Ray is a container storage orchestration engine that enables persistence for cloud-native workloads. With Rex-Ray, you can manage native Docker Volume Driver operations through a command-line interface (CLI).

    Amazon Elastic Block Store (Amazon EBS) provides block-level storage volumes for Amazon Elastic Cloud (EC2) instances. Amazon EBS volumes can be attached to any running EC2 instance hosted in the same Amazon availability zone to provide persistent storage that is independent of the deployed instance. EBS storage volumes can be exposed using NVMe (non-volatile memory express) as a host controller interface and storage protocol. NVMe devices enable you to accelerate the transfer of data between nodes and solid-state drives (SSDs) over a computer’s connection gateway.

    With this release, DC/OS updates REX-Ray to support NVMe storage when the DC/OS cluster runs on an Amazon instance. To work with NVMe devices, however, you must provide your own udev rules and nvme-cli package. For more information about using Rex-Ray, see the REX-Ray website and github repository.

  • Provide a driver that enables AWS Elastic Block Store (EBS) volumes for the Mesosphere Kubernetes Engine (MKE).

    You can use the AWS EBS Container Storage Interface (CSI) driver to manage storage volumes for the Mesosphere Kubernetes Engine (MKE). This driver enables MKE users to deploy stateful applications running in a DC/OS cluster on an AWS cloud instance.

  • Update support for the Container Storage Interface (CSI) specification.

    With this release, DC/OS supports the Container Storage Interface (CSI) API, version 1 (v1), specification. You can deploy plugins that are compatible with either the Container Storage Interface (CSI) API, v0 or v1, specification to create persistent volumes through local storage resource providers. DC/OS automatically detects the CSI versions that are supported by the plugins you deploy.

Issues fixed in this release

The issues that have been fixed in DC/OS 1.13 are grouped by feature, functional area, or component. Most change descriptions include one or more issue tracking identifiers enclosed in parenthesis for reference.

Admin Router

  • Enable Admin Router to handle long server names (COPS-4286, DCOS-46277).

    This release fixes an issue in Admin Router that prevented it from starting properly for some virtual machine configurations. For example, if you previously used a server name that exceeded the maximum size allowed, the dcos-adminrouter component might be unable to start the server. With this release, the packages/adminrouter/extra/src/nginx.master.conf file has been updated to support a server name hash bucket size of 64 characters.

  • Change the master Admin Router service endpoint /service/<service-name> so that it does not remove the Accept-Encoding header from requests, allowing services to serve compressed responses to user agents (DCOS_OSS-4906).

  • Enable to master Admin Router to expose the DC/OS networking API through the /net endpoint path (DCOS_OSS-1837).

    This API can be used, for example, to return the public IP addresses of cluster nodes through the /net/v1/nodes endpoint.

  • Enable Admin Router to return relative redirects to avoid relying on the Host header (DCOS-47845).

Command-line interface (CLI)

  • Fix the CLI task metrics summary command which was occasionally failing to find metrics (DCOS_OSS-4679).

Diagnostics and logging

  • Enable DC/OS to create consolidated diagnostics bundles by applying a timeout when reading systemd journal entries (DCOS_OSS-5097).

  • Add SELinux details to the DC/OS diagnostics bundle to provide additional information for troubleshooting and analysis (DCOS_OSS-4123).

  • Add external Mesos master and agent logs in the diagnostic bundle to provide additional information for troubleshooting and analysis (DCOS_OSS-4283).

  • Add logging for Docker-GC to the journald system logging facility (COPS-4044).

  • Modify Admin Router to log information to a non-blocking domain socket (DCOS-43956).

    Previously, if the journald logging facility failed to read the socket quickly enough, Admin Router would stop processing requests, causing log messages to be lost and blocking other processing activity.

  • Allow the DC/OS Storage Service (DSS) endpoint for collecting diagnostics to be marked as optional (DCOS_OSS-5031).

    The DC/OS Storage Service (DSS) provides an HTTP endpoint for collecting diagnostics. If you want the DC/OS diagnostics request to succeed when the storage service diagnostics endpoint is not available, you can configure the DC/OS diagnostics HTTP endpoint as optional. By specifying that the diagnostic endpoint is optional, you can ensure that failures to query the endpoint do not cause DC/OS diagnostics reporting to fail.

    If the storage service diagnostics endpoint is optional when you generate a diagnostics report, DC/OS records a log message indicating that the endpoint is unavailable and ignored because it was marked as optional.

  • Prevent cloud provider access or account keys from being included in diagnostic reports (DCOS-51751).

    With this release, the configuration parameters aws_secret_access_key and exhibitor_azure_account_key are marked as secret and not visible in the user.config.yaml file on cluster nodes. This information is only visible in user.config.full.yaml file. This file has stricter read permissions and is not included in DC/OS Diagnostics bundles.

UI

  • Change the default value for DC/OS UI X-Frame-Options from SAMEORIGIN to DENY. This setting is also now configurable using the adminrouter_x_frame_options configuration parameter (DCOS-49594).

Installation

  • Allow the DC/OS installer to be used when there is a space in its path (DCOS_OSS-4429).

  • Add a warning to the installer to let the user know if kernel modules required by the DC/OS storage service (DSS) are not loaded (DCOS-49088).

  • Improve the error messages returned if Docker is not running at the start of a DC/OS installation (DCOS-15890).

  • Stop requiring the ssh_user attribute to be set in the config.yaml file when using parts of the deprecated CLI installer (DCOS_OSS-4613).

Job management and scheduling

  • Job scheduling (Metronome) has been improved to handle the restart policy when a job fails. If a job fails to run, restarting the task should depend on the setting you have defined for the ON_FAILURE result (DCOS_OSS-4636).

Metrics

  • Prefix illegal Prometheus metric names with an underscore (DCOS_OSS-4899).

Networking

  • Fix an issue that previously caused the dcos-net-setup.py script to fail if the systemd network directory did not exist (DCOS-49711).

  • Add path-based routing to Admin Router to support routing of requests to the DC/OS networking (dcos-net) component (DCOS_OSS-1837).

  • Mark the dcos6 overlay network as disabled if the enable_ipv6 parameter is set to false (DCOS-40539).

  • Enable IPv6 support for layer-4 load balancing (l4lb) by default (DCOS_OSS-1993).

  • Fix a race condition in the layer-4 load balancing (l4lb) network component (DCOS_OSS-4939).

  • Fix IPv6 virtual IP support in the layer-4 load balancing (l4lb) network component (DCOS-50427).

  • Update iptable rules to allow the same port to be used for port mapping and virtual IP addresses (DCOS_OSS-4970).

    DC/OS now allows you to use the same port for traffic routed to virtual IP addresses and to containers that use port mapping (for example, network traffice routed to a container using bridge networking). Previously, if you configured a virtual IP address listening on the same port as the host port specified for port mapping, the iptable rules identified the port conflict and prevented the virtual IP traffic from being routed to its intended destination.

  • Update lashup to check that all master nodes are reachable (DCOS_OSS-4328).

    Lashup is an internal DC/OS building block for a distributed control operations. It is not an independent module, but used in conjunction with other components. This fix helps to ensure Lashup convergence to prevent connectivity issues and nodes creating multiple “sub-clusters” within a single DC/OS cluster.

  • Allow agents to store network information in a persistent location (COPS-4124, DCOS-46132, DCOS_OSS-4667).

    A new agent option --network_cni_root_dir_persist allows the container node root directory to store network information in a persistent location. This option enables you to specify a container work_dir root directory that persists network-related information. By persisting this information, the container network interface (CNI) isolator code can perform proper cleanup operations after rebooting.

    If rebooting a node does not delete old containers and IP/MAC addresses from etcd (which over time can cause pool exhaustion), you should set the --network_cni_root_dir_persist agent option in the config.yaml file to true. You should note that changing this flag requires rebooting the agent node or shutting down all container processes running on the node. Because a reboot or shutdown of containers is required, the default value for the --network_cni_root_dir_persist agent option is false.

    Before changing this option, you should plan for agent maintenance to minimize any service interruption. If you set this option and reboot a node, you should also unset the CNI_NETNS environment variable after rebooting using the CNI plugin DEL command so that the plugin cleans up as many resources as possible (for example, by releasing IPAM allocations) and returns a successful response.

  • Applications that use Docker containers with a virtual IP address resolve access to the application by using the host_IP:port_number instead of the container_ip:port_number for backend port mapping (COPS-4087).

  • The distributed layer-4 load-balancer (dcos-l4lb) network component waits to route traffic until an application scale-up operation is complete or the application health check has passed (COPS-3924, DCOS_OSS-1954).

    The dcos-l4lb process does not prevent traffic from being routed if you are scaling down the number of application instances. Network traffic is only suspended if the status of the application is determined to be unhealthy or unknown.

Third-party updates and compatibility

  • Update support for REX-Ray to the most recent stable version (DCOS_OSS-4316,COPS-3961).

  • Upgrade the version of the Telegraf metrics plugin supported to leverage recent bug fixes and feature improvements (DCOS_OSS-4675).

  • Update the supported version of Java to 8u192 to address known critical and high security vulnerabilities (DCOS-43938, DCOS_OSS-4380).

  • Upgrade the support for the Erlang/OTP framework to Erlang/OTP version 21.3 (DCOS_OSS-4902).

Known issues and limitations

This section covers any known issues or limitations that don’t necessarily affect all customers, but might require changes to your environment to address specific scenarios. The issues are grouped by feature, functional area, or component. Where applicable, issue descriptions include one or more tracking identifiers enclosed in parenthesis for reference.

Using separate JSON files for job scheduling

In this release, jobs and job schedules are created in two separate steps. Because of this change, you must structure the job definition in the JSON editor in distinct sections similar to this:

  • job: JSON definition that specifies the job identifier and job configuration details.
  • schedule: JSON definition that specifies the schedule details for the job.

This two-step approach to creating JSON for jobs is different from previous releases in which jobs and schedules could be created in one step. In previous releases, the job could have its schedule embedded in its JSON configuration.

If you have an existing JSON configuration that has an embedded schedule and you want to view or modify that file using the job form JSON editor, you must:

  1. Add the JSON object as the value for the job property in the editor.

    The job must be formatted according to the latest Jobs API specification. This API specification (v1) replaces the previous Jobs API specification (v0).

  2. Copy the schedules: [ scheduleJSON ] from the existing job JSON configuration and add it at the same level after the job property as schedule: scheduleJSON.

    The schedule must be formatted according to the Jobs API Schedule specification. This API specification (v1) replaces the previous Jobs API specification (v0).

  3. Verify that the schedule section is not an array.

  4. Remove the schedules property from the job’s JSON configuration settings.

The following example illustrates the changes required when you have job definition that includes an embedded schedule.

{
  "id": "test-schedule",
  "labels": {

  },
  "run": {
    "cpus": 1,
    "mem": 128,
    "disk": 0,
    "gpus": 0,
    "cmd": "sleep 100",
    "env": {

    },
    "placement": {
      "constraints": [

      ]
    },
    "artifacts": [

    ],
    "maxLaunchDelaySeconds": 300,
    "volumes": [

    ],
    "restart": {
      "policy": "NEVER"
    },
    "secrets": {

    }
  },
  "schedules": [
    {
      "id": "test",
      "cron": "* * * * *",
      "timezone": "UTC",
      "startingDeadlineSeconds": 900,
      "concurrencyPolicy": "ALLOW",
      "enabled": true,
      "nextRunAt": "2019-04-26T16:28:00.000+0000"
    }
  ],
  "activeRuns": [

  ],
  "history": {
    "successCount": 0,
    "failureCount": 0,
    "lastSuccessAt": null,
    "lastFailureAt": null,
    "successfulFinishedRuns": [

    ],
    "failedFinishedRuns": [

    ]
  }
}

To add this job definition to the JSON editor, you would modify the existing JSON as follows:

{
  "job": {
    "id": "test-schedule",
    "labels": {
    },
    "run": {
      "cpus": 1,
      "mem": 128,
      "disk": 0,
      "gpus": 0,
      "cmd": "sleep 100",
      "env": {
      },
      "placement": {
        "constraints": [
        ]
      },
      "artifacts": [ ],
      "maxLaunchDelaySeconds": 300,
      "volumes": [ ],
      "restart": { "policy": "NEVER" },
      "secrets": { }
    }
  },
  "schedule": {
    "id": "test",
    "cron": "* * * * *",
    "timezone": "UTC",
    "startingDeadlineSeconds": 900,
    "concurrencyPolicy": "ALLOW",
    "enabled": true,
    "nextRunAt": "2019-04-26T16:28:00.000+0000"
  }
}

Authentication tokens after an upgrade

  • Authentication tokens that are generated by DC/OS Open Authentication (dcos-oauth) before upgrading from DC/OS version 1.12.x to DC/OS version 1.13.x become invalid during the upgrade. To generate a new authentication token for access to DC/OS 1.13.x, log in using valid credentials after completing the upgrade.

Upgrading Marathon orchestration

  • You can only upgrade to Marathon 1.8 from 1.6.x or 1.7.x. To upgrade from an earlier version of Marathon, you must first upgrade to Marathon 1.6.x or 1.7.x.

Restrictions for Marathon application names

  • You should not use restricted keywords in application names.

    You should not add applications with names (identifiers) that end with restart, tasks, or versions. For example, the application names /restart and /foo/restart are invalid and generate errors when you attempt to issue a GET /v2/apps request. If you have any existing apps with restricted names, attempting any operation–except delete–will result in an error. You should ensure that application names comply with the validation rules before upgrading Marathon.

Deprecated or decommissioned features

  • In DC/OS 1.13, the DC/OS history service has transitioned into the retired state. The history service is scheduled to be decommissioned in DC/OS 1.14. You can find the definitions for each of the feature maturity states documented in the Mesosphere DC/OS Feature Maturity Lifecycle.

  • Some of the configuration parameters previously used to install DC/OS cluster components are no longer valid. The following dcos_generate_config.sh command-line options have been deprecated and decommissioned:

    • --set-superuser-password
    • --offline
    • --cli-telemetry-disabled
    • --validate
    • --preflight
    • --install-prereqs
    • --deploy
    • --postflight

    If you attempt to use an option that is no longer valid, the installation script displays a warning message. You can also identify deprecated options by running the dcos_generate_config.sh script with the --help option. The output for the --help option displays [DEPRECATED] for the options that are no longer used.

    These options will be removed in DC/OS 1.14. If you have scripts or programs that use any of the deprecated options, you should update them.

  • The CLI command dcos node has been replaced by the new command dcos node list.

    Running the dcos node command after installing this release automatically redirects to the output of the dcos node list command. The dcos node list command provides information similar to the output from the dcos node command, but also includes an additional column that indicates the public IP address of each node.

    If you have scripts or programs that use output from the dcos node command, you should test the output provided by the dcos node list command then update your scripts or programs, as needed.

  • Marathon-based HTTP, HTTPS, TCP, and Readiness checks

    Marathon.based HTTP, HTTPS, and TCP health checks have been deprecated since DC/OS 1.9. With this release, Marathon-based readiness checks have also been deprecated.

    If you have not already done so, you should migrate services to use the Mesos Health and Generic checks in place of the Marathon-based checks. As part of this migration, you should keep in mind that you can only specify one Mesos-based Health check and one Mesos-based Generic check.

  • Marathon support for App Container (appc) images is decommissioned in 1.13.

    There has been no active development for AppC images since 2016. Support for AppC images will be removed in DC/OS 1.14.

  • Setting the gpu_scheduling_behavior configuration option to undefined is no longer supported.

    With this release, the default value for the gpu_scheduling_behavior configuration option is restricted. The value undefined is decommissioned. This value will be removed in DC/OS 1.14.

    If you have scripts or programs that set the gpu_scheduling_behavior configuration option to undefined, you should update them, as needed.

  • Marathon no longer supports the api_heavy_events setting.

    With this release, the only response format allowed for /v2/events is light (in accordance with the previously-published deprecation plan). If you attempt to start Marathon with the --deprecated_features=api_heavy_events setting specified, the startup operation will fail with an error.

  • Marathon no longer supports Kamon-based metrics and related command-line arguments.

    The following command-line arguments that are related to outdated reporting tools have been removed:

    • --reporter_graphite
    • --reporter_datadog
    • --metrics_averaging_window

    If you specify any of these flags, Marathon will fail to start.

  • Proxying server-sent events (sse) from standby Marathon instances is no longer supported.

    DC/OS no longer allows a standby Marathon instance to proxy /v2/events from the Marathon leader. Previously, it was possible to use the proxy_events flag to force Marathon to proxy the response from /v2/events. This standby redirect functionality and the related flag are no longer valid in 1.13.

  • Marathon no longer supports the save_tasks_to_launch_timeout setting.

    The save_tasks_to_launch_timeout option was deprecated in Marathon 1.5 and using it has had no effect on Marathon operations since that time. If you specify the save_tasks_to_launch_timeout setting, Marathon will fail to start.

Updated components change lists

For access to the logs that track specific changes to components that are included in the DC/OS distribution, see the following links:

Previous releases

To review changes from a recent previous release, see the following links: