Remove the invalid specs from doc/source
The specs directory in Cyborg is not update, and we have the Cyborg specifications in https://specs.openstack.org/openstack/cyborg-specs/, so remove this directory in Cyborg, to reduce Cyborg maintenance costs. Change-Id: Iebcbf2ebd6da3bc51e85c62f18c547909026c2f0
This commit is contained in:
parent
acbc64f3be
commit
aa2aa69e34
doc/source
index.rst
specs
index.rst
pike/approved
queens/approved
cyborg-fpga-driver-proposal.rstcyborg-fpga-model-proposal.rstcyborg-internal-api.rstcyborg-nova-interaction.rstcyborg-spdk-driver-proposal.rst
rocky/approved
compute-node.rstcyborg-agent-driver-api.rstcyborg-fpga-bitstream-spec.rstcyborg-fpga-programming-proposal.rstcyborg-nova-sched.rstresource-quotas.rst
template.rst@ -60,7 +60,6 @@ Documentation for Developers
|
||||
contributor/contributing
|
||||
contributor/devstack_setup
|
||||
contributor/driver-development-guide
|
||||
specs/index
|
||||
|
||||
Indices and tables
|
||||
==================
|
||||
|
@ -1,40 +0,0 @@
|
||||
Cyborg Specs
|
||||
============
|
||||
|
||||
Template
|
||||
--------
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
template
|
||||
|
||||
Rocky
|
||||
-----
|
||||
This section has a list of specs for the Rocky release.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:glob:
|
||||
|
||||
rocky/approved/*
|
||||
|
||||
Queens
|
||||
------
|
||||
This section has a list of specs for the Queens release.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:glob:
|
||||
|
||||
queens/approved/*
|
||||
|
||||
Pike
|
||||
----
|
||||
This section has a list of specs for the Pike release.
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:glob:
|
||||
|
||||
pike/approved/*
|
@ -1,166 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Cyborg Agent Proposal
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-agent
|
||||
|
||||
This spec proposes the responsibilities and initial design of the
|
||||
Cyborg Agent.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Cyborg requires an agent on the compute hosts to manage the several
|
||||
responsibilities, including locating accelerators, monitoring their
|
||||
status, and orchestrating driver operations.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
Use of accelerators attached to virtual machine instances in OpenStack
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Cyborg Agent resides on various compute hosts and monitors them for
|
||||
accelerators. On it's first run Cyborg Agent will run the detect
|
||||
accelerator functions of all it's installed drivers. The resulting list
|
||||
of accelerators available on the host will be reported to the conductor
|
||||
where it will be stored into the database and listed during API requests.
|
||||
By default accelerators will be inserted into the database in a inactive
|
||||
state. It will be up to the operators to manually set an accelerator to
|
||||
'ready' at which point cyborg agent will be responsible for calling the
|
||||
drivers install function and ensuring that the accelerator is ready for use.
|
||||
|
||||
In order to mirror the current Nova model of using the placement API each Agent
|
||||
will send updates on it's resources directly to the placement API endpoint
|
||||
as well as to the conductor for usage aggregation. This should keep placement
|
||||
API up to date on accelerators and their usage.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
There are lots of alternate ways to lay out the communication between the Agent
|
||||
and the API endpoint or the driver. Almost all of them involving exactly where
|
||||
we draw the line between the driver, Conductor , and Agent. I've written my
|
||||
proposal with the goal of having the Agent act mostly as a monitoring tool,
|
||||
reporting to the cloud operator or other Cyborg components to take action.
|
||||
A more active role for Cyborg Agent is possible but either requires significant
|
||||
synchronization with the Conductor or potentially steps on the toes of
|
||||
operators.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Cyborg Agent will create new entries in the database for accelerators it
|
||||
detects it will also update those entries with the current status of the
|
||||
accelerator at a high level. More temporary data like the current usage of
|
||||
a given accelerator will be broadcast via a message passing system and won't
|
||||
be stored.
|
||||
|
||||
Cyborg Agent will retain a local cache of this data with the goal of not losing
|
||||
accelerator state on system interruption or loss of connection.
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
TODO once we firm up who's responsible for what.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Monitoring capability might be useful to an attacker, but without root
|
||||
this is a fairly minor concern.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
Notifying users that their accelerators are ready?
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
Interaction details around adding/removing/setting up accelerators
|
||||
details TBD.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Agent heartbeat for updated accelerator performance stats might make
|
||||
scaling to many accelerator hosts a challenge for the Cyborg endpoint
|
||||
and database. Perhaps we should consider doing an active 'load census'
|
||||
before scheduling instances? But that just moves the problem from constant
|
||||
load to issues with a bootstorm.
|
||||
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
By not placing the drivers with the Agent we keep the deployment footprint
|
||||
pretty small. We do add development complexity and security concerns sending
|
||||
them over the wire though.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
TBD
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
<jkilpatr>
|
||||
|
||||
Other contributors:
|
||||
<launchpad-id or None>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Agent implementation
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Cyborg Driver Spec
|
||||
* Cyborg API Spec
|
||||
* Cyborg Conductor Spec
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
CI infrastructure with a set of accelerators, drivers, and hardware will be
|
||||
required for testing the Agent installation and operation regularly.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Little to none. Perhaps on an on compute config file that may need to be
|
||||
documented. But I think it's best to avoid local configuration where possible.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Other Cyborg Specs
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release
|
||||
- Description
|
||||
* - Pike
|
||||
- Introduced
|
@ -1,414 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===================
|
||||
Cyborg API proposal
|
||||
===================
|
||||
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-api
|
||||
|
||||
This spec proposes to provide the initial API design for Cyborg.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Cyborg as a common management framework for dedicated devices (hardware/
|
||||
software accelerators, high-speed storage, etc) needs RESTful API to expose
|
||||
the basic functionalities.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* As a user I want to be able to spawn VM with dedicated hardware, so
|
||||
that I can utilize provided hardware.
|
||||
* As a compute service I need to know how requested resource should be
|
||||
attached to the VM.
|
||||
* As a scheduler service I'd like to know on which resource provider
|
||||
requested resource can be found.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
In general we want to develop the APIs that support basic life cycle management
|
||||
for Cyborg.
|
||||
|
||||
Life Cycle Management Phases
|
||||
----------------------------
|
||||
|
||||
For cyborg, LCM phases include typical create, retrieve, update, delete
|
||||
operations. One thing should be noted that deprovisioning mainly refers to
|
||||
detach(delete) operation which deactivate an acceleration capability but
|
||||
preserve the resource itself for future usage. For Cyborg, from functional
|
||||
point of view, the LCM includes provision, attach,update,list, and detach.
|
||||
There is no notion of deprovisioning for Cyborg API in a sense that we
|
||||
decomission or disconnect an entire accelerator device from the bus.
|
||||
|
||||
Difference between Provision and Attach/Detach
|
||||
----------------------------------------------
|
||||
|
||||
Noted that while the APIs support provisioning via CRUD operations,
|
||||
attach/detach are considered different:
|
||||
|
||||
* Provision operations (create) will involve api->
|
||||
conductor->agent->driver workflow, where as attach/detach (update/delete)
|
||||
could be taken care of at the driver layer without the involvement of the
|
||||
pre-mentioned workflow. This is similar to the difference between create a
|
||||
volume and attach/detach a volume in Cinder.
|
||||
|
||||
* The attach/detach in Cyborg API will mainly involved in DB status
|
||||
modification.
|
||||
|
||||
Difference between Attach/Detach To VM and Host
|
||||
-----------------------------------------------
|
||||
|
||||
Moreover there are also differences when we attach an accelerator to a VM or
|
||||
a host, similar to Cinder.
|
||||
|
||||
* When the attachment happens to a VM, we are expecting that Nova could call
|
||||
the virt driver to perform the action for the instance. In this case Nova
|
||||
needs to support the acc-attach and acc-detach action.
|
||||
|
||||
* When the attachment happens to a host, we are expecting that Cyborg could
|
||||
take care of the action itself via Cyborg driver. Althrough currently there
|
||||
is the generic driver to accomplish the job, we should consider a os-brick
|
||||
like standalone lib for accelerator attach/detach operations.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
* For attaching an accelerator to a VM, we could let Cyborg perform the action
|
||||
itself, however it runs into the risk of tight-coupling with Nova of which
|
||||
Cyborg needs to get instance related information.
|
||||
* For attaching an accelerator to a host, we could consider to use Ironic
|
||||
drivers however it might not bode well with the standalone accelerator rack
|
||||
scenarios where accelerators are not attached to server at all.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
A new table in the API database will be created::
|
||||
|
||||
CREATE TABLE accelerators (
|
||||
accelerator_id INT NOT NULL,
|
||||
device_type STRING NOT NULL,
|
||||
acc_type STRING NOT NULL,
|
||||
acc_capability STRING NOT NULL,
|
||||
vendor_id STRING,
|
||||
product_id STRING,
|
||||
remotable INT,
|
||||
);
|
||||
|
||||
Note that there is an ongoing discussion on nested resource
|
||||
provider new data structures that will impact Cyborg DB imp-
|
||||
lementation. For code implementation it should be aligned
|
||||
with resource provider db requirement as much as possible.
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
The API changes add resource endpoints to:
|
||||
|
||||
* `GET` a list of all the accelerators
|
||||
* `GET` a single accelerator for a given id
|
||||
* `POST` create a new accelerator resource
|
||||
* `PUT` an update to an existing accelerator spec
|
||||
* `PUT` attach an accelerator to a VM or a host
|
||||
* `DELETE` detach an existing accelerator for a given id
|
||||
|
||||
The following new REST API call will be created:
|
||||
|
||||
'GET /accelerators'
|
||||
*************************
|
||||
|
||||
Return a list of accelerators managed by Cyborg
|
||||
|
||||
Example message body of the response to the GET operation::
|
||||
|
||||
200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"accelerator":[
|
||||
{
|
||||
"uuid":"8e45a2ea-5364-4b0d-a252-bf8becaa606e",
|
||||
"acc_specs":
|
||||
{
|
||||
"remote":0,
|
||||
"num":1,
|
||||
"device_type":"CRYPTO"
|
||||
"acc_capability":
|
||||
{
|
||||
"num":2
|
||||
"ipsec":
|
||||
{
|
||||
"aes":
|
||||
{
|
||||
"3des":50,
|
||||
"num":1,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"uuid":"eaaf1c04-ced2-40e4-89a2-87edded06d64",
|
||||
"acc_specs":
|
||||
{
|
||||
"remote":0,
|
||||
"num":1,
|
||||
"device_type":"CRYPTO"
|
||||
"acc_capability":
|
||||
{
|
||||
"num":2
|
||||
"ipsec":
|
||||
{
|
||||
"aes":
|
||||
{
|
||||
"3des":40,
|
||||
"num":1,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
'GET /accelerators/{uuid}'
|
||||
**************************
|
||||
|
||||
Retrieve a certain accelerator info indetified by '{uuid}'
|
||||
|
||||
Example GET Request::
|
||||
|
||||
GET /accelerators/8e45a2ea-5364-4b0d-a252-bf8becaa606e
|
||||
|
||||
200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"uuid":"8e45a2ea-5364-4b0d-a252-bf8becaa606e",
|
||||
"acc_specs":{
|
||||
"remote":0,
|
||||
"num":1,
|
||||
"device_type":"CRYPTO"
|
||||
"acc_capability":{
|
||||
"num":2
|
||||
"ipsec":{
|
||||
"aes":{
|
||||
"3des":50,
|
||||
"num":1,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
If the accelerator does not exist a `404 Not Found` must be
|
||||
returned.
|
||||
|
||||
'POST /accelerators/{uuid}'
|
||||
***************************
|
||||
|
||||
Create a new accelerator
|
||||
|
||||
Example POST Request::
|
||||
|
||||
Content-type: application/json
|
||||
|
||||
{
|
||||
"name": "IPSec Card",
|
||||
"uuid": "8e45a2ea-5364-4b0d-a252-bf8becaa606e"
|
||||
}
|
||||
|
||||
The body of the request must match the following JSONSchema document::
|
||||
|
||||
{
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {
|
||||
"type": "string"
|
||||
},
|
||||
"uuid": {
|
||||
"type": "string",
|
||||
"format": "uuid"
|
||||
}
|
||||
},
|
||||
"required": [
|
||||
"name"
|
||||
]
|
||||
"additionalProperties": False
|
||||
}
|
||||
|
||||
The response body is empty. The headers include a location header
|
||||
pointing to the created accelerator resource::
|
||||
|
||||
201 Created
|
||||
Location: /accelerators/8e45a2ea-5364-4b0d-a252-bf8becaa606e
|
||||
|
||||
A `409 Conflict` response code will be returned if another accelerator
|
||||
exists with the provided name.
|
||||
|
||||
'PUT /accelerators/{uuid}/{acc_spec}'
|
||||
*************************************
|
||||
|
||||
Update the spec for the accelerator identified by `{uuid}`.
|
||||
|
||||
Example::
|
||||
|
||||
PUT /accelerator/8e45a2ea-5364-4b0d-a252-bf8becaa606e
|
||||
|
||||
Content-type: application/json
|
||||
|
||||
{
|
||||
"acc_specs":{
|
||||
"remote":0,
|
||||
"num":1,
|
||||
"device_type":"CRYPTO"
|
||||
"acc_capability":{
|
||||
"num":2
|
||||
"ipsec":{
|
||||
"aes":{
|
||||
"3des":50,
|
||||
"num":1,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
The returned HTTP response code will be one of the following:
|
||||
|
||||
* `200 OK` if the spec is successfully updated
|
||||
* `404 Not Found` if the accelerator identified by `{uuid}` was
|
||||
not found
|
||||
* `400 Bad Request` for bad or invalid syntax
|
||||
* `409 Conflict` if another process updated the same spec.
|
||||
|
||||
|
||||
'PUT /accelerators/{uuid}'
|
||||
**************************
|
||||
|
||||
Attach the accelerator identified by `{uuid}`.
|
||||
|
||||
Example::
|
||||
|
||||
PUT /accelerator/8e45a2ea-5364-4b0d-a252-bf8becaa606e
|
||||
|
||||
Content-type: application/json
|
||||
|
||||
{
|
||||
"name": "IPSec Card",
|
||||
"uuid": "8e45a2ea-5364-4b0d-a252-bf8becaa606e"
|
||||
}
|
||||
|
||||
The returned HTTP response code will be one of the following:
|
||||
|
||||
* `200 OK` if the accelerator is successfully attached
|
||||
* `404 Not Found` if the accelerator identified by `{uuid}` was
|
||||
not found
|
||||
* `400 Bad Request` for bad or invalid syntax
|
||||
* `409 Conflict` if another process attach the same accelerator.
|
||||
|
||||
|
||||
'DELETE /accelerator/{uuid}'
|
||||
****************************
|
||||
|
||||
Detach the accelerator identified by `{uuid}`.
|
||||
|
||||
The body of the request and the response is empty.
|
||||
|
||||
The returned HTTP response code will be one of the following:
|
||||
|
||||
* `204 No Content` if the request was successful and the accelerator was
|
||||
detached.
|
||||
* `404 Not Found` if the accelerator identified by `{uuid}` was
|
||||
not found.
|
||||
* `409 Conflict` if there exist allocations records for any of the
|
||||
accelerator resource that would be detached as a result of detaching
|
||||
the accelerator.
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Developers can use this REST API after it has been implemented.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
zhipengh <huangzhipeng@huawei.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement the APIs specified in this spec
|
||||
* Proposal to Nova about the new accelerator
|
||||
attach/detach api
|
||||
* Implement the DB specified in this spec
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Unit tests will be added to Cyborg API.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release
|
||||
- Description
|
||||
* - Pike
|
||||
- Introduced
|
@ -1,143 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Cyborg Conductor Proposal
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-agent
|
||||
|
||||
This spec proposes the responsibilities and initial design of the
|
||||
Cyborg Conductor.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Cyborg requires a conductor on the controller hosts to manage the cyborg
|
||||
system state and coalesce database operations.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
Use of accelerators attached to virtual machine instances in OpenStack
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Cyborg Conductor will reside on the control node and will be
|
||||
responsible for stateful actions taken by Cyborg. Acting as both a cache to
|
||||
the database and as a method of combining reads and writes to the database.
|
||||
All other Cyborg components will go through the conductor for database
|
||||
operations.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Having each Cyborg Agent instance hit the database on it's own is a possible
|
||||
alternative, and it may even be feasible if the accelerator load monitoring
|
||||
rate is very low and the vast majority of operations are reads. But since we
|
||||
intend to store metadata about accelerator usage updated regularly this model
|
||||
probably will not scale well.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Using the conductor 'properly' will result in little or no per instance state
|
||||
and stateful operations moving through the conductor with the exception of
|
||||
some local caching where it can be garunteed to work well.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
N/A
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Negligible
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
N/A
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
Faster Cybrog operation and less database load.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Generally positive so long as we don't overload the messaging bus trying
|
||||
to pass things to the Conductor to write out.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Conductor must be installed and configured on the controllers.
|
||||
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None for API users, internally heavy use of message passing will
|
||||
be required if we want to keep all system state in the controllers.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
jkilpatr
|
||||
|
||||
Other contributors:
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implementation
|
||||
* Integration with API and Agent
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Cyborg API spec
|
||||
* Cyborg Agent spec
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
This component should be possible to fully test using unit tests and functional
|
||||
CI using the dummy driver.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Some configuration values tuning save out rate and other parameters on the
|
||||
controller will need to be documented for end users
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Cyborg API Spec
|
||||
Cyborg Agent Spec
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release
|
||||
- Description
|
||||
* - Pike
|
||||
- Introduced
|
@ -1,163 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==============================
|
||||
Cyborg Generic Driver Proposal
|
||||
==============================
|
||||
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/generic-driver-cyborg
|
||||
|
||||
This spec proposes to provide the initial design for Cyborg's generic driver.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
This blueprint proposes to add a generic driver for openstack-cyborg.
|
||||
The goal is to provide users & operators with a reliable generic
|
||||
implementation that is hardware agnostic and provides basic
|
||||
accelerator functionality.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* As an admin user and a non-admin user with elevated privileges, I should be
|
||||
able to identify and discover attached accelerator backends.
|
||||
* As an admin user and a non-admin user with elevated privileges, I should be
|
||||
able to view services on each attached backend after the agent has
|
||||
discovered services on each backend.
|
||||
* As an admin user and a non-admin user, I should be able to list and update
|
||||
attached accelerators by driver by querying nova with the Cyborg-API.
|
||||
* As an admin user and a non-admin user with elevated privileges, I should be
|
||||
able to install accelerator generic driver.
|
||||
* As an admin user and a non-admin user with elevated privileges, I should be
|
||||
able to uninstall accelerator generic driver.
|
||||
* As an admin user and a non-admin user with elevated privileges, I should be
|
||||
able to issue attach command to the instance via the driver which gets
|
||||
routed to Nova via the Cyborg API.
|
||||
* As an admin user and a non-admin user with elevated privileges, I should be
|
||||
able to issue detach command to the instance via the driver which gets
|
||||
routed to Nova via the Cyborg API.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
* Cyborg needs a reference implementation that can be used as a model for
|
||||
future driver implementations and that will be referred to as the generic
|
||||
driver implementation
|
||||
* Develop the generic driver implementation that supports CRUD operations for
|
||||
accelerators for single backend and multi backend setup scenarios.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
* The generic driver will update the central database when any CRUD or
|
||||
attach/detach operations take place
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
This blueprint proposes to add the following APIs:
|
||||
|
||||
* cyborg install-driver <driver_id>
|
||||
* cyborg uninstall-driver <driver_id>
|
||||
* cyborg attach-instance <instance_id>
|
||||
* cyborg detach-instance <instance_id>
|
||||
* cyborg service-list
|
||||
* cyborg driver-list
|
||||
* cyborg update-driver <driver_id>
|
||||
* cyborg discover-services
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Developers will have access to a reference generic implementation which
|
||||
can be used to build vendor-specific drivers.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Rushil Chugh <rushil.chugh@gmail.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
This change would entail the following:
|
||||
|
||||
* Add a feature to identify and discover attached accelerator backends.
|
||||
* Add a feature to list services running on the backend
|
||||
* Add a feature to attach accelerators to the generic backend.
|
||||
* Add a feature to detach accelerators from the generic backend.
|
||||
* Add a feature to list accelerators attached to the generic backend.
|
||||
* Add a feature to modify accelerators attached to the generic backend.
|
||||
* Defining a reference implementation detailing the flow of requests between
|
||||
the cyborg-api, cyborg-conductor and nova-compute services.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Dependent on Cyborg API and Agent implementations.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Unit tests will be added test Cyborg generic driver.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release
|
||||
- Description
|
||||
* - Pike
|
||||
- Introduced
|
@ -1,193 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===========================
|
||||
Cyborg FPGA Driver Proposal
|
||||
===========================
|
||||
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-fpga-driver
|
||||
|
||||
This spec proposes to provide the initial design for Cyborg's FPGA driver.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
A Field Programmable Gate Array(FPGA) is an integrated circuit designed to be
|
||||
configured by a customer or a designer after manufacturing. The advantage lies
|
||||
in that they are sometimes significantly faster for some applications because
|
||||
of their parallel nature and optimality in terms of the number of gates used
|
||||
for a certain process. Hence, using FPGA for application acceleration in cloud
|
||||
has been becoming desirable.
|
||||
|
||||
There is a management framwork in Cyborg [1]_ for heterogeneous accelerators,
|
||||
tracking and deploying FPGAs. This spec will add a FPGA driver for Cyborg to
|
||||
manage specific FPGA devices.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* When Cyborg agent starts or does resource checking periodically, the Cyborg
|
||||
FPGA driver should enumerate the list of the FPGA devices, and report the
|
||||
details of all available FPGA accelerators on the host, such as BDF(Bus,
|
||||
Device, Function), PID(Product id) VID(Vendor id), IMAGE_ID and PF(Physical
|
||||
Function)/VF(Virtual Function) type.
|
||||
|
||||
* When user uses empty FPGA regions as their accelerators, Cyborg agent will
|
||||
call driver's program() interface. Cyborg agent should provide BDF
|
||||
of PF/VF, and local image path to the driver. More details can be found in
|
||||
ref [2]_.
|
||||
|
||||
* When there maybe more thant one vendor fpga card on a host, or on different
|
||||
hosts in the cluster, Cyborg agent can discover the wendors easiy and
|
||||
intelligently by Cyborg FPGA driver, and call the correct driver to execute
|
||||
it's operations, such as discover() and program().
|
||||
|
||||
|
||||
Proposed changes
|
||||
================
|
||||
|
||||
In general, the goal is to develop a Cyborg FPGA driver that supports
|
||||
discover/program interfaces for FPGA accelerator framework.
|
||||
|
||||
The driver should include the follow functions:
|
||||
1. discover()
|
||||
driver reports devices as following::
|
||||
|
||||
[{
|
||||
"vendor": "0x8086",
|
||||
"product": "bcc0",
|
||||
"pr_num": 1,
|
||||
"devices": "0000:be:00:0",
|
||||
"path": "/sys/class/fpga/intel-fpga-dev.0",
|
||||
"regions": [
|
||||
{"vendor": "0x8086",
|
||||
"product": "bcc1",
|
||||
"regions": 1,
|
||||
"devices": "0000:be:00:1",
|
||||
"path": "/sys/class/fpga/intel-fpga-dev.1"
|
||||
}]
|
||||
}]
|
||||
|
||||
pr_num: partial reconfiguration region numbers.
|
||||
|
||||
2. program(device_path, image)
|
||||
program the image to a PR region specified by device_path.
|
||||
device_path: the sys path of accelerator device.
|
||||
image: The local path of programming image.
|
||||
|
||||
Image Format
|
||||
----------------------------
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
FPGA driver will not touch Data model.
|
||||
The Cyborg Agent can call FPGA driver to update the database
|
||||
during the discover/program operations.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
The related FPGA accelerator APIs is out of scope for this spec.
|
||||
The FPGA management framework for Cyborg [1]_ will alter the proposal.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Deployers should install the specific FPGA management stack that the driver
|
||||
depends on.
|
||||
|
||||
Please see ref [2]_ for details.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
There will be some developer impact vis-à-vis new functionality that
|
||||
will be available to devs.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Shaohe Feng <shaohe.feng@intel.com>
|
||||
Dolpher Du <dolpher.du@intel.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement the cyborg-fpga-driver in this spec.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Cyborg API Spec
|
||||
* Cyborg Agent Spec
|
||||
* Cyborg Driver Spec
|
||||
* Cyborg Conductor Spec
|
||||
|
||||
Testing
|
||||
========
|
||||
|
||||
* Unit tests will be added to test Cyborg FPGA driver.
|
||||
* Functional tests will be added to test Cyborg FPGA driver.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Document FPGA driver in the Cyborg project
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* Cyborg API Spec
|
||||
* Cyborg Agent Spec
|
||||
* Cyborg Driver Spec
|
||||
* Cyborg Conductor Spec
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
||||
|
||||
References
|
||||
==========
|
||||
.. [1] https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-fpga-modelling
|
||||
.. [2] https://01.org/OPAE
|
@ -1,346 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Cyborg FPGA Model Proposal
|
||||
==========================================
|
||||
|
||||
Blueprint url is not available yet
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-fpga-modelling
|
||||
|
||||
This spec proposes the DB modelling schema for tracking reprogrammable
|
||||
resources
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
A field-programmable gate array (FPGA) is an integrated circuit designed to be
|
||||
configured by a customer or a designer after manufacturing. Their advantage
|
||||
lies in that they are sometimes significantly faster for some applications
|
||||
because of their parallel nature and optimality in terms of the number of gates
|
||||
used for a certain process. Hence, using FPGA for application acceleration in
|
||||
cloud has been becoming desirable. Cyborg as a management framwork for
|
||||
heterogeneous accelerators, tracking and deploying FPGAs are much needed
|
||||
features.
|
||||
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
When user requests FPGA resources, scheduler will use placement agent [1]_ to
|
||||
select appropriate hosts that have the requested FPGA resources.
|
||||
|
||||
When a FPGA type resource is allocated to a VM, Cyborg needs to track down
|
||||
which exact device has been assigned in the database. On the other hand, when
|
||||
the resource is released, Cyborg will need to be detached and free the exact
|
||||
resource.
|
||||
|
||||
When a new device is plugged in to the system(host), Cyborg needs to discover
|
||||
it and store it into the database
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
We need to add 2 more tables to Cyborg database, one for tracking all the
|
||||
deployables and one for arbitrary key-value pairs of deplyable associated
|
||||
attirbutes. These tables are named as Deployables and Attributes.
|
||||
|
||||
Deployables table consists of all the common attributes columns as well as
|
||||
a parent_id and a root_id. The parent_id will point to the associated parent
|
||||
deployable and the root_id will point to the associated root deployable.
|
||||
By doing this, we can form a nested tree structure to represent different
|
||||
hierarchies. In addition, there will a foreign key named accelerator_id
|
||||
reference to the accelerators table. For the case where FPGA has not been
|
||||
loaded any bitstreams on it, they will still be tracked as a Deployable but
|
||||
no other Deployables referencing to it. For instance, a network of
|
||||
FPGA hierarchies can be formed using deployables in following scheme::
|
||||
|
||||
-------------------
|
||||
------------------->|Deployable - FPGA|<--------------------
|
||||
| ------------------- |
|
||||
| /\ |
|
||||
| root_id / \ parent_id/root_id |
|
||||
| / \ |
|
||||
| ----------------- ----------------- |
|
||||
| |Deployable - PF| |Deployable - PF| |
|
||||
| ----------------- ----------------- |
|
||||
| /\ |
|
||||
| / \ parent_id root_id |
|
||||
| / \ |
|
||||
----------------- ----------------- |
|
||||
|Deployable - VF| |Deployable - VF| -----------------------
|
||||
----------------- -----------------
|
||||
|
||||
|
||||
Attributes table consists of a key and a value columns to represent arbitrary
|
||||
k-v pairs.
|
||||
|
||||
For instance, bitstream_id and function kpi can be tracked in this table.
|
||||
In addition, a foreign key deployable_id refers to the Deployables table and
|
||||
a parent_attribute_id to form nested structured attribute relationships.
|
||||
|
||||
Cyborg needs to have object classes to represent different types of
|
||||
deployables(e.g. FPGA, Physical Functions, Virtual Functions etc).
|
||||
|
||||
Cyborg Agent needs to add feature to discover the FPGA resources from FPGA
|
||||
driver and report them to the Cyborg DB through the conductor.
|
||||
|
||||
Conductor needs to add couple of sets of APIs for different types of deployable
|
||||
resources.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Alternativly, instead of having a flat table to track arbitrary hierarchies, we
|
||||
can use two different tables in Cyborg database, one for physical functions and
|
||||
one for virtual functions. physical_functions should have a foreign key
|
||||
constraint to reference the id in Accelerators table. In addition,
|
||||
virtual_functions should have a foreign key constraint to reference the id
|
||||
in physical_functions.
|
||||
|
||||
The problems with this design are as follows. First, it can only track up to
|
||||
3 hierarchies of resources. In case we need to add another layer, a lot of
|
||||
migaration work will be required. Second, even if we only need to add some new
|
||||
attribute to the existing resource type, we need to create new migration
|
||||
scripts for them. Overall the maintenance work is tedious.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
As discussed in previous sections, two tables will be added: Deployables and
|
||||
Attributes::
|
||||
|
||||
|
||||
CREATE TABLE Deployables
|
||||
(
|
||||
id INTEGER NOT NULL , /*Primary Key*/
|
||||
parent_id INTEGER , /*Pointer to the parent deployable's primary key*/
|
||||
root_id INTEGER , /*Pointer to the root deployable's primary key*/
|
||||
name VARCHAR2 (32 BYTE) , /*Name of the deployable*/
|
||||
pcie_address VARCHAR2 (32 BYTE) , /*pcie address which can be used for passthrough*/
|
||||
uuid VARCHAR2 (32 BYTE) , /*uuid v4 format for the deployable itself*/
|
||||
node_id VARCHAR2 (32 BYTE) , /*uuid v4 format to identify which host this deployable is located*/
|
||||
board VARCHAR2 (16 BYTE) , /*Identify the model of the deployable(e.g. KU115)*/
|
||||
vendor VARCHAR2 (16 BYTE) , /*Identify the vendor of the deployable(e.g. Xilinx)*/
|
||||
version VARCHAR2 (32 BYTE) , /*Identify the version of the deployable(e.g. 1.2a)*/
|
||||
type VARCHAR2 (32) , /*Identify the type of the deployable(e.g. FPGA/PF/VF)*/
|
||||
assignable CHAR (1) , /*Represent if the deployable can be assigned to users*/
|
||||
instance_id VARCHAR2 (32 BYTE) , /*Represent which instance this deployable has been assigned to*/
|
||||
availability INTEGER NOT NULL, /*enum type to represent the status of the deployable(e.g. acclocated/claimed)*/
|
||||
accelerator_id INTEGER NOT NULL /*foreign key references to the accelerator table*/
|
||||
) ;
|
||||
ALTER TABLE Deployables ADD CONSTRAINT Deployables_PK PRIMARY KEY ( id ) ;
|
||||
ALTER TABLE Deployables ADD CONSTRAINT Deployables_accelerators_FK FOREIGN KEY ( accelerator_id ) REFERENCES accelerators ( id ) ;
|
||||
|
||||
|
||||
CREATE TABLE Attributes
|
||||
(
|
||||
id INTEGER NOT NULL , /*Primary Key*/
|
||||
deployable_id INTEGER NOT NULL , /*foreign key references to the Deployables table*/
|
||||
KEY CLOB , /*Attribute Key*/
|
||||
value CLOB , /*Attribute Value*/
|
||||
parent_attribute_id INTEGER /*Pointer to the parent attribute's primary key*/
|
||||
) ;
|
||||
ALTER TABLE Attributes ADD CONSTRAINT Attributes_PK PRIMARY KEY ( id ) ;
|
||||
ALTER TABLE Attributes ADD CONSTRAINT Attributes_Deployables_FK FOREIGN KEY ( deployable_id ) REFERENCES Deployables ( id ) ON
|
||||
DELETE CASCADE ;
|
||||
|
||||
|
||||
RPC API impact
|
||||
---------------
|
||||
Two sets of conductor APIs need to be added. 1 set for physical functions,
|
||||
1 set for virtual functions
|
||||
|
||||
Physical function apis::
|
||||
|
||||
def physical_function_create(context, values)
|
||||
def physical_function_get_all_by_filters(context, filters, sort_key='created_at', sort_dir='desc', limit=None, marker=None, columns_to_join=None)
|
||||
def physical_function_update(context, uuid, values, expected=None)
|
||||
def physical_function_destroy(context, uuid)
|
||||
|
||||
Virtual function apis::
|
||||
|
||||
def virtual_function_create(context, values)
|
||||
def virtual_function_get_all_by_filters(context, filters, sort_key='created_at', sort_dir='desc', limit=None, marker=None, columns_to_join=None)
|
||||
def virtual_function_update(context, uuid, values, expected=None)
|
||||
def virtual_function_destroy(context, uuid)
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
Since these tables are not exposed to users for modifying/adding/deleting,
|
||||
Cyborg will only add two extra REST APIs to allow user query information
|
||||
related to deployables and their attributes.
|
||||
|
||||
API for retrieving Deployable's information::
|
||||
|
||||
Url: {base_url}/accelerators/deployable/{uuid}
|
||||
Method: GET
|
||||
URL Params:
|
||||
GET: uuid --> get deplyable by uuid
|
||||
|
||||
Data Params:
|
||||
None
|
||||
|
||||
Success Response:
|
||||
GET:
|
||||
Code: 200
|
||||
Content: { deployable: {id : 12, parent_id: 11, root_id: 10, ....}}
|
||||
|
||||
Error Response
|
||||
Code: 401 UNAUTHORIZED
|
||||
Content: { error : "Log in" }
|
||||
OR
|
||||
Code: 422 Unprocessable Entry
|
||||
Content: { error : "deployable uuid invalid" }
|
||||
|
||||
Sample Call:
|
||||
To get the deployable with uuid=2864a139-c2cd-4f9f-abf3-44eb3f09b83c
|
||||
$.ajax({
|
||||
url: "/accelerators/deployable/2864a139-c2cd-4f9f-abf3-44eb3f09b83c",
|
||||
dataType: "json",
|
||||
type : "get",
|
||||
success : function(r) {
|
||||
console.log(r);
|
||||
}
|
||||
});
|
||||
|
||||
API for retrieving list of Deployables with filters/attirbutes::
|
||||
|
||||
Url: {base_url}/accelerators/deployable
|
||||
Method: GET
|
||||
URL Params:
|
||||
None
|
||||
|
||||
Data Params:
|
||||
k-v pairs for filtering
|
||||
|
||||
Success Response:
|
||||
GET:
|
||||
Code: 200
|
||||
Content: { deployables: [{id : 12, parent_id: 11, root_id: 10, ....}]}
|
||||
|
||||
Error Response
|
||||
Code: 401 UNAUTHORIZED
|
||||
Content: { error : "Log in" }
|
||||
OR
|
||||
Code: 422 Unprocessable Entry
|
||||
Content: { error : "deployable uuid invalid" }
|
||||
|
||||
Sample Call:
|
||||
To get a list of FPGAs with no bitstream loaded.
|
||||
$.ajax({
|
||||
url: "/accelerators/deployable",
|
||||
data: {
|
||||
"bitstream_id": None,
|
||||
"type": "FPGA"
|
||||
},
|
||||
dataType: "json",
|
||||
type : "get",
|
||||
success : function(r) {
|
||||
console.log(r);
|
||||
}
|
||||
});
|
||||
|
||||
API for retrieving Deployable attributes' information::
|
||||
|
||||
Url: {base_url}/accelerators/deployable/{uuid}/attribute/{key}
|
||||
Method: GET
|
||||
URL Params:
|
||||
GET: uuid --> uuid for the associated deployable
|
||||
key --> key for the associated deployable
|
||||
|
||||
Data Params:
|
||||
None
|
||||
|
||||
Success Response:
|
||||
GET:
|
||||
Code: 200
|
||||
Content: { attribute: {key : value}}
|
||||
|
||||
Error Response
|
||||
Code: 401 UNAUTHORIZED
|
||||
Content: { error : "Log in" }
|
||||
OR
|
||||
Code: 422 Unprocessable Entry
|
||||
Content: { error : "attirbute key invalid" }
|
||||
|
||||
Sample Call:
|
||||
To get the value of key=kpi for deployable with id=2864a139-c2cd-4f9f-abf3-44eb3f09b83c
|
||||
$.ajax({
|
||||
url: "/accelerators/deployable/2864a139-c2cd-4f9f-abf3-44eb3f09b83c/attribute/kpi",
|
||||
dataType: "json",
|
||||
type : "get",
|
||||
success : function(r) {
|
||||
console.log(r);
|
||||
}
|
||||
});
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
There will be new functionalities available to the dev because of this work.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
Li Liu <liliu1@huawei.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
* Create migration scripts to add two more tables to the database
|
||||
* Create models in sqlalchemy as well as related conductor APIs
|
||||
* Create corespoinding objects
|
||||
* Create Conductor APIs to allow resourece reporting
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Testing
|
||||
=======
|
||||
* Unit tests will be added test Cyborg generic driver.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
Document FPGA Modelling in the Cyborg project
|
||||
|
||||
References
|
||||
==========
|
||||
.. [1] https://docs.openstack.org/nova/latest/user/placement.html
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
@ -1,265 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Cyborg Internal API spec
|
||||
==========================================
|
||||
|
||||
This document loosely specifies the API calls between
|
||||
the components of Cyborg. Driver, Agent, Conductor, and API endpoint.
|
||||
|
||||
These API's are internal and therefore may change from version to version
|
||||
without warning or backwards compatibility. This document is kept as a
|
||||
developer reference to be edited before any internally braking changes
|
||||
are made.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Developers writing one component of Cyborg need to know how to talk to another
|
||||
component of Cyborg, hopefully without having to go spelunking in the code
|
||||
of that component.
|
||||
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
Happier Cyborg developers
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Versioning internal API's
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
A mess
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
A fixed internal API should help keep data models consistent.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
The API changes add resource endpoints to:
|
||||
|
||||
Driver:
|
||||
|
||||
* `POST` start accelerator discovery FROM: Agent
|
||||
* `GET` get a list of discovered accelerators and their properties FROM: Agent
|
||||
|
||||
Agent:
|
||||
|
||||
* `POST` register driver FROM: Driver
|
||||
* `POST` start accelerator discovery across all drivers FROM: Conductor
|
||||
* `GET` get a list of all accelerators across all drivers FROM: Conductor
|
||||
|
||||
Conductor:
|
||||
* `POST` register agent FROM: Agent
|
||||
|
||||
|
||||
The following new REST API call will be created:
|
||||
|
||||
Driver 'POST /discovery'
|
||||
***************************
|
||||
|
||||
Trigger the discovery and setup process for a specific driver
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"status":"IN-PROGRESS"
|
||||
}
|
||||
|
||||
Driver 'GET /hardware'
|
||||
**************************
|
||||
|
||||
Gets a list of hardware, not accelerators, accelerators are
|
||||
ready to use entires available by the public API. Hardware are
|
||||
physical devices on nodes that may or may not be ready to use or
|
||||
even fully supported.
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"hardware":[
|
||||
{
|
||||
"uuid":"8e45a2ea-5364-4b0d-a252-bf8becaa606e",
|
||||
"acc_specs":
|
||||
{
|
||||
"remote":0,
|
||||
"num":1,
|
||||
"device_type":"CRYPTO"
|
||||
"acc_capability":
|
||||
{
|
||||
"num":2
|
||||
"ipsec":
|
||||
{
|
||||
"aes":
|
||||
{
|
||||
"3des":50,
|
||||
"num":1,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
"acc_status":
|
||||
{
|
||||
"setup_required":true,
|
||||
"reboot_equired":false
|
||||
}
|
||||
}]
|
||||
}
|
||||
|
||||
|
||||
Driver 'POST /hello'
|
||||
***************************
|
||||
|
||||
Registers that a driver has been installed on the machine and is ready to use.
|
||||
As well as it's endpoint and hardware support.
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"status":"READY",
|
||||
"endpoint":"localhost:1337",
|
||||
"type":"CRYPTO"
|
||||
}
|
||||
|
||||
Agent 'POST /discovery'
|
||||
***************************
|
||||
|
||||
Trigger the discovery and setup process for all registered drivers
|
||||
|
||||
See driver example
|
||||
|
||||
|
||||
Agent 'GET /hardware'
|
||||
***************************
|
||||
|
||||
Get list of hardware across all drivers on the node
|
||||
|
||||
see driver example
|
||||
|
||||
|
||||
Conductor 'POST /hello'
|
||||
***************************
|
||||
|
||||
Registers that an Agent has been installed on the machine and is ready to use.
|
||||
|
||||
.. code-block:: ini
|
||||
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"status":"READY",
|
||||
"endpoint":"compute-whatever:1337",
|
||||
}
|
||||
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Care must be taken to secure the internal endpoints from malicious calls
|
||||
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
N/A
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
This change might have an impact on python-cyborgclient
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
In this model the Agent takes care of wrangling however many drivers are on
|
||||
a compute and the Conductor takes care of wrangling all the agents to present
|
||||
a coherent answer to the API quickly and easily. I don't include
|
||||
API <-> Conductor calls yet because I assume the API will be for the most part
|
||||
working from the database while the Conductor tries to keep that database up to
|
||||
date and takes the occasional setup call.
|
||||
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
In this model we won't really know when we're missing an agent. If one has
|
||||
reported in previously and then goes away we can have an alarm for that. But
|
||||
if an agent never reports in we just have to assume no instance exists by that
|
||||
name. This means making sure the Cyborg Drivers/Agent's are installed and
|
||||
running is the responsibility of the deployment tool.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
More internal communication in Cyborg
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
|
||||
Primary assignee:
|
||||
jkilpatr
|
||||
|
||||
Other contributors:
|
||||
zhuli
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
N/A
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
N/A
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
N/A
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
N/A
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
N/A
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
@ -1,187 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=======================
|
||||
Cyborg-Nova interaction
|
||||
=======================
|
||||
|
||||
https://blueprints.launchpad.net/cyborg/+spec/cyborg-nova-interaction
|
||||
|
||||
Cyborg, as a service for managing accelerators of any kind needs to cooperate
|
||||
with Nova on two planes: Cyborg should be able to inform Nova about the
|
||||
resources through placement API[1], so that scheduler can leverage user
|
||||
requests for particular functionality into assignment of specific resource
|
||||
using resource provider which possess an accelerator, and second, Cyborg should
|
||||
be able to provide information on how Nova compute can attach particular
|
||||
resource to VM.
|
||||
|
||||
In a nutshell, this blueprint will define how information between Nova and
|
||||
Cyborg will be exchanged.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Currently in OpenStack the use of non-standard accelerator hardware is
|
||||
supported in that features exist across many of the core servers that allow
|
||||
these resources to be allocated, passed through, and eventually used.
|
||||
|
||||
What remains a challenge though is the lack of an integrated workflow; there
|
||||
is no way to configure many of the accelerator features without significant
|
||||
by hand effort and service disruptions that go against the goals of having
|
||||
a easy, stable, and flexible cloud.
|
||||
|
||||
Cyborg exists to bring these disjoint efforts together into a more standard
|
||||
workflow. While many components of this workflow already exist, some don't
|
||||
and will need to be written expressly for this goal.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
All possible use cases were briefly described in backlog Nova spec [2]. It
|
||||
might be distinguished two main use case groups for which accelerators might be
|
||||
used:
|
||||
|
||||
* Accelerator might be attached to the VM, where workload demands acceleration.
|
||||
That can be achieved by passing whole PCI device, certain host device from
|
||||
``/dev/`` filesystem, passing Virtual Function, etc.
|
||||
* Accelerator might be utilized by infrastructure, like accelerating virtual
|
||||
switches (i.e. Open vSwitch), and than utilized via appropriate service (like
|
||||
Neutron for example).
|
||||
|
||||
|
||||
Proposed Workflow
|
||||
=================
|
||||
|
||||
Using a method not relevant to this proposal Cyborg Agent inspects hardware
|
||||
and finds accelerators that it is interested in setting up for use.
|
||||
|
||||
These accelerators are registered into the Cyborg Database and the Cyborg
|
||||
Conductor is now responsible for using the Nova placement API to create
|
||||
corresponding traits and resources.
|
||||
|
||||
One of the primary responsibilities of the Cyborg conductor is to keep the
|
||||
placement API in sync with reality. For example if here is a device with
|
||||
a virtual function or a FPGA with a given program Cyborg may be tasked with
|
||||
changing the virtual function on the NIC or the program on the FPGA. At which
|
||||
point the previously specified traits and resources need to be updated.
|
||||
Likewise Cyborg will be watching monitoring Nova's instances to ensure that
|
||||
doing this doesn't pull resources out from under an allocated instance.
|
||||
|
||||
At a high level what we need to be able to do is the following
|
||||
|
||||
1. Add a PCI device to Nova's whitelist live
|
||||
(config only / needs implementation)
|
||||
2. Add information about this device to the placement API
|
||||
(existing / being worked)
|
||||
3. Hotplug and unplug PCI devices from instances
|
||||
(existing / not sure how well maintained)
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Don't use Cyborg, struggle with bouncing services and grub config changes
|
||||
yourself.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
N/A
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
N/A
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
N/A
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
N/A
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
N/A
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
N/A
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
N/A
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
N/A
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implementation of Cyborg service
|
||||
* Implementation of Cyborg agent
|
||||
* Blueprint for changes in Nova
|
||||
* Implementation of the POC which exposes functionality and interoperability
|
||||
between Cyborg and Nova
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
This design depends on the changes which may or may not be accepted in Nova
|
||||
project. Other than that is ongoing work on Nested resource providers:
|
||||
https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/nested-resource-providers.html
|
||||
Which would be an essential feature in Placement API, which will be leveraged
|
||||
by Cyborg.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
There would be a need to provide another gate, which would provide an
|
||||
accelerator for tests.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
* Document new nova api for whitelisting
|
||||
* Document developer and user interaction with the workflow
|
||||
* Document placement api standard identifiers
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* [1] https://docs.openstack.org/developer/nova/placement.html
|
||||
* [2] https://review.openstack.org/#/c/318047/
|
||||
* [3] https://github.com/openstack/nova/blob/390c7e420f3880a352c3934b9331774f7afdadcc/nova/compute/resource_tracker.py#L751
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
@ -1,221 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
===========================
|
||||
Cyborg SPDK Driver Proposal
|
||||
===========================
|
||||
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-spdk-driver
|
||||
|
||||
This spec proposes to provide the initial design for Cyborg's SPDK driver.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
SPDK is a high performance kit and provides a user space, polled-mode,
|
||||
asynchronous, lockless NVMe driver for storage acceleration on the
|
||||
backend. Our goal is to add a SPDK driver for Cyborg to manage SPDK,
|
||||
and further improve storage performance.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* When Cinder uses Ceph as its backend, the user should be able to
|
||||
use the Cyborg SPDK driver to discover the SPDK accelerator backend,
|
||||
enumerate the list of the Ceph nodes that have installed the SPDK.
|
||||
* When Cinder directly uses SPDK's BlobStore as its backend, the user
|
||||
should be able to accomplish the same life cycle management operations
|
||||
for SPDK as mentioned above. After enumerating the SPDK, the user can
|
||||
attach (install) SPDK on that node. When the task completes, the user
|
||||
can also detach the SPDK from the node. Last but not least the user
|
||||
should be able to update the latest and available SPDK.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
In general, the goal is to develop the Cyborg SPDK driver that supports
|
||||
discover/list/update/attach/detach operations for SPDK framework.
|
||||
|
||||
SPDK framework
|
||||
--------------
|
||||
|
||||
The SPDK framework comprises of the following components::
|
||||
|
||||
+-----------userspace--------+ +--------------+
|
||||
| +------+ +------+ +------+ | | +-----------+ |
|
||||
+---+ | |DPDK | |NVMe | |NVMe | | | | Ceph | |
|
||||
| N +-+-+NIC | |Target| |Driver+-+-+ |NVMe Device| |
|
||||
| I | | |Driver| | | | | | | +-----------+ |
|
||||
| C | | +------+ +------+ +------+ | | +-----------+ |
|
||||
+---+ | +------------------------+ | | | Blobstore | |
|
||||
| | DPDK Libraries | | | |NVMe Device| |
|
||||
| +------------------------+ | | +-----------+ |
|
||||
+----------------------------+ +---------------+
|
||||
|
||||
BlobStore NVMe Device Format
|
||||
----------------------------
|
||||
|
||||
BlobStore owns the entire NVMe device including metadata management
|
||||
and data management, which defines three basic units of disk space (like
|
||||
logical block, page, cluster). The NVMe device is divided into clusters
|
||||
starting from the first logical block.
|
||||
|
||||
LBA 0 LBA N
|
||||
+-----------+-----------+-----+-----------+
|
||||
| Cluster 0 | Cluster 1 | ... | Cluster N |
|
||||
+-----------+-----------+-----+-----------+
|
||||
|
||||
Cluster0 has special format which consists of pages. Page0 is the
|
||||
first page of Cluster0. Super Block contains the basic information of
|
||||
BlobStore.
|
||||
|
||||
+--------+-------------------+
|
||||
| Page 0 | Page 1 ... Page N |
|
||||
+--------+-------------------+
|
||||
| Super | Metadata Region |
|
||||
| Block | |
|
||||
+--------+-------------------+
|
||||
|
||||
Each blob is allocated a non-contiguous set of pages. These pages form
|
||||
a linked list.
|
||||
In general, the BlobStore adopts direct operation of bare metal device and
|
||||
avoids the filesystem, which improves efficiency.
|
||||
|
||||
Life Cycle Management Phases
|
||||
----------------------------
|
||||
* We should be able to add a judgement whether the backend node has SPDK kit
|
||||
in generic driver module. If true, initialize the DPDK environment (such as
|
||||
hugepage).
|
||||
* Import the generic driver module, and then we should be able to
|
||||
discover (probe) the system for SPDK.
|
||||
* Determined by the backend storage scenario, enumerate (list) the optimal
|
||||
SPDK node, returning a boolean value to judge whether the SPDK should be
|
||||
attached.
|
||||
* After the node where SPDK will be running is attached, we can now send a
|
||||
request about the information of namespaces, and then create an I/O queue
|
||||
pair to submit read/write requests to a namespace.
|
||||
* When Ceph is used as the backend, as the latest Ceph (such as Luminous)
|
||||
uses the BlueStore to be the storage engine, BlueStore and BlobStore are
|
||||
very similar things. We will not be able to use BlobStore to accelerate
|
||||
Ceph, but we can use Ioat and poller to boost speed for storage.
|
||||
* When SPDK is used as the backend, we should be able to use BlobStore to
|
||||
improve performance.
|
||||
* Whenever user requests, we should be able to detach the SPDK device.
|
||||
* Whenever user requests, we should be able to update SPDK to the latest and
|
||||
stable release.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
* The Cyborg SPDK driver will notify Cyborg Agent to update the database
|
||||
when discover/list/update/attach/detach operations take place.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
This blueprint proposes to add the following APIs:
|
||||
|
||||
* cyborg discover-driver(driver_type)
|
||||
* cyborg driver-list(driver_type)
|
||||
* cyborg install-driver(driver_id, driver_type)
|
||||
* cyborg attach-instance <instance_id>
|
||||
* cyborg detach-instance <instance_id>
|
||||
* cyborg uninstall-driver(driver_id, driver_type)
|
||||
* cyborg update-driver <driver_id, driver_type>
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
The SPDK can provide a user space, polled-mode, asynchronous,
|
||||
lockless NVMe driver for storage acceleration on the backend.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Deployers can call SPDK from the nodes which have installed SPDK
|
||||
after the driver has been implemented.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
luwei he <heluwei@huawei.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Implement the cyborg-spdk-driver in this spec.
|
||||
* Propose SPDK to py-spdk. The py-spdk is designed as a SPDK client
|
||||
which provides the python binding.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Cyborg API Spec
|
||||
* Cyborg Agent Spec
|
||||
* Cyborg Driver Spec
|
||||
* Cyborg Conductor Spec
|
||||
|
||||
Testing
|
||||
========
|
||||
|
||||
* Unit tests will be added to test Cyborg SPDK driver.
|
||||
* Functional tests will be added to test Cyborg SPDK driver. For example:
|
||||
discover-->list-->attach,whether the workflow can be passed successfully.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Document SPDK driver in the Cyborg project
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* Cyborg API Spec
|
||||
* Cyborg Agent Spec
|
||||
* Cyborg Driver Spec
|
||||
* Cyborg Conductor Spec
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
@ -1,413 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==============================================
|
||||
Cyborg-Nova-Glance Interaction in Compute Node
|
||||
==============================================
|
||||
|
||||
Cyborg is a service for managing accelerators, such as FPGAs, GPUs, etc. For
|
||||
scheduling an instance that needs accelerators, Cyborg needs to work with Nova
|
||||
at three levels:
|
||||
|
||||
* Representation and Discovery: Cyborg shall represent accelerators
|
||||
as resources in Placement. When a device is discovered, Cyborg
|
||||
updates resource inventories in Placement.
|
||||
|
||||
* Instance placement/scheduling: Cyborg may provide a weigher
|
||||
that prioritizes hosts based on available accelerator resources.
|
||||
|
||||
* Attaching accelerators to instances. In the compute node, Cyborg
|
||||
shall define a workflow based on interacting with Nova through a
|
||||
new os-acc library (like os-vif and os-brick).
|
||||
|
||||
The first two aspects are addressed in [#CyborgNovaSched]_. This spec
|
||||
addresses the attachment of accelerators to instances, via os-acc. For
|
||||
FPGAs, Cyborg also needs to interact with Glance for fetching bitstreams.
|
||||
Some aspects of that are covered in [#BitstreamSpec]_. This spec will
|
||||
address the interaction of Cyborg and Glance in the compute node.
|
||||
|
||||
This spec is common to all accelerators, including GPUs, High Precision
|
||||
Time Synchronization (HPTS) cards, etc. Since FPGAs have more aspects
|
||||
to be considered than other devices, some sections may focus on
|
||||
FPGA-specific factors. The spec calls out the FPGA-specific aspects.
|
||||
|
||||
Smart NICs based on FPGAs fall into two categories: those which
|
||||
expose the FPGA explicitly to the host, and those that do not. Cyborg's
|
||||
current scope includes the former. This spec includes such devices,
|
||||
though the Cyborg-Neutron interaction is out of scope.
|
||||
|
||||
The scope of this spec is Rocky release.
|
||||
|
||||
Terminology
|
||||
===========
|
||||
* Accelerator: The unit that can be assigned to an instance for
|
||||
offloading specific functionality. For non-FPGA devices, it is either the
|
||||
device itself or a virtualized version of it (e.g. vGPUs). For FPGAs, an
|
||||
accelerator is either the entire device, a region within the device or a
|
||||
function.
|
||||
|
||||
* Bitstream: An FPGA image, usually a binary file, possibly with
|
||||
vendor-specific metadata. A bitstream may implement one or more functions.
|
||||
|
||||
* Function: A specific functionality, such as matrix multiplication or video
|
||||
transcoding, usually represented as a string or UUID. This term may be used
|
||||
with multi-function devices, including FPGAs and other fixed function
|
||||
hardware like Intel QuickAssist.
|
||||
|
||||
* Region: A part of the FPGA which can be programmed without disrupting
|
||||
other parts of that FPGA. If an FPGA does not support Partial
|
||||
Reconfiguration, the entire device constitutes one region. A region
|
||||
may implement one or more functions.
|
||||
|
||||
Here is an example diagram for an FPGA with multiple regions, and multiple
|
||||
functions in a region::
|
||||
|
||||
PCI A PCI B
|
||||
| |
|
||||
+-------|--------|-------------------+
|
||||
| | | |
|
||||
| +----|--------|---+ +--------+ |
|
||||
| | +--|--+ +---|-+ | | | |
|
||||
| | | Fn A| | Fn B| | | | |
|
||||
| | +-----+ +-----+ | | | |
|
||||
| +-----------------+ +--------+ |
|
||||
| Region 1 Region 2 |
|
||||
| |
|
||||
+------------------------------------+
|
||||
|
||||
Problem description
|
||||
===================
|
||||
Once Nova has picked a compute node for placement of an instance that needs
|
||||
accelerators, the following steps needs to happen:
|
||||
|
||||
* Nova compute on that node has to invoke Cyborg Agent for handling the needed
|
||||
accelerators. This needs to happen through a library, named os-acc, patterned
|
||||
after os-vif (Neutron) and os-brick (Cinder).
|
||||
|
||||
* Cyborg Agent may call Glance to fetch a bitstream, either by id or based on
|
||||
tags.
|
||||
|
||||
* Cyborg Agent may need to call into a Cyborg driver to program said bitstream.
|
||||
|
||||
* Cyborg Agent needs to call into a Cyborg driver to prepare a device and/or
|
||||
obtain an attach handle (e.g. PCI BDF) that can be attached to the instance.
|
||||
|
||||
* Cyborg Agent returns enough information to Nova compute via os-acc for the
|
||||
instance to be launched.
|
||||
|
||||
The behavior of each of these steps needs to be specified.
|
||||
|
||||
In addition, the OpenStack Compute API [#ServerConcepts]_ specifies the
|
||||
operations that can be done on an instance. The behavior with respect to
|
||||
accelerators must be defined for each of these operations. That in turn is
|
||||
related to when Nova compute calls os-acc.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
Please see [#CyborgNovaSched]_. We intend to support FPGAaaS with
|
||||
request time programming, and AFaaS (both pre-programmed and
|
||||
orchestrator-programmed scenarios).
|
||||
|
||||
Cyborg will discover accelerator resources whenever the Cyborg agent starts up.
|
||||
PCI hot plug can be supported past Rocky release.
|
||||
|
||||
Cyborg must support all instance operations mentioned in OpenStack Compute API
|
||||
[#ServerConcepts]_ in Rocky, except booting off a snapshot and live migration.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
OpenStack Server API Behavior
|
||||
-----------------------------
|
||||
The OpenStack Compute API [#ServerConcepts]_ mentions the list of operations
|
||||
that can be performed on an instance. Of these, some will not be supported by
|
||||
Cyborg in Rocky. The list of supported operations (with
|
||||
the intended behaviors) are as follows:
|
||||
|
||||
* When an instance is started, the accelerators requested by that instance’s
|
||||
flavor must be attached to the instance. On termination, those resources are
|
||||
released.
|
||||
|
||||
* When an instance is paused, suspended or locked, the accelerator resources
|
||||
are left intact, and not detached from the instance. So, when the instance is
|
||||
unpaused, resumed or unlocked, there is nothing to do.
|
||||
|
||||
* When an instance is shelved, the accelerator resources are detached. On an
|
||||
unshelve, it is expected that the build operation will go through the
|
||||
scheduler again, so it is equivalent to an instance start.
|
||||
|
||||
* When an instance is deleted, the accelerator resources are detached. On a
|
||||
restore, it is expected that the build operation will go through the
|
||||
scheduler again, so it is equivalent to an instance start.
|
||||
|
||||
* Reboot: The accelerator resources are left intact. It is up the instance
|
||||
software to rediscover attached resources.
|
||||
|
||||
* Rebuild: Prior to the instance image replacement, all device access must be
|
||||
quiesced, i.e., accesses to devices from that instance must be completed and
|
||||
further accesses must be prohibited. The mechanics of such quiescing are
|
||||
outside the scope of this document. With that precondition, accelerator
|
||||
resources are left attached to the instance during the rebuild.
|
||||
|
||||
* Resize (with change of flavor): It is equivalent to a termination followed by
|
||||
re-scheduling and restart. The accelerator resources are detached on
|
||||
termination, and re-attached on when the instance is scheduled again.
|
||||
|
||||
* Cold migration: It is equivalent to a termination followed by re-scheduling
|
||||
and restart. The accelerator resources are detached on termination, and
|
||||
re-attached on when the instance is scheduled again.
|
||||
|
||||
* Evacuate: This is a forcible rebuild by the administrator. As the semantics
|
||||
of evacuation are left open even without accelerators, Cyborg’s behavior is
|
||||
also left undefined.
|
||||
|
||||
* Set administrator password, trigger crash dump: These are supported and not
|
||||
no-ops for accelerators.
|
||||
|
||||
The following instance operations are not supported in this release:
|
||||
|
||||
* Booting off a snapshot: The snapshot may have been taken when the attached
|
||||
accelerators were in a particular state. When booting off a previous
|
||||
snapshot, the current configuration and state of accelerators may not match
|
||||
the snapshot. So, this is unsupported.
|
||||
|
||||
* Live migration: Until a mechanism is defined to migrate accelerator state
|
||||
along with the instance, this is unsupported.
|
||||
|
||||
os_acc Structure
|
||||
----------------
|
||||
Cyborg will develop a new library named os-acc. That library will offer the
|
||||
APIs listed later in this section. Nova Compute calls these APIs if it sees
|
||||
that the requested flavor refers to CUSTOM_ACCELERATOR resource class, except
|
||||
for the initialize() call, which is called unconditionally. Nova Compute calls
|
||||
these APIs asynchronously, as suggested below::
|
||||
|
||||
with ThreadPoolExecutor(max_workers=1) as executor:
|
||||
future = executor.submit(os_acc.<api>, *args)
|
||||
# do other stuff
|
||||
try:
|
||||
data = future.result()
|
||||
except:
|
||||
# handle exceptions
|
||||
|
||||
The APIs of os-acc are as below:
|
||||
|
||||
* initialize()
|
||||
|
||||
* Called once at start of day. Waits for Cyborg Agent to be ready to accept
|
||||
requests, i.e., all devices enumerated and traits published.
|
||||
|
||||
* Returns None on success.
|
||||
|
||||
* Throws ``CyborgAgentUnavailable`` exception if Cyborg Agent cannot be
|
||||
contacted.
|
||||
|
||||
* plug(instance_info, selected_rp, flavor_extra_specs)
|
||||
|
||||
* Parameters are all read-only. Here are their descriptions:
|
||||
|
||||
* instance_info: dictionary containing instance UUID, instance name,
|
||||
project/tenant ID and VM image UUID. The instance name is needed for
|
||||
better logging, the project/tenant ID may be passed to some accelerator
|
||||
policy engine in the future and the VM image UUID may be used to query
|
||||
Glance for metadata about accelerator requirements that may be stored
|
||||
with the VM image.
|
||||
|
||||
* selected_rp: Information about the selected resource provider is
|
||||
passed as a dictionary.
|
||||
|
||||
* flavor_extra_specs: the extra_specs field in the flavor, including
|
||||
resource classes, traits and other fields interpreted by Cyborg.
|
||||
|
||||
* Called by Nova compute when an instance is started, unshelved, or
|
||||
restored and after a resize or cold migration.
|
||||
|
||||
* Called before an instance is built, i.e., before the specification of
|
||||
the instance is created. For libvirt-based hypervisors, this means
|
||||
the call happens before the instance’s domain XML is created.
|
||||
|
||||
* As part of this call, Cyborg Agent may fetch bitstreams from Glance and
|
||||
initiate programming. It may fetch the bitstream specified in the
|
||||
request’s flavor extra specs, if any. If the request refers to a
|
||||
function ID/name, Cyborg Agent would query Glance to find bitstreams
|
||||
that provide the flavor and match the chosen device, and would then
|
||||
fetch the needed bitstream.
|
||||
|
||||
* As part of this call, Cyborg Agent will locate the Deployable corresponding
|
||||
to the chosen RP, locate the attach handles (e.g. PCI BDF) needed, update
|
||||
its internal data structures in a persistent way, and return the needed
|
||||
information back to Nova.
|
||||
|
||||
* Returns an array, with one entry per requested accelerator, each entry
|
||||
being a dictionary. The dictionary is structured as below for Rocky:
|
||||
|
||||
| { “pci_id”: <pci bdf> }
|
||||
|
||||
* unplug(instance_info)
|
||||
|
||||
* Parameters are all read-only. Here are their descriptions:
|
||||
|
||||
* instance_info: dictionary containing instance UUID and instance
|
||||
name. The instance name is needed for better logging.
|
||||
|
||||
* Called when an instance is stopped, shelved, or deleted and before
|
||||
a resize or cold migration.
|
||||
|
||||
* As part of this call, Cyborg Agent will clean up internal resources, call
|
||||
the appropriate Cyborg driver to clean up the device resources and update
|
||||
its data structures persistently.
|
||||
|
||||
* Returns the number of accelerators that were released. Errors may cause
|
||||
exceptions to be thrown.
|
||||
|
||||
Workflows
|
||||
---------
|
||||
The pseudocode for each os-acc API can be expressed as below::
|
||||
|
||||
def initialize():
|
||||
# checks that all devices are discovered and their traits published
|
||||
# waits if any discovery operation is ongoing
|
||||
return None
|
||||
|
||||
def plug(instance_info, rp, extra_specs):
|
||||
validate_params(....)
|
||||
glance = glanceclient.Client(...)
|
||||
driver = # select Cyborg driver for chosen rp
|
||||
rp_deployable = # get deployable for RP
|
||||
if extra_specs refers to ``CUSTOM_FPGA_<vendor>_REGION_<uuid>`` and
|
||||
extra_specs refers to ``bitstream:<uuid>``:
|
||||
bitstream = glance.images.data(image_uuid)
|
||||
driver.program(bitstream, rp_deployable, …)
|
||||
if extra_specs refers to ``CUSTOM_FPGA_<vendor>_FUNCTION_<uuid>`` and
|
||||
extra_specs refers to function UUID/name:
|
||||
region_type_uuid = # fetch from selected RP
|
||||
bitstreams = glance.images.list(...)
|
||||
# queries Glance by function UUID/name property and region type
|
||||
# UUID to get matching bitstreams
|
||||
if len(bitstreams) > 1:
|
||||
error(...) # bitstream choice policy is outside Cyborg
|
||||
driver.program(bitstream, rp_deployable, …)
|
||||
pci_bdf = driver.allocate_handle(...)
|
||||
# update Cyborg DB with instance_info and BDF usage
|
||||
return { “pci_id”: pci bdf }
|
||||
|
||||
def unplug(instance_info):
|
||||
bdf_list = # fetch BDF usage from Cyborg DB for instance
|
||||
# update Cyborg DB to mark those BDFs as free
|
||||
return len(bdf_list)
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
N/A
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Decide how to associate multiple functions/bitstreams in extra specs
|
||||
with multiple devices in the flavor.
|
||||
|
||||
* Decide specific changes needed in Cyborg conductor, db, agent and drivers.
|
||||
|
||||
* Others: TBD
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Nested Resource Provider support in Nova
|
||||
|
||||
* `Nova Granular Requests
|
||||
<https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/granular-resource-requests.html>`_
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
For each vendor driver supported in this release, we need to integrate the
|
||||
corresponding FPGA type(s) in the CI infrastructure.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The behavior with respect to accelerators during various instance operations
|
||||
(reboot, pause, etc.) must be documented. The procedure to upload a bitstream,
|
||||
including applying Glance properties, must also be documented.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [#CyborgNovaSched] `Cyborg Nova Scheduling Specification
|
||||
<https://review.openstack.org/#/c/554717/>`_
|
||||
|
||||
.. [#Bitstreamspec] `Cyborg bitstream metadata standardization spec
|
||||
<https://review.openstack.org/#/c/558265/>`_
|
||||
|
||||
.. [#ServerConcepts] `OpenStack Server API Concepts
|
||||
<https://docs.openstack.org/api-guide/compute/server_concepts.html>`_
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader understand
|
||||
what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Rocky
|
||||
- Introduced
|
||||
|
@ -1,222 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Cyborg Agent-Driver API
|
||||
==========================================
|
||||
|
||||
Cyborg agent interacts with each Cyborg driver in the compute node to
|
||||
discover available devices. This spec defines how the agent-driver API
|
||||
is structured.
|
||||
|
||||
No change is proposed to the way the agent discovers the drivers on
|
||||
start or restart.
|
||||
|
||||
This spec is common to all accelerators, including GPUs, High Precision
|
||||
Time Synhronization (HPTS) cards, etc. Since FPGAs have more aspects to
|
||||
be considered than other devices, some sections may focus on FPGA-specific
|
||||
factors. The spec calls out the FPGA-specific aspects.
|
||||
|
||||
The scope of this spec is Rocky release, but the API has been designed
|
||||
to be extensible for future releases. Accordingly, the spec calls out
|
||||
the Rocky-specific aspects.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
The [#Cyborg_Nova_scheduling_spec]_ specifies that devices are
|
||||
represented using Resource Providers (RPs), Resource Classes (RCs)
|
||||
and traits. The information needed to create them has to come from
|
||||
the Cyborg driver to the Cyborg agent, which in turn needs to
|
||||
push it to the Cyborg Conductor.
|
||||
|
||||
The main challenge is discovering the device topology for FPGAs.
|
||||
An FPGA may have one or more Partial Reconfiguration regions,
|
||||
and those regions may have one or more accelerators nested inside them.
|
||||
Further, it may have local memory that is either partitioned or
|
||||
shared among the regions.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
* Devices of different types (GPUs, FPGAs, HPTS cards, Quick Assist) are
|
||||
present in the same host.
|
||||
|
||||
* FPGAs of different types, possibly from different vendors, are present
|
||||
in the same host.
|
||||
|
||||
* An FPGA may have one or more regions. Each region may have one
|
||||
or more accelerators.
|
||||
|
||||
* In Rocky, we may support only one region per FPGA, and only one
|
||||
accelerator per region.
|
||||
|
||||
* For Rocky, it is proposed that local memory need not be exposed as
|
||||
a resource to orchestration. That is because, since there is only
|
||||
one region per FPGA, an instance attached to that region will be
|
||||
able to access all the memory, no matter how much there is. For
|
||||
non-FPGA devices like GPUs, there does not seem to be a requirement
|
||||
to expose video RAM.
|
||||
|
||||
Cyborg will assume and handle the following component relationships:
|
||||
|
||||
* One product (e.g. Intel PAC Arria 10) may correspond to multiple
|
||||
PCI vendor/device IDs.
|
||||
|
||||
* One PCI vendor/device ID may correspond to different region type IDs.
|
||||
This could be either because there are multiple regions in the same device
|
||||
or because there are different versions/revisions of the same device.
|
||||
|
||||
* But the same region type ID will never show up in products with
|
||||
different PCI IDs.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Today, the Cyborg agent invokes the discover() API for each driver
|
||||
that it finds. The discover() API returns a dictionary indexed by
|
||||
the PCI BDF of a device. The value element in the key-value pair of
|
||||
the dictionary contains the components and characteristics
|
||||
of the device with that BDF.
|
||||
|
||||
We propose to retain the same model, but enhance the dictionary to
|
||||
include enough information to create the resource providers and traits
|
||||
needed to populate Placement. Here are the additional proposed keys
|
||||
in the device dictionary for each PF:
|
||||
|
||||
| ``"type": <enum-string>`` # One of GPU, FPGA, etc.
|
||||
| ``"vendor": <string>``
|
||||
| ``"product": <string>``
|
||||
|
||||
Also, in the ``regions`` entry for each PF, it is proposed to add
|
||||
the following keys:
|
||||
|
||||
| ``"region-type-uuid": <uuid>`` # Optional, default: NULL
|
||||
| ``"bitstream-id": <uuid>`` # Glance/other UUID, optional, default: NULL
|
||||
| ``"function-uuid": <uuid>`` # Optional, default: NULL
|
||||
|
||||
When the agent receives this dictionary for a device, it will do
|
||||
the following:
|
||||
|
||||
* If there is nested RP support, create an RP for the device and each
|
||||
region within.
|
||||
|
||||
* Create a device type trait: ``CUSTOM_<type>_<vendor>_<product>``.
|
||||
Apply it to the device RP (if nRP support exists) or the compute node RP.
|
||||
|
||||
* E.g. CUSTOM_FPGA_INTEL_PAC_ARRIA10.
|
||||
|
||||
* NOTE: The agent will convert all characters to upper case, replace
|
||||
spaces with underscores, and check for conformance to custom trait
|
||||
syntax (see [#Custom_traits]_)
|
||||
|
||||
* Create region type traits for each region, of the form:
|
||||
``CUSTOM_<type>_<vendor>_REGION_<type-uuid>``. Apply them to the
|
||||
corresponding region RP (if nRP support exists) or the compute node RP.
|
||||
|
||||
* E.g. CUSTOM_FPGA_INTEL_REGION_<type-uuid>
|
||||
|
||||
* NOTE: For UUIDs, the agent will convert all hexadecimal digits to upper
|
||||
case, replace hyphens with underscores and validate all characters.
|
||||
|
||||
* Create function type traits for each function in each region, of the form:
|
||||
``CUSTOM_<type>_<vendor>_FUNCTION_<function-uuid>``. Apply them to the
|
||||
corresponding region RP (if nRP support exists) or the compute node RP.
|
||||
|
||||
* E.g. CUSTOM_FPGA_INTEL_FUNCTION_<function-uuid>
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
N/A
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Add the new fields to the database under Deployables and Attributes.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Need to update unit tests to check for the newly added fields.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [#Cyborg_Nova_scheduling_spec] `Cyborg/Nova Scheduling spec <https://review.openstack.org/#/c/554717>`_
|
||||
|
||||
.. [#Custom_traits] `Custom Traits <https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/resource-provider-traits.html#rest-api-impact>`_
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader
|
||||
understand what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Rocky
|
||||
- Introduced
|
@ -1,253 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/4.0/legalcode
|
||||
|
||||
====================================================
|
||||
Cyborg FPGA Bitstream metadata spec
|
||||
====================================================
|
||||
|
||||
Blueprint url:
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-fpga-bitstream-metadata-spec
|
||||
|
||||
This spec proposes the FPGA Bitstream metadata specifications for bitstream
|
||||
management
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
A field-programmable gate array (FPGA) is an integrated circuit designed to be
|
||||
configured by a customer or a designer after manufacturing. Their advantage
|
||||
lies in that they are sometimes significantly faster for some applications
|
||||
because of their parallel nature and optimality in terms of the number of
|
||||
gates used for a certain process. Hence, using FPGA for application
|
||||
acceleration in cloud has become desirable. One of the encountered problems is
|
||||
when it comes to bitstream management, it is difficult to map bitstreams to
|
||||
their appropriate FPGA boards or reconfigurable regions. The aim of this
|
||||
proposal is to provide a standardized set of metadata which should be
|
||||
encapsulated together with bitstream storage.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
When user requests to reprogram a FPGA board with certain functionality in the
|
||||
cloud environment, he or she will need to retrieve a suitable bitstream from
|
||||
the storage. In order to find the suitable one, bitstreams need to be
|
||||
categorized based on some properties defined in metadata.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
For each metadata, it will be stored as a row in this Glance's image_properties
|
||||
in key-value pair format: column [name] holds the key whereas column [value]
|
||||
holds the value. Note: no batabase schema change is required. This is a
|
||||
standardization document to guide how to use existing Glance table for FPGA
|
||||
bitstreams.
|
||||
|
||||
Given this, Cyborg will standardize the key convention as follows:
|
||||
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| name | value | nullable | description |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| bs-name | aes-128| False | name of the bitstream(not unique) |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| bs-uuid | {uuid} | False | The uuid generated during synthesis |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| vendor | Xilinx | False | Vendor of the card |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| board | KU115 | False | Board type for this bitstream to load|
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| shell_id | {uuid} | True | Required shell bs-uuid for the bs |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| version | 1.0 | False | Device version number |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| driver | SDX | True | Type of driver for this bitstream |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| driver_ver | 1.0 | False | Driver version |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| driver_path | /path/ | False | Where to retrieve the driver binary |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| topology | {CLOB} | False | Function Topology |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| description | desc | True | Description |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| region_uuid | {uuid} | True | The uuid for target region type |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| function_uuid| {uuid} | False | The uuid for bs function type |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
| function_name| nic-40 | True | The function name for this bitstream |
|
||||
+--------------+---------+-----------+--------------------------------------+
|
||||
|
||||
Here are the details regarding some definded keys.
|
||||
|
||||
[shell_id]
|
||||
This field is optional. If a loading this PR bitstream requires a shell image,
|
||||
this field specifies the shell bitstream's uuid. If it field is null, it means
|
||||
this bitstream is a shell bitstream.
|
||||
|
||||
[driver]
|
||||
This specifies the path to a package of scripts/binaries to be installed in
|
||||
order to use the loaded bitstream(e.g. insmod some kernel driver/git clone
|
||||
some remote source code, etc)
|
||||
|
||||
[region_uuid]
|
||||
This value specifies the type of region that is required to load this
|
||||
bitstream. This type is a uuid generated during the shell bitstream synthesis.
|
||||
|
||||
[function_uuid]
|
||||
This value specifies the type of function for this bitstream. It helps the
|
||||
upsteam scheduler to match traits with appropriate bitstream.
|
||||
|
||||
[topology]
|
||||
This field describes the topology of function structures after the bitstream is
|
||||
loaded on the FPGA. In particular, it uses JSON format to visualize how
|
||||
physical functions, virtual functions are co-related to each other. It is
|
||||
vendor driver's responsibility to interpret this and prepare the porper report
|
||||
for Cyborg Agent. For instance::
|
||||
|
||||
{
|
||||
"pf_num": 2,
|
||||
"vf_num": 2,
|
||||
"pf": [
|
||||
{
|
||||
"name": "pf_1",
|
||||
"capability": "",
|
||||
"kpi": "",
|
||||
"pci_offset": "0",
|
||||
"vf": [
|
||||
{
|
||||
"name": "vf_1",
|
||||
"pci_offset": "1"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "pf_2",
|
||||
"capability": "",
|
||||
"kpi": "",
|
||||
"pci_offset": "2",
|
||||
"vf": [
|
||||
{
|
||||
"name": "vf_2",
|
||||
"pci_offset": "3"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
This JSON template guides Cyborg Agent to populate vf/pf/deployable list in
|
||||
Cyborg.
|
||||
|
||||
Given the above JSON topology, Cyborg Driver should be able to interpret the
|
||||
accelerator structure as follows::
|
||||
|
||||
=============
|
||||
=Accelerator=
|
||||
=============
|
||||
|
|
||||
============
|
||||
=Deployable=
|
||||
============
|
||||
/\
|
||||
/ \
|
||||
=================== ===================
|
||||
= Deployable pf_1 = = Deployable pf_2 =
|
||||
=================== ===================
|
||||
| |
|
||||
| |
|
||||
=================== ===================
|
||||
= Deployable vf_1 = = Deployable vf_2 =
|
||||
=================== ===================
|
||||
|
||||
Noted: 1. Topology is not mandatory to fill in, as long as vendor driver can
|
||||
figure out what resources to report after the bitstream is loaded. 2. The JSON
|
||||
provided here is only a reference template. It does not have to be PCI-centric
|
||||
etc. and up to vendors how to define it for their products. 3. A root
|
||||
deployable shouldbe created in the graph. In addition, the pfs and vfs here
|
||||
are all instances of deployable. Please refer to the DB objects specs
|
||||
regarding physical_function and virtual_function.
|
||||
|
||||
|
||||
Finnally, all of the FPGA bitstreams should be TAGGED as "FPGA" in Glance.
|
||||
This helps distinguishing between normal VM images and bitstream images
|
||||
during filtering.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
RPC API impact
|
||||
---------------
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
Accelerator vendors should implement the logic in program() api to populate
|
||||
the loaded topology
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
Li Liu <liliu1@huawei.com>
|
||||
Shaohe Feng <shaohe.feng@intel.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
* Provide example JSON format for bitstream
|
||||
* Provide example implementation of vendor driver
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
None
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Rocky
|
||||
- Introduced
|
@ -1,200 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
====================================================
|
||||
Cyborg FPGA Programming Service Proposal
|
||||
====================================================
|
||||
|
||||
Blueprint url is not available yet
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-fpga-programming-ability
|
||||
|
||||
This spec proposes a Programming Service to be added to Cyborg to allow user
|
||||
dynamically change the functions loaded on FPGA in cloud environment
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
A field-programmable gate array (FPGA) is an integrated circuit designed to be
|
||||
configured by a customer or a designer after manufacturing. Their advantage
|
||||
lies in that they are sometimes significantly faster for some applications
|
||||
because of their parallel nature and optimality in terms of the number of
|
||||
gates used for a certain process. In addition, FPGA can be reprogrammed based
|
||||
on different applications Hence, using FPGA for application acceleration in
|
||||
cloud has been becoming desirable. Cyborg as a management framwork for
|
||||
heterogeneous accelerators, tracking, deploying and reprogramming FPGAs are
|
||||
much needed features. Since the FPGA modelling has already been proposed in
|
||||
another document, this spec will be focused on proposing Reporgramming
|
||||
Service for FPGAs in Cyborg
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
In the scenario of OpenCL, user loads the accelerators on FPGA for their
|
||||
application. When different applications are executing on OpenCL environment,
|
||||
the accelerators will be changed from time to time. It will not be feasible
|
||||
to login to each host and change the FPGA configuration manually by lab admin.
|
||||
Instead, through the reprogramming service, users can manage the functions
|
||||
of FPGA using a set of REST APIs.
|
||||
|
||||
Similarly, during the maintenance of FPGA, admin needs to update/migrate
|
||||
shells and bitstreams on FPGAs within data center. Cyborg Reprogramming
|
||||
Service will allow them to use the APIs from a centralized console.
|
||||
|
||||
Since this is a pure proposal for programming APIs, it would not focus on
|
||||
what the upstream use case/runtime is. Those details will be in separate
|
||||
specs when needed.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
First of all, Cyborg needs to add extra REST APIs to allow others to invoke
|
||||
the programming service. The REST api should have following format::
|
||||
|
||||
Url: {base_url}/fpga/{deployable_uuid}
|
||||
Method: POST
|
||||
URL Params:
|
||||
None
|
||||
|
||||
Data Params:
|
||||
glance_bitstream_uuid
|
||||
|
||||
Success Response:
|
||||
POST:
|
||||
Code: 200
|
||||
Body: { "msg" : "bitstream has been loaded successfully"}
|
||||
|
||||
Error Response
|
||||
Code: 401 UNAUTHORIZED
|
||||
Body: { error : "Log in" }
|
||||
OR
|
||||
Code: 422 Unprocessable Entry
|
||||
Body: { error : "User is not authorized to use the resource" }
|
||||
|
||||
Sample Call:
|
||||
To program fpga resource with deployable_uuid=2864a139-c2cd-4f9f-abf3-44eb3f09b83c
|
||||
with bitstream with uuid=0b955a5b-f5dd-49d0-8c4f-28729427d303
|
||||
$.ajax({
|
||||
url: "/fpga/2864a139-c2cd-4f9f-abf3-44eb3f09b83c",
|
||||
data: {
|
||||
"glance_bitstream_uuid": "0b955a5b-f5dd-49d0-8c4f-28729427d303"
|
||||
},
|
||||
dataType: "json",
|
||||
type : "post",
|
||||
success : function(r) {
|
||||
console.log(r);
|
||||
}
|
||||
});
|
||||
|
||||
Second, implement the service in Cyborg which does three tasks: 1. identify
|
||||
the host location of the requested FPGA/Partial Reconfiguraion(PR) Region(e.g.
|
||||
on which host is the board located). 2. Check if the user(API caller,
|
||||
OpenStack Login User, etc) has the privilige to use the given bitstream,
|
||||
FPGA, or host. 3. If the previous checks pass, Cyborg will send the program
|
||||
notification to the target host with requested FPGA.
|
||||
|
||||
Third, implement notification callee in Cyborg Agent. This should be a rpc
|
||||
call with following signature::
|
||||
|
||||
int program_fpga_with_bitstream(deployable_uuid, bitstream_uuid)
|
||||
|
||||
The function takes both deployable_uuid and bitstream_uuid as input. It uses
|
||||
deployable_uuid to identify which specific FPGA/PR region is going to be
|
||||
programmed and uses bitstream_uuid to retrieve bitstream from the bitstream
|
||||
storage service (Glance in the context of OpenStack). In addition, this is a
|
||||
synchronous meaning it will wait for the programming task to be completed and
|
||||
then return a status code as integer. The return code should have following
|
||||
interpretation:
|
||||
|
||||
+------+--------------------------------------------------------+
|
||||
| code | meaning |
|
||||
+------+--------------------------------------------------------+
|
||||
| 0 | program successfully |
|
||||
+------+--------------------------------------------------------+
|
||||
| 1 | failed with unkown errors |
|
||||
+------+--------------------------------------------------------+
|
||||
| 2 | invalid deployable_uuid(target fpga not found) |
|
||||
+------+--------------------------------------------------------+
|
||||
| 3 | invalid bitstream_uuid(bitstream can not be downloaded)|
|
||||
+------+--------------------------------------------------------+
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
A rest api will be added to the Cyborg service as we discussed previously.
|
||||
It should not impact any of the existing rest apis
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
The access to FPGA/PR region and bitstreams should be carefully checked.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
On the Cyborg Agent side, it relies on program() api implemented by vendor.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
Primary assignee:
|
||||
Li Liu <liliu1@huawei.com>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
* Implement the cyborg program service rest api
|
||||
* Implement the cyborg program service
|
||||
* Implement the notification call in Cyborg Agent, which invokes vendor driver
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
The Cyborg-Nova interaction related specs need to be aware the change of the
|
||||
accelerators when FPGAs are being reprogrammed.
|
||||
|
||||
References
|
||||
==========
|
||||
None
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Rocky
|
||||
- Introduced
|
@ -1,486 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Cyborg-Nova Interaction for Scheduling
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/cyborg/+spec/cyborg-nova-interaction
|
||||
|
||||
Cyborg provides a general management framework for accelerators, such
|
||||
as FPGAs, GPUs, etc. For scheduling an instance that needs accelerators,
|
||||
Cyborg needs to work with Nova on three levels:
|
||||
|
||||
* Representation and Discovery: Cyborg shall represent accelerators as
|
||||
resources in Placement. When a device is discovered, Cyborg updates
|
||||
resource providers, inventories, traits, etc. in Placement.
|
||||
|
||||
* Instance placement/scheduling: Cyborg may provide a filter and/or weigher
|
||||
that limit or prioritize hosts based on available accelerator resources,
|
||||
but it is expected that Placement itself can handle most requirements.
|
||||
|
||||
* Attaching accelerators to instances. In the compute node, Cyborg shall
|
||||
define a workflow based on interacting with Nova through a new os-acc
|
||||
library (similar to os-vif and os-brick).
|
||||
|
||||
This spec addresses the first two aspects. There is another spec to
|
||||
address the attachment of accelerators to instances [#os-acc]_.
|
||||
Cyborg also needs to handle some aspects for FPGAs without involving
|
||||
Nova, specifically FPGA programming and bitstream management. They
|
||||
will be covered in other specs. This spec is independent of those specs.
|
||||
|
||||
This spec is common to all accelerators, including GPUs, High Precision
|
||||
Time Synchronization (HPTS) cards, etc. Since FPGAs have more aspects to
|
||||
be considered than other devices, some sections may focus on FPGA-specific
|
||||
factors. The spec calls out the FPGA-specific aspects.
|
||||
|
||||
Smart NICs based on FPGAs fall into two categories: those which expose
|
||||
the FPGA explicitly to the host, and those that do not. Cyborg's scope
|
||||
includes the former. This spec includes such devices, though the
|
||||
Cyborg-Neutron interaction is out of scope.
|
||||
|
||||
The scope of this spec is Rocky release.
|
||||
|
||||
Terminology
|
||||
===========
|
||||
* Accelerator: The unit that can be assigned to an instance for
|
||||
offloading specific functionality. For non-FPGA devices, it is either the
|
||||
device itself or a virtualized version of it (e.g. vGPUs). For FPGAs, an
|
||||
accelerator is either the entire device, a region within the device or a
|
||||
function.
|
||||
|
||||
* Bitstream: An FPGA image, usually a binary file, possibly with
|
||||
vendor-specific metadata. A bitstream may implement one or more functions.
|
||||
|
||||
* Function: A specific functionality, such as matrix multiplication or video
|
||||
transcoding, usually represented as a string or UUID. This term may be used
|
||||
with multi-function devices, including FPGAs and other fixed function
|
||||
hardware like Intel QuickAssist.
|
||||
|
||||
* Region: A part of the FPGA which can be programmed without disrupting
|
||||
other parts of that FPGA. If an FPGA does not support Partial
|
||||
Reconfiguration, the entire device constitutes one region. A region
|
||||
may implement one or more functions.
|
||||
|
||||
Here is an example diagram for an FPGA with multiple regions, and multiple
|
||||
functions in a region::
|
||||
|
||||
PCI A PCI B
|
||||
| |
|
||||
+-------|--------|-------------------+
|
||||
| | | |
|
||||
| +----|--------|---+ +--------+ |
|
||||
| | +--|--+ +---|-+ | | | |
|
||||
| | | Fn A| | Fn B| | | | |
|
||||
| | +-----+ +-----+ | | | |
|
||||
| +-----------------+ +--------+ |
|
||||
| Region 1 Region 2 |
|
||||
| |
|
||||
+------------------------------------+
|
||||
|
||||
Problem description
|
||||
===================
|
||||
Cyborg's representation and handling of accelerators needs to be consistent
|
||||
with Nova's Placement API. Specifically, they must be modeled in terms of
|
||||
Resource Providers (RPs), Resource Classes (RCs) and Traits.
|
||||
|
||||
Though PCI Express is entrenched in the data center, some accelerators
|
||||
may be exposed to the host via some other protocol. Even with PCI, the
|
||||
connections between accelerator components and PCI functions
|
||||
may vary across devices. Accordingly, Cyborg should not represent
|
||||
accelerators as PCI functions.
|
||||
|
||||
For instances that need accelerators, we need to define a way for Cyborg
|
||||
to be included seamlessly in the Nova scheduling workflow.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
We need to satisfy the following use cases for the tenant role:
|
||||
|
||||
* Device as a Service (DaaS): The flavor asks for a device.
|
||||
|
||||
* FPGA variation: The flavor asks for a device to which specific
|
||||
bitstream(s) can be applied. There are three variations, the first
|
||||
two of which delegate bitstream programming to Cyborg for secure
|
||||
programming:
|
||||
|
||||
* Request-time Programming: The flavor specifies a bitstream. (Cyborg
|
||||
applies the bitstream before instance bringup. This is similar to
|
||||
AWS flow.)
|
||||
|
||||
* Run-time Programming: The instance may request one or more
|
||||
bitstreams dynamically. (Cyborg receives the request and does
|
||||
the programming.)
|
||||
|
||||
* Direct Programming: The instance directly programs the FPGA
|
||||
region assigned to it, without delegating it to Cyborg. The
|
||||
security questions that this raises need to be addressed in
|
||||
the future. (This is listed only for completeness; this is not
|
||||
going to be addressed in Rocky, or even future releases till
|
||||
the security concerns are fully addressed.)
|
||||
|
||||
* Accelerated Function as a Service (AFaaS): The flavor asks for a
|
||||
function (e.g. ipsec) attached to the instance. The operator may
|
||||
satisfy this use case in two ways:
|
||||
|
||||
* Pre-programmed: Do not allow orchestration to modify any function,
|
||||
for any of these reasons:
|
||||
|
||||
* Only fixed function hardware is available. (E.g. ASICs.)
|
||||
|
||||
* Operational simplicity.
|
||||
|
||||
* Assure tenants of programming security, by doing all programming offline
|
||||
through some audited process.
|
||||
|
||||
* For FPGAs, allow orchestration to program as needed, to maximize
|
||||
flexibility and availability of resources.
|
||||
|
||||
An operator must be able to provide both Device as a Service and Accelerated
|
||||
Function as a Service in the same cluster, to serve all
|
||||
kinds of users: those who are device-agnostic, those using 3rd party
|
||||
bitstreams, and those using their own bitstreams (incl. developers).
|
||||
|
||||
The goal for Cyborg is to provide the mechanisms to enable all these use
|
||||
cases.
|
||||
|
||||
In this spec, we do not consider bitstream developer or device developer
|
||||
roles. Also, we assume that each accelerator device is dedicated to a
|
||||
compute node, rather than shared among several nodes.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Representation
|
||||
--------------
|
||||
|
||||
* Cyborg will represent a generic accelerator for a device type as a
|
||||
custom Resource Class (RC) for that type, of the form
|
||||
CUSTOM_ACCELERATOR_<device-type>. E.g. CUSTOM_ACCELERATOR_GPU,
|
||||
CUSTOM_ACCELERATOR_FPGA, etc. This helps in defining separate quotas
|
||||
for different device types.
|
||||
|
||||
* Device-local memory is the memory available to the device alone,
|
||||
usually in the form of DDR, QDR or High Bandwidth Memory in the
|
||||
PCIe board along with the device. It can also be represented as an
|
||||
RC of the form CUSTOM_ACCELERATOR_MEMORY_<memory-type>. E.g.
|
||||
CUSTOM_ACCELERATOR_MEMORY_DDR. A single PCIe board may have more
|
||||
than one type of memory.
|
||||
|
||||
* In addition, each device/region is represented as a Resource Provider
|
||||
(RP). This enables traits to be applied to it and other RPs/RCs to
|
||||
be contained within it. So, a device RP provides one or more instances
|
||||
of that device type's RC. This depends on nested RP support in
|
||||
Nova [#nRP]_.
|
||||
|
||||
* For FPGAs, both the device and the regions within it will be
|
||||
represented as RPs. This allows the hierarchy within an FPGA
|
||||
to be naturally modelled as an RP hierarchy.
|
||||
|
||||
* Using Nested RPs is the preferred way. But, until Nova
|
||||
supports nested RPs, Cyborg shall associate the
|
||||
RCs and traits (described below) with the compute node RPs. This
|
||||
requires that all devices on a single host must share the same
|
||||
traits. If nested RP support becomes usable after Rocky release,
|
||||
the operator needs to handle the upgrade as below:
|
||||
|
||||
* Terminate all instances using accelerators.
|
||||
|
||||
* Remove all Cyborg traits and inventory on all compute node RPs,
|
||||
perhaps by running a script.
|
||||
|
||||
* Perform the Cyborg upgrade. Post-upgrade, the new agent/driver(s)
|
||||
will create RPs for the devices and publish the traits
|
||||
and inventory.
|
||||
|
||||
* Cyborg will associate a Device Type trait with each device, of the
|
||||
form CUSTOM_<device-type>-<vendor>. E.g. CUSTOM_GPU_AMD or
|
||||
CUSTOM_FPGA_XILINX. This trait is intended to help match the
|
||||
software drivers/libraries in the instance image. This is meant to
|
||||
be used in a flavor when a single driver/library in the instance
|
||||
image can handle most or all of device types from a vendor.
|
||||
|
||||
* For FPGAs, this trait and others will be applied to the region
|
||||
RPs which are children of the device RPs as well.
|
||||
|
||||
* Cyborg will associate a Device Family trait with each device as
|
||||
needed, of the form CUSTOM_<device-type>_<vendor>_<family>.
|
||||
E.g. CUSTOM_FPGA_INTEL_ARRIA10.
|
||||
This is not a product name, but the name of a device family, used to
|
||||
match software in the instance image with the device family. This is
|
||||
a refinement of the Device Type Trait. It is meant to be used in
|
||||
a flavor when there are different drivers/libraries for different
|
||||
device families. Since it may be tough to forecast whether a new
|
||||
device family will need a new driver/library, it may make sense to
|
||||
associate both these traits with the same device RP.
|
||||
|
||||
* For FPGAs, Cyborg will associate a region type trait with each region
|
||||
(or with the FPGA itself if there is no Partial Reconfiguration
|
||||
support), of the form CUSTOM_FPGA_REGION_<vendor>__<uuid>.
|
||||
E.g. CUSTOM_FPGA_REGION_INTEL_<uuid>. This is needed for Device as a
|
||||
Service with FPGAs.
|
||||
|
||||
* For FPGAs, Cyborg may associate a function type trait with a region
|
||||
when the region gets programmed, of the form
|
||||
CUSTOM_FPGA_FUNCTION_<vendor>_<uuid>. E.g.
|
||||
CUSTOM_FPGA_FUNCTION_INTEL_<gzip-uuid>.
|
||||
This is needed for AFaaS use case. This is updated when Cyborg
|
||||
reprograms a region as part of AFaaS request.
|
||||
|
||||
* For FPGAs, Cyborg should associate a CUSTOM_PROGRAMMABLE trait with
|
||||
every region. This is needed to lay the groundwork for
|
||||
multi-function accelerators in the future. Flavors should ask for
|
||||
this trait, except in the pre-programmed case.
|
||||
|
||||
* For FPGAs, since they may implement a wide variety of functionality,
|
||||
we may also attach a Functionality Trait.
|
||||
E.g. CUSTOM_FPGA_COMPUTE, CUSTOM_FPGA_NETWORK, CUSTOM_FPGA_STORAGE.
|
||||
|
||||
* The Cyborg agent needs to get enough information from the Cyborg driver
|
||||
to create the RPs, RCs and traits. In particular, it needs to get the
|
||||
device type string, region IDs and function IDs from the driver. This
|
||||
requires the driver/agent interface to be enhanced [#drv-api]_.
|
||||
|
||||
* The modeling in Placement represents generic virtual accelerators as
|
||||
resource classes, and devices/regions as RPs. This is PCI-agnostic.
|
||||
However, many FPGA implementations use PCI Express in general, and
|
||||
SR-IOV in particular. In those cases, it is expected that Cyborg will
|
||||
pass PCI VFs to instances via PCI Passthrough, and retain the PCI PF
|
||||
in the host for management.
|
||||
|
||||
Flavors
|
||||
-------
|
||||
For the sake of illustrating how the device representation in Nova
|
||||
can be used, and for completeness, we now show how to define flavors
|
||||
for various use cases. Please see [#flavor]_ for more details.
|
||||
|
||||
* A flavor that needs device access always asks for one or more instances
|
||||
of 'resource:CUSTOM_ACCELERATOR_<device-type>'. In addition, it
|
||||
needs to specify the right traits.
|
||||
|
||||
* Example flavor for DaaS:
|
||||
|
||||
| ``resources:CUSTOM_ACCELERATOR_HPTS=1``
|
||||
| ``trait:CUSTOM_HPTS_ZTE=required``
|
||||
|
||||
NOTE: For FPGAs, the flavor should also include CUSTOM_PROGRAMMABLE trait.
|
||||
|
||||
* Example flavor for AFaaS Pre-programed:
|
||||
|
||||
| ``resources:CUSTOM_ACCELERATOR_FPGA=1``
|
||||
| ``trait:CUSTOM_FPGA_INTEL_ARRIA10=required``
|
||||
| ``trait:CUSTOM_FPGA_FUNCTION_INTEL_<gzip-uuid>=required``
|
||||
|
||||
* Example flavor for AFaaS Orchestration-Programmed:
|
||||
|
||||
| ``resources:CUSTOM_ACCELERATOR_FPGA=1``
|
||||
| ``trait:CUSTOM_FPGA_INTEL_ARRIA10=required``
|
||||
| ``trait:CUSTOM_PROGRAMMABLE=required``
|
||||
| ``function:CUSTOM_FPGA_FUNCTION_INTEL_<gzip-uuid>=required``
|
||||
(Not interpreted by Nova.)
|
||||
|
||||
* NOTE: When Nova supports preferred traits, we can use that instead
|
||||
of 'function' keyword in extra specs.
|
||||
|
||||
* NOTE: For Cyborg to fetch the bitstream for this function, it
|
||||
is assumed that the operator has configured the function UUID
|
||||
as a property of the bitstream image in Glance.
|
||||
|
||||
* Another example flavor for AFaaS Orchestration-Programmed which
|
||||
refers to a function by name instead of UUID for ease of use:
|
||||
|
||||
| ``resources:CUSTOM_ACCELERATOR_FPGA=1``
|
||||
| ``trait:CUSTOM_FPGA_INTEL_ARRIA10=required``
|
||||
| ``trait:CUSTOM_PROGRAMMABLE=required``
|
||||
| ``function_name:<string>=required``
|
||||
(Not interpreted by Nova.)
|
||||
|
||||
* NOTE: This assumes the operator has configured the function name
|
||||
as a property of the bitstream image in Glance. The FPGA
|
||||
hardware is not expected to expose function names, and so
|
||||
Cyborg will not represent function names as traits.
|
||||
|
||||
* A flavor may ask for other RCs, such as local memory.
|
||||
|
||||
* A flavor may ask for multiple accelerators, using the granular resource
|
||||
request syntax. Cyborg can tie function and bitstream fields in
|
||||
the extra_specs to resources/traits using an extension of the granular
|
||||
resource request syntax (see References) which is not interpreted by Nova.
|
||||
|
||||
| ``resourcesN: CUSTOM_ACCELERATOR_FPGA=1``
|
||||
| ``traitsN: CUSTOM_FPGA_INTEL_ARRIA10=required``
|
||||
| ``othersN: function:CUSTOM_FPGA_FUNCTION_INTEL_<gzip-uuid>=required``
|
||||
|
||||
Scheduling workflow
|
||||
--------------------
|
||||
We now look at the scheduling flow when each device implements only
|
||||
one function. Devices with multiple functions are outside the scope for now.
|
||||
|
||||
* A request spec with a flavor comes to Nova conductor/scheduler.
|
||||
|
||||
* Placement API returns the list of RPs which contain the requested
|
||||
resources with matching traits. (With nested RP support, the returned
|
||||
RPs are device/region RPs. Without it, they are compute node RPs.)
|
||||
|
||||
* FPGA-specific: For AFaaS orchestration-programmed use case, Placement
|
||||
will return matching devices but they may not have the requested
|
||||
function. So, Cyborg may provide a weigher which checks the
|
||||
allocation candidates to see which ones have the required function trait,
|
||||
and ranks them higher. This requires no change to Cyborg DB.
|
||||
|
||||
* The request_spec goes to compute node (ignoring Cells for now).
|
||||
|
||||
NOTE: When one device/region implements multiple functions and
|
||||
orchestration-driven programming is desired, the inventory of that
|
||||
device needs to be adjusted.
|
||||
This can be addressed later and is not a priority for Rocky release.
|
||||
See References.
|
||||
|
||||
* Nova compute calls os-acc/Cyborg [#os-acc]_.
|
||||
|
||||
* FPGA-specific: If the request spec asks for a function X in extra specs,
|
||||
but X is not present in the selected region RP, Cyborg should program
|
||||
that region.
|
||||
|
||||
* Cyborg should associate RPs/RCs and PFs/VFs with Deployables in its
|
||||
internal DB. It can use such mappings associating the requested resource
|
||||
(device/function) with some attach handle that can be used to
|
||||
attach the resource to an instance (such as a PCI function).
|
||||
|
||||
NOTE : This flow is PCI-agnostic: no PCI whitelists involved.
|
||||
|
||||
Handling Multiple Functions Per Device
|
||||
--------------------------------------
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
N/A
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Following changes are needed in Cyborg.
|
||||
|
||||
* Do not publish PCI functions as resources in Nova. Instead, publish
|
||||
RC/RP info to Nova, and keep RP-PCI mapping internally.
|
||||
|
||||
* Cyborg should associate RPs/RCs and PFs/VFs with Deployables in its
|
||||
internal DB.
|
||||
|
||||
* Driver/agent interface needs to report device/region types so that
|
||||
RCs can be created.
|
||||
|
||||
* Deployables table should track which RP corresponds to each Deployable.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
This change allows tenants to initiate FPGA bitstream programming. To mitigate
|
||||
the security impact, it is proposed that only 2 methods are offered for
|
||||
programming (flavor asks for a bitstream, or the running instance asks for
|
||||
specific bitstreams) and both are handled through Cyborg. There is no direct
|
||||
access from an instance to an FPGA.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Decide specific changes needed in Cyborg conductor, db, agent and drivers.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* `Nested Resource Providers
|
||||
<https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/nested-resource-providers-allocation-candidates.html>`_
|
||||
|
||||
* `Nova Granular Requests
|
||||
<https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/granular-resource-requests.html>`_
|
||||
|
||||
NOTE: the granular requests feature is needed to define a flavor that requests
|
||||
non-identical accelerators, but is not needed for Cyborg development in Rocky.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
For each vendor driver supported in this release, we need to integrate the
|
||||
corresponding FPGA type(s) in the CI infrastructure.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [#os-acc] `Specification for Compute Node <https://review.openstack.org/#/c/566798/>`_
|
||||
|
||||
.. [#nRP] `Nested RPs in Rocky <https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/nested-resource-providers-allocation-candidates.html>`_
|
||||
|
||||
.. [#drv-api] `Specification for Cyborg Agent-Driver API <https://review.openstack.org/#/c/561849/>`_
|
||||
|
||||
.. [#flavor] `Custom Resource Classes in Flavors <https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/custom-resource-classes-in-flavors.html>`_
|
||||
|
||||
.. [#qspec] `Cyborg Nova Queens Spec <https://github.com/openstack/cyborg/blob/master/doc/specs/queens/approved/cyborg-nova-interaction.rst>`_
|
||||
|
||||
.. [#ptg] `Rocky PTG Etherpad for Cyborg Nova Interaction <https://etherpad.openstack.org/p/cyborg-ptg-rocky-nova-cyborg-interaction>`_
|
||||
|
||||
.. [#multifn] `Detailed Cyborg/Nova scheduling <https://etherpad.openstack.org/p/Cyborg-Nova-Multifunction>`_
|
||||
|
||||
.. [#mails] `Openstack-dev email discussion <https://lists.openstack.org/pipermail/openstack-dev/2018-April/128951.html>`_
|
||||
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader know
|
||||
what happened over time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Rocky
|
||||
- Introduced
|
@ -1,204 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================
|
||||
Quota Usage for Cyborg Resources
|
||||
================================
|
||||
|
||||
Launchpad blueprint:
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/cyborg-resource-quota
|
||||
|
||||
There are multiple ways to slice an OpenStack cloud. Imposing quota on these
|
||||
various slices puts a limitation on the amount of resources that can be
|
||||
consumed which helps to guarantee "fairness" or fair distribution of resources
|
||||
at the creation time. If a particular project needs more resources, the
|
||||
concept of quota gives the ability to increase the resource count on-demand,
|
||||
given that the system constraints are not exceeded.
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
At present in Cyborg we don't have the concept of Quota on acceleration
|
||||
resources, so users can consume as many resources as they want.
|
||||
Quotas are tied closely to physical resources and billable entities, hence from
|
||||
Cyborg's perspective, it helps to limit the allocation and consumption
|
||||
of a particular kind of resources at a certain value.
|
||||
|
||||
In place of implementing quota like other services, we want to enable
|
||||
the unified limit which is provided by Keystone to manage our quota limit[1].
|
||||
With unified limits, all limits will be set in Keystone and enforced by
|
||||
oslo.limit. So we decided to implement quota usage part first.
|
||||
Once the oslo.limit is ready for other services, Cyborg will invoke oslo.limit
|
||||
to get the limit information and do limit check etc.
|
||||
|
||||
This specs aims at the implementation of quota usage in Cyborg. As the
|
||||
oslo.limit is not finished yet, we can directly set the value of limit
|
||||
manually, and reserved the function calling oslo.limit with a "pass" inside.
|
||||
|
||||
|
||||
Use cases
|
||||
---------
|
||||
Alice is an admin. She would like to have a feature which will give her
|
||||
details of Cyborg acceleration resource consumptions so that she can manage her
|
||||
resources appropriately.
|
||||
|
||||
She might run into following scenarios:
|
||||
|
||||
* Ability to know current resource consumption.
|
||||
|
||||
* Ability to prohibit overuse by a project.
|
||||
|
||||
* Prevent situation where users in a project get starved because users in
|
||||
other project consume all the resource. "Quota Management" would help to
|
||||
gurantee "fairness".
|
||||
|
||||
* Prevent DOS kind of attacks, abuse or error by users, which leads to an
|
||||
excessive amount of resources allocation.
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
Proposed changes are introducing a Quota_Usage Table which primarily stores
|
||||
the quota usage assigned for each resource in a project, and a Reservation
|
||||
Table to store every modification of resource usage.
|
||||
|
||||
When a new resource allocation request comes, the 'reserved' field in the Quota
|
||||
usages table will be updated. This acceleration resource is being used to set
|
||||
up VM. For example, the fpga quota hardlimit is 5 and 3 fgpas have
|
||||
already been used, then two new fpga requests come in. Since we have 3 fpgas
|
||||
already used, the 'used' field will be set to 3. Now the 'reserved'
|
||||
field will be set to 2 untill the fpga attachment is successful. Once
|
||||
the attachment is done this field will be reset to 0, and the 'used'
|
||||
count will be updated from 3 to 5. So at this moment, hardlimit is 5, used
|
||||
is 5 and in-progress is 0. So there is one more request comes in, this request
|
||||
will be rejected since there is not enough quota available.
|
||||
|
||||
In general,
|
||||
|
||||
Resource quota available = Resource hard_limit - [
|
||||
(Resource reserved + Resources already allocated for project)]
|
||||
|
||||
In this specs, we just focus on the update of quota usage and we will not check
|
||||
if one user has already exceed his quota limit. The limit management will be
|
||||
set in Keystone in the future and we just need to invoke the oslo.limit.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
At present there is no quota infrastructure in Cyborg.
|
||||
|
||||
Adding Quota Management layer at the Orchestration layer could be an
|
||||
alternative.However, our approach will give a finer view of resource
|
||||
consumptions at the IaaS layer which can be used while provisioning Cyborg
|
||||
resources.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
New Quota usages and reservation table will be introduced to Cyborg database to
|
||||
store quota consumption for each resource in a project.
|
||||
|
||||
Quota usages table:
|
||||
|
||||
+---------------+--------------+------+-----+---------+----------------+
|
||||
| Field | Type | Null | Key | Default | Extra |
|
||||
+---------------+--------------+------+-----+---------+----------------+
|
||||
| created_at | datetime | YES | | NULL | |
|
||||
| updated_at | datetime | YES | | NULL | |
|
||||
| id | int(11) | NO | PRI | NULL | auto_increment |
|
||||
| project_id | varchar(255) | YES | MUL | NULL | |
|
||||
| resource | varchar(255) | NO | | NULL | |
|
||||
| reserved | int(11) | NO | | NULL | |
|
||||
| used | int(11) | NO | | NULL | |
|
||||
+---------------+--------------+------+-----+---------+----------------+
|
||||
|
||||
Quota reservation table:
|
||||
|
||||
+------------+--------------+------+-----+---------+----------------+
|
||||
| Field | Type | Null | Key | Default | Extra |
|
||||
+------------+--------------+------+-----+---------+----------------+
|
||||
| created_at | datetime | YES | | NULL | |
|
||||
| updated_at | datetime | YES | | NULL | |
|
||||
| deleted_at | datetime | YES | | NULL | |
|
||||
| deleted | tinyint(1) | YES | | NULL | |
|
||||
| id | int(11) | NO | PRI | NULL | auto_increment |
|
||||
| uuid | varchar(36) | NO | | NULL | |
|
||||
| usage_id | int(11) | NO | MUL | NULL | |
|
||||
| project_id | varchar(255) | YES | MUL | NULL | |
|
||||
| resource | varchar(255) | YES | | NULL | |
|
||||
| delta | int(11) | NO | | NULL | |
|
||||
| expire | datetime | YES | | NULL | |
|
||||
+------------+--------------+------+-----+---------+----------------+
|
||||
|
||||
We will also introduce QuotaEngine class which represents the set of
|
||||
recognized quotas and DbQuotaDriver class which performs check to enforcement
|
||||
of quotas and also allows to obtain quota information.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
Not sure if we need to expose GET quota usage before oslo.limit settle down.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
None
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
None
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
Xinran WANG
|
||||
|
||||
Other contributors:
|
||||
None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Introduce Quota usages and Reservation table in Cyborg databases.
|
||||
* Update these two tables during allocation and deallocation of resources.
|
||||
* Reserve the place of function which will invoke oslo.limit with a "pass"
|
||||
inside.
|
||||
* Add rollback mechanism when allocation fails.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
* Each commit will be accompanied with unit tests.
|
||||
* Gate functional tests will also be covered.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
None
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] https://review.openstack.org/#/c/540803
|
@ -1,392 +0,0 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Example Spec - The title of your blueprint
|
||||
==========================================
|
||||
|
||||
Include the URL of your launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/openstack-cyborg/+spec/example
|
||||
|
||||
Introduction paragraph -- why are we doing anything? A single paragraph of
|
||||
prose that operators can understand. The title and this first paragraph
|
||||
should be used as the subject line and body of the commit message
|
||||
respectively.
|
||||
|
||||
Some notes about the cyborg-spec and blueprint process:
|
||||
|
||||
* Not all blueprints need a spec. For more information see
|
||||
https://docs.openstack.org/developer/cyborg/blueprints.html#specs
|
||||
|
||||
* The aim of this document is first to define the problem we need to solve,
|
||||
and second agree the overall approach to solve that problem.
|
||||
|
||||
* This is not intended to be extensive documentation for a new feature.
|
||||
For example, there is no need to specify the exact configuration changes,
|
||||
nor the exact details of any DB model changes. But you should still define
|
||||
that such changes are required, and be clear on how that will affect
|
||||
upgrades.
|
||||
|
||||
* You should aim to get your spec approved before writing your code.
|
||||
While you are free to write prototypes and code before getting your spec
|
||||
approved, its possible that the outcome of the spec review process leads
|
||||
you towards a fundamentally different solution than you first envisaged.
|
||||
|
||||
* But, API changes are held to a much higher level of scrutiny.
|
||||
As soon as an API change merges, we must assume it could be in production
|
||||
somewhere, and as such, we then need to support that API change forever.
|
||||
To avoid getting that wrong, we do want lots of details about API changes
|
||||
upfront.
|
||||
|
||||
Some notes about using this template:
|
||||
|
||||
* Your spec should be in ReSTructured text, like this template.
|
||||
|
||||
* Please wrap text at 79 columns.
|
||||
|
||||
* The filename in the git repository should match the launchpad URL, for
|
||||
example a URL of: https://blueprints.launchpad.net/openstack-cyborg/+spec/awesome-thing
|
||||
should be named awesome-thing.rst
|
||||
|
||||
* Please do not delete any of the sections in this template. If you have
|
||||
nothing to say for a whole section, just write: None
|
||||
|
||||
* For help with syntax, see http://sphinx-doc.org/rest.html
|
||||
|
||||
* To test out your formatting, build the docs using tox and see the generated
|
||||
HTML file in doc/build/html/specs/<path_of_your_file>
|
||||
|
||||
* If you would like to provide a diagram with your spec, ascii diagrams are
|
||||
required. http://asciiflow.com/ is a very nice tool to assist with making
|
||||
ascii diagrams. The reason for this is that the tool used to review specs is
|
||||
based purely on plain text. Plain text will allow review to proceed without
|
||||
having to look at additional files which can not be viewed in gerrit. It
|
||||
will also allow inline feedback on the diagram itself.
|
||||
|
||||
* If your specification proposes any changes to the Cyborg REST API such
|
||||
as changing parameters which can be returned or accepted, or even
|
||||
the semantics of what happens when a client calls into the API, then
|
||||
you should add the APIImpact flag to the commit message. Specifications with
|
||||
the APIImpact flag can be found with the following query:
|
||||
|
||||
https://review.openstack.org/#/q/status:open+project:openstack/cyborg+message:apiimpact,n,z
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
A detailed description of the problem. What problem is this blueprint
|
||||
addressing?
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
What use cases does this address? What impact on actors does this change have?
|
||||
Ensure you are clear about the actors in each use case: Developer, End User,
|
||||
Deployer etc.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Here is where you cover the change you propose to make in detail. How do you
|
||||
propose to solve this problem?
|
||||
|
||||
If this is one part of a larger effort make it clear where this piece ends. In
|
||||
other words, what's the scope of this effort?
|
||||
|
||||
At this point, if you would like to just get feedback on if the problem and
|
||||
proposed change fit in Cyborg, you can stop here and post this for review to
|
||||
get preliminary feedback. If so please say:
|
||||
Posting to get preliminary feedback on the scope of this spec.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
What other ways could we do this thing? Why aren't we using those? This doesn't
|
||||
have to be a full literature review, but it should demonstrate that thought has
|
||||
been put into why the proposed solution is an appropriate one.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Changes which require modifications to the data model often have a wider impact
|
||||
on the system. The community often has strong opinions on how the data model
|
||||
should be evolved, from both a functional and performance perspective. It is
|
||||
therefore important to capture and gain agreement as early as possible on any
|
||||
proposed changes to the data model.
|
||||
|
||||
Questions which need to be addressed by this section include:
|
||||
|
||||
* What new data objects and/or database schema changes is this going to
|
||||
require?
|
||||
|
||||
* What database migrations will accompany this change.
|
||||
|
||||
* How will the initial set of new data objects be generated, for example if you
|
||||
need to take into account existing instances, or modify other existing data
|
||||
describe how that will work.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
Each API method which is either added or changed should have the following
|
||||
|
||||
* Specification for the method
|
||||
|
||||
* A description of what the method does suitable for use in
|
||||
user documentation
|
||||
|
||||
* Method type (POST/PUT/GET/DELETE)
|
||||
|
||||
* Normal http response code(s)
|
||||
|
||||
* Expected error http response code(s)
|
||||
|
||||
* A description for each possible error code should be included
|
||||
describing semantic errors which can cause it such as
|
||||
inconsistent parameters supplied to the method, or when an
|
||||
instance is not in an appropriate state for the request to
|
||||
succeed. Errors caused by syntactic problems covered by the JSON
|
||||
schema definition do not need to be included.
|
||||
|
||||
* URL for the resource
|
||||
|
||||
* URL should not include underscores, and use hyphens instead.
|
||||
|
||||
* Parameters which can be passed via the url
|
||||
|
||||
* JSON schema definition for the request body data if allowed
|
||||
|
||||
* Field names should use snake_case style, not CamelCase or MixedCase
|
||||
style.
|
||||
|
||||
* JSON schema definition for the response body data if any
|
||||
|
||||
* Field names should use snake_case style, not CamelCase or MixedCase
|
||||
style.
|
||||
|
||||
* Example use case including typical API samples for both data supplied
|
||||
by the caller and the response
|
||||
|
||||
* Discuss any policy changes, and discuss what things a deployer needs to
|
||||
think about when defining their policy.
|
||||
|
||||
Note that the schema should be defined as restrictively as
|
||||
possible. Parameters which are required should be marked as such and
|
||||
only under exceptional circumstances should additional parameters
|
||||
which are not defined in the schema be permitted (eg
|
||||
additionaProperties should be False).
|
||||
|
||||
Reuse of existing predefined parameter types such as regexps for
|
||||
passwords and user defined names is highly encouraged.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
Describe any potential security impact on the system. Some of the items to
|
||||
consider include:
|
||||
|
||||
* Does this change touch sensitive data such as tokens, keys, or user data?
|
||||
|
||||
* Does this change alter the API in a way that may impact security, such as
|
||||
a new way to access sensitive information or a new way to login?
|
||||
|
||||
* Does this change involve cryptography or hashing?
|
||||
|
||||
* Does this change require the use of sudo or any elevated privileges?
|
||||
|
||||
* Does this change involve using or parsing user-provided data? This could
|
||||
be directly at the API level or indirectly such as changes to a cache layer.
|
||||
|
||||
* Can this change enable a resource exhaustion attack, such as allowing a
|
||||
single API interaction to consume significant server resources? Some examples
|
||||
of this include launching subprocesses for each connection, or entity
|
||||
expansion attacks in XML.
|
||||
|
||||
For more detailed guidance, please see the OpenStack Security Guidelines as
|
||||
a reference (https://wiki.openstack.org/wiki/Security/Guidelines). These
|
||||
guidelines are a work in progress and are designed to help you identify
|
||||
security best practices. For further information, feel free to reach out
|
||||
to the OpenStack Security Group at openstack-security@lists.openstack.org.
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
Please specify any changes to notifications. Be that an extra notification,
|
||||
changes to an existing notification, or removing a notification.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
Aside from the API, are there other ways a user will interact with this
|
||||
feature?
|
||||
|
||||
* Does this change have an impact on python-cyborgclient? What does the user
|
||||
interface there look like?
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
Describe any potential performance impact on the system, for example
|
||||
how often will new code be called, and is there a major change to the calling
|
||||
pattern of existing code.
|
||||
|
||||
Examples of things to consider here include:
|
||||
|
||||
* A periodic task might look like a small addition but if it calls conductor or
|
||||
another service the load is multiplied by the number of nodes in the system.
|
||||
|
||||
* Scheduler filters get called once per host for every instance being created,
|
||||
so any latency they introduce is linear with the size of the system.
|
||||
|
||||
* A small change in a utility function or a commonly used decorator can have a
|
||||
large impacts on performance.
|
||||
|
||||
* Calls which result in a database queries (whether direct or via conductor)
|
||||
can have a profound impact on performance when called in critical sections of
|
||||
the code.
|
||||
|
||||
* Will the change include any locking, and if so what considerations are there
|
||||
on holding the lock?
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Discuss things that will affect how you deploy and configure OpenStack
|
||||
that have not already been mentioned, such as:
|
||||
|
||||
* What config options are being added? Should they be more generic than
|
||||
proposed (for example a flag that other hypervisor drivers might want to
|
||||
implement as well)? Are the default values ones which will work well in
|
||||
real deployments?
|
||||
|
||||
* Is this a change that takes immediate effect after its merged, or is it
|
||||
something that has to be explicitly enabled?
|
||||
|
||||
* If this change is a new binary, how would it be deployed?
|
||||
|
||||
* Please state anything that those doing continuous deployment, or those
|
||||
upgrading from the previous release, need to be aware of. Also describe
|
||||
any plans to deprecate configuration values or features. For example, if we
|
||||
change the directory name that instances are stored in, how do we handle
|
||||
instance directories created before the change landed? Do we move them? Do
|
||||
we have a special case in the code? Do we assume that the operator will
|
||||
recreate all the instances in their cloud?
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Discuss things that will affect other developers working on OpenStack,
|
||||
such as:
|
||||
|
||||
* If the blueprint proposes a change to the driver API, discussion of how
|
||||
other hypervisors would implement the feature is required.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Who is leading the writing of the code? Or is this a blueprint where you're
|
||||
throwing it out there to see who picks it up?
|
||||
|
||||
If more than one person is working on the implementation, please designate the
|
||||
primary author and contact.
|
||||
|
||||
Primary assignee:
|
||||
<launchpad-id or None>
|
||||
|
||||
Other contributors:
|
||||
<launchpad-id or None>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
* Include specific references to specs and/or blueprints in cyborg, or in other
|
||||
projects, that this one either depends on or is related to.
|
||||
|
||||
* If this requires functionality of another project that is not currently used
|
||||
by Cyborg, document that fact.
|
||||
|
||||
* Does this feature require any new library dependencies or code otherwise not
|
||||
included in OpenStack? Or does it depend on a specific version of library?
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Please discuss the important scenarios needed to test here, as well as
|
||||
specific edge cases we should be ensuring work correctly. For each
|
||||
scenario please specify if this requires specialized hardware, a full
|
||||
OpenStack environment, or can be simulated inside the Cyborg tree.
|
||||
|
||||
Please discuss how the change will be tested. We especially want to know what
|
||||
tempest tests will be added. It is assumed that unit test coverage will be
|
||||
added so that doesn't need to be mentioned explicitly, but discussion of why
|
||||
you think unit tests are sufficient and we don't need to add more tempest
|
||||
tests would need to be included.
|
||||
|
||||
Is this untestable in gate given current limitations (specific hardware /
|
||||
software configurations available)? If so, are there mitigation plans (3rd
|
||||
party testing, gate enhancements, etc).
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Which audiences are affected most by this change, and which documentation
|
||||
titles on docs.openstack.org should be updated because of this change? Don't
|
||||
repeat details discussed above, but reference them here in the context of
|
||||
documentation for multiple audiences. For example, the Operations Guide targets
|
||||
cloud operators, and the End User Guide would need to be updated if the change
|
||||
offers a new feature available through the CLI or dashboard. If a config option
|
||||
changes or is deprecated, note here that the documentation needs to be updated
|
||||
to reflect this specification's change.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Please add any useful references here. You are not required to have any
|
||||
reference. Moreover, this specification should still make sense when your
|
||||
references are unavailable. Examples of what you could include are:
|
||||
|
||||
* Links to mailing list or IRC discussions
|
||||
|
||||
* Links to notes from a summit session
|
||||
|
||||
* Links to relevant research, if appropriate
|
||||
|
||||
* Related specifications as appropriate (e.g. if it's an EC2 thing, link the
|
||||
EC2 docs)
|
||||
|
||||
* Anything else you feel it is worthwhile to refer to
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader understand
|
||||
what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Pike
|
||||
- Introduced
|
Loading…
x
Reference in New Issue
Block a user