Merge "[spec] Making DB scalable and flexible enough"
This commit is contained in:
commit
1ad1d2762d
379
doc/specs/in-progress/db_refactoring.rst
Normal file
379
doc/specs/in-progress/db_refactoring.rst
Normal file
@ -0,0 +1,379 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==============================
|
||||
Scaling & Refactoring Rally DB
|
||||
==============================
|
||||
|
||||
There are a lot of use cases that can't be done because of DB schema that we
|
||||
have. This proposal describes what and why we should change in DB.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
There are 3 use cases that requires DB refactoring:
|
||||
|
||||
1. scalable task engine
|
||||
|
||||
Run benchmarks with billions iterations
|
||||
Generate distributed load 10k-100k RPS
|
||||
Generate all reports/aggregated based on that data
|
||||
|
||||
2. multi scenario load generation
|
||||
|
||||
Running multiple scenarios as a part of single subtask requires changes
|
||||
in the way how we are storing subtask results.
|
||||
|
||||
3. task debugging and profiling
|
||||
|
||||
Store complete results of validation in DB (e.g. what validators were run,
|
||||
what validators passed, what didn't passed and why).
|
||||
|
||||
Store durations of all steps (validation/task) as well as other execution
|
||||
stats needed by CLI and to generate graphs in reports.
|
||||
|
||||
Store statuses, duration, errors of context cleanup steps.
|
||||
|
||||
Current schema doesn't work for those cases.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Changes in DB
|
||||
-------------
|
||||
|
||||
Existing DB schema
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
+------------+ +-------------+
|
||||
| Task | | TaskResult |
|
||||
+------------+ +-------------+
|
||||
| | | |
|
||||
| id | | id |
|
||||
| uuid <--+----+- task_uuid |
|
||||
| ^ | | |
|
||||
+---+--------+ +-------------+
|
||||
|
||||
* Task - stores task status, tags, validation log
|
||||
|
||||
* TaskResult - stores all information about workloads, including
|
||||
configuration, conext, sla, results etc.
|
||||
|
||||
|
||||
New DB schema
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
+------------+ +-------------+ +--------------+ +---------------+
|
||||
| Task | | Subtask | | Workload | | WorkloadData |
|
||||
+------------+ +-------------+ +--------------+ +---------------+
|
||||
| | | | | | | |
|
||||
| id | | id <----+--+ | id <-----+--+ | id |
|
||||
| uuid <--+----+- task_uuid | +-+- subtask_id | +-+- workload_id |
|
||||
| ^ | | uuid | | uuid | | uuid |
|
||||
+---+--------+ +---^---------+ | | | |
|
||||
+--------------------------------+- task_uuid | | |
|
||||
| | +--------------+ | |
|
||||
+----------------------------------------------------+- task_uuid |
|
||||
| | +---------------+
|
||||
+-------+---------+
|
||||
|
|
||||
+--------+ +
|
||||
| Tag | |
|
||||
+--------+ |
|
||||
| | |
|
||||
| id | |
|
||||
| uuid -+--+
|
||||
| type |
|
||||
| tag |
|
||||
+--------+
|
||||
|
||||
|
||||
* Task - stores information about task, when it was started/updated/finished,
|
||||
it's status, description, and so on. As well it used to aggregate all
|
||||
subtasks related to this task
|
||||
|
||||
* SubTask - stores information about subtask, when it was started/updated/
|
||||
finished, it's status, description, configuration, aggregated information
|
||||
about workloads. Without subtasks we won't be able to track information
|
||||
about task execution, and run many subtasks in single task.
|
||||
|
||||
* Workload - aggregated information about some specific workload (required
|
||||
for reports) as well as information how these workloads are executed in
|
||||
parallel/serial and status of each workload. Without workloads table we
|
||||
won't be able to support multiple workloads per single subtas
|
||||
|
||||
* WorkloadData - contains chunks of raw data for future data analyze and
|
||||
reporting. This is complete information that we don't need always, as well
|
||||
for getting overview of what happend. As we have multiple chunks per
|
||||
Workload, we won't be able to store them without creating this table.
|
||||
|
||||
* Tag - contains tags binded to tasks and subtasks by uuid and type
|
||||
|
||||
|
||||
Task table
|
||||
~~~~~~~~~~
|
||||
|
||||
id : INT, PK
|
||||
uuid : UUID
|
||||
|
||||
# Optional
|
||||
deployment_uuid : UUID
|
||||
|
||||
# Full input task configuration
|
||||
input_task : TEXT
|
||||
|
||||
title : String
|
||||
description : TEXT
|
||||
|
||||
# Structure of verification results:
|
||||
# [
|
||||
# {
|
||||
# "name": <name>, # full validator function name,
|
||||
# # validator plugin name (in the future)
|
||||
# "input": <input>, # smallest part of
|
||||
# "message": <msg>, # message with description
|
||||
# "success": <bool>, # did validatior pass
|
||||
# "duration": <float> # duration of validation process
|
||||
# },
|
||||
# .....
|
||||
# ]
|
||||
validation_result : TEXT
|
||||
|
||||
# Duration of verification can be used to tune verification process.
|
||||
validation_duration : FLOAT
|
||||
|
||||
# Duration of benchmarking part of task
|
||||
task_duration : FLOAT
|
||||
|
||||
# All workloads in the task are passed
|
||||
pass_sla : BOOL
|
||||
|
||||
# Current status of task
|
||||
status : ENUM(init, validating, validation_failed,
|
||||
aborting, soft_aborting, aborted,
|
||||
crashed, validated, running, finished)
|
||||
|
||||
|
||||
Task.status diagram of states
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
INIT -> VALIDATING -> VALIDATION_FAILED
|
||||
-> ABORTING -> ABORTED
|
||||
-> SOFT_ABORTING -> ABORTED
|
||||
-> CRASHED
|
||||
-> VALIDATED -> RUNNING -> FINISHED
|
||||
-> ABORTING -> ABORTED
|
||||
-> SOFT_ABORTING -> ABORTED
|
||||
-> CRASHED
|
||||
|
||||
|
||||
Subtask table
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
id : INT, PK
|
||||
uuid : UUID
|
||||
task_uuid : UUID
|
||||
title : String
|
||||
description : TEXT
|
||||
|
||||
# Position of Subtask in Input Task
|
||||
position : INT
|
||||
|
||||
# Context and SLA could be defined both Subtask-wide and per workload
|
||||
context : JSON
|
||||
sla : JSON
|
||||
|
||||
run_in_parallel : BOOL
|
||||
duration : FLOAT
|
||||
|
||||
# All workloads in the task are passed
|
||||
pass_sla : BOOL
|
||||
|
||||
# Current status of task
|
||||
status : ENUM(running, finished, crashed)
|
||||
|
||||
|
||||
Workload table
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
id : INT, PK
|
||||
uuid : UUID
|
||||
subtask_id : INT
|
||||
task_uuid : UUID
|
||||
|
||||
# Unlike Task's and Subtask's title which is arbitrary
|
||||
# Workload's name defines scenario being executed
|
||||
name : String
|
||||
|
||||
# Scenario plugin docstring
|
||||
description : TEXT
|
||||
|
||||
# Position of Workload in Input Task
|
||||
position : INT
|
||||
|
||||
runner : JSON
|
||||
runner_type : String
|
||||
|
||||
# Context and SLA could be defined both Subtask-wide and per workload
|
||||
context : JSON
|
||||
sla : JSON
|
||||
|
||||
args : JSON
|
||||
|
||||
# SLA structure that contains all detailed info looks like:
|
||||
# [
|
||||
# {
|
||||
# "name": <full_name_of_validator>,
|
||||
# "duration": <duration_of_validation>,
|
||||
# "success": <boolean_pass_or_not>,
|
||||
# "message": <description_of_what_happend>,
|
||||
# }
|
||||
#]
|
||||
#
|
||||
sla_results : TEXT
|
||||
|
||||
# Context data structure (order makes sense)
|
||||
#[
|
||||
# {
|
||||
# "name": string
|
||||
# "setup_duration": FLOAT,
|
||||
# "cleanup_duration": FLOAT,
|
||||
# "exception": LIST # exception info
|
||||
# "setup_extra": DICT # any custom data
|
||||
# "cleanup_extra": DICT # any custom data
|
||||
#
|
||||
# }
|
||||
#]
|
||||
context_execution : TEXT
|
||||
|
||||
starttime : TIMESTAMP
|
||||
|
||||
load_duration : FLOAT
|
||||
full_duration : FLOAT
|
||||
|
||||
# Shortest and longest iteration duration
|
||||
min_duration : FLOAT
|
||||
max_duration : FLOAT
|
||||
|
||||
total_iter_count : INT
|
||||
failed_iter_count : INT
|
||||
|
||||
# Statictics data structure (order makes sense)
|
||||
# {
|
||||
# "<action_name>": {
|
||||
# "min_duration": FLOAT,
|
||||
# "max_duration": FLOAT,
|
||||
# "median_duration": FLOAT,
|
||||
# "avg_duration": FLOAT,
|
||||
# "percentile90_duration": FLOAT,
|
||||
# "percentile95_duration": FLOAT,
|
||||
# "success_count": INT,
|
||||
# "total_count": INT
|
||||
# },
|
||||
# ...
|
||||
# }
|
||||
statistics : JSON # Aggregated information about actions
|
||||
|
||||
# As for SLA result
|
||||
success : BOOL
|
||||
|
||||
# Profile information collected during the run of scenario
|
||||
# This is internal data and format of it can be changed over time
|
||||
# _profiling_data : Text
|
||||
|
||||
|
||||
WorkloadData
|
||||
~~~~~~~~~~~~
|
||||
|
||||
id : INT, PK
|
||||
uuid : UUID
|
||||
workload_id : INT
|
||||
task_uuid : UUID
|
||||
|
||||
# Chunk order it's used to be able to sort output data
|
||||
chunk_order : INT
|
||||
|
||||
# Amount of iterations, can be useful for some of algorithms
|
||||
iteration_count : INT
|
||||
|
||||
# Number of failed iterations
|
||||
iteration_failed : INT
|
||||
|
||||
# Full size of results in bytes
|
||||
chunk_size : INT
|
||||
|
||||
# Size of zipped results in bytes
|
||||
zipped_chunk_size : INT
|
||||
|
||||
started_at : TIMESTAMP
|
||||
finished_at : TIMESTAMP
|
||||
|
||||
# Chunk_data structure
|
||||
# [
|
||||
# {
|
||||
# "duration": FLOAT,
|
||||
# "idle_duration": FLOAT,
|
||||
# "timestamp": FLOAT,
|
||||
# "errors": LIST,
|
||||
# "output": {
|
||||
# "complete": LIST,
|
||||
# "additive": LIST,
|
||||
# },
|
||||
# "actions": LIST
|
||||
# },
|
||||
# ...
|
||||
# ]
|
||||
chunk_data : BLOB # compressed LIST of JSONs
|
||||
|
||||
|
||||
Tag table
|
||||
~~~~~~~~~
|
||||
|
||||
id : INT, PK
|
||||
uuid : UUID of task or subtask
|
||||
type : ENUM(task, subtask)
|
||||
tag : TEXT
|
||||
|
||||
- (uuid, type, tag) is unique and indexed
|
||||
|
||||
|
||||
Open questions
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
- We store both SLA configuration (plugin names and config params) and
|
||||
SLA results (passed/failed and numeric data). The same is true for context.
|
||||
Should we separate 'sla_results' from 'sla' and 'context_execution' from
|
||||
'context' in Workload?
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
- boris-42 (?)
|
||||
- ikhudoshyn
|
||||
|
||||
Milestones
|
||||
----------
|
||||
|
||||
Target Milestone for completion: N/A
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
TBD
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
- There should be smooth transition of code to work with new data structure
|
Loading…
x
Reference in New Issue
Block a user