-
Notifications
You must be signed in to change notification settings - Fork 30
Inference Engine Traces
As described on the Inference Engine Debugging page, debugging or testing of the Inference Engine requires “traces” — samples of data simulating that which would come from a bus (i.e. the inputs) and, ideally, the desired results produced by the Inference Engine (i.e. the outputs). Traces that have the desired/specified outputs are referred to as “labeled” traces, because the inputs are “labeled” with the desired outputs.
This page describes the format of trace files, and the process/guidelines by which to create a trace file.
There are many examples of traces created according to the format and conventions described in this page already in the onebusaway-nyc repository: https://github.com/camsys/onebusaway-nyc/tree/master/onebusaway-nyc-integration-tests/src/integration-test/resources/traces
To create an integration tests from the trace file, see Creating an Integration Test
Each trace is embodied by a single plain text CSV file, with the columns described below.
Input Columns:
These are the columns that define the inputs to the Inference Engine.
Column | Description | Required | Example | Notes |
---|---|---|---|---|
vid | Vehicle ID, fully qualified with agency ID | Required | MTA NYCT_7564 | |
lat | Latitude | Required | 40.553357 | |
lon | Longitude | Required | -74.117308 | |
operator_id | Numeric employee ID | Required | 123456 | Actual value is irrelevant, can use arbitrary value. |
reported_run_id | Run ID transmitted from the bus. Not agency-qualified. | Optional? | 63-101 | This column is not expected to match directly to run ID’s in the bundle; it is fuzzy-matched. |
assigned_run_id | Run ID assigned to the operator whose ID was received from the bus. Not agency-qualified. | Optional | B63-101 | This column must match exactly a run in the bundle. |
timestamp | Timestamp, in YYYY-MM-DD HH:MM:SS | Required | 2012-01-11 09:13:52 | This is assumed to be in the same timezone as the location of the bundle. |
dsc | Desgination Sign Code, numeric | Optional | 4630 | |
direction_deg | Bearing of the bus, decimal | Optional | 80.28 | North is… 0? |
speed | Speed, in integer MPH | Optional | 35 |
Output Columns (aka “labels”):
These are the columns that define the expected outputs of the Inference Engine. Integration tests only tests the inferred outputs against those columns that are provided in the trace.
Column | Description | Required | Example | Notes |
---|---|---|---|---|
actual_is_run_formal | Boolean indicator for formal inference: TRUE or FALSE | Required | FALSE | |
actual_run_id | Run ID, not agency-qualified | Required only if is_run_formal = TRUE | B63-101 | |
actual_trip_id | Trip ID, fully agency-qualified | Optional | MTA NYCT_JG_C3-Weekday-SDon-080200_B35_27 | |
actual_block_id | Block ID, fully agency-qualified | Optional | MTA NYCT_JG_C3-Weekday-SDon_E_JG_46920_B35-27 | |
actual_dsc | Destination Sign Code | Optional | 4630 | |
actual_phase | Operational Phase (see Inference Engine page) | Optional | IN_PROGRESS | Also supports prefixes (e.g. “LAYOVER_” meaning any layover phase) and/or multiple values separated by “+” character (e.g. “IN_PROGRESS+LAYOVER_”) |
actual_status | Operational Status (see Inference Engine page) | Optional | default | Also supports multiple values separated by ‘+’ character (e.g. “default+stalled”) |
Since traces are CSV files, there are any number of ways to create them. They can be synthesized completely from scratch, for example if there no actual vehicles installed with tracking equipment. More commonly, traces are created because a OBA-NYC system is up and running, but some bug or erroneous behavior in the Inference Engine needs to be investigated or changed.
To generate a trace from an existing OBA-NYC system, the easiest way to start is to create the trace from the database. This usually includes both the input fields and output columns described above. Typically the results that were inferred in actual operation are the starting point for creating the desired/actual results.
TODO: Document which columns from the OBA-NYC databases (obanyc_cclocationreport + obanyc_inferredlocation, or obanyc_reporting) are typically used to populate the input and (initial) output/actual columns.
Given a trace file that has been populated from the OBA-NYC databases, the trace file is typically modified using Microsoft Excel according to the following procedures.
- Ensure that the timestamp column is the correct format (which it typically will not be after Excel reads the CSV file). Change it to a Custom format, with format string ‘yyyy-mm-dd hh:mm:ss’.
- If the file has column names of ‘inferred_*’ (e.g. ‘infered_run_id’), change them (e.g. using Find/Replace) to ‘actual_*’ (e.g. ‘actual_run_id’)
- Remove any ‘actual_*’ columns that are not accepted in the trace format described above (e.g. actual_service_date, actual_distance_along_block, actual_distance_along_trip, actual_block_lat, actual_block_lon)
- If missing, add the ‘actual_is_run_formal’ column, as it is required.
- Remove, add, or modify the ‘actual_*’ columns according to what is actually being tested by this particular trace, as discussed below.
This SQL query can be used as a prototype for generating a trace file. If executed in a SQL tool (e.g. DbVisualizer) the results can generally then be exported as a CSV in the right format. Obviously the specifics of the WHERE
clause need to be adjusted to get the exact records for a trace; this is just an example.
SELECT
COALESCE(cc.vehicle_id, '') AS vid,
COALESCE(cc.latitude, '') AS lat,
COALESCE(cc.longitude, '') AS lon,
COALESCE(cc.operator_id_designator, '') AS operator_id,
COALESCE(cc.run_id_designator, '') AS reported_run_id,
COALESCE(inf.assigned_run_number, '') AS assigned_run_id,
COALESCE(cc.time_reported, '') AS TIMESTAMP,
COALESCE(cc.dest_sign_code, '') AS dsc,
COALESCE(cc.direction_deg, '') AS direction_deg,
COALESCE(cc.speed, '') AS speed,
COALESCE('', '') AS assigned_block_id,
inf.inference_is_formal AS actual_is_run_formal,
COALESCE(inf.inferred_run_id, '') AS actual_run_id,
COALESCE(inf.inferred_trip_id, '') AS actual_trip_id,
COALESCE(inf.inferred_block_id, '') AS actual_block_id,
COALESCE(inf.inferred_dest_sign_code, '') AS actual_dsc,
COALESCE(inf.inferred_phase, '') AS actual_phase,
COALESCE(inf.inferred_status, '') AS actual_status
FROM
(
SELECT
*
FROM
obanyc_cclocationreport
WHERE
vehicle_id=423
AND time_reported>='2015-01-26'
AND time_reported <= '2015-01-27') cc
LEFT OUTER JOIN
(
SELECT
*
FROM
obanyc_inferredlocation
WHERE
vehicle_id=423
AND time_reported>='2015-01-26'
AND time_reported <= '2015-01-27') inf
ON
cc.uuid=inf.uuid
Deciding (a) which ‘actual_*’ columns should be in the trace, and (b) the values of those columns, is the most subtle part of this process. It depends on understanding exactly what the trace is attempting to accomplish in terms of constraining the behavior of the Inference Engine in a desired yet feasible manner. As such, it is not possible to thoroughly document all ways in which the output/actual columns would be populated.
Nevertheless, below are some of the common guidelines for populating the output/actual columns of a trace file given experience to date.
- Typically the first 2-3 rows of a trace do not have any actual_* values (except actual_is_run_formal=FALSE). This is to give the Inference Engine time to ‘warm up’ when it starts the trace.
- With the exception of the first 2-3 rows, it is good practice to always specify actual_phase and actual_status.
- It is typical to insert some amount of ‘slop’ in the actual_phase column during transitions to/from LAYOVER_ states, unless the point of the trace is to specifically enforce the timing of those transitions down to the single-update level. This ‘slop’ consists of having the actual_phase be a combined value of DEADHEAD_+LAYOVER_ for a couple of updates surrounding the transition between a deadhead and a layover state (or vice versa). Likewise for LAYOVER_+IN_PROGRESS surrounding the transition between a layover and an in progress state (or vice versa).
- Actual_block_id or actual_run_id should only be specified if actual_is_run_formal is TRUE.
- Actual_run_id and actual_block_id are typically not both required, as there is a certain equivalency between blocks and runs. Beware however of scheduled mid-route and terminal reliefs, during which a run would change but a block would not.
- Actual_trip_id is rarely used. Either it is implied by the run or block id in a formal inference case, or is overly specific for informal inference (in which case actual_dsc is preferable, see below).
- For informal inference, actual_dsc is typically used to constrain the inference to a certain route and direction (since the traces do not explicitly accommodate route_id or direction).
- Actual_dsc (or trip_id) is typically specified only for trace rows with actual_phase of IN_PROGRESS. The exception to this would be a trace that specifically enforces when during a layover the Inference Engine changes the inferred trip.
With a trace created, consider adding it to an integration test to continuously verify the intended behaviour. See Creating an Integration Test