John X. Wang Author of Evaluating Organization Development
FEATURED AUTHOR

John X. Wang

Senior Principal Functional Safety Engineer
Flex

John X. Wang is Senior Principal Functional Safety Engineer at Flex. Dr. Wang has authored/coauthored numerous books and papers on reliability engineering, risk engineering, engineering decision making under uncertainty, robust design and Six Sigma, lean manufacturing, green electronics manufacturing, cellular manufacturing, and industrial design engineering - inventive problem solving. His book was featured as ISE Magazine May 2017 Book of the Month.

Biography

John X. Wang is Senior Principal Functional Safety Engineer at Flex, where he is responsible for functional safety activities for products being developed by the global engineering team. Dr. Wang has served as Senior Principal Safety Engineer (Member of Product Safety Board) at Cobham Mission Systems, where he twice received Nomination as Employee of the Month Awards and Received Nomination of C.A.R.E (Certified, Ambassadors, Responsive, Exceptional) Award.

Dr. Wang received PhD degree in reliability engineering from University of Maryland, College Park, MD, in 1995. He was then with the GE Transportation as an Engineering Six Sigma Black Belt, leading Design for Reliability (DFR) and Design for Six Sigma (DFSS) while teaching GE-Gannon University’s Graduate Co-Op programs and National Technological University professional short course, and serving as a member of the IEEE Reliability Society Risk Management Committee. He has worked as a Corporate Master Black Belt at Visteon Corporation, Reliability Engineering Manager at Whirlpool Corporation, Senior Principal Functional Safety Engineer at Panduit Corp. and Collins Aerospace. In 2009, Dr. Wang received an Individual Achievement Award when working as a Senior Principal Functional Safety Engineer at Raytheon Company. He joined GE Aviation Systems in 2010, where he was awarded the distinguished title of Senior Principal Functional Safety Engineer (CTH - Controlled Title Holder) in 2013. Dr. Wang has been a Group Engineer (Senior Principal Functional Safety Engineer), leading computer vision programs and robotics development at Danfoss Power Solutions, where his work on autonomous vehicle DAVIS was celebrated.

As a Certified Reliability Engineer certified by American Society for Quality, Dr. Wang has authored/coauthored numerous books and papers on reliability engineering, risk engineering, engineering decision making under uncertainty, robust design and Six Sigma, lean manufacturing, green electronics manufacturing, cellular manufacturing, & industrial design engineering. He has been affiliated with Austrian Aerospace Agency/European Space Agency, Vienna University of Technology, Swiss Federal Institute of Technology in Zurich, Paul Scherrer Institute in Switzerland, and Tsinghua University in China.

Having presented various professional short courses and seminars, Dr. Wang has performed joint research with the Delft University of Technology in the Netherlands and the Norwegian Institute of Technology.

Since "knowledge, expertise, and scientific results are well known internationally," Dr. Wang has been invited to present at various national & international engineering events.

Education

    Certified Reliability Engineer

Areas of Research / Professional Expertise

    Risk Engineering, Reliability Engineering, Lean Six Sigma, Decision Making Under Uncertainty, Business Communication, Green Electronics Manufacturing, Cellular Manufacturing and industrial design engineering.

Personal Interests

    Poetry - having authored published poems.

Websites

Books

Featured Title
 Featured Title - Green Electronics Manufacturing Creating Environmental - 1st Edition book cover

Articles

Association of Manufacturing Excellence (AME) Target Online

A Single Line Opens the Way: Inventive Problem Solving


Published: Jan 03, 2017 by Association of Manufacturing Excellence (AME) Target Online
Authors: John X. Wang
Subjects: Engineering - General, Engineering - Industrial & Manufacturing

Poetic thinking is a life-cherishing force, a game-changing accelerator for Inventive Problem Solving of Industrial Design Engineering. One line or a few lines within a poem can upend your way of designing industrial products. It can help you recover and reclaim a way of designing products break up acceptance of the status-quo, or, put more fiercely, a single line can tear down a pre-fabricated understanding of your customers.

EETimes

Achieve robust designs with Six Sigma


Published: May 20, 2010 by EETimes
Authors: John X. Wang

Developing "best-in-class" robust designs is crucial for creating competitive advantages. Customers want their products to be dependable--"plug-and-play." They also expect them to be reliable--"last a long time." Furthermore, customers are cost-sensible; they anticipate that products will be affordable. Becoming robust means seeking win–win solutions for productivity and quality improvement. So far, robust design has been a "road less traveled."

informIT

Achieving Robust Designs with Six Sigma: Dependable, Reliable, and Affordable


Published: Oct 06, 2005 by informIT
Authors: John X. Wang
Subjects: Engineering - General

Developing "best-in-class" robust designs is crucial for creating competitive advantages. Customers want their products to be dependable—"plug-and-play." They also expect them to be reliable—"last a long time." Furthermore, customers are cost-sensible; they anticipate that products will be affordable. Becoming robust means seeking win–win solutions for productivity and quality improvement.

International Journal of General Systems

COMPLEXITY AS A MEASURE OF THE DIFFICULTY OF SYSTEM DIAGNOSIS


Published: Mar 01, 1996 by International Journal of General Systems
Authors: John X. Wang
Subjects: Engineering - General

Complexity as a measure of the difficulty of diagnosis, or troubleshooting, of a system is explored in this paper. It is found in this paper that the system structure has an effect on system complexity as well as the number of components which make up the system. This complexity function presents an intrinsic feature of the system and can be used as a measure for system complexity, which is significant to reliability prediction and allocation.

Reliability Engineering & System Safety

General inspection strategy for fault diagnosis—minimizing the inspection costs


Published: Mar 01, 1995 by Reliability Engineering & System Safety
Authors: Rune Reinertsen, John X. Wang
Subjects: Engineering - General

In this paper, a general inspection strategy for system fault diagnosis is presented. This general strategy provides the optimal inspection procedure when the inspections require unequal effort and the minimum cut set probabilities are unequal.

Reliability and Maintainability Symposium, 1995. Proceedings., Annual

Time-dependent logic for goal-oriented dynamic-system analysis


Published: Jan 19, 1995 by Reliability and Maintainability Symposium, 1995. Proceedings., Annual
Authors: Marvin L. Roush, John X. Wang
Subjects: Engineering - General

This paper highlights an error that can arise in analyzing accident scenarios when time dependence is ignored. A simple straight-forward solution is provided without the necessity of utilizing more powerful (and hence more complex) dynamic event tree techniques. This possible solution is a straightforward extension of the current fault-tree/event-tree approach by incorporating a time-dependent algebraic formalism into fault-tree/event-tree analysis.

Reliability Engineering & System Safety

Optimal inspection sequence in fault diagnosis


Published: Mar 01, 1992 by Reliability Engineering & System Safety
Authors: John X. Wang
Subjects: Engineering - General

The average number of inspections in fault diagnosis to find the actual minimal cutsets (MCS) causing a system failure is found to be dependent on the inspection sequence adopted. An inspection on a component whose Fussell-Vesely importance is nearest to 0·5 leads to the discovery of the actual MCS by a minimum number of inspections.

Reliability Engineering & System Safety

Fault tree diagnosis based on shannon entropy


Published: Feb 01, 1992 by Reliability Engineering & System Safety
Authors: John X. Wang

A fault tree diagnosis methodology which can locate the actual MCS (minimum cut set) in the system in a minimum number of inspections is presented. The result reveals that, contrary to what is suggested by traditional diagnosis methodology based on probabilistic importance, inspection on a basic event whose Fussell-Vesely importance is nearest to 0·5 best distinguishes the MCSs.

Reliability Engineering & System Safety

A practical approach for phased mission analysis


Published: Apr 30, 1989 by Reliability Engineering & System Safety
Authors: Xue Dazhi, John X. Wang

This paper presents a new treatment of phased mission problems. Generalized intersection and union concept is developed and provides a pragmatic tool to guide a phased mission analysis. A practical approach to incorporate phased mission analysis into accident sequence quantification, which does not use the basic event transformation, is presented.

Photos

News

Covid 19: Statistics and Physics of Pandemic (PoP) for mitigating Risk and Uncertainty

By: John X. Wang
Subjects: Biomedical Science, Business & Management, Disaster Planning & Recovery , Engineering - General, Healthcare, Life Science, Medicine, Physics, Statistics

  • COVID-19 Mortality data is a reliable indicator of the extent of this pandemic.

  • One of the assumptions is that those who have died are tested and the cause of death is captured accurately.

  • However, mortality data comes with at least a 3 weeks’ time lag from the time of infection (latency observed).

  • Its usefulness as a predictor of the current number of positive COVID-19 cases is limited in that sense.

Three-parameter Generalized Gamma distribution

Mortality data can be useful for epidemiology model fitting and testing once a certain number of deaths have occurred. In this article, COVID-19 mortality is modeled using the three-parameter Generalized Gamma distribution which is a flexible family comprising the following four

  1. Weibull

  2. Gamma

  3. Exponential

  4. Lognormal

statistical distributions. However, COVID-19 mortality data is sometimes under reported for a variety of reasons in certain regions, especially remote areas. Reasons include access to facilities and resources needed to do relevant tests especially when death occurs outside of Health facilities.

The four distributions of the three-parameter Generalised Gamma distribution are fitted using SAS. The goodness of fit measure, Akaike Information Criterion (AIC) suggests that the Gamma distribution as the best distribution to describe COVID-19 death. The Weibull distribution is a very close competitor and is suggested as the best by the other goodness of fit measures.

Physics of Pandemic (PoP) for mitigating Risk and Uncertainty

Based on publicly available data, this article gives an indication of the spread of the COVID-19 disease in certain regions, especially in remote areas with limited access to facilities and resources needed to do relevant tests. The Gamma distribution with an increasing but concave hazard rate best describes statistically how the disease is spreading. These insights would provide helpful insights about understanding the Physics of Pandemic (PoP) for mitigating Risk and Uncertainty.

Decision Making under Uncertainty: analyze efficacy endpoints in the diagnosis and treatment of Covid-19 with SAS

By: John X. Wang
Subjects: Biomedical Science, Engineering - General, Healthcare, Life Science, Statistics

Survival Analysis

  • Unadjusted (Kaplan Meier & Log-Rank Test)

    • SAS Proc LIFETEST

  • Adjusted (Cox proportional hazards regression model)

    • SAS Proc PHREG

    • Selection of covariates to be used depends on the indication and

    • treatment setting. E.g. type and/or response to prior therapy

    • Examples of other possible covariates

If you use for example SAS, you will need to get data from the standard SAS procedure for survival analysis, that is usually PROC LIFETEST and PROC PHREG (for Cox multivariate analysis).

Time to event endpoints including all-cause mortality, and cardiovascular-related mortality will be analyzed using SAS Proc Lifetest; p-values will be from the log-rank test. For cardiovascular-related mortality, subjects who died for reasons other than cardiovascular (including “inderterminate”) will be designated as censored at the time of death. Subjects who discontinue for transplantation (ie, heart transplantation and combined heart and liver transplantation) or cardiac mechanical assist device will be handled in the same manner as death (as done in the primary analysis).

Kaplan-Meier survival curves for each treatment group along with median survival times (if applicable) will be presented. Kaplan-Meier product limit estimators will be generated. The number of subjects at risk, number of events and number of censored observations through 6, 12, 18, 24, 30 months will be summarized using the “Method=Life” option in PROC LIFETEST.

Time to event endpoints will also be analyzed using Cox proportional hazards model using Proc PHREG with treatment, Transthyretin (TTR) genotype (variant and wild-type), and New York Heart Association (NYHA) baseline classification (NYHA Classes I and II combined and NYHA Class III ) as factors.

For the analyses by NYHA baseline classification, the Cox proportional hazard model will include treatment and TTR genotype. For the analyses by TTR genotype, the model will include treatment and NYHA baseline classification. The Kaplan-Meier survival curves will also be generated for the analyses by NYHA baseline classification and TTR genotype.

Sample SAS code for the survival analysis is given below.

PROC LIFETEST DATA=xxx PLOTS=(s) graphics;

TIME dur *status(0);

STRATA trt;

RUN;

PROC PHREG DATA=xxx;

MODEL dur*status(0)=trt genotype NYHAbase/ ties=EXACT;

RUN

SAS Example: An investigator conducted a small safety and efficacy study comparing treatment to placebo with respect to adverse reactions. The data are as follows:

 

Treatment

Placebo

adverse reaction

12

4

no adverse reaction

32

40

***********************************************************************
* This is a program that illustrates the use of PROC FREQ in SAS for  *
* performing statistical inference with the odds ratio in a two-way   *
* frequency table.                                                    *
***********************************************************************;

proc format;
value groupfmt 0='placebo' 1='treatment';
value noyesfmt 0='no' 1='yes';
run;

data adverse_reactions;
input group advreact count;
format group groupfmt. advreact noyesfmt.;
cards;
0 0 40
0 1  4
1 0 32
1 1 12
;
run;

proc freq data=adverse_reactions;
tables group*advreact/chisq measures;
exact chisq measures;
weight count;
title "Statistical Inference Using the Odds Ratio";
run;

Conclusions

  • Despite its complexity, “stable” standards exist for efficacy evaluation based on techniques for Decision Making under Uncertainty.

  • Use of efficacy indicators may be different from an indication to another Managing, deriving and analyzing efficacy endpoints in the diagnosis and treatment of Covid-19 requires a clear understanding of the disease

  • The use of efficacy endpoints in drug approval may change again with the idea of targeting the therapies based on molecular profiling

Risk Engineering: Covid 19, Efficacy Endpoint, and Safety

By: John X. Wang
Subjects: Biomedical Science, Business & Management, Disaster Planning & Recovery , Emergency Response, Engineering - General, Healthcare, Life Science, Statistics

  • Efficacy is the ability to perform a task to a satisfactory or expected degree.

  • Obviously, a drug (or any medical treatment) should be used only when it will benefit a patient.

  • Benefit takes into account both the drug's ability to produce the desired result (efficacy) and the type and likelihood of adverse effects (safety).

  • Risk engineering is the application of engineering skills and methodologies to the management of risk. It involves hazard identification, risk analysis, risk evaluation and risk treatment.

Standardization of efficacy endpoints

Standardization of efficacy endpoints across clinical trials may facilitate comparative evaluation of vaccines for deployment programs, provided that such comparisons are not confounded by differences in trial design or study populations.

Statistical Considerations and Safety Consideration Provided by FDA

The United States Food and Drug Administration (FDA) issued a new guidance for industry related to the Coronavirus Disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus (SARS-CoV-2). FDA issued the guidance, titled “Development and Licensure of Vaccines to Prevent COVID-19”, on June 30, 2020 with the following Statistical Considerations.

  • Statistical Considerations: To ensure that a widely deployed COVID-19 vaccine is effective, the primary efficacy endpoint point estimate for a placebo-controlled efficacy trial should be at least 50%, and the statistical success criterion should be that the lower bound of the appropriately alpha-adjusted confidence interval around the primary efficacyendpoint point estimate is >30%.

  • Safety Consideration: The general safety evaluation of COVID-19 vaccines, including the size of the safety database to support vaccine licensure, should be no different than for other preventive vaccines for infectious diseases.

Vaccine Efficacy (VE)

COVID-19 vaccines, like influenza vaccines may reduce the risk of disease and severity of symptoms following infection. If this is the case, the endpoint of clinically symptomatic COVID-19 of any severity grade, risks the inclusion in efficacy analysis of cases of vaccine-induced attenuated COVID-19 defined as mild, residual COVID-19 upon infection in a vaccinated person. This lowers the VE estimate and reduce its precision. The use of the COVID-19 with signs or symptoms of pneumonia as the primary efficacy endpoint may result in earlier demonstration of VE by limiting the number of vaccine-induced attenuated COVID-19 included in the primary efficacy analysis despite there being fewer overall cases.

All developers should consider the collection of detailed data on clinical signs and symptoms of COVID-19 cases ascertained during a clinical trial. This will allow possible iterations of clinical efficacy case definitions to be assessed for their performance characteristics and suitability for clinical efficacy endpoint trials. Of note, any such assessment should consider the sensitivity and specificity of the diagnostic assay used.

Evaluating the Efficacy of COVID-19 Vaccines

A large number of studies are being conducted to evaluate the efficacy and safety of candidate vaccines against novel coronavirus disease-2019 (COVID-19). Most Phase 3 trials have adopted virologically confirmed symptomatic COVID-19 disease as the primary efficacy endpoint, although laboratory-confirmed SARS-CoV-2 is also of interest. In addition, it is important to evaluate the effect of vaccination on disease severity. It is possible that a vaccine is much more effective in preventing severe than mild COVID-19. Thus, it is essential ti evaluate the effect of vaccination on severe COVID-19. However, a large sample size is likely required for a trial using a severe COVID-19 endpoint.

Risk Engineering: prevent Enhanced Respiratory Disease (ERD) proactively

It should be noted that data from studies in animal models administered certain vaccine constructs against other coronaviruses (SARS-CoV and MERS-CoV) have raised concerns of a theoretical risk for COVID-19 vaccine-associated enhanced respiratory disease (ERD). Risk Engineering prevents Enhanced Respiratory Disease (ERD) proactively.

On the River of Industrial Design Engineering: Flow of Poetic Thinking

By: John X. Wang
Subjects: Engineering - General, Engineering - Industrial & Manufacturing

Flow of Poetic Thinking. Flow of the River. Flow of the Engineering Thinking. The analogies have been explored in my latest book titled “Industrial Design Engineering Inventive Problem Solving”.

During the holiday season, when my family and I just moved to a new home in Farmington Hills, Michigan, I reflected the journey from the great rivers of Danube, Aare/Rhine, and Amazon to our sweet new home.

My poem, “Crossing with Blue Moonlight,” appeared in Poetry Quarterly: The Winter/Summer 2012 edition. Being written in Grand Rapids, Michigan, the poem recollected the journey from the great rivers of Danube, Aare/Rhine, and Amazon, to the great rivers of Mississippi and Grand River, the longest river in Michigan.

Crossing with Blue Moonlight

“Are you drunken

with Sonata Moonlight

on the Beautiful Blue Danube?

Do you remember the blueberry hill?”

I forget

because it’s decades ago.

Can you bring me a piece of melody

and a cup of Jasmine tea?

“Are you drunken

with Aare water

under the moonlit dark blue sky?

Do you remember the Blues Trail?”

I forget

because it’s oceans away.

Can you bring me a piece of melody

and streams of whiskey?

“Are you drunken

with the turbulent maelstrom

over the river of Amazon?

Do you remember the crossroads

stretching under dark blue sky?

Do you remember the peaceful country roads

crossing with blue moonlight?”

Here, Aare water is the Aare River (Switzerland), which joins Rhine River by the Black Forest. My wife Lisa and I lived by the river when I worked at Paul Scherrer Institute in Switzerland. We also lived in Vienna, Austria, by the beautiful Blue Danube when I was affiliated with Austrian Aerospace Agency/European Space Agency.

Merry Christmas and Happy New Year! All the best for the holiday season, John

Highly Accelerated Life Testing for Green Electronics Manufacturing

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Materials Science, Nanoscience & Technology, Physics

HALT procedures vary from lab to lab but are typically performed similar to the procedure which is summarized below. HALT procedure is divided into 5 Stages:

  • Stage 1 – Temperature Step Stresses,
  • Stage 2 – Temperature Ramps,
  • Stage 3 – Vibration Step Stresses,
  • Stage 4 – Combined Temperature &Vibration Stresses, and
  • Stage 5 – Temperature Destruct Limits.

Stage 1 is used to determine the HALT Operational Limits for temperature. The goal is not to cause destruction in Stage 1, however sometimes the operational and destruct limits occur simultaneously. The HALT Destruct Limits for temperature and vibration are typically found in Stages 3 to 5.

Temperature Step Stresses – Stage 1 (Figure 1)

No alt text provided for this image

Figure 1 – Stage 1 Temperature Steps

Stage 1 is started with Cold Step Stresses. Testing is started at 10ºC and is decreased in 10 ºC increments until the lower operating limit is determined or the chamber minimum temperature of -100 ºC is reached.

  • The dwell time at each step is defined as the point when stabilization and saturation of the device and its components is achieved which is typically 15 to 20 minutes.
  • Functional testing will occur during this stabilization period.
  • The dwell time will be determined from temperature measurements obtained from thermocouples placed on the product.
  • Thermocouple data from individual components that can be a source of heating or cooling are not used to define the dwell time.

The second part in Stage 1 is Hot Step Stresses. Testing is started at 40 ºC and increased in 10 ºC increments until either the upper operating limit is determined or the chamber maximum temperature of +200 ºC is reached.

  • The dwell time will be established using the same procedure as for the Cold Step testing segment.
  • Note that the upper and lower temperatures may be reduced if material limitations, i.e., solder melting or plastic softening are exceeded.
  • Also, it is good practice to perform a functional test of the product at room temperature or 25 ºC before starting a HALT to get baseline measurement on its performance.

Temperature Ramps – Stage 2 (Figure 2)

No alt text provided for this image

Figure 2 – Stage 2 Temperature Ramps

  • During this Stage, temperature cycles with rapid transition rates (ramps) will be applied to the product.
  • The chamber air temperature will be changed at 60 ºC/minute.
  • The hot and cold temperatures will typically range from 10 ºC above the lower operating limit to 10 ºC below the upper operating limit.
  • These 10 ºC reductions are to allow for over shooting caused by changing the temperatures extremely fast. The dwell time, established in Stage 1, will normally be used at each hot and cold temperature. Five cycles are applied.

Vibration Step Stresses – Stage 3 (Figure 3)

No alt text provided for this image

Figure 3 – Stage 3 Vibration Steps

  • A broadband vibration spectrum will be applied through the HALT chamber table.
  • The HALT chamber table should apply random vibration energy to 10,000 Hz in 6 DOF (degrees of freedom).
  • Vibration step stresses will start at 10 Grms and increase in 5 Grms steps until either the operating, the destruct limits, or the chamber maximum vibration level of 60 Grms is reached.
  • At 40 Grms levels and above, the vibration step will be returned to 10 Grms for 1 minute to detect failures that could be hidden by extreme forces occurring at higher vibration levels.
  • Dwell time at each step, will be approximately 15 minutes to accumulate fatigue damage.
  • Grms is measured with a 5 KHz bandwidth.
  • This test is performed at room temperature of approximately 20 to 25 ºC.

Combined Temperature & Vibration Stresses – Stage 4 (Figure 4)

No alt text provided for this image

Figure 4 – Stage 4 Combined Temperature & Vibration

  • Combined temperature and vibration stresses are applied in Stage 4. During this Stage, the chamber air is changed at 60 ºC/minute.
  • The hot and cold temperatures are the same as those used in Stage 2.
  • The dwell time at each hot and cold temperature will be the same as used in Stage 2.
  • Vibration level is fixed during each temperature step and begins at 10 Grms and increases in 10 Grms steps until either the operating or destruct limits or the chamber maximum vibration level of 60 Grms is reached.

Temperature Destruct Limits – Stage 5 (Figure 5)

No alt text provided for this image

Figure 5 – Stage 5 Temperature Destruct

  • The cold temperature destruct limit is found by starting at the lower operating limit (found in Stage 1) and decreasing the temperature in 10 ºC increments until either the low temperature destruct limit or the chamber minimum temperature of -100 ºC is reached.
  • The hot temperature destruct limit is found by starting at the upper operating limit (found in Stage 1) and increasing the temperature in 10ºC increments until either the hot temperature destruct limit or chamber maximum temperature of 200 ºC is reached.
  • The dwell time established in Stage 1 is used typically, however dwell times may be reduced if the product stops operating or if failures occur. If the product fails to operate, the temperature will be reduced or increased towards 20 ºC to see if the product recovers.
  • If the unit is non operational after stabilizing at 20 ºC, the product will be repaired (if practical) so that the test temperatures can be expanded. If it is not practical to repair the product, Stage 5 will be terminated.

Power On/Off Cycling

  • Powered on/off cycling is recommended at every temperature or vibration step to create additional electrical stresses.
  • These power cycles will be conducted quickly but sufficient time will be allowed so as not to create artificial excessive overloads and failure modes.
  • Powered on/off cycling may not be appropriate for every product as it may create artificial stresses and failure modes, or the product may take too long to power up.

Test Samples

  • The typical number of products tested simultaneously is 1 to 4 as practical based on the cost and size of the products.
  • Additional spare parts or backup units (not under test) may be needed for spare parts to repair and continue with testing if a non-repairable failure occurs.

Test Reporting

  • High quality test reports written by DES will contain, at a minimum, plots similar to those shown in Figures 1 to 5. These plots will include measured chamber control temperature, vibrations and product response temperatures, vibrations along with an indication of where each failure or significant event occurred during the HALT.
  • Additionally plots of test voltages, currents, pressures or other applicable parameters will be included as applicable.
  • The report will also contain identification of samples, a list of

HASS Testing for Industrial Design Engineering

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Chemical, Engineering - Civil, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Engineering - Mining, Nanoscience & Technology, Physics

What is HASS Testing?

  • HASS is an acronym for Highly Accelerated Stress Screening.

  • HASS is performed during manufacturing on production products or components.

  • It is a screening method used to expose manufacturing defects that would cause a failure in normal field environments including shipping, storage and use.

Two very important concepts to be applied in HASS are:

  1. HALT must be performed before HASS to produce a robust product design before you enter into a HASS screening program.

  2. A Proof of Screen (POS) (sometimes called Safety of Screen) must be used to validate the HASS to prove that sufficient life is left for a normal use lifetime.

During HASS, stresses may be higher than normal operation to precipitate defects in a short amount of time, however the stresses are within the capability of the design as proven by the HALT.

  • HASS stresses are also typically more aggressive than those used in traditional Environmental Stress Screening (ESS) which makes HASS a more efficient screen than ESS.

    • HASS screens are typically an hour to a few hours whereas ESS screens may take a day to a few days.

  • The types of stresses used for HASS are similar to those used in HALT.

  • HASS uses combined temperature cycling, random vibration and electrical loading/monitoring.

  • HASS screens are performed in the same type of chamber that is used for HALT.

    • The vibration in HASS is randomly applied over a broad frequency range producing energy to 10,000 Hz in 6 degrees of freedom.

What is a Typical HASS Profile?

A typical HASS profile is shown in Figure 1

  1. The first part of the HASS profile (0 to ≈ 40 minutes) is a precipitation screen used to precipitate latent defects.

    1. The precipitation screen consists of high levels of steady vibration and temperature rate of changes of approximately 40 – 60 °C/minute.

  2. The second part of the HASS profile (40 – 75 minutes) is a detection screen used to detect precipitated defects.

    1. The detection screen consists of lower levels of modulated vibration and slower temperature rate of changes of approximately 5 – 10 °C/minute. It is sometimes easier to find intermittent defects during the detection screen because the stress levels are lower and the temperature rate of change is slower.

Figure 1.  Sample HASS profile

Why Use HASS?

  • To produce a rugged, reliable product

  • To regularly audit or screen your production components to check for and improve manufacturing quality

  • To produce a more rugged/reliable product with less field failures and warranty expenses

What are the Benefits of HASS?

HASS screens are shorter than traditional Environmental Stress Screening (ESS) methods. This results in reduced time and cost to screen production. Since the screens are shorter and more aggressive than ESS:

  • Defects are typically found sooner

  • Problems are corrected faster

  • Fewer defects reach the customers’ hands

Overall product quality and manufacturing process control are improved when using HASS. Reliable products lead to happy customers. Happy customers lead to increased profits/market share!

What Every Engineer Should Know About Highly Accelerated Stress Screening (HASS)

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Chemical, Engineering - Civil, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Engineering - Mining, Environmental Science, Nanoscience & Technology, Physics, Statistics

HALT-HASS-testing-fundamentals-Application

HASS Testing: Highly Accelerated Stress Screening

Highly Accelerated Stress Screening (HASS) is a technique for production screening that rapidly exposes process or production flaws in products. Its purpose is to expose a product to optimized production screens without affecting product reliability. Unlike HALT, HASS uses nondestructive stresses of extreme temperatures and temperature change rates with vibration.

  • HASS (Highly Accelerated Stress Screening) is a very effective production screening method used to find manufacturing defects, problems, and monitor processes.

  • The proper use of HASS testing can reduce screening time & costs, improve process control and improve overall quality.

Equipment Capabilities:

  • Temperatures from -100°C to +200°C (-148°F to +392°F).

  • Temperatures can be raised or lowered up to 60°C (140°F) per minute.

  • Vibration input levels up to 60 Grms with six degrees of freedom.

  • Additional loads such as electrical, fluid pressure, etc. can be applied to parts inside the chamber simultaneously with temperature and vibration.

Why HASS Works

  • Early life product failures are often attributable to the inherent variability of manufacturing processes (solder and component changes, etc.).

    • Even a well designed product can suffer high early failure rates where process-induced failures are not found and fixed before the product reaches the customer.

  • HASS testing (Highly Accelerated Stress Screen) is an accelerated reliability screen that can reveal latent flaws not detected by ESS (Environmental Stress Screening), burn-in and other test methods.

  • HASS testing uses stresses beyond specification, but within the capability of the design as determined by the HALT. The combination of variable thermal and simultaneous vibration stresses, in conjunction with product specific stresses, finds those defects and marginal products that used to show up as “out of box” infant failures, and increased warranty costs.

  • Because the stresses in HASS are more rigorous than those delivered by traditional approaches, HASS testing substantially accelerates early discovery of manufacturing process issues.

  • Reliability engineers can then correct the variations that would otherwise lead to field failures and virtually eliminate shipment of marginal product.

  • SPC (Statistical Process Control Chart) reveals that, internal and external process changes, such as parts substitutions, will affect product reliability.

  • By utilizing HASS testing to detect weak links introduced into the product during the manufacturing process, newfound failure modes can be analyzed, understood and eliminated, restoring process integrity.

  • A much more reliable product reaches customers and far fewer warranty claims result.

HASS Testing (Highly Accelerated Stress Screening): a proven test method

  • HASS Testing (Highly Accelerated Stress Screening) is a proven test method developed to find Manufacturing/Production-process induced defects in electronics and electro-mechanical assemblies before those products are released to market.

  • HASS is a powerful testing tool for improving product reliability, reducing warranty costs and increasing customer satisfaction.

After a product has undergone HALT Testing, HASS Testing can be deployed in the production process. Once the product is made robust (through the application of HALT), the next logical step is to monitor the production processes using a technique called Highly Accelerated Stress Screening (HASS).

  • The goal of a HASS is to induce failure modes that can be inherent or introduced in the production process.

  • HASS have been proven to be effective in screening failures that may have gone undetected in the burn-in testing process.

  • A product that passes normal production tests but fails in a HASS would have probably failed early after product release increasing warranty costs.

HASS - Highly Accelerated Stress Screening is used to improve the robustness/reliability of a product through test-fail-fix process where the applied stresses may be beyond the specified operating limits (OL) determined by HALT. This is applied to 100% of the manufactured units.

There are 2 parts to HASS Testing:

  1. HASS Development/Proof-of-screen (POS)

  2. HASS Production Screen.

HASS Proof-of-screen (POS)

  • Since HASS levels are more aggressive than conventional screening tools, a POS procedure is used to establish the effectiveness in revealing production induced defects.

  • A POS is vital to determine that the HASS stresses are capable of revealing production defects, but not so extreme as to remove significant life from the test item.

HASS Production Screen

  • Instituting HASS to screen the product is an excellent tool to maintain a high level of robustness and it will reduce the test time required to screen a product resulting in long term savings.

  • Ongoing HASS screening assures that any weak components or manufacturing process degradations are quickly detected and corrected.

  • HASS is not intended to be a rigid process that has an endpoint. It is a dynamic process that may need modification or adjustment over the life of the product.

HASS Testing Benefits:

  • Assures Manufacturing process and workmanship integrity.

  • Verifies integrity of mechanical interconnects and component tolerance compatibility.

  • Identify and preclude escape of potential early life product failures.

  • Decrease product infant mortality, and increase reliability.

  • Detect & correct design & process changes.

  • Detect & correct component variation.

  • Reduce production time & cost.

  • Increase out-of-box quality & field reliability.

  • Decrease field service & warranty costs.

  • Find Manufacturing Process Problems

What Every Engineer Should Know About HALT and HASS

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Chemical, Engineering - Civil, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Engineering - Mining, Environmental Science, Nanoscience & Technology, Physics

HALT/HASS

What is HALT and HASS?

  • Highly Accelerated Life Testing (HALT) is a process that utilizes a stepped stress approach in exposing your product to diverse accelerated stresses to discover the physical limitations of a design and product reliability.

    • Manufacturers can discover their products' failure modes and determine their failure mechanisms.

  • Highly Accelerated Stress Screening (HASS) is a production quality assessment to quickly and efficiently identify any weaknesses that the product may have inadvertently developed during the manufacturing stage.

Both are "Test, Analyze, Verify, and Fix" approaches - with Root Cause Analysis along the way!

Why should you perform HALT/HASS?

  • Highly accelerated life tests find weaknesses and flaws early in the design phase by testing to failure, while highly accelerated stress screening (HASS) catches manufacturing defects on production parts prior to installation without reducing the part's life.

  • HALT also provides valuable data for reliability metrics at the component level. The test results benefit customers, protect the manufacturer's reputation, and prevent costly re-design later in the product development cycle.

What is unique about a test chamber?

  • Unlike other environmental simulation chambers, HALT and HASS chambers offer fast temperature ramp rates (up to 60C per minute) and combine thermal, vibration, and shock simulation in a single apparatus.

  • Vibration levels up to 50 Grms can be applied simultaneously in three linear axes (X, Y, and Z) and three rotational axes (pitch, roll, yaw).

How do you specify a HALT/HASS test?

HALT and HASS profiles are composed of several segments defined by the product's intended end-use environment:

Cold Step Stress and Hot Step Stress:

  • Incrementally decreasing or increasing the temperature to identify the product limitations.

    • Select start and end points based on the end-use environment for the product reliability and physical limitations of the components.

Vibration Step Stress:

  • Incrementally increasing the vibration levels while pausing on the way up to see how your product responds.

    • Begin at a set Grms level, dwell for a specified duration, then increase to a higher amplitude and repeat the cycle to initiate failures.

Rapid Thermal Transitions (or Thermal Shock):

  • Subjecting your product to pre-defined maximum and minimum temperatures and rapidly cycling between them.

Combined Environment:

  • Simulating real world conditions where your product will be exposed to multiple random environments simultaneously.

Common HALT/HASS acronyms used to specify test profiles:

  • "Grms" - Vibrational G's in the root mean square, where "G" is acceleration due to gravity.

  • "PSD" - Power Spectral Density - In a random vibration spectrum, it is the measurement of the amplitude and frequency.

  • "LOL" and "LDL" - In the cold temp step stress stage, they are the "Lower Operating Limit" and "Lower Destructive Limit"

  • "UOL" and "UDL" - The "Upper Operating Limit" and "Upper Destruct Limit " occur in the hot temp step stress stage.

What Every Engineer Should Know Highly Accelerated Life Testing (HALT)

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Chemical, Engineering - Civil, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Engineering - Mining, Nanoscience & Technology, Physics

Assessing the robustness of an electronic product is integral to successful design and performance. HALT is an important testing tool for this purpose, and its effectiveness can be maximized through careful planning prior to testing and detailed execution.

1. DEVELOPING A HALT PLAN

Setting clear expectations and directives for conducting HALT is a multi-step process that starts with bringing the design engineers together to:

Develop A Test Plan Based On Physics Of Failure (PoF), clearly defining objectives, expected environments and sample availability.

Determine the applicable stresses, such as temperature, vibration and/or shock.

Decide how many devices (known as samples) are available for testing. Generally, one to five samples are used.

Select the functional tests to be run during development, such as what the device should be doing; which circuits should be active; and what codes/sensors should be gathering data.

Identify which parameters need to be monitored based upon the expected environment.

Define what constitutes a failure.

In conjunction with developing the foundational outline, two key areas must be addressed:

APPLICABLE STRESSES

Select the appropriate stresses for testing:

  • Vibration

  • High temperature

  • Low temperature

  • Voltage/frequency margining

  • Power cycling

  • Combined stresses, i.e., temperature and vibration

STEP STRESS APPROACH

For each intended stress, clearly delineate:

  • The starting stress point

  • The amount by which to increment the intended stress in each step

  • The duration of each step

  • The device or equipment limit for that stress, thereby ending HALT efficacy

2. SETTING UP A HALT

For accurate results, particular attention must be paid to the HALT configuration:

  • Design a vibration fixture to ensure vibrational energy is being transmitted into the product.

  • Design air ducting to ensure thermal energy is being transmitted into the product.

  • Tune chamber for the sample being tested.

  • Determine locations for thermocouples to monitor temperature.

  • Set up all functional test equipment and cabling.

3. CONDUCTING A HALT

HALT is comprehensive and encompasses several testing phases, each with specific parameters to follow.

THERMAL STEP STRESS

Thermal Step Stress testing applies incremental temperature stress levels throughout the product lifecycle in order to identify product failure modes.

To do so:

  • Begin with cold step stress, followed by hot step stress.

  • Initially use 10°C increments, decreasing to 5°C increments as limits are approached.

  • Set the dwell time minimum at 10 minutes plus the time needed to run a functional test. Timing should commence once the temperature has reached its set point.

  • Continue test until technology limits are reached.

  • Apply power cycling, load variations and frequency variation during vibration stress test.

FAST THERMAL TRANSITIONS

Fast Thermal Transitions are exactly as the name implies – changing temperatures as quickly as the testing equipment/chamber allows.

To do so:

  • Keep temperature range within 5°C of the operating limits determined during step testing

  • If the sample cannot withstand maximum thermal transitions, decrease the transition rate by 10°C per minute until the limit is found.

  • Continue transitions for 10 minutes, or the time it takes to run a functional test.

  • Apply power cycling, load variations and frequency variation during vibration stress test.

VIBRATION STEP STRESS

Vibration Step Stress testing applies incremental vibrational stress levels throughout the product lifecycle in order to identify product failure modes.

To do so:

  • Determine the Grms increments, typically ranging from 3-5 Grms on product.

  • Set the dwell time minimum at 10 minutes plus the time needed to run a functional test. Timing should commence once the temperature has reached its set point.

  • Continue test until technology limits are reached.

  • Apply power cycling, load variations and frequency variation during vibration stress test.

COMBINE TESTING

Merge testing results and methodologies to further test products.

To do so:

  • Develop a thermal profile using thermal operating limits, dwell times and transitions identified in earlier testing.

  • Apply additional product stresses during vibration stress test.

  • Use a constant vibration level of ≈5 Grms in first combined runs, and stepped in the same increments as those in vibration step stress tests.

  • Add ticket vibration (≈5 Grms) at higher Grm levels (>20 Grms) to determine if failures were precipitated at high G levels, but only detectable at low G levels.

4. POST-HALT

Once HALT is completed, the design engineers’ focus becomes determining the root causes of all failures and corrective action. Essentially, a verification HALT needs to be implemented to evaluate if testing adjustments fixed the problems.

What Every Engineer Should Know About HALT, HASS, and HART

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Chemical, Engineering - Civil, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Engineering - Mining, Nanoscience & Technology, Physics

What Every Engineer Should Know About HALT, HASS, and HART

Accelerated Life & Stress Testing (HALT/HASS/HART)

Would you like to minimize your warranty claims, reduce field failures, and minimize time to market while increasing customer confidence in the reputation of your products? Performing highly accelerated life testing might be what you need.

  • Highly Accelerated Life Testing (HALT), Highly Accelerated Stress Screening (HASS), and Highly Accelerated Robustness Testing (HART) are three methods utilized for early determination of potential problems your products may encounter during the course of their lifetime.

  • These methods are conceptually similar to fatigue testing, as they attempt to wear out test materials or products at highly accelerated rates.

Highly Accelerated Life Test (HALT)

  • HALT Testing, also known as Highly Accelerated Life Testing, is a technique designed to discover weak links of a product.

  • Utilizing the two most common testing stimuli, temperature and vibration, it can precipitate failure modes faster than traditional testing approaches.

  • Rather than applying low stress levels over long durations, HALT testing applies high stress levels for short duration well beyond the expected field environment.

  • It uses an incremental step stress approach, increasing variables until anomalies occur. As HALT testing is designed to precipitate failures, it is not a pass/fail test and requires root cause analysis and corrective action to achieve optimum value from testing.

  • It creates the ability to learn more about the product’s design and material limitations and provides opportunities to continually improve the design before bringing the product out to market.

  • As most items are only as good as their weakest link, the testing should be repeated to systematically improve the overall robustness of the product.

A typical HALT testing program includes five individual tests:

  1. High temperature step stress

  2. Low temperature step stress

  3. Vibration step stress

  4. Rapid thermal cycling

  5. Combined Environment

Each test has a specific goal. This may be examining material degradation (cracking, blistering, warping, melting, etc.) as a result of cold and hot environments, discovering mechanical issues (breaking, cracking and loosening of samples), or better understanding potential electrical issues (fretting, discontinuities and variances in change of electrical performance).

Highly Accelerated Stress Screening (HASS)

  • HALT testing is typically employed during the design or development stages of a product life cycle.

  • HASS testing is slightly different, because its purpose is to speed the discovery of latent defects in manufacturing during the production stages of a product life cycle to reduce associated failures.

  • The HASS process utilizes similar equipment to the HALT process, and the development of a usable HASS profile requires previous HALT results as well as other data and related validation information.

  • In addition to identifying defects in existing products, HASS testing can also determine the impact that alternative components will have on your product design and durability.

Highly Accelerated Robustness Testing

  • Highly Accelerated Robustness Testing (HART) is a type of accelerated life testing which produces the type of damage caused by sunlight and weather within a compressed period of time.

  • Typically, HARTwill cycle between intense UV and moisture exposure at controlled temperatures to achieve this end.

  • These tests are often utilized to test the designs of equipment intended for outdoor applications to predict how they will withstand weather throughout their lifetimes.

What Every Engineer Should Know About Defense in Depth to Mitigate Coronavirus Social Engineering Risk

By: John X. Wang
Subjects: Business & Management, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology

Defense-in-depth for Industrial Design Engineering

  • Defense-in-depth is an information assurance strategy that provides multiple, redundant defensive measures in case a security control fails or a vulnerability is exploited.

    • It originates from a military strategy by the same name, which seeks to delay the advance of an attack, rather than defeating it with one strong line of defense.

  • Defense-in-depth cybersecurity use cases include end-user security, product design and network security.

  • An opposing principle to defense in depth is known as simplicity-in-security, which operates under the assumption that too many security measures might introduce problems or gaps that attackers can leverage.

Defense in Depth at the Identity Perimeter to Mitigate Coronavirus Social Engineering Risk

  • The rapid acceleration from on-location to remote workforce as part of the Coronavirus Pandemic response opened the door to malicious actors accelerating their phishing and social engineering attacks.

  • Cybercriminals prey on user anxiety by embedding malicious files in COVID-19 themed emails.

    • Remote work layered with user anxiety increases credential theft attack success rates, leaving organizations’ mission-critical applications and data at risk.

Defense-in-depth architecture: Layered security

  • Defense-in-depth security architecture is based on controls that are designed to protect the physical, technical and administrative aspects of your network.

Physical controls

These controls include security measures that prevent physical access to IT systems, such as security guards or locked doors.

Technical controls

Technical controls include security measures that protect network systems or resources using specialized hardware or software, such as a firewall appliance or antivirus program.

Administrative controls

Administrative controls are security measures consisting of policies or procedures directed at an organization’s employees, e.g., instructing users to label sensitive information as “confidential”.

Access measures

Access measures include authentication controls, biometrics, timed access and VPN.

Workstation defenses

Workstation defense measures include antivirus and anti-spam software.

Data protection

Data protection methods include data at rest encryption, hashing, secure data transmission and encrypted backups.

Perimeter defenses

Network perimeter defenses include firewalls, intrusion detection systems and intrusion prevention systems.

Monitoring and prevention

The monitoring and prevention of network attacks involves logging and auditing network activity, vulnerability scanners, sandboxing and security awareness training.

Start with Identity and Zero Trust

  • For years, security professionals have said that the perimeter is shifting away from traditional controls like firewalls and focusing on enforcing user access.

    • As many organizations shift to fully remote. The Coronavirus shift toward a fully remote workforce for many organizations heightens the urgency over maintaining access governance controls that protect information.

  • Many organizations moved from partial remote workforce to fully remote workforce in the span of a week, or in some cases nearly overnight.

    • This means more devices accessing an organization’s systems and software, but many without the required firewall protections or forced security patch updates done on-premises. Any one of those devices, if compromised by malware, can lead to a system wide attack.

  • To rapidly accelerate security, organizations need to find a way to move towards a Zero Trust model, one that verifies and never trusts.

    • This means knowing all the devices, users, applications, and data across the organization. Then, working towards creating the appropriate controls for each of those categories.

  • For organizations that have a matured cybersecurity posture, identifying people, hardware, and data may be faster since that information is already contained within risk assessments.

    • To accelerate a Zero Trust strategy, organizations can leverage current identity and access controls and add context such as location, time of day, and application to limit user activity. By doing this, organizations can limit the impact of malware installed as part of a social engineering attack.

Embrace Adaptive Multi-Factor Authentication (MFA)

  • After setting contextual controls, organizations using adaptive MFA can apply those controls to modules within applications.

    • MFA acts as the key that unlocks access to applications, but even within that access, organizations need to provide additional layers of access protection.

  • Organizations can use context, such as time of day or location, to trigger inter-application MFA.

For example, if a user is trying to access a payroll module within an application from an anomalous location, adaptive MFA uses that context and requires the user to provide additional authentication information to prove their identity. By forcing this additional authentication, the adaptive MFA ensures that the user is who they say they are, rather than implicitly trusting the user.
  • This additional level of access security prevents malicious actors from leveraging stolen credentials throughout the organization’s Software-as-a-Service (SaaS) application.

  • Cybercriminals may be able to gain entrance to the application itself, but the additional layer of security around sensitive data and applications that comes from using adaptive MFA means that the organization is adding another “gate” that needs to be unlocked, thus protecting the information by restricting abnormal access.

Incorporate Data Masking

  • Organizations often assume that encryption acts as an unfailing security technology. An incorrect implementation or attacker who can crack the algorithm puts the data at risk.

  • Incorporating data masking by applying contextual controls to what information is visible to a user acts as another layer of defense against stolen credential use.

For example, assume a remote worker lives on the west coast of the United States. Incorporating geolocation as part of the user’s access and data visibility would give the user access and visibility into sensitive information as long as the person is in that geographic location. Applying data masking based on geographic location protects sensitive data even if a cyber attacker gains entrance to an application by making the sensitive data “invisible” to them. If a cybercriminal on the east coast of the United States gains entrance to the application with stolen credentials, then the cybercriminal would have access but not visibility to the information.
  • Many organizations may consider data masking a way to “protect from over-the-shoulder” risk when users are in public locations.

    • However, even with the workforce nearly fully remote as a social distancing strategy, data masking can provide a much-needed additional level of defense.

Adding Layers of Defense at the Identity Perimeter

  • Remote work is not new to most organizations. However, requiring all employees to work remotely, and having to do it in a short time frame under stress, is new.

  • Meanwhile, cybercriminals continue to evolve their attacks, preying on users’ anxiety as part of their social engineering strategies.

  • Protecting digital and human health means creating controls that can evolve as situations need. However, to accelerate security in this new work environment, organizations need to find “shortcuts.”

    • Leveraging current access controls then applying context offers one way for organizations to move toward a new, more layered security strategy at the Identity perimeter to mitigate credential theft attack risks.

Risk Engineering for Defense-in-depth information assurance: Use cases

  • Broadly speaking, defense-in-depth use cases can be broken down into user protection scenarios and network security scenarios.

Website protection

  • Defense-in-depth user protection involves a combination of security offerings (e.g., WAF, antivirus, antispam software, etc.) and training to block threats and protect critical data.

  • A vendor providing software to protect end-users from cyberattacks can bundle multiple security offerings in the same product. For example, packaging together antivirus, firewall, anti-spam and privacy controls.

  • As a result, the user’s network is secured against malware, web application attacks (e.g., XSS, CSRF).

Network security

  • An organization sets up a firewall, and in addition, encrypts data flowing through the network, and encrypts data at rest.

    • Even if attackers get past the firewall and steal data, the data is encrypted.

  • An organization sets up a firewall, runs an Intrusion Protection System with trained security operators, and deploys an antivirus program.

    • This provides three layers of security – even if attackers get past the firewall, they can be detected and stopped by the IPS. And if they reach an end-user computer and try to install malware, it can be detected and removed by the antivirus.

What Every Engineer Should Know About Reverse engineering

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

Reverse engineering

  • Reverse engineering, also called back engineering, is the process by which a man-made object is deconstructed to reveal its designs, architecture, or to extract knowledge from the object;

    • similar to scientific research, the only difference being that scientific research is about a natural phenomenon.

  • For example, the reproduction of another manufacturer's product following detailed examination of its construction or composition.

Reverse engineering is taking apart an object to see how it works in order to duplicate or enhance the object. The practice, taken from older industries, is now frequently used on computer hardware and software.

Software reverse engineering

  • Software reverse engineering involves reversing a program's machine code (the string of 0s and 1s that are sent to the logic processor) back into the source code that it was written in, using program language statements.

  • Software reverse engineering is done to retrieve the source code of a program because the source code was lost, to study how the program performs certain operations, to improve the performance of a program, to fix a bug (correct an error in the program when the source code is not available), to identify malicious content in a program such as a virus or to adapt a program written for use with one microprocessor for use with another.

  • Reverse engineering for the purpose of copying or duplicating programs may constitute a copyright violation. In some cases, the licensed use of software specifically prohibits reverse engineering.

Reverse engineering tools and techniques

  • Someone doing reverse engineering on software may use several tools to disassemble a program.

  • One tool is a hexadecimal dumper, which prints or displays the binary numbers of a program in hexadecimal format (which is easier to read than a binary format).

    • By knowing the bit patterns that represent the processor instructions as well as the instruction lengths, the reverse engineer can identify certain portions of a program to see how they work.

  • Another common tool is the disassembler.

    • The disassembler reads the binary code and then displays each executable instruction in text form.

    • A disassembler cannot tell the difference between an executable instruction and the data used by the program so a debugger is used, which allows the disassembler to avoid disassembling the data portions of a program.

  • These tools might be used by a cracker to modify code and gain entry to a computer system or cause other harm.

Hardware reverse engineering

  • Hardware reverse engineering involves taking apart a device to see how it works.

    • For example, if a processor manufacturer wants to see how a competitor's processor works, they can purchase a competitor's processor, disassemble it, and then make a processor similar to it.

      • However, this process is illegal in many countries.

  • In general, hardware reverse engineering requires a great deal of expertise and is quite expensive.

3-D images

  • Another type of reverse engineering involves producing 3-D images of manufactured parts when a blueprint is not available in order to remanufacture the part.

  • To reverse engineer a part, the part is measured by a Coordinate Measuring Machine (CMM).

    • As it is measured, a 3-D wire frame image is generated and displayed on a monitor.

    • After the measuring is complete, the wire frame image is dimensioned.

  • Any part can be reverse engineered using these methods.

The term forward engineering is sometimes used in contrast to reverse engineering.

Software Reverse Engineering Tools

1. IDA-Pro, Hex-Rays.

  • It’s an interactive disassembler and has an inbuilt command language or IDC.

  • It also supports a variety of executables, operating systems, and more.

  • You can use this tool to build diagrams, change the names of markers, and do a whole lot more.

  • An Assembler Code can be decompiled through the Hex-Rays Decompiler plug-in.

2. CFF Explorer.

  • This one includes Resource editor, PE, and HEX editor, Signature scanner, Import editor, Address converter, a Disassembler, and a Dependency Analyzer.



3. API Monitor.

  • It intercepts API function calls and can also display output and input data.

3. WinHex.

  • It can display codes of software files, something that a simple text editor can’t do.

4. Hiew.

  • This is a binary files editor who focuses on work using code.

  • It also features a built-in disassembler.

  • You can use it to view and edit logical as well as physical drives. It also has tools for creating custom plugins.

5. Fiddler.

  • This is a proxy operating with traffic between a remote server and the computer.

  • It can work with both HTTPS and HTTP.

6. Scylla.

  • It enables you to leave a running application process.

  • You can restore an import table and run the application.

7. Relocation Section Editor.

  • This one helps you to remove the Relocation table’s values.

8. PEiD.

  • It’s considered as one of the best tools for detecting the packer.

How to Use Software Reverse Engineering Tools?

Let’s go over how to use some of the software reverse engineering tools that were mentioned.

1. Using IDA-Pro to Open Researched Executable.



  • After downloading a test application to IDA Pro, press ‘OK.’

  • You’ll see that the import table is close to being empty.

  • If you think that the application is packed, you can use PEiD to help detect the packer used.

2. Using PEiD for Packer Information.

  • Load the application and consider running a scan by going to Options and choosing ‘Hardcore scan.’

  • Now, select the folder containing the application you’re working on.

  • This will show you the packer that was used.

3. Using CFF Explore for Unpacking.

  • Go to the UPX Utility Page and simple press the ‘Unpack’ button.

  • Once that’s done, you can upload the application to IDA Pro to restore the assembler code.

  • Download it again to IDA Pro.

  • Agree when asked whether symbols from the server are to be uploaded.

  • You’ll get to see code, an import table, and some functions in the application.

  • Now, using IDA Pro, run and debug the application by selecting Debugger > Select Debugger > Local Win32 debugger>F9.

  • It’s time for you to get rid of the fact about how a debugger was detected by the application. Click the NtQueryInformation Process to get a list of xref functions.

  • Click on it to see the third parameter which is for output.

    • If it’s equal to 1, it means the debugger is attached to the application.

    • However, it’s equal to 0 it means the application doesn’t have a debugger attached.

3. Modifying Executed Statements in Hiew.

  • You’ll need to upload the application and consider switching to the Decode Mode. You can enter the Edit Mode by pressing F3 followed by F2. You can press F9 to save the application.

4. Deleting a Relocation Table’s value using a Relocation Section Editor.

  • If a crash occurs after using a Relocation Section Editor, you’ll need to use CFF Explorer.

5. Modifying a Relocation Table’s value using a CFF Explorer.

  • Simply open the application on CFF Explorer and replace the required value.



6. API Monitor.

  • You can use API Monitor for monitoring a number of functions.

  • You can also go ahead and add functions if you prefer. API Monitor will help you see the parameters that were passed to the said function.

7. Using WinHex for Detection.

  • It’s recommended to detect the binary file’s type before exploration.

  • You can use WinHex to do so.

  • Take note that the MZ signature at zero offsets happens to correspond to the PE-format files. That’s why it’s an exe file or dll.

8. Using Scylla.

  • You can use Scylla to create a memory dump of a packed app to run it.

  • You are to open the packed executable file in IDA Pro.

  • Use the ‘pusha’ command for saving general-purpose registers to the stack.

  • You can make an application dump and restore the import table as well by opening Scylla without closing IDA Pro.

  • Remove the Relocation Table if the application crashes.

What Every Engineer Should Know About Decision Making Anti Tamper

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology

Anti Tamper

  • Anti-tamper software (or tamper-resistant software) is software which makes it harder for an attacker to modify it.

  • The measures involved can be passive such as obfuscation to make reverse engineering difficult or active tamper-detection techniques which aim to make a program malfunction or not operate at all if modified.

  • It is essentially tamper resistance implemented in the software domain.

  • It shares certain aspects but also differs from related technologies like copy protection and trusted hardware, though it is often used in combination with them.

  • Anti-tampering technology typically makes the software somewhat larger and also has a performance impact.

  • There are no provably secure software anti-tampering methods; thus, the field is an arms race between attackers and software anti-tampering technologies.

Obfuscation, checksums and much more

  • When software has been made tamper-proof, it is protected against reverse engineering and modifications. Tamper-proofing is a combination of many techniques.

  • Each aspect of this protection adds an individual defense to the code, or prevents a certain attack method.

  • Most importantly, these transformations protect each other and provide a synergistic effect enabling that the protection strength in sum of these techniques is much more than the sum of its individual protection measures.

Tamper-proofing contains large elements of obfuscation

  • Obfuscated software “cannot” be understood by humans.

  • The way a reverse engineer breaks obfuscated software is well known:

    • The attacker changes one bit and observes the consequences of that change.

      • He keeps doing this until the software is understood.

        • However, tamper-proofing prevents this attack on obfuscation from succeeding.

Tamper-proofing is more than encryption

  • Encryption is very useful and necessary.

  • Encrypted software needs to be decrypted first before an attacker knows what it does.

  • Tamper-proofing uses lots of encryption but adds an extra element: it hides how the decryption works or what encryption key is used.

  • Typically users employ encryption to protect data and tamper-proofing to protect the code or encryption keys.

Tamper-proofing contains elements of checksums and hashcodes

  • These are necessary to detect changes in the protected code. Tamper-proofing also helps hiding the checksums and hash codes.

A few technologies related to tamper-proofing:

Tamper-proofing compared to virus-checking

  • Virus checking protects the computer as a whole at its perimeter and scans the file system. Tamper-proofing protects a particular application inside the computer.

  • Traditional virus checking is based on a large data base of patterns to recognize malicious software.

    • Tamper-proofing does not recognize unwanted software, however it detects changes in the protected software or its behavior.

  • Virus checking catches the majority of old fashioned attacks.

    • Tamper-proofing is an active defense against modern or Zero day attacks.

  • Nevertheless, tamper-proofing and virus checking should support each other.

Tamper-proofing compared to license checking

  • License checking is a function within a program which verifies conditions for using the program.

  • License checking software usually uses tamper-proofing internally to protect itself from being disabled.

Tamper-proofing compared to copy-protection

  • Tamper-proofed software in theory can easily be copied.

    • However the copy is as tamper-proof as the original.

      • If the original would work on one computer only, so would the copy.

  • Copy protection is mostly done with some hardware support.

    • Tamper protection is a good addition to harden the copy protection.

Tamper-proofing compared to trusted hardware modules

  • Both try to provide similar services.

    • However, in reality they don’t compete but rather have synergistic effects.

    • Trusted hardware relies on very complex support software and obviously requires hardware support.

  • Both cannot always be assumed to be present or correct.

    • Trusted hardware can add a huge benefit to tamper-proofing, and vice versa, tamper-proofing can augment trusted hardware.

    • Tamper-proofing can (most of the time) be applied without any hardware support.

    • Tamper-proofing can not make absolute security guarantees, but it can be made as tough as you require.

Tamper-proofing compared to a firewall

  • The firewall is a perimeter feature.

    • It prevents malware from entering, but typically does not protect against new (unknown) or sophisticated attacks.

      • Once the malware is inside the computer, a firewall can only detect anomalies in the communications.

  • Tamper-proofing on the other hand stays active, no matter where an attack originated from.

  • A firewall is a good solution for generic protection.

  • Tamper-proofing protects exactly the software which needs protection.

FAQs

Is tamper-proofing fool safe?

  • No. Mathematics can show where tamper-proofing has its limits.

    • However, for good tamper-proofing, these limits are far beyond any attackers capabilities or patience (similarly, with enough brain power, encryption can often be decoded).

  • Tamper-proofing is not a single measure.

    • It consists of many different transformations which individually protect each other as well as the software to be protected.

  • Weak protection may suffer from the so called zipper-effect:

    • One measure after the other can be cracked in the right order.

  • Strong protection not only has no visible zipper effect,

    • however, the protection as a whole is stronger then the sum of the individual measures.

 

Does tamper-proofing have performance impacts?

  • Yes. There is NO software protection which doesn’t.

  • The impact can be minor or may become considerable.

  • Good tamper-proofing tools allow the programmer to carefully direct the tamper-proofed regions, so that the total performance impact can be held in control.

  • A software provider can decide between choosing extreme tamper-protection or providing the highest performance.

Does tamper-proofing make the software larger?

  • Yes, it does. That is, however, usually an advantage and not a disadvantage; it makes the attacker’s job harder.

Can tamper-proofing be used to protect malware?

  • Yes, malware traditionally uses tamper-proofing to hide itself.

  • Similar to encryption, tamper-proofing can be in plain view and nevertheless succeed in protecting the code.

What Every Engineer Should Know About WiFi Penetration Testing Tools

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

  • The internet has become an integral part of our lives today.

  • Ranging from social media, online shopping, mobile banking, online research among other things… all require an internet connection.

  • It is why you see Wi-Fi hotspots wherever you go because people always need to connect to the internet.

  • Most of these Wi-Fi are usually secured with a password, so you need to know this security key in order to gain access.

  • Wi-Fi security analysis and penetration testing is an integral part of creating a secure network.

  • And this is what brings us to the best Wifi penetration testing tools that you can use to ethically test a wireless network and fix it.

  • By trying to hack into your own wireless networking using this wifi hacking tools, you’ll be able to better understand wifi security vulnerabilities and how to protect yourself from them.

  • In this article we are going to look at the wifi penetration testing tools used by hackers in 2019.

  • If you are completely new to pentesting or need to upgrade your skills, check out my other article learning pentesting through online tutorials.

  • Using these wireless pentesting tools, you’ll be able to uncover rogue access points, weak Wifi passwords and spot security holes before a hacker does.

  • You can also use these wifi hacking tools to see who is doing what in your network by analyzing their network packets.

Aircrack

  • Aircrack is one of the most popular wifi pentesting tools for cracking both WEP and WPA wifi passwords.

  • It uses one of the smartest algorithms for capturing passwords by first capturing the network packets.

  • Once it has gathered enough packets, it uses them to try and recover the wifi password by implementing an optimized Fluhrer, Mantin and Shamir (FMS) attack.

  • Apart from supporting most of the available wireless adapters, it has very high success rate and is almost always guaranteed to work.

  • In order to effectively be able to use this wifi pentest tool to crack wifi passwords, you’ll need deep knowledge and understanding of Linux as it comes as a Linux distribution.

Reaver

  • Reaver is also one of the most popular open source wireless network pentesting tools, but it has taken a long time without continuous development.

  • This wifi hacking tool uses brute force attack to crack wifi passwords for WPA/WPA2 wireless networks.

  • Even though the source code for this amazing wireless pentesting tools was hosted on Google, here is a great Reaver usage documentation that shows how to use it.

  • It’s a great wifi hacking tool even though it has taken many years without updates.

  • You can use it as a great alternative to other wireless penetration testing tools that use brute force attack to crack wifi security keys.

Airsnort

  • Airsnort is a free wifi pentesting tool that is used to crack wifi passwords for WEP networks.

  • It works by gathering network packets, examining them and then using them to compose the encryption key once enough packets have been gathered.

  • This tool is very easy to use and comes with both the Windows and Linux operating systems.

  • Even though, it’s a great password cracking tool for a WEP network, it has the same problem as the Reaver tool.

  • The Airsnort source code is still available on Sourceforge.net but it has not been updated in years.

  • It’s a great wifi security tool to try, though, for hacking wifi passwords.

Cain & Abel

  • Cain and Abel is one of the top wireless penetration testing tools for cracking WEP wifi passwords, particularly for the Windows platform.

  • It’s popular because if its ability to crack wifi passwords using various techniques like network packet sniffing, dictionary attacks, brute force attacks and cryptanalysis.

  • This tool can also recover network security keys by analyzing network protocols.

  • Apart from cracking passwords, you can also use this wifi hacking tools to record VoIP conversations, get cache data as well as to get hold of routing protocols for the purpose of ethical hacking.

  • It is an updated tool and is available for all the different versions of the Windows operating system.

Infernal Twin

  • Infernal Twin is an automated wireless penetration testing tool created to aid pentesters assess the security of a wifi network.

  • Using this tool you can create an Evil Twin attack, by creating a fake wireless access point to sniff network communications.

  • After creating a fake wifi access point you can eavesdrop users using phishing techniques and launch a man-in-the-middle attack targeting a particular user.

  • Because this tool is written in Python, you can install in various Linux distros and use it for wireless network auditing and pentesting.

  • It enables you to hack wifi passwords for WEP/WPA/WPA2 wireless networks.

Wireshark

  • Wireshark is a free and open source wireless penetration testing tool for analyzing network packets.

  • It enables you to know what is happening in your wireless network by capturing the packets and analyzing them at a micro-level.

  • Because it’s multi-platform it can run on all the popular operating systems including Windows, Linux, Mac, Solaris & FreeBSD.

  • Even though you it might not help you recover plaintext passphrases, you can use it to sniff and capture live data on wifi networks, bluetooth, ethernet, USB among others.

  • However, to use this tool adequately, you need a deep understanding of network protocols in order to be able to analyze the data obtained.

  • So you first need to study network protocols and here are the best network security courses online to get you started.

Wifiphisher

  • Wifiphisher is another great wifi pentesting tool for cracking the password of a wireless network.

  • It functions by creating a fake wireless access point which you can use for red team engagements or wifi security testing.

  • Using this tool you can easily achieve a man-in-the-middle position again wifi access clients by launching a targeted wifi association attack.

  • You can the use it to mount victim customized web phishing attacks against the connected clients in order to capture credentials or infect their stations with malware.

  • So you can use it to launch fast automated phishing attacks on a wifi network to steal passwords.

  • Even this tool is free and comes pre-installed in the Kali Linux distro, it is also available for Windows and Mac OS’es.

CowPatty

  • CowPatty is an automated command-line wireless penetration testing tool for launching dictionary attacks on WPA/WPA2 wifi networks using Pre-Shared Key (PSK)-based authentication.

  • It can launch an accelerated network attack if a precomputed PMK file is available for the Service Set Identifier (SSID) being assessed.

  • Because this wireless hacking tool runs on a word-list containing the passwords to be used in the attack, you are out of luck if the password is not within the passwords’ word list.

  • Another drawback is the sluggishness of this tool because the hash uses SHA1 with the SSID speed which depends on the password strength.

  • So it uses the password dictionary to generate the hash for each word contained in the dictionary using the SSID.

  • Thus even though this tool is easy to use, it’s really slow.

OmniPeek

  • OmniPeek is a very popular wireless pentesting tool that is used for packet sniffing as well as network packet analyzing.

  • Even though this is a paid tool and only runs on the Windows OS, it has a 30 day trial to test run the platform before you commit to a paid plan.

  • It works just as great as, and is a similar way, as Wireshark that I already mentioned above.

  • However, while you can use this tool to capture and analyze wireless network traffic, you’ll need a deep knowledge of network protocols and packets to be able to understand the collected data.

  • One reason it’s very popular as a wireless network hacking tool is that it supports almost all of the available network interface cards in the market…

  • So you are less likely to face network card compatibility issues.

  • Besides, you can also extend the functionality of this wifi pentest tool by using many of the readily available plugins to achieve greater troubleshooting capabilities.

  • You’ll also get expert GUI-based views for faster diagnostics because it has a built in expert system that suggests root cause analysis for hundreds of common network problems.

Conclusion

  • So there goes the list of the top wireless hacking tools for pentesting your wifi.

  • While you can use some of these wifi pentesting tools to crack wifi passwords, you can also use some of them to monitor your network traffic.

  • However, these are not the only wifi security tools out there. There are many more wireless hacking tools.

  • But these are the most common one among ethical hackers, and you can learn how to use them through these online penetration testing courses.

  • Also, note that even though you can use these wifi penetration testing tools to gain unauthorized access to a network, hacking into a network might be a criminal offence in your country.

  • So tread with caution if you are going to use these wifi security tools on another network.

  • These wifi pentesting tools are basically used by system admins or programmers working on a wifi based software for monitoring and troubleshooting wifi networks.

Have you used any of the wifi hacking tools in this list before?

What Every Engineer Should Know About Kali Linux for Cybersecurity

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

  • Kali Linux is the most popular OS used by Cyber Security experts all over the world.

  • So if you want to get into the world of Cyber Security and Ethical Hacking, Kali Linux Installation is the first step for you.

  • Kali Linux is a Debian-based Linux distribution aimed at advanced Penetration Testing and Security

What is Kali Linux?

Kali Linux, initially released in 2013, is a popular Debian-based Linux distribution operating system mainly used by penetration testers and digital forensics experts.

What are Some of the Features of Kali Linux?

  • Availability of more than 600 tools: Kali Linux has an extensive collection of penetration testing tools. Whichever tool you need—from Aircrack-ng for examining Wi-Fi network security to John the Ripper for cracking passwords, this bundle has everything for your cyber security needs.

  • Completely free: Just like its predecessor, BackTrack, Kali Linux is offered free of charge—forever.

  • Multi-language support: The team at Offensive Security has done an excellent job at offering Kali Linux help in several languages, apart from the traditional English language.

  • Entirely customizable: Do you want to customize Kali Linux to suit your unique preferences? Nothing prevents you from reworking the tool to suit your design needs, even from the kernel.

  • ARMEL and ARMHF support: If you like using an ARM-derived, single-board infrastructure such as BeagleBone Black and Raspberry Pi, Kali Linux will have you covered. The tool supports ARMEL and ARMHF systems, allowing you to carry out hacking without many hassles.

  • Open-source model: Kali Linux is offered as an open-source software, allowing anyone who wants to tweak its source code to do so.

How to Use Kali Linux

With Kali Linux, ethical hackers can assess the computing infrastructure of an organization and discover vulnerabilities to be addressed.

Here are the main steps for carrying out penetration testing on a network and the Kali Linux tools that can be used.

1. Reconnaissance

In this first process, a pen tester collects preliminary information or intelligence on the target, enabling better planning for the actual attack.

Some Kali Linux reconnaissance tools include

  • Recon-ng

  • Nmap

  • Hping3

  • DNSRecon

2. Scanning

In this step, technical tools are utilized to collect more intelligence on the target. For example, a pen tester can use a vulnerability scanner to identify security loopholes in a target network.

Some Kali Linux scanning tools include

  • Arp-scan

  • jSQL Injection

  • Cisco-auditing-tool

  • Oscanner

  • WebSploit

  • Nikto

3. Gaining access

In this third step, the ethical hacker infiltrates the target network with the intention of extracting some useful data or to use the compromised system to launch more attacks.

Some Kali Linux exploitation tools include

Metasploit Framework

BeEF (Browser Exploitation Framework)

Wireshark

John the Ripper

Aircrack-ng

4. Maintaining access

Just like the name suggests, this phase requires the pen tester to continue dominating the target system as long as possible and cause more destruction. It requires tools that can allow stealthy behavior and under-the-ground operations.

Some Kali Linux tools for maintaining access include

  • Powersploit

  • Webshells

  • Weevely

  • Dns2tcp

  • Cryptcat

5. Covering tracks

In this last stage, the hacker removes any sign of past malicious activity on the target network. For example, any alterations made or access privileges escalated are returned to their original statuses.

Some Kali Linux tools for covering tracks include

  • Meterpreter

  • Veil

  • Smbexec

Conclusion

Kali Linux cyber security is a useful tool for penetration testing. You should learn the ins and outs of using the tool so that you can sufficiently guard your critical IT infrastructure from malicious attackers.

After mastering use of this tool using a Kali Linux tutorial, you’ll feel comfortable carrying out advanced penetration testing to discover vulnerabilities in your network.

What Every Engineer Should Know About Global Frameworks and Standards for Cybersecurity

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

NIST Cybersecurity Framework

The U.S. Commerce Department’s National Institute of Standards and Technology (NIST) has released version 1.1 of its popular Framework for Improving Critical Infrastructure Cybersecurity, more widely known as the Cybersecurity Framework.

More than ever, organizations must balance a rapidly evolving cyber threat landscape against the need to fulfill business requirements. To help these organizations manage their cybersecurity risk, NIST convened stakeholders to develop a Cybersecurity Framework that addresses threats and supports business. While the primary stakeholders of the Framework are U.S. private-sector owners and operators of critical infrastructure, its user base has grown to include communities and organizations across the globe.

The framework was developed with a focus on industries vital to national and economic security, including energy, banking, communications and the defense industrial base. It has since proven flexible enough to be adopted voluntarily by large and small companies and organizations across all industry sectors, as well as by federal, state and local governments.

FIPS: Federal Information Processing Standards.

FIPS (Federal Information Processing Standards) are a set of standards that describe document processing, encryption algorithms and other information technology standards for use within non-military government agencies and by government contractors and vendors who work with the agencies.

FIPS are standards and guidelines for federal computer systems that are developed by National Institute of Standards and Technology (NIST) in accordance with the Federal Information Security Management Act (FISMA) and approved by the Secretary of Commerce. These standards and guidelines are developed when there are no acceptable industry standards or solutions for a particular government requirement. Although FIPS are developed for use by the federal government, many in the private sector voluntarily use these standards.

RTCA DO-178C

RTCA DO-178C: Software Considerations in Airborne Systems and Equipment Certification is the primary document by which the certification authorities such as FAA, EASA and Transport Canada approve all commercial software-based aerospace systems.

DO-326A

DO-326A: The international standards D-326A (U.S.) and ED-202A (Europe) titled "Airworthiness Security Process Specification" are the cornerstones of the "DO-326/ED-202 Set" and they are the only Acceptable Means of Compliance (AMC) by FAA & EASA for aviation cyber-security airworthiness certification, as of 2019. The "DO-326/ED-202 Set" also includes companion documents DO-356A/ED-203A: "Airworthiness Security Methods and Considerations" & DO-355/ED-204: "Information Security Guidance for Continuing Airworthiness" (U.S. & Europe) and ED-201: "Aeronautical Information System Security (AISS) Framework Guidance" & ED-205: "Process Standard for Security Certification / Declaration of Air Traffic Management / Air Navigation Services (ATM/ANS) Ground Systems“ (Europe only).

DO-355

DO-355: Security DO-355 Information Security Guidance for Continuing Airworthiness. This document is a resource for civil aviation authorities and the aviation industry when the operation and maintenance of aircraft and the effects of information security threats can affect aircraft safety. It deals with the activities that need to be performed in operation and maintenance of the aircraft related to information security threats.

This document also guidance that is related to operational and commercial effects (i.e. guidance that exceeds the safety-only effects). Thus, it also supports harmonizing security guidance documents among Design Approval Holders (DAH), which is deemed beneficial to DAHs, operators and civil aviation authorities. It is a companion document to DO-326A that supports security in the development and modification part of the airworthiness process.

DO-356A

DO-356A: Airworthiness Security Methods and Considerations.

Scope

Airworthiness security is the protection of the airworthiness of an aircraft from intentional unauthorized electronic interference. This includes the consequences of malware and forged data and of access of other systems to aircraft systems.

This guidance provides methods and considerations for securing airworthiness during the aircraft development life cycle from project initiation until the Aircraft Type Certificate is issued for the aircraft type design. It was developed in the context of DO-326A/ED-202A "Airworthiness Security Process Specification" which addresses type certification considerations during the first three life cycle stages of an aircraft type (Initiation, Development or Acquisition, and Implementation) and DO-355/ED-204, "Information Security Guidance for Continuing Airworthiness" which addresses airworthiness security for continued airworthiness.

It is intended to be used in conjunction with other applicable guidance material, including SAE ARP 4754A/ED-79A, SAE ARP 4761/ED-135, DO-178C/ED-12C, and DO-254/ED-80 and with the advisory material associated with FAA AC 25.1309-1A and EASA AMC 25.1309, in the context of part 25 for Transport Category Airplanes which include an approved passenger seating configuration of more than 19 passenger seats. This guidance is not intended for CFR parts 23, 27, 29, 33.28, and 35.15, normal, utility, acrobatic, and commuter category airplanes, normal category rotorcraft, transport category rotorcraft, engines, and propellers.

Purpose

This document describes guidelines, methods and tools used in performing an airworthiness security process. The guidelines, methods and tools presented are not intended to be exhaustive and can be expected to be updated with additional methods and considerations, including those needed to meet evolving regulatory assumptions. Applicants can propose alternative practices for consideration by the authorities. Practices for airworthiness security are still undergoing evolution and refinement as new features are deployed and the security threat itself evolves.

RTCA/EUROCAE documents on Aeronautical Systems Security will address information security for the overall Aeronautical Information System Security (AISS) of airborne systems with related ground systems and environment. This guidance material is for equipment manufacturers, aircraft manufacturers, and anyone else who is applying for an initial Type Certificate (TC), and afterwards ( e.g. for Design Approval Holders (DAH)), Supplemental Type Certificate (STC), Amended Type Certificate (ATC) or changes to Type Certification for installation and continued airworthiness for aircraft systems, and is derived from understood best practice.

ISO 27001

ISO/IEC 27001: Information Security Management. ISO/IEC 27001 is widely known, providing requirements for an information security management system (ISMS), though there are more than a dozen standards in the ISO/IEC 27000 family. Using them enables organizations of any kind to manage the security of assets such as financial information, intellectual property, employee details or information entrusted by third parties.

ISO 27002

ISO 27002: Information technology — Security techniques — Code of practice for information security controls. ISO/IEC 27002:2013 gives guidelines for organizational information security standards and information security management practices including the selection, implementation and management of controls taking into consideration the organization's information security risk environment(s). It is designed to be used by organizations that intend to:

  • select controls within the process of implementing an Information Security Management System based on ISO/IEC 27001;

  • implement commonly accepted information security controls;

  • develop their own information security management guidelines.

ITIL

ITIL: ITIL is a framework of best practices for delivering IT services. ITIL's systematic approach to IT service management can help businesses manage risk, strengthen customer relations, establish cost-effective practices, and build a stable IT environment that allows for growth, scale and change.

The IT Infrastructure Library (ITIL) is a library of volumes describing a framework of best practices for delivering IT services. ITIL has gone through several revisions in its history and currently comprises five books, each covering various processes and stages of the IT service lifecycle. ITIL’s systematic approach to IT service management can help businesses manage risk, strengthen customer relations, establish cost-effective practices, and build a stable IT environment that allows for growth, scale and change.

Developed by the British government's Central Computer and Telecommunications Agency (CCTA) during the 1980s, the ITIL first consisted of more than 30 books, developed and released over time, that codified best practices in information technology accumulated from many sources (including vendors' best practices) around the world. IBM, for example, says that its four-volume series on systems-management concepts, A Management System for Information Systems, known as the Yellow Books, provided vital input into the original ITIL books.

GDPR

GDPR: General Data Protection Regulation. The General Data Protection Regulation 2016/679 is a regulation in EU law on data protection and privacy in the European Union and the European Economic Area. It also addresses the transfer of personal data outside the EU and EEA areas.

DFARS

DFARS: Defense Federal Acquisition Regulation Supplement. The Defense Federal Acquisition Regulation Supplement (DFARS) to the Federal Acquisition Regulation (FAR) is administered by the Department of Defense (DoD). The DFARS implements and supplements the FAR. The DFARS contains requirements of law, DoD-wide policies, delegations of FAR authorities, deviations from FAR requirements, and policies/procedures that have a significant effect on the public. The DFARS should be read in conjunction with the primary set of rules in the FAR. See also the suggested search for "Government Contracts."

What Every Engineer Should Know About SE Linux

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

Security-Enhanced Linux (SELinux) is a Linux kernel security module that provides a mechanism for supporting access control security policies, including mandatory access controls (MAC). SELinux is a set of kernel modifications and user-space tools that have been added to various Linux distributions.

What is SELinux?

Security-Enhanced Linux (SELinux) is a security architecture for Linux® systems that allows administrators to have more control over who can access the system. It was originally developed by the United States National Security Agency (NSA) as a series of patches to the Linux kernel using Linux Security Modules (LSM).

SELinux was released to the open source community in 2000, and was integrated into the upstream Linux kernel in 2003.

How does SELinux work?

SELinux defines access controls for the applications, processes, and files on a system. It uses security policies, which are a set of rules that tell SELinux what can or can’t be accessed, to enforce the access allowed by a policy.

When an application or process, known as a subject, makes a request to access an object, like a file, SELinux checks with an access vector cache (AVC), where permissions are cached for subjects and objects.

If SELinux is unable to make a decision about access based on the cached permissions, it sends the request to the security server. The security server checks for the security context of the app or process and the file. Security context is applied from the SELinux policy database. Permission is then granted or denied. If permission is denied, an "avc: denied" message will be available in /var/log.messages.

How to configure SELinux

There are a number of ways that you can configure SELinux to protect your system. The most common are targeted policy or multi-level security (MLS).

Targeted policy is the default option and covers a range of processes, tasks, and services. MLS can be very complicated and is typically only used by government organizations.

You can tell what your system is supposed to be running at by looking at the /etc/sysconfig/selinux file. The file will have a section that shows you whether SELinux is in permissive mode, enforcing mode, or disabled, and which policy is supposed to be loaded.

SELinux labeling and type enforcement

Type enforcement and labeling are the most important concepts for SELinux.

SELinux works as a labeling system, which means that all of the files, processes, and ports in a system have an SELinux label associated with them. Labels are a logical way of grouping things together. The kernel manages the labels during boot.

Labels are in the format user:role:type:level (level is optional). User, role, and level are used in more advanced implementations of SELinux, like with MLS. Label type is the most important for targeted policy.

SELinux uses type enforcement to enforce a policy that is defined on the system. Type enforcement is the part of an SELinux policy that defines whether a process running with a certain type can access a file labeled with a certain type.

Enabling SELinux

If SELinux has been disabled in your environment, you can enable SElinux by editing /etc/selinux/config and setting SELINUX=permissive. Since SELinux was not currently enabled, you don’t want to set it to enforcing right away because the system will likely have things mislabeled that can keep the system from booting.

You can force the system to automatically relabel the filesystem by creating an empty file named .autorelabel in the root directory and then rebooting. If the system has too many errors, you should reboot while in permissive mode in order for the boot to succeed. After everything has been relabeled, set SELinux to enforcing with /etc/selinux/config and reboot, or run setenforce 1.

If a sysadmin is less familiar with the command line, there are graphic tools available that can be used to manage SELinux.

SELinux provides an additional layer of security for your system that is built into Linux distributions. It should remain on so that it can protect your system if it is ever compromised.

Discretionary access control (DAC) vs. mandatory access control (MAC)

Traditionally, Linux and UNIX systems have used DAC. SELinux is an example of a MAC system for Linux.

With DAC, files and processes have owners. You can have the user own a file, a group own a file, or other, which can be anyone else. Users have the ability to change permissions on their own files.

The root user has full access control with a DAC system. If you have root access, then you can access any other user’s files or do whatever you want on the system.

But on MAC systems like SELinux, there is administratively set policy around access. Even if the DAC settings on your home directory are changed, an SELinux policy in place to prevent another user or process from accessing the directory will keep the system safe.

SELinux policies let you be specific and cover a large number of processes. You can make changes with SELinux to limit access between users, files, directories, and more.

How to handle SELinux errors

When you get an error in SELinux there is something that needs to be addressed. It is likely 1 of these 4 common problems:

  1. The labels are wrong. If your labeling is incorrect you can use the tools to fix the labels.

  2. A policy needs to be fixed. This could mean that you need to inform SELinux about a change you’ve made, or you might need to adjust a policy. You can fix it using booleans or policy modules.

  3. There is a bug in the policy. It could be that a bug exists in the policy that needs to be addressed.

  4. The system has been broken in to. Although SELinux can protect your systems in many scenarios, the possibility for a system to be compromised still exists. If you suspect that this is the case, take action immediately.

What are booleans?

Booleans are on/off settings for functions in SELinux. There are hundreds of settings that can turn SELinux capabilities on or off, and many are already predefined. You can find out which booleans have already been set in your system by running getsebool -a.

What Every Engineer Should Know About Hardware Cybersecurity Designs

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology

Secure Boot

The Unified Extensible Firmware Interface (UEFI) specification defines a mechanism called "Secure Boot" for ensuring the integrity of firmware and software running on a platform. Secure Boot establishes a trust relationship between the UEFI BIOS and the software it eventually launches (such as bootloaders, OSes, or UEFI drivers and utilities). After Secure Boot is enabled and configured, only software or firmware signed with approved keys are allowed to execute. Conversely, software signed with blacklisted keys are disallowed from executing. In this way, a system can guard against malicious attacks, rootkits, and unauthorized software updates that could happen prior to the OS launching.

The Secure Boot mechanism relies on public/private key pairs to verify the digital signature of all firmware and software before execution. Before digging in to the details of UEFI's Secure Boot, let's begin with a bit of high-level background on digital signatures.

Digital Signatures

The basic idea of digital signatures is to generate a pair of keys:

  • A private key to be kept private and secured by the originator.

  • A public key that can be distributed freely.

The mathematical correlation between this public/private key pair allows for checking the digital signature of a message for authenticity. To do the check, only the public key is necessary, and the message can be verified as having been signed by the private key without ever knowing the private key itself.

One other feature of this public/private key pair is that it is impractical to calculate the private key from the contents of the public key. This feature allows for the distribution of the public key without compromising the private key.

Lastly, a message cannot be signed using the public key. Only the private key is capable of signing the message properly. This is the basic mechanism digital signature technology uses to verify a message's integrity without compromising the details or contents of the private key.

Secure Boot Details

With this understanding of digital signatures, the UEFI "Secure Boot" technology consists of a collection of keys, categorized as follows:

  • Platform Key (PK)

  • Key Exchange Key (KEK)

  • Whitelist Database (DB)

  • Blacklist Database (DBX)

On a system with Secure Boot enabled and configured, each of these items will contain the public portions of public/private key pairs. The keys are used to authorize various components of the firmware and software.

Platform Key (PK)

The Platform Key (PK) establishes a trust relationship between the platform owner and the firmware (UEFI BIOS) by controlling access to the KEK database. There is a single PK per platform, and the public portion of the PK is installed into the system, typically during production at the OEM. The private portion of the PK is necessary for modifying the KEK database.

Key Exchange Key (KEK)

The Key Exchange Key (KEK) database establishes a trust relationship between the firmware and the OS. The KEK consists of a list of public keys that can be checked against for authorization to modify the whitelist database (DB) or blacklist database (DBX). There can be multiple KEKs per platform. The private portion of a KEK is necessary for modifying the DB or DBX.

Whitelist Database (DB)

The whitelist database (DB) is a list of public keys that are used to check the digital signature of a given firmware or software. To discuss the DB, let's assume the system is booting and is about to execute the bootloader for selecting an OS to boot. The system will check the digital signature of the bootloader using the public keys in the DB, and if this bootloader was signed with a corresponding private key, then the bootloader is allowed to execute. Otherwise, it is blocked as unauthorized.

Blacklist Database (DBX)

Conversely, the blacklist database (DBX) is a list of public keys known to correspond to malicious or unauthorized firmware or software. Any software signed with a corresonding private key from this database will be blocked.

Secure Boot is relatively self-contained. If the handful of signed objects haven’t been tampered with, the platform boots, and secure boot is done. If objects have been changed so the signature is no longer valid, the platform doesn’t boot and a re-installation is indicated.

Trusted Boot: a key strategy for ensuring the trustworthiness of an embedded computing system

Trusted boot is a key strategy for ensuring that the trustworthiness of an embedded computing system, beginning with the very first software instruction at system startup to protect against cyber attacks.

Trust in an embedded module or system means ensuring that the system operates exactly as intended. In the context of the boot process, trust means that an embedded module executes only the boot code, operating system, and application code. The only way to guarantee trust in this chain is to ensure that all code – from the very first instruction that a processor executes – is authentic and specifically intended by the system integrator to execute on that processor.

Establishing initial trust in the boot process involves various means to do that, although many of these same techniques are also useful for extending trust to the operating system and application code.

Cryptography in the form of encryption and digital signatures is an essential component for establishing trust and preventing a malicious actor from modifying, adding, or replacing authentic code. While encryption ensures confidentiality to prevent prying eyes from understanding the code, it does not guarantee that the code comes from an authorized source and has not been tampered with in some way.

Protections to ensure software authentication typically require the use of public-key cryptography to create a digital signature for the software. This involves the use of a protected private key to generate the signature, and a public key stored on the module. It does an attacker no good to have the public key since he also needs the private key to generate a valid digital signature.

Trusted boot by itself does not make a secure system. There are attacks that can bypass authentication, particularly if the attacker has physical access to the module. There are other ways to gain access to the information an attacker wants. For example, the attacker may simply wait until the module is powered up and then observe the application code while it is running.

These techniques for bypassing authentication do not diminish the usefulness of trusted boot as a component in a defense-in-depth approach in which different security technologies work together to provide a far more secure system than any one technology could deliver on its own.

In the embedded computing world, Intel, Power Architecture, and Arm are the dominant processor architectures, and each one has technologies that support a trusted-boot process. In addition, trusted boot can be designed using external hardware cryptographic engines and key storage devices. The remainder of this article will survey available commercial options for providing trusted boot.

Trusted eXecution Technology (TXT)

Trusted eXecution Technology (TXT) and Boot Guard are two Intel technologies that create a trusted boot process. Both make use of an external device called a trusted platform module (TPM) -- a dedicated security coprocessor that provides cryptographic services.

TXT makes use of authenticated code modules (ACM) from Intel. An ACM matches to a specific chipset and authenticates with a public signing key that is hard-coded in the chipset. During boot, the BIOS ACM measures (performs a cryptographic hash) various BIOS components and stores them in the TPM. A separate secure initialization (SINIT) ACM does the same for the operating system. Each time a module boots, TXT measures the boot code and determines if any changes have been made. A user-defined launch control policy (LCP) enables the user to fail the boot process or continue to boot as a non-trusted system.

Boot Guard

Boot Guard is a newer Intel technology that works in a complementary fashion to TXT. It introduces an initial boot block (IBB) that executes prior to the BIOS and ensures that the BIOS is trusted before allowing a boot to occur. The OEM signs the IBB and programs the public signing key into one-time programmable fuses that are stored in the processor. On power-up the processor verifies the authenticity of the IBB, and then the IBB verifies the BIOS prior to allowing it to execute. Intel describes Boot Guard as “hardware based boot integrity protection that prevents unauthorized software and malware takeover of boot blocks critical to a system’s function.”

Many of the Power Architecture and Arm processors from NXP support the Trust Architecture, and the two different NXP processors use variations of the same Trust Architecture. They implement trusted boot differently than Intel. For Power Architecture and Arm processors, the trusted boot process requires the application, including the boot code, to be signed using an RSA private signature key. The digital signature and public key is appended to the image and written to nonvolatile memory. A hash of the public key is programmed into one-time programmable fuses in the processor.

Internal Secure Boot Code (ISBC)

When the processor boots, it begins executing from Internal Secure Boot Code (ISBC), stored in non-volatile memory inside the processor. The ISBC authenticates the public key using the stored hash value. The public key then authenticates the signature for the externally stored boot and application code. The external code may contain its own validation data similar to the ISBC. This process can break the external code into many smaller chunks that are validated separately by the previous code chunk. In this way the Trust Architecture can extend to subsequent software modules to establish a chain of trust.

NXP QorIQ processors, meanwhile, enable encryption of portions of the image to prevent attackers from stealing the image from flash memory, and allow for alternate images to add resiliency to the secure boot sequence.

Intel and NXP provide mechanisms to implement trusted boot and ensure the authenticity of the boot process starting from the very first line of code. Still, these important steps do not complete the trusted boot effort.

Designers of secure systems must follow up these with security mechanisms to ensure that trusted boot is maintained throughout the rest of the boot process. Designers also must identify and mitigate potential attack vectors, especially in embedded systems that might be open to issues of supply chain integrity, physical tampering, and remote cyber attack.

Measured Boot

Measured Boot is more flexible, but also requires an important step... Somehow all those hashes have to be stored in a way that there’s very little chance that they can be manipulated, and a very high likelihood that they can be reliably reported to a management station, using a process called Attestation. Since Measured Boot doesn’t stop the platform from booting, the host OS can’t be relied upon to report the hashes.

Platform Configuration Registers (PCRs)

In the case of Measure Boot, the Trusted Platform Module is used to record these hashes. The TPM is a small self-contained security processor that can be attached to a system bus as a simple peripheral. Of the many functions a TPM can provide, one is the facility called Platform Configuration Registers (PCRs), used for storing hashes.

Secure Enclaves: a New Approach in Data Security

Secure Enclaves is a New Approach in Data Security. All data is encrypted in memory and decrypted only while being used inside the CPU. The data is still completely protected, even if the operating system, hypervisor or root user are compromised. Secure enclaves enable the new concept of confidential computing.

A new approach to addressing the problem of protecting data in use can be found in hardware-based security in the form of secure enclaves. Secure enclaves allow applications to execute securely and enforced at the hardware level by the CPU itself. All data is encrypted in memory and decrypted only while being used inside the CPU. The data is still completely protected, even if the operating system, hypervisor or root user are compromised. With secure enclaves, data can be fully protected across its full life cycle—at rest, in motion and in use—for the first time.

Advanced secure enclaves offer further security using a process called “attestation” to verify that the CPU is genuine and the application is the correct one and hasn’t been altered. Operating in secure enclaves gives users complete confidence that code is running as intended and that data is completely protected, wherever it is. This approach is gaining traction; for example, it enables sensitive applications including machine learning and artificial intelligence to be run in the cloud. Secure enclaves enable the new concept of confidential computing.

How it works

  • The Secure Enclave runs a dedicated microkernel and undergoes a secure boot process separate from the rest of the device. It receives its system updates independent of the other CPU components.

  • When the device boots, the Secure Enclave generates an ephemeral encryption key and "entangles" it with a UID (user ID) which cannot be accessed by the rest of the CPU. This key is used to encrypt, and verify the authenticity of, the Secure Enclave's portion of the device's memory. Any data written to NAND flash storage by the Secure Enclave is encrypted by combining this entangled ephemeral key with an anti-replay counter to prevent data tampering.

  • Authentication data is sent from biometric sensors to the Secure Enclave over a serial bus. The CPU facilitates this operation, but cannot read the data. The data is processed by the Secure Enclave in its encrypted memory space.

  • If the Secure Enclave verifies the biometric data as authentic, it sends a message to the CPU using a "mailbox" of hardware interrupts. The CPU then permits the user to unlock or make purchases with the device.

Trusted Platform Module (TPM)

A Trusted Platform Module (TPM) is a specialized chip on an endpoint device that stores RSA encryption keys specific to the host system for hardware authentication. Each TPM chip contains an RSA key pair called the Endorsement Key (EK). The pair is maintained inside the chip and cannot be accessed by software.

TPM (Trusted Platform Module) is a computer chip (microcontroller) that can securely store artifacts used to authenticate the platform (your PC or laptop).

TPM (Trusted Platform Module) is a computer chip (micro-controller) that can securely store artifacts used to authenticate the platform (your PC or laptop). These artifacts can include passwords, certificates, or encryption keys. A TPM can also be used to store platform measurements that help ensure that the platform remains trustworthy. Authentication (ensuring that the platform can prove that it is what it claims to be) and attestation (a process helping to prove that a platform is trustworthy and has not been breached) are necessary steps to ensure safer computing in all environments. Trusted Platform Module provides

  • A random number generator

  • Facilities for the secure generation of cryptographic keys for limited uses.

  • Remote attestation: Creates a nearly unforgeable hash key summary of the hardware and software configuration. The software in charge of hashing the configuration data determines the extent of the summary. This allows a third party to verify that the software has not been changed.

  • Binding: Encrypts data using the TPM bind key, a unique RSA key descended from a storage key[clarification needed].

  • Sealing: Similar to binding, but in addition, specifies the TPM state[clarification needed] for the data to be decrypted (unsealed).

Computer programs can use a TPM to authenticate hardware devices, since each TPM chip has a unique and secret RSA key burned in as it is produced. Pushing the security down to the hardware level provides more protection than a software-only solution.

Trusted modules can be used in computing devices other than PCs, such as mobile phones or network equipment.

The nature of hardware-based cryptography ensures that the information stored in hardware is better protected from external software attacks. A variety of applications storing secrets on a TPM can be developed. These applications make it much harder to access information on computing devices without proper authorization (e.g., if the device was stolen). If the configuration of the platform has changed as a result of unauthorized activities, access to data and secrets can be denied and sealed off using these applications.

However, it is important to understand that TPM cannot control the software that is running on a PC. TPM can store pre-run time configuration parameters, but it is other applications that determine and implement policies associated with this information.

Processes that need to secure secrets, such as digital signing, can be made more secure with a TPM. And mission critical applications requiring greater security, such as secure email or secure document management, can offer a greater level of protection when using a TPM. For example, if at boot time it is determined that a PC is not trustworthy because of unexpected changes in configuration, access to highly secure applications can be blocked until the issue is remedied (if a policy has been set up that requires such action). With a TPM, one can be more certain that artifacts necessary to sign secure email messages have not been affected by software attacks. And, with the use of remote attestation, other platforms in the trusted network can make a determination, to which extent they can trust information from another PC. Attestation or any other TPM functions do not transmit personal information of the user of the platform.

These capabilities can improve security in many areas of computing, including e-commerce, citizen-to-government applications, online banking, confidential government communications and many other fields where greater security is required. Hardware-based security can improve protection for VPN, wireless networks, file encryption (as in Microsoft’s BitLocker) and password/PIN/credentials’ management. TPM specification is OS-agnostic, and software stacks exist for several Operating Systems. TPMs use the following cryptographic algorithms:

  • Rivest–Shamir–Adleman (RSA)

  • Secure Hash Algorithm 1 (SHA1)

  • Hash-based Message Authentication Code (HMAC)

The Trusted Computing Group (TCG) is an international de facto standards body of approximately 120 companies engaged in creating specifications that define PC TPMs, trusted modules for other devices, trusted infrastructure requirements, APIs and protocols necessary to operate a trusted environment. After specifications are completed, they are released to the technology community and can be downloaded from the TCG Web Site.

Without standard security procedures and shared specifications, it is not possible for components of the trusted environment to interoperate, and trusted computing applications cannot be implemented to work on all platforms. A proprietary solution cannot ensure global interoperability and is not capable of providing a comparable level of assurance due to more limited access to cryptographic and security expertise and reduced availability for a rigorous review process. From the point of view of cryptography, for interoperability with the other elements of the platform, other platforms, and infrastructure, it is necessary for trusted modules to be able to use the same cryptographic algorithms, Although standard published algorithms may have weaknesses, these algorithms are thoroughly tested and are gradually replaced or improved when vulnerabilities are discovered. This is not true in the case of proprietary algorithms.

Real Time Operating Systems

What is a Real-Time OS?

In general, an operating system (OS) is responsible for managing the hardware resources of a computer and hosting applications that run on the computer. An RTOS performs these tasks, but is also specially designed to run applications with very precise timing and a high degree of reliability. This can be especially important in measurement and automation systems where downtime is costly or a program delay could cause a safety hazard.

To be considered "real-time", an operating system must have a known maximum time for each of the critical operations that it performs (or at least be able to guarantee that maximum most of the time). Some of these operations include OS calls and interrupt handling. Operating systems that can absolutely guarantee a maximum time for these operations are commonly referred to as "hard real-time", while operating systems that can only guarantee a maximum most of the time are referred to as "soft real-time". In practice, these strict categories have limited usefulness - each RTOS solution demonstrates unique performance characteristics and the user should carefully investigate these characteristics.

To fully grasp these concepts, it is helpful to consider an example. Imagine that you are designing an airbag system for a new model of car. In this case, a small error in timing (causing the airbag to deploy too early or too late) could be catastrophic and cause injury. Therefore, a hard real-time system is needed; you need assurance as the system designer that no single operation will exceed certain timing constraints. On the other hand, if you were to design a mobile phone that received streaming video, it may be ok to lose a small amount of data occasionally even though on average it is important to keep up with the video stream. For this application, a soft real-time operating system may suffice.

The main point is that, if programmed correctly, an RTOS can guarantee that a program will run with very consistent timing. Real-time operating systems do this by providing programmers with a high degree of control over how tasks are prioritized, and typically also allow checking to make sure that important deadlines are met.

In contrast to real-time operating systems, the most popular operating systems for personal computer use (such as Windows) are called general-purpose operating systems. While more in-depth technical information on how real-time operating systems differ from general-purpose operating systems is given in a section below, it is important to remember that there are advantages and disadvantages to both types of OS. Operating systems like Windows are designed to maintain user responsiveness with many programs and services running (ensuring "fairness"), while real-time operating systems are designed to run critical applications reliably and with precise timing (paying attention to the programmer's priorities).

Important Terminology and Concepts

  • Determinism: An application (or critical piece of an application) that runs on a hard real-time operating system is referred to as deterministic if its timing can be guaranteed within a certain margin of error.

  • Soft vs Hard Real-Time: An OS that can absolutely guarantee a maximum time for the operations it performs is referred to as hard real-time. In contrast, an OS that can usually perform operations in a certain time is referred to as soft real-time.

  • Jitter: The amount of error in the timing of a task over subsequent iterations of a program or loop is referred to as jitter. Real-time operating systems are optimized to provide a low amount of jitter when programmed correctly; a task will take very close to the same amount of time to execute each time it is run.

Example Real-Time Applications

Real-time operating systems were designed for two general classes of applications: event response and closed-loop control. Event response applications, such as automated visual inspection of assembly line parts, require a response to a stimulus in a certain amount of time. In this visual inspection system, for example, each part must be photographed and analyzed before the assembly line moves.

By carefully programming an application that runs on a hard real-time operating system, designers working on event response applications can guarantee that a response will happen deterministically (within a certain maximum amount of time). Considering the parts inspection example, using a general-purpose OS could result in a part not being inspected in time - therefore delaying the assembly line, forcing the part to be discarded, or shipping a potentially defective part.

In contrast, closed-loop control systems, such as an automotive cruise control system, continuously process feedback data to adjust one or more outputs. Because each output value depends on processing the input data in a fixed amount of time, it is critical that loop deadlines are met in order to assure that the correct outputs are produced. What would happen if a cruise control system failed to determine what the throttle setting should be at a given point in time? Once again, hard real-time operating systems can guarantee that control system input data is processed in a consistent amount of time (with a fixed worst-case maximum).

It should also be noted that many applications that must run for extended periods of time can benefit from the reliability that an RTOS can provide. Because real-time operating systems typically run a minimal set of software rather than many applications and processes at the same time, they are well suited for systems that require 24-7 operation or where down-time is unacceptable or expensive.

Under the Hood: How Real-Time OSs Differ from General-Purpose OSs

Operating systems such as Microsoft Windows and Mac OS can provide an excellent platform for developing and running your non-critical measurement and control applications. However, these operating systems are designed for different use cases than real-time operating systems, and are not the ideal platform for running applications that require precise timing or extended up-time. This section will identify some of the major under-the-hood differences between both types of operating systems, and explain what you can expect when programming a real-time application.

Setting Priorities

When programming an application, most operating systems (of any type) allow the programmer to specify a priority for the overall application and even for different tasks within the application (threads). These priorities serve as a signal to the OS, dictating which operations the designer feels are most important. The goal is that if two or more tasks are ready to run at the same time, the OS will run the task with the higher priority.

In practice, general-purpose operating systems do not always follow these programmed priorities strictly. Because general-purpose operating systems are optimized to run a variety of applications and processes simultaneously, they typically work to make sure that all tasks receive at least some processing time. As a result, low-priority tasks may in some cases have their priority boosted above other higher priority tasks. This ensures some amount of run-time for each task, but means that the designer's wishes are not always followed.

In contrast, real-time operating systems follow the programmer's priorities much more strictly. On most real-time operating systems, if a high priority task is using 100% of the processor, no other lower priority tasks will run until the high priority task finishes. Therefore, real-time system designers must program their applications carefully with priorities in mind. In a typical real-time application, a designer will place time-critical code (e.g. event response or control code) in one section with a very high priority. Other less-important code such as logging to disk or network communication may be combined in a section with a lower priority.

Interrupt Latency

Interrupt latency is measured as the amount of time between when a device generates an interrupt and when that device is serviced. While general-purpose operating systems may take a variable amount of time to respond to a given interrupt, real-time operating systems must guarantee that all interrupts will be serviced within a certain maximum amount of time. In other words, the interrupt latency of real-time operating systems must be bounded.

Performance

One common misconception is that real-time operating systems have better performance than other general-purpose operating systems. While real-time operating systems may provide better performance in some cases due to less multitasking between applications and services, this is not a rule. Actual application performance will depend on CPU speed, memory architecture, program characteristics, and more.

Though real-time operating systems may or may not increase the speed of execution, they can provide much more precise and predictable timing characteristics than general-purpose operating systems.

Hardware-Based Encryption

Hardware-based encryption uses a device’s on-board security to perform encryption and decryption. It is self-contained and does not require the help of any additional software. Therefore, it is essentially free from the possibility of contamination, malicious code infection, or vulnerability.

When a device is used on a host computer, a good hardware-based solution requires no drivers to be loaded, so no interaction with the processes of the host system is required. It also requires minimum configuration and user interaction and does not cause performance degradation.

A hardware-based solution is most advisable when protecting sensitive data on a portable device such as a laptop or a USB flash drive; it is also effective when protecting data at rest. Drives containing sensitive data like that pertaining to financial, healthcare or government fields are better protected through hardware keys that can be effective even if drives are stolen and installed in other computers.

Self-Encrypted Drives (SEDs) are an excellent option for high-security environments. With SEDs, the encryption is on the drive media where the disk Encryption key (DEK) used to encrypt and decrypt is securely stored. The DEK relies on a drive controller to automatically encrypt all data to the drive and decrypt it as it leaves the drive. Nothing, from the encryption keys to the authentication of the user, is exposed in the memory or processor of the host computer, making the system less vulnerable to attacks aimed at the encryption key.

  • Uses a dedicated processor physically located on the encrypted drive

  • Processor contains a random number generator to generate an encryption key, which the user’s password will unlock

  • Increased performance by off-loading encryption from the host system

  • Safeguard keys and critical security parameters within crypto-hardware

  • Authentication takes place on the hardware

  • Cost-effective in medium and larger application environments, easily scalable

  • Encryption is tied to a specific device, so encryption is “always on”

  • Does not require any type of driver installation or software installation on host PC

  • Protects against the most common attacks, such as cold boot attacks, malicious code, brute force attack

Hardware-based encryption offers stronger resilience against some common, not-so-sophisticated attacks. In general, malicious hackers won’t be able to apply brute-force attacks to a hardware-encrypted system as the crypto module will shut down the system and possibly compromise data after a certain number of password-cracking attempts. With software-based solutions, however, hackers might be able to locate and possibly reset the counters as well as copy the encrypted file to different systems for parallel cracking attempts.

Hardware solutions, however, might be impractical due to cost. Hardware encryption is also tied to a particular device and one solution cannot be applied to the entire system and all its parts. Updates are also possible only through device substitution.

Trusted Execution

What is Trusted Execution?

Trusted Execution Technology (TXT) is a set of hardware extensions to processors and chipsets that enhance the digital office platform with security capabilities such as measured launch and protected execution. TXT provides hardware-based mechanisms that help protect against software-based attacks and protects the confidentiality and integrity of data stored or created on the client PC.

TXT provides these mechanisms by enabling an environment where applications can run within their own space—protected from all other software on the system. These capabilities provide the protection mechanisms, rooted in hardware, that are necessary to provide trust in the application's execution environment. In turn, these mechanisms can protect vital data and processes from being compromised by malicious software running on the platform.

How Trusted Execution Works?

The Trusted Platform Module (TPM) as specified by the TCG provides many security functions including special registers (called Platform Configuration Registers – PCRs) which hold various measurements in a shielded location in a manner that prevents spoofing. Measurements consist of

  • a cryptographic hash using a Secure Hashing Algorithm (SHA);

    • TPM v1.0 specification uses the SHA-1 hashing algorithm.

    • More recent TPM versions (v2.0+) call for SHA-2.

A desired characteristic of a cryptographic hash algorithm is that (for all practical purposes) the hash result (referred to as a hash digest or a hash) of any two modules will produce the same hash value only if the modules are identical.

What Every Engineer Should Know Cryptography

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

Cryptography

  • Cryptography is a method of protecting information and communications through the use of codes so that only those for whom the information is intended can read and process it.

    • The pre-fix "crypt" means "hidden" or "vault" and the suffix "graphy" stands for "writing."

  • In computer science, cryptography refers to secure information and communication techniques derived from mathematical concepts and a set of rule-based calculations called algorithms to transform messages in ways that are hard to decipher.

    • These deterministic algorithms are used for cryptographic key generation and digital signing and verification to protect data privacy, web browsing on the internet and confidential communications such as credit card transactions and email.

Cryptography techniques

  • Cryptography is closely related to the disciplines of cryptology and cryptanalysis.

    • It includes techniques such as microdots, merging words with images, and other ways to hide information in storage or transit.

      • However, in today's computer-centric world, cryptography is most often associated with scrambling plaintext (ordinary text, sometimes referred to as cleartext) into ciphertext (a process called encryption), then back again (known as decryption).

        • Individuals who practice this field are known as cryptographers.

Modern cryptography concerns itself with the following four objectives:

  1. Confidentiality: the information cannot be understood by anyone for whom it was unintended

  2. Integrity: the information cannot be altered in storage or transit between sender and intended receiver without the alteration being detected

  3. Non-repudiation: the creator/sender of the information cannot deny at a later stage his or her intentions in the creation or transmission of the information

  4. Authentication: the sender and receiver can confirm each other's identity and the origin/destination of the information

Procedures and protocols that meet some or all of the above criteria are known as cryptosystems. Cryptosystems are often thought to refer only to mathematical procedures and computer programs; however, they also include the regulation of human behavior, such as choosing hard-to-guess passwords, logging off unused systems, and not discussing sensitive procedures with outsiders.

Cryptographic algorithms

  • Cryptosystems use a set of procedures known as cryptographic algorithms, or ciphers, to encrypt and decrypt messages to secure communications among computer systems, devices such as smartphones, and applications.

    • A cipher suite uses one algorithm for encryption, another algorithm for message authentication and another for key exchange.

      • This process, embedded in protocols and written in software that runs on operating systems and networked computer systems, involves public and private key generation for data encryption/decryption, digital signing and verification for message authentication, and key exchange.

Types of cryptography

Advanced Encryption Standard (AES)

Single-key or symmetric-key encryption algorithms create a fixed length of bits known as a block cipher with a secret key that the creator/sender uses to encipher data (encryption) and the receiver uses to decipher it. Types of symmetric-key cryptography include the Advanced Encryption Standard (AES), a specification established in November 2001 by the National Institute of Standards and Technology as a Federal Information Processing Standard (FIPS 197), to protect sensitive information. The standard is mandated by the U.S. government and widely used in the private sector.

In June 2003, AES was approved by the U.S. government for classified information. It is a royalty-free specification implemented in software and hardware worldwide. AES is the successor to the Data Encryption Standard (DES) and DES3. It uses longer key lengths (128-bit, 192-bit, 256-bit) to prevent brute force and other attacks.

Public-key or asymmetric-key encryption algorithms

Public-key or asymmetric-key encryption algorithms use a pair of keys, a public key associated with the creator/sender for encrypting messages and a private key that only the originator knows (unless it is exposed or they decide to share it) for decrypting that information. The types of public-key cryptography include RSA, used widely on the internet; Elliptic Curve Digital Signature Algorithm (ECDSA) used by Bitcoin; Digital Signature Algorithm (DSA) adopted as a Federal Information Processing Standard for digital signatures by NIST in FIPS 186-4; and Diffie-Hellman key exchange.

Types of cryptographic hash functions

To maintain data integrity in cryptography, hash functions, which return a deterministic output from an input value, are used to map data to a fixed data size. Types of cryptographic hash functions include

  1. SHA-1 (Secure Hash Algorithm 1)

  2. SHA-2

  3. SHA-3

History of cryptography

Word "cryptography" is derived from the Greek kryptos, meaning hidden

The word "cryptography" is derived from the Greek kryptos, meaning hidden. The origin of cryptography is usually dated from about 2000 B.C., with the Egyptian practice of hieroglyphics. These consisted of complex pictograms, the full meaning of which was only known to an elite few. The first known use of a modern cipher was by Julius Caesar (100 B.C. to 44 B.C.), who did not trust his messengers when communicating with his governors and officers. For this reason, he created a system in which each character in his messages was replaced by a character three positions ahead of it in the Roman alphabet.

Cryptography has turned into a battleground

In recent times, cryptography has turned into a battleground of some of the world's best mathematicians and computer scientists. The ability to securely store and transfer sensitive information has proved a critical factor in success in war and business.

Impact of internet on cryptography

The internet has allowed the spread of powerful programs and, more importantly, the underlying techniques of cryptography, so that today many of the most advanced cryptosystems and ideas are now in the public domain.

Cryptography concerns

  • Attackers can circumvent cryptography, hack into computers that are responsible for data encryption and decryption, and exploit weak implementations, such as the use of default keys.

    • However, cryptography makes it harder for attackers to access messages and data protected by encryption algorithms.

In summary, cryptography is technique of securing information and communications through use of codes so that only those person for whom the information is intended can understand it and process it. Thus preventing unauthorized access to information. The prefix “crypt” means “hidden” and suffix graphy means “writing”.

 

What Every Engineer Should Know About 10 Data Security Solutions

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

Data security refers to the process of protecting data from unauthorized access and data corruption throughout its lifecycle. Data security includes data encryption, tokenization, and key management practices that protect data across all applications and platforms.

What is Data Security?

  • Data security refers to the process of protecting data from unauthorized access and data corruption throughout its lifecycle.

    • Data security includes data encryption, tokenization, and key management practices that protect data across all applications and platforms.

Why Data Security?

  • Organizations around the globe are investing heavily in information technology (IT) cyber defense capabilities to protect their critical assets.

    • Whether an enterprise needs to protect a brand, intellectual capital, and customer information or provide controls for critical infrastructure, the means for incident detection and response to protecting organizational interests have three common elements:

      1. people,

      2. processes,

      3. technology.

10 Data Security Solutions

  1. Cloud access security – Protection platform that allows you to move to the cloud securely while protecting data in cloud applications.

  2. Data encryption – Data-centric and tokenization security solutions that protect data across enterprise, cloud, mobile and big data environments.

  3. Hardware security module -- Hardware security module that guards financial data and meets industry security and compliance requirements.

  4. Key management -- Solution that protects data and enables industry regulation compliance.

  5. Enterprise Data Protection – Solution that provides an end-to-end data-centric approach to enterprise data protection.

  6. Payments Security – Solution provides complete point-to-point encryption and tokenization for retail payment transactions, enabling PCI scope reduction.

  7. Big Data, Hadoop and IofT data protection – Solution that protects sensitive data in the Data Lake – including Hadoop, Teradata, Micro Focus Vertica, and other Big Data platforms.

  8. Mobile App Security - Protecting sensitive data in native mobile apps while safeguarding the data end-to-end.

  9. Web Browser Security - Protects sensitive data captured at the browser, from the point the customer enters cardholder or personal data, and keeps it protected through the ecosystem to the trusted host destination.

  10. eMail Security – Solution that provides end-to-end encryption for email and mobile messaging, keeping Personally Identifiable Information and Personal Health Information secure and private.

What Every Engineer Should Know About Vulnerability Management

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

Vulnerability management is the "cyclical practice of identifying, classifying, prioritizing, remediating, and mitigating" software vulnerabilities.

What is Vulnerability Management and Scanning?

  • Vulnerability management is the process of identifying, evaluating, treating, and reporting on security vulnerabilities in systems and the software that runs on them.

    • This, implemented alongside with other security tactics, is vital for organizations to prioritize possible threats and minimizing their "attack surface."

  • Security vulnerabilities, in turn, refer to technological weaknesses that allow attackers to compromise a product and the information it holds.

    • This process needs to be performed continuously in order to keep up with new systems being added to networks, changes that are made to systems, and the discovery of new vulnerabilities over time.

  • Vulnerability management software can help automate this process.

    • They’ll use a vulnerability scanner and sometimes endpoint agents to inventory a variety of systems on a network and find vulnerabilities on them.

      • Once vulnerabilities are identified, the risk they pose needs to be evaluated in different contexts so decisions can be made about how to best treat them.

        • For example, vulnerability validation can be an effective way to contextualize the real severity of a vulnerability.

What is the difference between Vulnerability Management and Vulnerability Assessment?

  • Generally, a Vulnerability Assessment is a portion of the complete Vulnerability Management system.

    • Organizations will likely run multiple Vulnerability Assessments to get more information on their Vulnerability Management action plan.

The vulnerability management process can be broken down into the following four steps:

  1. Identifying Vulnerabilities

  2. Evaluating Vulnerabilities

  3. Treating Vulnerabilities

  4. Reporting Vulnerabilities

Step 1: Identifying Vulnerabilities

At the heart of a typical vulnerability management solution is a vulnerability scanner. The scan consists of four stages:

  1. Scan network-accessible systems by pinging them or sending them TCP/UDP packets

  2. Identify open ports and services running on scanned systems

  3. If possible, remotely log in to systems to gather detailed system information

  4. Correlate system information with known vulnerabilities

Vulnerability scanners are able to identify a variety of systems running on a network, such as laptops and desktops, virtual and physical servers, databases, firewalls, switches, printers, etc. Identified systems are probed for different attributes: operating system, open ports, installed software, user accounts, file system structure, system configurations, and more. This information is then used to associate known vulnerabilities to scanned systems. In order to perform this association, vulnerability scanners will use a vulnerability database that contains a list of publicly known vulnerabilities.

  • Properly configuring vulnerability scans is an essential component of a vulnerability management solution.

  • However, vulnerability scanners can sometimes disrupt the networks and systems that they scan. If available network bandwidth becomes very limited during an organization’s peak hours, then vulnerability scans should be scheduled to run during off hours.

  • If some systems on a network become unstable or behave erratically when scanned, they might need to be excluded from vulnerability scans, or the scans may need to be fine-tuned to be less disruptive.

    • Adaptive scanning is a new approach to further automating and streamlining vulnerability scans based on changes in a network.

      • For example, when a new system connects to a network for the first time, a vulnerability scanner will scan just that system as soon as possible instead of waiting for a weekly or monthly scan to start scanning that entire network.

  • Vulnerability scanners aren’t the only way to gather system vulnerability data anymore, though.

  • Endpoint agents allow vulnerability management solutions to continuously gather vulnerability data from systems without performing network scans.

    • This helps organizations maintain up-to-date system vulnerability data whether or not, for example, employees’ laptops are connected to the organization’s network or an employee’s home network.

  • Regardless of how a vulnerability management solution gathers this data, it can be used to create reports, metrics, and dashboards for a variety of audiences.

Step 2: Evaluating Vulnerabilities

  • After vulnerabilities are identified, they need to be evaluated so the risks posed by them are dealt with appropriately and in accordance with an organization’s risk management strategy.

  • Vulnerability management solutions will provide different risk ratings and scores for vulnerabilities, such as Common Vulnerability Scoring System (CVSS) scores.

    • These scores are helpful in telling organizations which vulnerabilities they should focus on first, but the true risk posed by any given vulnerability depends on some other factors beyond these out-of-the-box risk ratings and scores.

Here are some examples of additional factors to consider when evaluating vulnerabilities:

  1. Is this vulnerability a true or false positive?

  2. Could someone directly exploit this vulnerability from the Internet?

  3. How difficult is it to exploit this vulnerability?

  4. Is there known, published exploit code for this vulnerability?

  5. What would be the impact to the business if this vulnerability were exploited?

  6. Are there any other security controls in place that reduce the likelihood and/or impact of this vulnerability being exploited?

  7. How old is the vulnerability/how long has it been on the network?

Like any security tool, vulnerability scanners aren’t perfect. Their vulnerability detection false-positive rates, while low, are still greater than zero. Performing vulnerability validation with penetration testing tools and techniques helps weed out false-positives so organizations can focus their attention on dealing with real vulnerabilities. The results of vulnerability validation exercises or full-blown penetration tests can often be an eye-opening experience for organizations that thought they were secure enough or that the vulnerability wasn’t that risky.

Step 3: Treating Vulnerabilities

Once a vulnerability has been validated and deemed a risk, the next step is prioritizing how to treat that vulnerability with original stakeholders to the business or network. There are different ways to treat vulnerabilities, including:

  • Remediation: Fully fixing or patching a vulnerability so it can’t be exploited. This is the ideal treatment option that organizations strive for.

  • Mitigation: Lessening the likelihood and/or impact of a vulnerability being exploited. This is sometimes necessary when a proper fix or patch isn’t yet available for an identified vulnerability. This option should ideally be used to buy time for an organization to eventually remediate a vulnerability.

  • Acceptance: Taking no action to fix or otherwise lessen the likelihood/impact of a vulnerability being exploited. This is typically justified when a vulnerability is deemed a low risk, and the cost of fixing the vulnerability is substantially greater than the cost incurred by an organization if the vulnerability were to be exploited.

Vulnerability management solutions provide recommended remediation techniques for vulnerabilities. Occasionally a remediation recommendation isn’t the optimal way to remediate a vulnerability; in those cases, the right remediation approach needs to be determined by an organization’s security team, system owners, and system administrators. Remediation can be as simple as applying a readily-available software patch or as complex as replacing a fleet of physical servers across an organization’s network.

When remediation activities are completed, it’s best to run another vulnerability scan to confirm that the vulnerability has been fully resolved.

However, not all vulnerabilities need to be fixed. For example, if an organization’s vulnerability scanner has identified vulnerabilities in Adobe Flash Player on their computers, but they completely disabled Adobe Flash Player from being used in web browsers and other client applications, then those vulnerabilities could be considered sufficiently mitigated by a compensating control.

Step 4: Reporting vulnerabilities

  • Performing regular and continuous vulnerability assessments enables organizations to understand the speed and efficiency of their vulnerability management program over time.

  • Vulnerability management solutions typically have different options for exporting and visualizing vulnerability scan data with a variety of customizable reports and dashboards.

    • Not only does this help IT teams easily understand which remediation techniques will help them fix the most vulnerabilities with the least amount of effort, or help security teams monitor vulnerability trends over time in different parts of their network, but it also helps support organizations’ compliance and regulatory requirements.

Staying Ahead of Attackers through Vulnerability Management

  • Threats and attackers are constantly changing, just as organizations are constantly adding new mobile devices, cloud services, networks, and applications to their environments.

    • With every change comes the risk that a new hole has been opened in your network, allowing attackers to slip in and walk out with your crown jewels.

  • Every time you get a new affiliate partner, employee, client or customer, you open up your organization to new opportunities, but you’re also exposing it to new threats.

    • Protecting your organization from these threats requires a vulnerability management solution that can keep up with and adapt to all of these changes.

      • Without that, attackers will always be one step ahead.

Pen testing not included in vulnerability management

  • Vulnerability management is not a penetration test.

  • Just because a product scans your systems doesn’t mean you have a pen test tool.

    • In fact, the reality is quite the opposite.

      • A vulnerability management scanner is often checking for the presence or absence of a specific condition such as the installation of a specific patch.

  • A pen test tool, on the other hand, will actually attempt to break into the system using predefined exploits.

  • While both types of tests might ultimately deliver the same recommendation, the methods used to arrive at these conclusions are wildly different.

  • If you’re looking for a good pen test, odds are good that you need more than a tool.

  • A pen test should be exhaustive and include physical testing and in-person interviews as well as many other things.

Conclusion

In conclusion, vulnerability management is only one piece of a security program. It’s not going to solve the entire risk management challenge. Vulnerability management is the foundation of a security program. You have to start with a comprehensive understanding of what’s on your network. If you don’t know it’s there, there’s no way you can protect it. You also have to understand the risks for every asset on your network in order to effectively prioritize and remediate.

What Every Engineer Should Know About Public Key Infrastructure (PKI)

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

What is Public Key Infrastructure (PKI)?

The Publicublic Key Infrastructure (PKI) is a technology for authenticating users and devices in the digital world.

    • The basic idea is to have one or more trusted parties digitally sign documents certifying that a particular cryptographic key belongs to a particular user or device.

    • The key can then be used as an identity for the user in digital networks.

  • The users and devices that have keys are often just called entities.

    • In general, anything can be associated with a key that it can use as its identity.

    • Besides a user or device, it could be a program, process, manufacturer, component, or something else.

    • The purpose of a PKI is to securely associate a key with an entity.

  • The trusted party signing the document associating the key with the device is called a Certificate Authority (CA).

    • The certificate authority also has a cryptographic key that it uses for signing these documents.

    • These documents are called certificates.

  • In the real world, there are many certificate authorities, and most computers and web browsers trust a hundred or so certificate authorities by default.

  • A public key infrastructure relies on digital signature technology, which uses public key cryptography.

    • The basic idea is that the secret key of each entity is only known by that entity and is used for signing.

    • This key is called the private key.

    • There is another key derived from it, called the public key, which is used for verifying signatures but cannot be used to sign.

      • This public key is made available to anyone, and is typically included in the certificate document.

      • Public Key Infrastructure (PKI) is the set of hardware, software, policies, processes, and procedures required to create, manage, distribute, use, store, and revoke digital certificates and public-keys.

  • PKIs are the foundation that enables the use of technologies, such as digital signatures and encryption, across large user populations.

  • PKIs deliver the elements essential for a secure and trusted business environment for e-commerce and the growing Internet of Things (IoT).

  • PKIs help establish the identity of people, devices, and services – enabling controlled access to systems and resources, protection of data, and accountability in transactions.

  • Next generation business applications are becoming more reliant on PKI technology to guarantee high assurance, because evolving business models are becoming more dependent on electronic interaction requiring online authentication and compliance with stricter data security regulations.

The Role of Certificate Authorities (CAs)

  • In order to bind public keys with their associated user (owner of the private key), PKIs use digital certificates.

  • Digital certificates are the credentials that facilitate the verification of identities between users in a transaction.

  • Much as a passport certifies one’s identity as a citizen of a country, the digital certificate establishes the identity of users within the ecosystem.

  • Because digital certificates are used to identify the users to whom encrypted data is sent, or to verify the identity of the signer of information, protecting the authenticity and integrity of the certificate is imperative to maintain the trustworthiness of the system.

  • Certificate authorities (CAs) issue the digital credentials used to certify the identity of users. CAs underpin the security of a PKI and the services they support, and therefore can be the focus of sophisticated targeted attacks.

  • In order to mitigate the risk of attacks against CAs, physical and logical controls as well as hardening mechanisms, such as Hardware Security Modules (HSMs) have become necessary to ensure the integrity of a PKI.

PKI Deployment

  • PKIs provide a framework that enables cryptographic data security technologies such as digital certificates and signatures to be effectively deployed on a mass scale.

  • PKIs support identity management services within and across networks and underpin online authentication inherent in Secure Socket Layer (SSL) and Transport Layer Security (TLS) for protecting internet traffic, as well as document and transaction signing, application code signing, and time-stamping.

  • PKIs support solutions for desktop login, citizen identification, mass transit, mobile banking, and are critically important for device credentialing in the IoT.

  • Device credentialing is becoming increasingly important to impart identities to growing numbers of cloud-based and internet-connected devices that run the gamut from smart phones to medical equipment.

Cryptographic Security

  • Using the principles of asymmetric and symmetric cryptography, PKIs facilitate the establishment of a secure exchange of data between users and devices – ensuring authenticity, confidentiality, and integrity of transactions.

  • Users (also known as “Subscribers” in PKI parlance) can be individual end users, web servers, embedded systems, connected devices, or programs/applications that are executing business processes.

  • Asymmetric cryptography provides the users, devices or services within an ecosystem with a key pair composed of a public and a private key component.

    • A public key is available to anyone in the group for encryption or for verification of a digital signature.

    • The private key on the other hand, must be kept secret and is only used by the entity to which it belongs, typically for tasks such as decryption or for the creation of digital signatures.

The Increasing Importance of PKIs

  • With evolving business models becoming more dependent on electronic transactions and digital documents, and with more Internet-aware devices connected to corporate networks, the role of a PKI is no longer limited to isolated systems such as secure email, smart cards for physical access or encrypted web traffic.

  • PKIs today are expected to support larger numbers of applications, users and devices across complex ecosystems.

  • And with stricter government and industry data security regulations, mainstream operating systems and business applications are becoming more reliant than ever on an organizational PKI to guarantee trust.

How Does PKI Work ?

  • PKI (or Public Key Infrastructure) is the framework of encryption and cybersecurity that protects communications between the server (your website) and the client (the users).

  • It works by using two different cryptographic keys:

    • a public key and

    • a private key.

  • The public key is available to any user that connects with the website.

  • The private key is a unique key generated when a connection is made, and it is kept secret.

  • When communicating, the client uses the public key to encrypt and decrypt, and the server uses the private key.

  • This protects the user’s information from theft or tampering.

How Does PKI Authentication Work?

  • A Public Key Infrastructure requires several different elements for effective use.

  • A Certificate Authority (CA) is used to authenticate the digital identities of the users, which can range from individuals to computer systems to servers.

  • Certificate Authorities prevent falsified entities and manage the life cycle of any given number of digital certificates within the system.

  • Second in command is the component of a Registration Authority (RA), which is authorized by the Certificate Authority to provide digital certificates to users on a case-by-case basis. All of the certificates that are requested, received, and revoked by both the Certificate Authority and the Registration Authority are stored in an encrypted certificate database.

  • Certificate history and information is also kept on what is called a certificate store, which is usually grounded on a specific computer and acts as a storage space for all memory relevant to the certificate history, including issued certificates and private encryption keys.

    • Google Wallet is a great example of this.

  • By hosting these elements on a secure framework, a Public Key Infrastructure can protect the identities involved as well as the private information used in situations where digital security is necessary, such as smart card logins, SSL signatures, encrypted documents, and more.

    • SSL, or Secure Sockets Layer, is an encryption-based Internet security protocol.

Does PKI Perform Encryption?

  • Public Key Infrastructure is a complex subject, so you may be wondering if it actually performs encryption. The simple answer is yes, it does.

    • What is PKI if not a one-stop-shop for the encryption of classified information and private identities?

  • PKI performs encryption directly through the keys that it generates.

    • Whether these keys are public or private, they encrypt and decrypt secure data.

What Type of Encryption Does PKI Use?

  • PKI merges the use of both asymmetric and symmetric encryption.

    • Symmetrical encryption protects the single private key that is generated upon the initial exchange between parties—the digital handshake, if you will.

      • This secret key must be passed from one party to another in order for all parties involved to decrypt the information that was exchanged.

    • Asymmetric encryption is fairly new to the game and you may know it better as “public key cryptography.”

      • Asymmetric encryption uses two keys to encrypt plain text, both a public key and a secret key.

  • Both symmetric and asymmetric encryption have their own strengths and best use case scenarios, which is what makes the combination of both so powerful in Public Key Infrastructure.

Digital Certificates

  • PKI functions because of digital certificates.

  • A digital certificate is like a drivers license—it’s a form of electronic identification for websites and organizations.

  • Secure connections between two communicating machines are made available through PKI because the identities of the two parties can be verified by way of certificates.

  • So how do devices get these certificates?

    • You can create your own certificates for internal communications.

    • If you would like certificates for a commercial site or something of a larger scale, you can obtain a PKI digital certificate through a trusted third party issuer, called a certificate authority.

  • Much like the state government issuing you a license, certificate authorities vet the organizations seeking certificates and issue one based on their findings.

    • Just as someone trusts the validity of your license based on the authority of the government, devices trust digital certificates based on the authority of the issuing certificate authorities.

    • This process is similar to how code signing works to verify programs and downloads.

PKI & Digital Certificates

  • PKI functions on asymmetric key methodology: a private key and a public key.

    • The private key can only be accessed by the owner of a digital certificate, and they can choose where the public key goes.

      • A certificate is essentially a way of handing out that public key to users that the owner wants to have it.

  • Private and public PKI keys must work together.

  • A file that is encrypted by the private key can only be decrypted by the public key, and vice versa.

    • If the public key can only decrypt the file that has been encrypted by the private key, being able to decrypt that file assures that the intended receiver and sender took part in the informational transaction.

What Is the Benefit of Providing a Public Key in the Form of a Certificate?

  • PKI authentication through the use of digital certificates is the most effective way to protect confidential electronic data.

    • These digital certificates are incredibly detailed and unique to each individual user, making them nearly impossible to falsify.

  • Once a user is issued a unique certificate, the details incorporated into the certificate undergo a very thorough vetting process that includes PKI authentication and authorization.

    • Certificates are backed by a number of security processes such as time-stamping, registration, validation, and more to ensure the privacy of both the identity and the electronic data affiliated with the certificate.

How Is PKI Used?

  • Public Key Infrastructure is used to protect confidential communication from one party to another.

    • By using a two-key encryption system, PKI secures sensitive electronic information as it is passed back and forth between two parties, and provides each party with a key to encrypt and decrypt the digital data.

7 Popular Ways PKI is Used

PKI security is used in many different ways. The following are a few ways that PKI security can be used:

  1. Securing emails

  2. Securing web communications (such as retail transactions)

  3. Digitally signing software

  4. Digitally signing applications

  5. Encrypting files

  6. Decrypting files

  7. Smart card authentication

Does Using a PKI Infrastructure Guarantee Secure Authentication?

  • As far as we know, secure authentication is not a solid guarantee no matter how careful we are to facilitate a foundation of encryption and protection.

    • Breaches in security do happen from time to time, which is what makes the Certificate Authority and Registration Authority so vital to the operations.

  • Without a top-performing CA and RA to authenticate and manage public key information, the “web of trust” is virtually nonexistent.

 

Security Limitations of PKI

  • With all of the strengths of a Public Key Infrastructure, there is room for improvement.

    • As it currently stands, PKIs rely heavily on the integrity of the associated Certificate Authority and Registration Authority, which aren’t always functioning at the ideal level of diligence and scrutiny.

    • PKI management mistakes are another weak link that needs to be addressed.

  • Another current security limitation of Public Key Infrastructures today (or rather, a security risk) is the obvious lack of multi-factor authentication on many of the top frameworks.

  • Regardless of the world’s increasing ability to blow through passwords, PKIs have been slow to combat this threat with various levels of authorization before entry.

  • Furthermore, the overall usability of Public Key Infrastructure has never been ideal.

    • More often than not, PKIs are so remarkably complicated that users would rather forgo the addition PKI authorization in exchange for a more convenient and realistic security process.

  • Lastly, PKI technology is known for its inability to easily adapt to the ever-changing advancements of the digital world.

    • Users report being unhappy with their current PKI’s lack of ability to support new applications that are geared toward improvements in security, convenience, and scalability.

Q&A

Does SSL Use PKI?

  • SSL (Secure Sockets Layer) Cryptography relies heavily on PKI security to encrypt and decrypt a public key exchange using both symmetric and asymmetric encryption.

How does PKI work with an SSL? We can sum up the relationship in three phases:

  • First, the web server sends a copy of its unique asymmetric public key to the web browser.

  • The browser responds by generating a symmetric session key and encrypting it with the asymmetric public key that was received by the server.

    • In order to decrypt and utilize the session key, the web server uses the original unique asymmetric private key.

  • Once the digital relationship has been established, the web browser and the web server are able to exchange encrypted information across a secure channel. The Public Key Infrastructure acts as the framework and facilitator for the encryption, decryption, and exchange of information between the two parties.

What Is PKI Authentication?

  • PKI authentication (or public key infrastructure) is a framework for two-key asymmetric encryption and decryption of confidential electronic data.

    • By way of digital certificate authorization, management, and authentication, a PKI can secure private data that is exchanged between several parties, which can take the form of people, servers, and systems.

What Every Engineer Should Know About Cyber Security Simulations

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

What Are The Types Of Cyber Security Simulations?

There are two types of cyber security simulations:

  1. Offensive (Red Teaming): Red Teams are offensive in nature and specialize in attacking systems, networks, human resources, or physical assets with the goal of breaking through security defenses.

  2. Defensive (Blue Teaming): Blue Teams play defense and maintain the internal network against all cyber attacks and threats.

Somewhere in the middle are Purple teams, which is not a separate team, although it can be. When red teams and blue teams come together to share knowledge from each other’s discipline they enhance the security capabilities of an organization.

  • Advanced cyber security programs will perform simulated red team attacks against the blue team’s defenses in order to test the effectiveness of the network’s security.

  • The purpose of these simulations isn’t for the red team to breakthrough or for the blue team to detect an attack.

    • Instead, the goal is for both teams to share information on how to improve the overall security posture of the organization.

What Is A Red Team?

  • A red team consists of security professionals who act as adversaries to overcome cyber security controls.

  • Red teams often consist of independent ethical hackers who evaluate system security in an objective manner.

  • They utilize all the available techniques (discussed below) to find weaknesses in people, processes, and technology to gain unauthorized access to assets.

  • As a result of these simulated attacks, red teams make recommendations and plans on how to strengthen an organization’s security posture.

How Does A Red Team Work?

  • You might be surprised to learn that red teams spend more time planning an attack then they do performing attacks.

  • In fact, red teams deploy a number of methods to gain access to a network.

  • Social engineering attacks, for example, rely on reconnaissance and research to deliver targeted spear phishing campaigns.

  • Likewise, prior to performing a penetration test, packet sniffers and protocol analyzers are used to scan the network and gather as much information about the system as possible.

Typical information gathered during this phase includes:

  • Uncovering operating systems in use (Windows, macOS, or Linux).

  • Identifying the make and model of networking equipment (servers, firewalls, switches, routers, access points, computers, etc.).

  • Understanding physical controls (doors, locks, cameras, security personnel).

  • Learning what ports are open/closed on a firewall to allow/block specific traffic.

  • Creating a map of the network to determine what hosts are running what services along with where traffic is being sent.

Once the red team has a more complete idea of the system they develop a plan of action designed to target vulnerabilities specific to the information they gathered above.

For example, a red team member may know that a server is running Microsoft Windows Server 2016 R2 (a server operating system) and that the default domain policies might still be in use.
Microsoft “ships” their software in its default state leaving it up to network administrators to update the policies, which Microsoft recommends you do as soon as possible to harden network security. If still set to its default state, an attacker can work to compromise the relaxed security measures in place.

After vulnerabilities are identified, a red team tries to exploit those weaknesses to gain access into your network. Once an attacker is in your system the typical course of action is to use privilege escalation techniques, whereby the attacker attempts to steal the credentials of an administrator who has greater/full access to the highest levels of critical information.

The Tiger Team

  • In the early days of network security, a tiger team carried out many of the same functions as a red team.

  • The term has evolved over the years now referring to tiger teams as an elite and highly specialized group hired to take on a particular challenge against the security posture of an organization.

Examples Of Red Team Exercises

Red teams use a variety of methods and tools to exploit weaknesses and vulnerabilities in a network. It’s important to note that red teams will use any means necessary, per the terms of engagement, to break into your system. Depending on the vulnerability they may deploy malware to infect hosts or even bypass physical security controls by cloning access cards.

Examples of red team exercises include:

  • Penetration testing, also known as ethical hacking, is where the tester tries to gain access to a system, often using software tools. For example, ‘John the Ripper’ is a password-cracking program. It can detect what type of encryption is used, and try to bypass it.

  • Social engineering is where the Red Team attempts to persuade or trick members of staff into disclosing their credentials or allowing access to a restricted area.

  • Phishing entails sending apparently-authentic emails that entice staff members to take certain actions, such as logging into the hacker’s website and entering credentials.

  • Intercepting communication software tools such as packet sniffers and protocol analyzers can be used to map a network, or read messages sent in clear text. The purpose of these tools is to gain information on the system. For example, if an attacker knows a server is running on a Microsoft operating system then they would focus their attacks to exploit Microsoft vulnerabilities.

  • Card cloning of an employee’s security card to grant access into unrestricted areas, such as a server room.

What Is A Blue Team?

  • A blue team consists of security professionals who have an inside out view of the organization.

  • Their task is to protect the organization’s critical assets against any kind of threat.

  • They are well aware of the business objectives and the organization’s security strategy.

    • Therefore, their task is to strengthen the castle walls so no intruder can compromise the defenses.

How Does A Blue Team Work?

  • The blue team first gathers data, documents exactly what needs to be protected and carries out a risk assessment.

  • They then tighten up access to the system in many ways, including introducing stronger password policies and educating staff to ensure they understand and conform to security procedures.

  • Monitoring tools are often put in place, allowing information regarding access to the systems to be logged and checked for unusual activity.

  • Blue teams will perform regular checks on the system, for example, DNS audits, internal or external network vulnerability scans and capturing sample network traffic for analysis.

  • Blue teams have to establish security measures around key assets of an organization.

  • They start their defensive plan by identifying the critical assets, document the importance of these assets to the business and what impact the absence of these assets will have.

  • Blue teams then perform risk assessments by identifying threats against each asset and the weaknesses these threats can exploit.

  • By evaluating the risks and prioritizing it, the blue team develops an action plan to implement controls that can lower the impact or likelihood of threats materializing against assets.

  • Senior management involvement is crucial at this stage as only they can decide to accept a risk or implement mitigating controls against it.

  • The selection of controls is often based on a cost-benefit analysis to ensure security controls deliver maximum value to the business.

For example, a blue team may identify that the company’s network is vulnerable to a DDoS (Distributed Denial of Service) attack. This attack reduces the availability of the network to legitimate users by sending incomplete traffic requests to a server. Each of these requests requires resources to perform an action, which is why the attack severely cripples a network.
The team then calculates the loss in case the threat occurs. Based on cost-benefit analysis and aligning with business objectives, a blue team would consider installing an intrusion detection and prevention system to minimize the risk of DDoS attacks.

Examples Of Blue Team Exercises

Blue teams use a variety of methods and tools as countermeasures to protect a network from cyber attacks. Depending on the situation a blue team may determine that additional firewalls need to be installed to block access to an internal network. Or, the risk to social engineering attacks is so critical that it warrants the cost of implementing security awareness training company-wide.

Examples of blue team exercises include:

  • Performing DNS audits (domain name server) to prevent phishing attacks, avoid stale DNS issues, avoid downtime from DNS record deletions, and prevent/reduce DNS and web attacks.

  • Conducting digital footprint analysis to tracks users’ activity and identify any known signatures that might indicate a breach of security.

  • Installing endpoint security software on external devices such as laptops and smartphones.

  • Ensuring firewall access controls are properly configured and that antivirus software are kept up to date

  • Deploying IDS and IPS software as a detective and preventive security control.

  • Implementing SIEM solutions to log and ingest network activity.

  • Analyzing logs and memory to pick up unusual activity on the system, and identify and pinpoint an attack.

  • Segregating networks and ensure they are configured correctly.

  • Using vulnerability scanning software on a regular basis.

  • Securing systems by using antivirus or anti-malware software.

  • Embedding security in processes.

What Are The Benefits Of Red And Blue Teams?

  • Implementing a red and blue team strategy allows an organization to benefit from two totally different approaches and skillsets.

  • It also brings a certain amount of competitiveness into the task, which encourages high performance on part of both teams.

  • The red team is valuable, in that it identifies vulnerabilities, but it can only highlight the current status of the system.

  • On the other hand, the blue team is valuable in that it gives long term protection by ensuring defenses remain strong, and by constant monitoring of the system.

The key advantage, however, is the continual improvement in the security posture of the organization by finding gaps and then filling those gaps with appropriate controls.

How Do Red And Blue Teams Work Together?

  • Communication between the two teams is the most important factor in successful red and blue team exercises.

  • The blue team should stay up to date on the new technologies for improving security and should share these findings with the red team.

  • Likewise, the red team should always be aware of new threats and penetration techniques used by hackers and advise the blue team on prevention techniques.

  • Depending on the goal of your test will depend on whether or not the red team informs the blue team of a planned test.

    • For example, if the goal is to simulate a real response scenario to a “legitimate” threat, then you wouldn’t want to tell the blue team about the test.

  • The caveat is that someone in management should be aware of the test, typically the blue team lead.

  • This ensures the response scenario is still tested, but with tighter control when/if the situation is escalated.

  • When the test is complete both teams gather information and report on their findings.

  • The red team advises the blue team if they manage to penetrate defenses, and provide advice on how to block similar attempts in a real scenario.

  • Likewise, the blue team should let the red team know whether or not their monitoring procedures picked up an attempted attack.

  • Both teams should then work together to plan, develop, and implement stronger security controls as needed.

What Is A Purple Team?

  • While red teams and blue teams share common goals, they’re often not politically aligned.

    • For example, red teams who report on vulnerabilities are praised for a job well done.

    • Therefore, they’re not incentivized to help the blue team strengthen their security by sharing information on how they bypassed their security.

  • There’s also no point in “winning” red team tests if you’re not sharing that information with the blue team.

  • Remember, the main purpose of red and blue team exercises is to strengthen the overall security posture of the organization.

  • That’s where the concept of a purple team comes into place.

  • A purple team isn’t necessarily a stade alone team, although it could be.

  • The goal of a purple team is to bring both red and blue teams together while encouraging them to work as a team to share insights and create a strong feedback loop.

  • Management should ensure that the red and blue teams work together and keep each other informed.

  • Enhanced cooperation between both teams through proper resource sharing, reporting and knowledge share is essential for the continual improvement of the security program.

What Every Engineer Should Know About Static Application Security Testing (SAST)

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

What is SAST?

  • Static Application Security Testing (SAST), or static analysis, is a testing methodology that analyzes source code to find security vulnerabilities that make your organization’s applications susceptible to attack.

  • SAST scans an application before the code is compiled.

  • It’s also known as white box testing.

What problems does SAST solve?

  • SAST takes place very early in the software development life cycle (SDLC) as it does not require a working application and can take place without code being executed.

  • It helps developers identify vulnerabilities in the initial stages of development and quickly resolve issues without breaking builds or passing on vulnerabilities to the final release of the application.

  • SAST tools give developers real-time feedback as they code, helping them fix issues before they pass the code to the next phase of the SDLC.

  • This prevents security-related issues from being considered an afterthought.

  • SAST tools also provide graphical representations of the issues found, from source to sink. These help you navigate the code easier.

  • Some tools point out the exact location of vulnerabilities and highlight the risky code.

  • Tools can also provide in-depth guidance on how to fix issues and the best place in the code to fix them, without requiring deep security domain expertise.

  • Developers can also create the customized reports they need with SAST tools; these reports can be exported offline and tracked using dashboards.

  • Tracking all the security issues reported by the tool in an organized way can help developers remediate these issues promptly and release applications with minimal problems.

  • This process contributes to the creation of a secure SDLC.

  • It’s important to note that SAST tools must be run on the application on a regular basis, such as during daily/monthly builds, every time code is checked in, or during a code release.

Why is SAST an important security activity?

  • Developers dramatically outnumber security staff.

  • It can be challenging for an organization to find the resources to perform code reviews on even a fraction of its applications.

  • A key strength of SAST tools is the ability to analyze 100% of the codebase.

  • Additionally, they are much faster than manual secure code reviews performed by humans.

  • These tools can scan millions of lines of code in a matter of minutes.

  • SAST tools automatically identify critical vulnerabilities—such as buffer overflows, SQL injection, cross-site scripting, and others—with high confidence.

  • Thus, integrating static analysis into the SDLC can yield dramatic results in the overall quality of the code developed.

What are the key steps to run SAST effectively?

  • There are six simple steps needed to perform SAST efficiently in organizations that have a very large number of applications built with different languages, frameworks, and platforms.

  1. Finalize the tool.

    • Select a static analysis tool that can perform code reviews of applications written in the programming languages you use.

      • The tool should also be able to comprehend the underlying framework used by your software.

  1. Create the scanning infrastructure, and deploy the tool.

    • This step involves handling the licensing requirements, setting up access control and authorization, and procuring the resources required (e.g., servers and databases) to deploy the tool.

  1. Customize the tool.

    • Fine-tune the tool to suit the needs of the organization.

      • For example, you might configure it to reduce false positives or find additional security vulnerabilities by writing new rules or updating existing ones.

    • Integrate the tool into the build environment, create dashboards for tracking scan results, and build custom reports.

  1. Prioritize and onboard applications.

    • Once the tool is ready, onboard your applications.

    • If you have a large number of applications, prioritize the high-risk applications to scan first.

    • Eventually, all your applications should be onboarded and scanned regularly, with application scans synced with release cycles, daily or monthly builds, or code check-ins.

  1. Analyze scan results.

    • This step involves triaging the results of the scan to remove false positives.

    • Once the set of issues is finalized, they should be tracked and provided to the deployment teams for proper and timely remediation.

  1. Provide governance and training.

    • Proper governance ensures that your development teams are employing the scanning tools properly.

    • The software security touchpoints should be present within the SDLC. SAST should be incorporated as part of your application development and deployment process.

What tools can be used for SAST?

  • Source code analysis tools, also referred to as Static Application Security Testing (SAST) Tools, are designed to analyze source code and/or compiled versions of code to help find security flaws.

  • Some tools are starting to move into the IDE.

  • For the types of problems that can be detected during the software development phase itself, this is a powerful phase within the development life cycle to employ such tools, as it provides immediate feedback to the developer on issues they might be introducing into the code during code development itself.

  • This immediate feedback is very useful, especially when compared to finding vulnerabilities much later in the development cycle.

How is SAST different from DAST?

  • Organizations are paying more attention to application security, owing to the rising number of breaches.

    • They want to identify vulnerabilities in their applications and mitigate risks at an early stage.

  • There are two different types of application security testing—SAST and dynamic Application Security Testing (DAST).

    • Both testing methodologies identify security flaws in applications, but they do so differently.

Here are some of the key differences between the two testing methodologies:

Static application security testing (SAST) and dynamic application security testing (DAST) are both methods of testing for security vulnerabilities, but they’re used very differently. Here are some key differences between SAST and DAST:

SAST

DAST

White box security testing

The tester has access to the underlying framework, design, and implementation. The application is tested from the inside out. This type of testing represents the developer approach.

Black box security testing

The tester has no knowledge of the technologies or frameworks that the application is built on. The application is tested from the outside in. This type of testing represents the hacker approach.

Requires source code

SAST doesn’t require a deployed application. It analyzes the sources code or binary without executing the application.

Requires a running application

DAST doesn’t require source code or binaries. It analyzes by executing the application.

Finds vulnerabilities earlier in the SDLC

The scan can be executed as soon as code is deemed feature-complete.

Finds vulnerabilities toward the end of the SDLC

Vulnerabilities can be discovered after the development cycle is complete.

Less expensive to fix vulnerabilities

Since vulnerabilities are found earlier in the SDLC, it’s easier and faster to remediate them. Findings can often be fixed before the code enters the QA cycle.

More expensive to fix vulnerabilities

Since vulnerabilities are found toward the end of the SDLC, remediation often gets pushed into the next cycle. Critical vulnerabilities may be fixed as an emergency release.

Can’t discover run-time and environment-related issues

Since the tool scans static code, it can’t discover run-time vulnerabilities.

Can discover run-time and environment-related issues

Since the tool uses dynamic analysis on an application, it is able to find run-time vulnerabilities.

Typically supports all kinds of software

Examples include web applications, web services, and thick clients.

Typically scans only apps like web applications and web services

DAST is not useful for other types of software.

SAST and DAST techniques complement each other. Both need to be carried out for comprehensive testing.

What Every Engineer Should Know About Cyber Threat Trees

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

Cyber Threat Trees for large system threat cataloging and analysis

  • The implementation of cyber threat countermeasures requires identification of points in a system where redundancy or other modifications are needed.

  • Because large systems have many possible threats that may be interdependent, it is crucial that such threats be cataloged in a manner that allows for efficient representation and ease of analysis to identify the most critical threats.

  • To address this problem, we can model large system threats by conceptually representing them as a Cyber Threat Tree implemented as a directed graph known as a Multiple-Valued Decision Diagram (MDD).

  • The cyber threat tree structure improves upon both the classical fault tree and attack tree structures by expanding the representation of possible system threats.

  • This cyber threat tree model is incorporated into an existing MDD software package to help identify and catalog possible system threats.

Building and Integrating an Information Security Trustworthiness Framework

  • We can introduce a new type of tree: Cyber Threat Trees.

  • They are a superset of Fault and Attack trees since they are based on multiple-valued or radix-p valued algebras over a finite and discrete set of values.

  • For example, when the radix p=2, the cyber threat tree reduces to a fault or attack tree depending on the nature of the disruptive events.

  • It explains that cyber threat trees have usually allowed for allow for more complicated interactions to be modeled.

Cyber Threat Trees with Neuro-Fuzzy based Software Risk Estimation Tool

  • For Software threat prediction, various statistical approaches as well as advanced approaches are introduced in different areas where Software systems are being used.

  • For Cyber Threat, Cyber threat trend analysis model is proposed using Hidden Markov Model (HMM), to forecast the Cyber threat trend.

    • HMM is a tool in which hidden state is determined .

  • After comparison with existing techniques, the Neuro-Fuzzy based Software Risk Estimation Tool provides accurate results.

    • MERIT workshop and training programs are conducted for effective training about insider threat awareness.

      • Insider threats are those undesired events that are performed by the legitimate users.

  • Threat Analysis and Modeling (TAM) tool is used to identify the threats and evaluate the risks.

    • This process is useful in business applications.

    • To identify the most critical large system threats, Cyber Threat Tree is implemented as directed graph known as Multiple Valued Decision Diagram (MDD).

      • Multiple Valued Logic function is used to represent the threat states and their interdependence.

Software Reliability Assessment using Neuro Fuzzy System

  • Software Reliability Assessment using Neuro Fuzzy System utilizes a threat representation structure called a Cyber Threat Tree.

  • This idea was motivated from the ideas of fault trees, which were originally devised by Bell laboratories.

  • Cyber threat trees have important differences from the fault trees in that many threat events are not statistically independent and that, unlike the fault tree model, we do not model threats as faults.

  • In the fault tree model, a fault either exists or does not; hence, it is based on a binary Boolean logic switching function.

  • Binary decision diagrams (BDDs) and their extended format multi-state BDDs (MBDDs) have been adapted to solve a fault tree model for reliability analysis.

  • This CRC Press News discusses a new family of decision diagrams, Multiple-Valued Decision Diagrams (MDDs) for dependability analysis of fault tolerant systems.

  • Both BDDs and MDDs can be used to find the exact solution for extremely large systems with arbitrary component failure distribution.

  • However, as compared with the BDD approach, the MDD approach has two advantages:

    • it incorporates imperfect fault coverage modeling automatically, also,

    • it provides a straightforward and efficient solution to analyzing system safety.

  • The reliability and safety of a fault tolerant computer system called 3P2M are analyzed to illustrate the advantages of the MDD approach.

Summary

  • System security continues to be of increasing importance.

  • To effectively address both natural and intentional threats to large systems, the threats must be cataloged and analyzed.

  • Extremely large and complex systems can have an accordingly large number of threat scenarios.

  • Simply listing the threats and devising countermeasures for each is ineffective and not efficient.

  • This CRC Press News describe a threat cataloging methodology whereby a large number of threats can be efficiently cataloged and analyzed for common features.

  • This allows countermeasures to be formulated that address a large number of threats that share common features.

  • The methodology utilizes Multiple-Valued Logic for describing the state of a large system and a multiple-valued decision diagram (MDD) for the threat catalog and analysis.

What Every Engineer Should Know About Common Attack Pattern Enumeration and Classification (CAPEC)

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

  • Understanding how the adversary operates is essential to effective cyber security.

  • Common Attack Pattern Enumeration and Classification (CAPEC) helps by providing a comprehensive dictionary of known patterns of attack employed by adversaries to exploit known weaknesses in cyber-enabled capabilities.

  • It can be used by analysts, developers, testers, and educators to advance community understanding and enhance defenses.

  • The Common Attack Pattern Enumeration and Classification (CAPEC) is a comprehensive dictionary and classification taxonomy of known attacks that can be used by analysts, developers, testers, and educators to advance community understanding and enhance defenses.

Examples

CAPEC-66: SQL Injection

  • "This attack exploits target software that constructs SQL statements based on user input. An attacker crafts input strings so that when the target software constructs SQL statements based on the input, the resulting SQL statement performs actions other than those the application intended..."

CAPEC-540: Overread Buffers

  • "An adversary attacks a target by providing input that causes an application to read beyond the boundary of a defined buffer. This typically occurs when a value influencing where to start or stop reading is set to reflect positions outside of the valid memory location of the buffer. This type of attack may result in exposure of sensitive information, a system crash, or arbitrary code execution."

Many security analysts report attack types using the consequence of the incident, the attack pattern, the name of the device being targeted or a name that was derived at the time and may be used again in the future. Given this inconsistency, a standardization for describing attack type patterns might help analysts report cybersecurity threats more accurately and consistently.

Cataloging Threats With CAPEC

  • An existing standard that has recently been revised sets forth a common set of names for cyberattack patterns.

  • The Common Attack Pattern Enumeration and Classification (CAPEC) is maintained by the MITRE Corporation, a not-for-profit organization that operates research and development centers sponsored by the federal government.

  • CAPEC is a comprehensive dictionary and classification taxonomy of known attacks that can be used by analysts, developers, testers and educators to advance community understanding and enhance defenses.

  • The objective of the CAPEC effort is to create a publicly available catalog of common attack patterns classified in an intuitive manner, along with a comprehensive schema for describing related attacks and sharing information about them.

CAPEC Hierarchy

  • CAPEC uses graph views, which are basically hierarchical representations of attack patterns.

    • The top of the hierarchy is a set of categories (see Figure 1), under which there are meta-level patterns. These meta-level patterns are parents to standard patterns, which may then be parents to detailed patterns.

  • CAPEC version 2.9 currently provides two views on the CAPEC site: Mechanisms of Attack and Domains of Attack.

    • In the Mechanisms of Attack view, nine categories are shown at the top level, with a total of 503 attack patterns within the entire hierarchy.

capec-blog-figure1.png

Figure 1: Mechanisms of Attack Categories (Source: CAPEC)

  • Shown in Figure 2 is a partial listing of a few expanded branches of the Mechanism of Attack view hierarchy.

    • Note how the hierarchy follows this format: View -> Category -> Meta -> Standard -> Detailed.

capec-blog-figure2.png

Figure 2: CAPEC hierarchy example (Source: CAPEC)

Consequences, Device Types and Attack Vectors

Below are some examples of names analysts commonly use when reporting attack types that are not attack patterns:

  • Denial-of-service —consequence;

  • Point-of-sale (POS) — targeted device type;

  • Internet of Things (IoT) — targeted device type;

  • Backdoor — Consequence or indicator, depending on how it’s detected;

  • Malicious documents (attachments) and links — attack vector;

  • Shellshock — specific malware or campaign;

  • Web — assuming anything using HTTP for an attack;

  • Remote code execution — consequence; and

  • Wi-Fi — attack vector.

Many in the security community, including the IBM X-Force team, have lumped some of these examples together in the past under the category of “attack type or pattern.” However, as shown above, these examples are not representative of an attack pattern. Some are consequences, such as DoS, while others describe the type of device that’s being targeted, such as an IoT device.

Looking It Up

When looking up IDs on the CAPEC website, you will notice there’s a presentation filter option on the left side. It defaults to “Basic” and has an option labeled “Complete.” What is shown from either view is based on the data available and applies to the given entry. The headings noted below are from CAPEC-17.

Basic will show you:

  • Summary,

  • Attack Prerequisites,

  • Solutions and Mitigations, and

  • Related Attack Patterns.

Complete will show you everything that Basic has, plus:

  • Typical Severity,

  • Typical Likelihood of Exploit, Methods of Attack,

  • Examples-Instances,

  • Attack Skills or Knowledge Required,

  • Resources Required,

  • Solutions and Migrations,

  • Attack Motivation-Consequences,

  • Injection Vector,

  • Payload,

  • Activation Zone,

  • Payload Activation Impact,

  • Related Weaknesses,

  • Related Attack Patterns,

  • Purposes,

  • Impact,

  • Technical Context,

  • References and Content History.

What Every Engineer Should Know About Cyber Kill Chain

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

  • The term kill chain was originally used as a military concept related to the structure of an attack; consisting of

    • target identification,

    • force dispatch to target,

    • decision and order to attack the target, and

    • finally the destruction of the target.

  • Conversely, the idea of "breaking" an opponent's kill chain is a method of defense or preemptive action.

  • More recently, Lockheed Martin adapted this concept to information security, using it as a method for modeling intrusions on a computer network.

    • Developed by Lockheed Martin, the Cyber Kill Chain framework is part of the Intelligence Driven Defense model for identification and prevention of cyber intrusions activity.

    • The model identifies what the adversaries must complete in order to achieve their objective.

Neutralizing a Cyber Attack using the Cyber Kill Chain Model

The seven steps of the Cyber Kill Chain enhance visibility into an attack and enrich an analyst’s understanding of an adversary’s tactics, techniques and procedures.

1. Reconnaissance

The attacker gathers information on the target before the actual attack starts.

  • Many security professionals feel that there is nothing that can be done about this stage, but that’s beyond wrong.

  • Quite often, cyber attackers collect information on their intended targets by searching internet sites like LinkedIn or Instagram.

  • They may also try to gather intel through techniques such as calling employees, email interactions, or dumpster diving.

This is where secure behaviors can have a big impact.

  • An aware workforce will know they are a target and limit what they publicly share.

  • They will authenticate people on the phone before they share any sensitive information.

  • They safely dispose of and shred sensitive documents.

  • Does this totally neutralize this stage?

    • Absolutely not, but then again, no control fully does.

    • However, this can put a big dent in the attacker’s capabilities to gather information.

  • A properly trained workforce can report suspicious activity, such as odd phone calls probing for more information.

2. Weaponization

  • The cyber attacker does not interact with the intended victim.

  • Instead, they create their attack.

  • For example, the attacker may create an infected Microsoft Office document paired with a customized phishing email, or perhaps they create a new strain of self-replicating malware to be distributed via USB drive.

  • There are few security controls, including security awareness, that may impact or neutralize this stage, unless the cyber attacker does some limited testing on the intended target.

3. Delivery

Transmission of the attack to the intended victim(s).

  • For example, this would be sending the actual phishing email or distributing the infected USB drives at a local coffee shop or cafe.

  • While there is an entire technical industry dedicated to stopping this stage, people also play a critical role.

While people aren’t proficient at remembering lots of new information, they are very good at being adaptable.

  • They generally follow that “this does just not seem right” instinct.

  • In addition, the 2019 Verizon DBIR found passwords and phishing as the two primary attack vectors, both involving people.

  • As such, it is people and not technology that are the first line of defense in detecting and stopping many of these attacks, to include new or custom attacks such as CEO Fraud or Spear Phishing.

  • In addition, people can identify and stop attacks that most technologies cannot even filter, such as attacks over the phone.

  • A trained workforce greatly reduces this attack surface area.

4. Exploitation

  • This implies actual ‘detonation’ of the attack, such as the exploit running on the system.

  • Trained people ensure the systems they are running are updated and current.

  • They ensure they have anti-virus running and enabled.

  • They ensure that any sensitive data they are working with is on secured systems, making them far more secure against exploitation.

5. Installation

  • The attacker installs malware on the victim.

  • Not all attacks require malware, such as a CEO fraud attack or harvesting login credentials.

  • However, just like exploitation when malware is involved, a trained and secure workforce can help ensure they are using secure devices that are updated, current, and have anti-virus enabled, which would stop many malware installation attempts.

  • In addition, this is where we begin to go beyond just the “human firewall” and leverage the “human sensor”.

  • A key step in detecting an infected system is to look for abnormal behavior.

  • Who better to detect abnormal behavior than the people using the system every day?

6. Command & Control

  • This implies that once a system is compromised and/or infected, the system has to call home to a Command and Control (C&C) system for the cyber attacker to gain control.

  • This is why ‘hunting’ has become so popular.

  • They’re looking for abnormal outbound activities like this.

7. Actions on Objectives

  • Once the cyber attacker establishes access to the organization, they can then execute actions to achieve their objectives.

    • Motivations vary greatly depending on the threat actor.

    • It may include political, financial, or military gain, so it is very difficult to define what those actions will be.

  • Once again, this is where a trained workforce of human sensors embedded throughout your organization can vastly improve

    • your ability to detect and respond to an incident, and

    • your resilience capabilities.

  • In addition, secure behaviors will make it far more difficult for a successful adversary to pivot throughout the organization and achieve their objectives.

    • Behaviors such as the use of strong, unique passwords, authenticating people before sharing sensitive data, or reviewing their last login are some of the many behaviors that make the attacker’s life far more difficult and result in them being far more likely to be detected.

How to Use Cyber Kill Chain Effectively

  • Different security techniques bring forward different approaches to the cyber kill chain – everyone defines the stages slightly differently.

    • Alternative models of the cyber kill chain combine several of the above steps into a C&C stage (command and control, or C2) and others into an ‘Actions on Objective’ stage.

    • Some combine lateral movement and privilege escalation into an exploration stage; others combine intrusion and exploitation into a ‘point of entry’ stage.

  • It’s a model often criticized for focusing on perimeter security and limited to malware prevention.

    • When combined with advanced analytics and predictive modeling, however, the cyber kill chain becomes critical to data security.

With the above breakdown, the kill chain is structured to reveal the active state of a data breach. Each stage of the kill chain requires specific instrumentation to detect cyber attacks.

"Home My Sweet Home" - The Journey is My Home. Happy New Year!

By: John X. Wang
Subjects: Architecture, Engineering - General, Engineering - Industrial & Manufacturing

Surrounding Great Smoky Mountains National Park are gateway towns with attractions and souvenirs for the traveler. After a retreat to nature, we emerged into towns with attractions and souvenirs; here, we enjoyed aquariums, rides, museums, and Dollywood, an amusement park and resort. "Home My Sweet Home" - The Journey is My Home. The contrast between her childhood home and her sprawling Dollywood complex is a reminder that the biggest thrills are when we take large leaps to follow our dreams, no matter how long the road.

Merry Christmas! Imagine to Get on a Self Driving Car Touring on Historic Park Roads

By: John X. Wang
Subjects: Agricultural Science, Architecture, Engineering - Environmental, Environmental Science, Geoscience

Imagine my family is driving the 11-mile one-way loop road through Cades Cove with a Self Driving Car, the road takes you through a lush valley surrounded by mountains. We stop to visit historic buildings, a grist mill, and watch wildlife. Then, for a quieter ride, we head to the Roaring Forks motor nature trail with views of rushing streams, old log cabins, another mill, and forested wilderness. Afterwards, we continue with other beautiful drives including the 18-mile Little River Road from the Sugarlands Visitor Center to Townsend, and the Blue Ridge Parkway (outside of the Great Smoky Mountains National Park).

Proceed down the road to the motor trail's entrance near the Rainbow Falls trailhead parking lot. We enjoy leaf-peeping, hiking trails and waterfalls along the way. Merry Christmas!

Merry Christmas from the Mountain Farm Museum at Oconaluftee Valley

By: John X. Wang
Subjects: Agricultural Science, Architecture, Geoscience

Image my family is getting on a Self Driving Car on the 11-mile one-way loop road through Cades Cove; the road takes us through a lush valley surrounded by mountains. We stop to visit historic buildings, a grist mill, and watch wildlife. Then, for a quieter ride, we head to the Roaring Forks motor nature trail with views of rushing streams, old log cabins, another mill, and forested wilderness. Afterwards, we continue with other beautiful drives including the 18-mile Little River Road from the Sugarlands Visitor Center to Townsend, and the Blue Ridge Parkway (outside of the Great Smoky Mountains National Park).

Proceed down the road to the motor trail's entrance near the Rainbow Falls trailhead parking lot. We enjoy leaf-peeping, hiking trails and waterfalls along the way. Merry Christmas!

Risk Engineering Approach for Penetration Testing

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

pen test (penetration testing)

  • Penetration testing, also called pen testing or ethical hacking, is the practice of testing a computer system, network or web application to find security vulnerabilities that an attacker could exploit.

  • Penetration testing can be automated with software applications or performed manually. Either way, the process involves gathering information about the target before the test, identifying possible entry points, attempting to break in -- either virtually or for real -- and reporting back the findings.

  • The main objective of penetration testing is to identify security weaknesses.

  • Penetration testing can also be used to test an organization's security policy, its adherence to compliance requirements, its employees' security awareness and the organization's ability to identify and respond to security incidents.

  • Typically, the information about security weaknesses that are identified or exploited through pen testing is aggregated and provided to the organization's IT and network system managers, enabling them to make strategic decisions and prioritize remediation efforts.

  • Penetration tests are also sometimes called white hat attacks because in a pen test, the good guys are attempting to break in.

A penetration test is a simulated cyber attack against your computer system to check for exploitable vulnerabilities. In the context of web application security, penetration testing is commonly used to augment a web application firewall (WAF).

Pen testing can involve the attempted breaching of any number of application systems, (e.g., application protocol interfaces (APIs), frontend/backend servers) to uncover vulnerabilities, such as unsanitized inputs that are susceptible to code injection attacks.

Insights provided by the penetration test can be used to fine-tune your WAF security policies and patch detected vulnerabilities.

Penetration testing stages

The pen testing process can be broken down into five stages.

1. Planning and reconnaissance

The first stage involves:

 

  • Defining the scope and goals of a test, including the systems to be addressed and the testing methods to be used.

  • Gathering intelligence (e.g., network and domain names, mail server) to better understand how a target works and its potential vulnerabilities.

2. Scanning

The next step is to understand how the target application will respond to various intrusion attempts. This is typically done using:

  • Static analysis – Inspecting an application’s code to estimate the way it behaves while running. These tools can scan the entirety of the code in a single pass.

  • Dynamic analysis – Inspecting an application’s code in a running state. This is a more practical way of scanning, as it provides a real-time view into an application’s performance.

3. Gaining Access

This stage uses web application attacks, such as cross-site scripting, SQL injection and backdoors, to uncover a target’s vulnerabilities. Testers then try and exploit these vulnerabilities, typically by escalating privileges, stealing data, intercepting traffic, etc., to understand the damage they can cause.

4. Maintaining access

The goal of this stage is to see if the vulnerability can be used to achieve a persistent presence in the exploited system— long enough for a bad actor to gain in-depth access. The idea is to imitate advanced persistent threats, which often remain in a system for months in order to steal an organization’s most sensitive data.

5. Analysis

The results of the penetration test are then compiled into a report detailing:

  • Specific vulnerabilities that were exploited

  • Sensitive data that was accessed

  • The amount of time the pen tester was able to remain in the system undetected

This information is analyzed by security personnel to help configure an enterprise’s WAF settings and other application security solutions to patch vulnerabilities and protect against future attacks.

What are the types of pen tests?

  • White box pen test - In a white box test, the hacker will be provided with some information ahead of time regarding the target company’s security info.

  • Black box pen test - Also known as a ‘blind’ test, this is one where the hacker is given no background information besides the name of the target company.

  • Covert pen test - Also known as a ‘double-blind’ pen test, this is a situation where almost no one in the company is aware that the pen test is happening, including the IT and security professionals who will be responding to the attack. For covert tests, it is especially important for the hacker to have the scope and other details of the test in writing beforehand to avoid any problems with law enforcement.

  • External pen test - In an external test, the ethical hacker goes up against the company’s external-facing technology, such as their website and external network servers. In some cases, the hacker may not even be allowed to enter the company’s building. This can mean conducting the attack from a remote location or carrying out the test from a truck or van parked nearby.

  • Internal pen test - In an internal test, the ethical hacker performs the test from the company’s internal network. This kind of test is useful in determining how much damage a disgruntled employee can cause from behind the company’s firewall.

How is a typical pen test carried out?

  • Pen tests start with a phase of reconnaissance, during which an ethical hacker spends time gathering data and information that they will use to plan their simulated attack.

  • After that, the focus becomes gaining and maintaining access to the target system, which requires a broad set of tools.

  • Tools for attack include software designed to produce

    • brute-force attacks or

    • SQL injections.

  • There is also hardware specifically designed for pen testing, such as

    • small inconspicuous boxes that can be plugged into a computer on the network to provide the hacker with remote access to that network.

    • In addition, an ethical hacker may use social engineering techniques to find vulnerabilities.

      • For example, sending phishing emails to company employees, or even disguising themselves as delivery people to gain physical access to the building.

The hacker wraps up the test by covering their tracks; this means removing any embedded hardware and doing everything else they can to avoid detection and leave the target system exactly how they found it.

Penetration Testing Methodology

  • Once the threats and vulnerabilities have been evaluated, the penetration testing should address the risks identified throughout the environment.

  • The penetration testing should be appropriate for the complexity and size of an organization.

  • All locations of sensitive data; all key applications that store, process or transmit such data; all key network connections; and all key access points should be included.

  • The penetration testing should attempt to exploit security vulnerabilities and weaknesses throughout the environment, attempting to penetrate both at the network level and key applications.

  • The goal of penetration testing is to determine if unauthorized access to key systems and files can be achieved.

  • If access is achieved, the vulnerability should be corrected and the penetration testing re-performed until the test is clean and no longer allows unauthorized access or other malicious activity.

Application Penetration Testing

Identifies application layer flaws such as Cross Site Request Forgery, Cross Site Scripting, Injection Flaws, Weak Session Management, Insecure Direct Object References and more.

Network Penetration Testing

Focuses on identifying network and system level flaws including Misconfigurations, Product-specific vulnerabilities, Wireless Network Vulnerabilities, Rogue Services, Weak Passwords, and Protocols.

Physical Penetration Testing

Also known as physical intrusion testing, this testing reveals opportunities to compromise physical barriers such as locks, sensors, cameras, mantraps and more.

IoT/Device Penetration Testing

Aims to uncover hardware and software level flaws with Internet of Things devices including Weak Passwords, Insecure Protocols, APIS, or Communication Channels, Misconfigurations and more.

Risk Engineering Approach for Penetration Testing

The Risk Engineering Approach for Penetration Testing typically involves the following six steps:

  1. Information Gathering — the stage of reconnaissance against the target.

  2. Threat Modeling — identifying and categorizing assets, threats, and threats communities.

  3. Vulnerability Analysis — discovering flaws in systems and applications using a set of tools, both commercially available tools and internally developed.

  4. Exploitation — simulating a real-world attack to document any vulnerabilities.

  5. Post-Exploitation — determining the value of compromise, considering data or network sensitivity.

  6. Reporting — outlining the findings with suggestions for prioritizing fixes. For us, that means walking through the results with you hand-in-hand.

How Can You Exploit Vulnerabilities?

  • Penetration testing can either be done in-house by your own experts using pen testing tools, or you can outsource to a penetration testing services provider.

  • A penetration test starts with the security professional enumerating the target network to find vulnerable systems and/or accounts.

  • This means scanning each system on the network for open ports that have services running on them.

  • It is extremely rare that an entire network has every service configured correctly, properly password protected, and fully patched.

  • Once the penetration tester has a good understanding of the network and the vulnerabilities that are present, he/she will use a penetration testing tool to exploit a vulnerability in order to gain unwelcomed access.

Security professionals do not just target systems, however. Often, a pen tester targets users on a network through phishing emails, pre-text calling, or onsite social engineering.

How Do You Test the "User Risk" to Your IT Security Chain?

Your users present an additional risk factor as well. Attacking a network via human error or compromised credentials is nothing new. If the continuous cybersecurity attacks and data breaches have taught us anything, it’s that the easiest way for a hacker to enter a network and steal data or funds is still through network users.

Compromised credentials are the top attack vector across reported data breaches year after year, a trend proven by the Verizon Data Breach Report. Part of a penetration test’s job is to resolve the aforementioned security threat caused by user error. A pen tester will attempt brute-force password guessing of discovered accounts to gain access to systems and applications. While compromising one machine can lead to a breach, in a real-life scenario an attacker will typically use lateral movement to eventually land on a critical asset.

Another common way to test the security of your network users is through a simulated phishing attack. Phishing attacks use personalized communication methods to convince the target to do something that’s not in their best interest. For example, a phishing attack might convince a user that it’s time for a "mandatory password reset" and to click on an embedded email link. Whether clicking on the malicious link drops malware or it simply gives the attacker the door they need to steal credentials for future use, a phishing attack is one of the easiest ways to exploit network users. If you are looking to test your users’ awareness around phishing attacks, make sure that the penetration testing tool you use has these capabilities.

What Does Risk Engineering Approach for Penetration Testing Mean to a Business?

A penetration test is a crucial component to network security. Through these tests a business can identify:

  • Security vulnerabilities before a hacker does

  • Gaps in information security compliance

  • The response time of their information security team, i.e. how long it takes the team to realize that there is a breach and mitigate the impact

  • The potential real-world effect of a data breach or cybersecurity attack

  • Actionable remediation guidance

Through penetration testing, security professionals can effectively find and test the security of multi-tier network architectures, custom applications, web services, and other IT components. These penetration testing tools and services help you gain fast insight into the areas of highest risk so that you may effectively plan security budgets and projects. Thoroughly testing the entirety of a business's IT infrastructure is imperative to taking the precautions needed to secure vital data from cybersecurity hackers, while simultaneously improving the response time of an IT department in the event of an attack.

Attack Tree and Attack Net for Penetration Testing

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

Penetration testing involves teams who conduct technical and process hacks. Web application penetration testing, for example, involves the enlistment of hackers who see how and where they can accomplish an infiltration. Within the Software Development Life Cycle (SDLC), penetration testing is vital to discover vulnerabilities and gives teams across an organization an accurate measurement of an organization’s security posture. The cybersecurity posture of an organization refers to its overall strength in securing against outside threats to attack surface vulnerabilities.

Penetration testers will try to break into an application – whether in testing or production. Once completed, penetration testing provides the information and documentation to prove to regulatory bodies that an enterprise is taking steps to achieve a secure environment. Every pen test is different based on the individuals performing it, their approaches, mindsets, capabilities plus tools involved.

Why Are Penetration Tests Important?

Penetration tests not only send a message that an organization is doing what they can to ensure the security of their private and confidential data. It also helps DevSecOps teams to better understand the dynamic of hacks, including how bad actors can compromise an IT ecosystem’s attack surface.

Pen testing services help security teams to identify areas for improvement and prioritize threat mitigation strategies. Penetration testing can yield surprising results and can help organizations to better understand the different attack vectors that can compromise data. For example, within a web application security testing exercise, pen testers will find as many ways to attack the various parts of the application. This will provide SDLC teams with a vulnerability perspective that is more about the attacker’s point-of-view – think like a hacker, if you will.

Penetration Attack Tree Model Oriented to Attack Resistance Test

Security testing and penetration testing are guided by the threat model. A good threat model is a blueprint for a penetration test. Additionally, relevant Techniques, Tactics and Procedures (TTPs) and/or targeting data available from threat intelligence must be included in testing activities.

  • Attack model is the foundation for organizing and implementing attacks against the target system in Attack Resistance Test.

  • By redefining the node of the attack tree model and describing the relation of the attack tree nodes, we build a penetration attack tree model which can describe, organize, classify, manage and schedule the attacks for Attack Resistance Test.

  • We can design a penetration attack system whose attack scheme is the instance of the model application.

Alternative to Attack Tree: Attack Net Penetration Testing

The modeling of penetration testing as a Petri net is surprisingly useful. It retains key advantages of the flaw hypothesis and attack tree approaches while providing some new benefits. Penetration testing is a critical step in the development of any secure product or system. While many current businesses define penetration testing as the application of automated network vulnerability scanners to an operational site, true penetration testing is much more than that. Penetration testing stresses not only the operation, but also the implementation and design of a product or system.

The development of penetration testing is a combination of art and science. The effectiveness of penetration testing depends on the skill and experience of the testers. Penetration testers need firm grounding in the first principles of information security but they also need an almost encylopedic knowledge of product or system trivia that have little apparent relationship to principles. Penetration testing also requires a special kind of insight that cannot be systematized.

In spite of this, there are widely used process models for penetration testing. Penetration testers that follow these models are more effective in their use of resources. Penetration testing process models are structured around some paradigm that organizes the discovery of potential attacks on the live system. In this paper we describe a new process model for penetration testing that uses the Petri net as its paradigm. Surprisingly, this approach provides increased structure to flaw generation activities, without restricting the free range of inquiry. This technique is particularly useful for organizing penetration testing by means of distributed or cooperative attacks. It also has the nice properties of easily depicting both refinement of specific attacks and attack alternatives in a manner similar to attack trees.

The attack net approach to penetration testing is a departure from both the flaw hypothesis model and the attack tree model; however, it retains the essential benefits of both. Any penetration testing process is unavoidably dependent upon the flaw hypothesis process model. Any valid penetration testing process model will retain many of its features and so does the attack net model.

Nevertheless, the attack net penetration testing process brings more discipline to the brainstorming activity without restricting the free range of ideas in any way. Attack nets also provide the alternatives and refinement of the attack tree approach.

Attack nets provide a graphical means of showing how a collection of flaws may be combined to achieve a significant system penetration. This is important since an attack net can make full use of hypothetical flaws. Attack nets can model more sophisticated attacks that may combine several flaws, none of which is a threat by itself. The ability to use discovered transitions (i.e. security relevant commands) to connect subnets allows penetration teams to communicate easily about the cumulative effects of several minor flaws.

The separation of penetration test commands or events from the attack states or objectives also increases the descriptive power of this approach. The basic notion of an initial security relevant state, the hostile test input, and the resulting security state is captured by the minimal Petri net representation.

In addition to specifying composition or refinement, attack nets can also model choices. The use of disjunctive transitions allows the movement of some tokens while other places are empty, thus modeling vulnerabilities that could be exploited in several ways or alternative attacks on a single goal.

Summary

This threat assessment’s results led to significant enhancements to this environment’s infrastructure security controls, modified key operational processes and triggered penetration testing activities to determine the presence and magnitude of potential flaws in specific components. Beyond this, it enabled an informed decision on risk management at the executive level regarding the threat and attack vector.

Happy Cyber Monday! What Every Engineer Should Know About Security Risk Analysis Techniques with a Table

By: John X. Wang
Subjects: Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology

Happy Cyber Monday! What Every Engineer Should Know About Security Risk Analysis Techniques with a Table

Analysis Process

Techniques

Brief descriptions

 

 

 

Assets identification

Assets categorizations

Identify each asset according to predefined categories

Assets valuation

Asset valuation table

Assign each asset with value for disclosure, integrity and denial of service

Threats identification

Threats categorizations

Identify each threat according to predefined categories

Threat trees

Identify threats by decomposing general categories threat into specific threats

Vulnerabilities identification

Vulnerabilities check-lists

Table listing the assets and check for vulnerabilities

Attack trees

Identify vulnerabilities and describe the security of the system

Abuse case model

Identify and model the vulnerabilities of the system

Survivable Network Analysis

Identify softspot components (essential and compromisable) and provide recommendations

Risk Assessment

Impact valuation table

Assign value of low, medium or high according to attacks and impact

Security Measures identification

Checklists

Based on what can be done for the problem, on the controls known so far

Security Measures Principles

List of principles to help with selection of appropriate security measures

 

What Every Engineer Should Know About Quantitative Threat Modeling Methods to Support Decision Making Under Uncertainty

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

Quantitative Threat Modeling Method

To enhance the security of mobile devices, enterprises are developing and adopting mobile device management systems. However, if a mobile device management system is exploited, mobile devices and the data they contain will be compromised. Therefore, it is important to perform extensive threat modeling to develop realistic and meaningful security requirements and functionalities

One step involved in the security engineering process is threat modeling. Threat modeling involves understanding the complexity of the system and identifying all of the possible threats, regardless of whether or not they can be exploited. Proper identification of threats and appropriate selection of countermeasures reduces the ability of attackers to misuse the system. Introduced during the HotSoS conference in Pittsburgh, PA in April 2016 by Bradley Potteiger, Goncalo Martins, and Xenofon Koutsoukos, Quantitative Threat Modeling Method is a quantitative, integrated threat modeling approach that merges software and attack centric threat modeling techniques.

  • The STRIDE threat model is composed of a system model representing the physical and network infrastructure layout, as well as a component model illustrating component specific threats.

  • Component attack trees allow for modeling specific component contained attack vectors, while system attack graphs illustrate multi-component, multi-step attack vectors across the system.

  • The Common Vulnerability Scoring System (CVSS) is leveraged to provide a standardized method of quantifying the low level vulnerabilities in the attack trees.

This hybrid method consists of attack trees, STRIDE, and CVSS methods applied in synergy. It aims to address a few pressing issues with threat modeling for cyber-physical systems that had complex interdependence among their components.

The central step of the Quantitative Threat Modeling Method (Quantitative TMM) is to build component attack trees for the five threat categories of STRIDE. This activity shows the dependencies among attack categories and low-level component attributes. After that, the CVSS method is applied and scores are calculated for the components in the tree.

An additional goal for the method is to generate attack ports for individual components. These attack ports (effectively root nodes for the component attack trees) illustrate activities that can pass risk to the connected components. The scoring assists with the process of performing a system risk assessment. If an attack port is dependent on a component root node with a high-risk score, that attack port also has a high-risk score and has a high probability of being executed. The opposite is also true.

  • This method can be applied to identify all possible threats against a mobile device management system by analyzing and identifying threat agents, assets, and adverse actions.

  • It can also be used for developing security requirements such as a protection profile and design a secure system.

  • Contains built-in prioritization of threat mitigation

  • Has automated components

  • Has consistent results when repeated

Threat Modeling method based on Attacking Path Analysis (T-MAP)

Yue Chen, Barry Boehm, and Luke Sheppard developed another quantitative threat modeling method, the Threat Modeling method based on Attacking Path Analysis (T-MAP), which quantifies security threats by calculating the total severity weights of relevant attacking paths for Commercial Off The Shelf (COTS) systems.

  • Compared to existing approaches, T-MAP is sensitive to an organization' s business value priorities and IT environment.

  • It distills the technical details of thousands of relevant software vulnerabilities into management-friendly numbers at a high-level.

  • This method systematically establishes the traceability and consistency from management-level organizational value propositions to technical-level security threats and corresponding mitigation strategies.

  • T-MAP could provide promising strength in prioritizing and estimating security investment effectiveness, as well as in evaluating the security performance of COTS systems.

  • T-MAP can help system designers evaluate the security performance of COTS systems and analyze the effectiveness of security practices.

  • This model can be implemented using UML class diagrams, access class diagrams, vulnerability class diagrams, target asset class diagrams and affected Value class diagrams.

Happy Cyber Monday! Play the Extended Security Cards with Hybrid Threat Modeling Method (hTMM)

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Gaming, Homeland Security, Information Technology, Web, Web2

Hybrid Threat Modeling Method (hTMM)

The Hybrid Threat Modeling Method (hTMM) was developed by the Software Engineering Institute (SEI) in 2018. It consists of a combination of

  • SQUARE (Security Quality Requirements Engineering Method),

  • Security Cards, and

  • PnG activities.

The targeted characteristics of the method include

  • no false positives, no overlooked threats,

  • a consistent result regardless of who is doing the threat modeling, and

  • cost effectiveness.

The main steps of the method are

  • Identify the system to be threat-modeled.

  • Apply Security Cards based on developer suggestions.

  • Remove unlikely PnGs (i.e., there are no realistic attack vectors).

  • Summarize the results using tool support.

  • Continue with a formal risk-assessment method.

Initial steps for the Hybrid Threat Modeling Method

The initial steps in the hTMM are as follows:

  1. Identify the system you will be modeling. Execute Steps 1-3 of SQUARE or a similar security requirements method, e.g.

    1. Agree on definitions.

    2. Identify a business goal for the system, assets, and security goals.

    3. Gather as many artifacts as feasible.

  1. Create a large initial set of possible threats by applying Security Cards in the following way:

    1. Distribute the Security Cards to participants either in advance or at the start of the activity. Include representatives of at least the three following groups of stakeholders:

      • system users/purchasers,

      • system engineers/developers, and

      • cybersecurity experts.

You may find that within each of those categories, there are multiple distinct perspectives that must be represented. Other relevant stakeholders can be included, as well.

  1. Have the participants look over the cards along all four dimensions: human impact, adversary's motivations, adversary's resources, and adversary's methods. To familiarize themselves with the type of information on the cards, have participants read at least one card from each dimension, front and back.

  2. Use the cards to support a brainstorming session. Consider each dimension independently, and sort the cards within that dimension in order of how relevant and risky it is for the system overall. Discuss as a team what orderings are identified. It's important to be inclusive, so do not exclude ideas that seem unlikely or illogical at this point in time.

Record the data

As you conduct your brainstorming exercise, record the following:

  1. If your system were compromised, what assets, both human and system, could be impacted?

  2. Who are the personae non gratae who might reasonably attack your system and why? What are their names/job titles/roles? Describe them in some detail:

    1. What are their goals?

    2. What resources and skills might the PnG have?

  1. In what ways could the system be attacked?

    • For each attack vector, have you identified a PnG (or could you add a PnG) capable of utilizing that vector?

Analyze the collected data

  1. After the data in Step 2 has been collected, you will have enough information to prune the listed attacks, based on which have PnGs that are unlikely, and which have no realistic attack vectors could be identified. Once this is done, for the remaining attacks:

    • Itemize their misuse cases. This expands on HOW the adversary attacks the system. The misuse cases provide the supporting detailed information on how the attack takes place.

  1. Summarize the results from the above steps, utilizing tool support, as follows:

    1. Actor (PnG): Who or what instigates the attack?

    2. Purpose: What is the actor's goal or intent?

    3. Target: What asset is the target?

    4. Action: What action does the actor perform or attempt to perform? Here you should consider both the resources and the skills of the actor. You will also be describing HOW the actor might attack your system and its expansion into misuse cases.

    5. Result of the action: What happens as a result of the action? What assets are compromised? What goal has the actor achieved?

    6. Impact: What is the severity of the result (high, medium, or low)

    7. Threat type: (e.g., denial of service, spoofing)

After the preceding steps are done, you can continue with a formal risk assessment method, using these results, and the additional steps of a security requirements method (such as SQUARE), perhaps tailoring the method to eliminate steps you have already accounted for in the threat modeling exercise.

Steps 1 and 5 are activities that precede and follow the bulk of the threat modeling work. We felt it was necessary to include these, to understand where hTMM fits into lifecycle activities, specifically security requirements engineering.

Summary

In summary, the hybrid threat modeling method (hTMM) combines

Happy Cyber Monday! Consider Playing Security Cards during the Holiday Season

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

Security Cards: A Threat Brainstorming Toolkit (credit: https://securitycards.cs.washington.edu)

The purpose of the Security Cards is to facilitate the broad exploration on of potential security and privacy threats to a system with the “security mindset”.

Security Cards identify unusual and complex attacks. They are not a formal method but, rather, a kind of brainstorming technique. With help from a deck of cards (see an example in Figure 6), analysts can answer questions about an attack, such as

  • Who might attack?

  • Why might the system be attacked?

  • What assets are of interest?

  • How can these attacks be implemented?"

This method uses a deck of 42 cards to facilitate threat-discovery activities:

  • Human Impact (9 cards),

  • Adversary's Motivations (13 cards),

  • Adversary Resources (11 cards), and

  • Adversary's Methods (9 cards).

The Security Cards encourage you to think broadly and creatively about computer security threats by exploring with 42 cards along 4 dimensions (suits).

Usage of the Security Cards

The Security Cards can be used for a wide range of purposes and in a wide range of contexts. For example,

  • the cards could be used by junior engineers to learn about security threats,

  • by professional software and hardware developers for training and to surface threats in system design, and

  • by project teams to communicate about potential security threats with management and others.

Step-by-Step Activities

These detailed activities provide step-by-step suggestions for using the Security Cards in an interactive workshop/training context. The activities can be used "as is" or adapted and extended as needed. While the activities are phrased as engineering training plans, they can be used in other contexts as well.

Sorting by Threat Importance (combining with 5 Whys)

  • Have participants consider a specific system.

  • With that system in mind, ask participants to consider each dimension independently and sort the cards within that dimension in order of how relevant and risky it is for the system overall.

  • Within each dimension, what orderings are identified?

  • Is there more than one reasonable ordering?

  • 5 Whys?

With Security Cards, the teams of participants exhibited higher effectiveness. Almost all types of threats were found by teams using Security Cards. Incorporate 5 Whys help to prevent false positives.

Multi-Dimensions of Threat Discovery

  • Have participant consider a specific suite.

  • With that system in mind and using the entire card deck, have participants explore card combinations from different dimensions to surface possible threats to the system.

  • Which combinations of cards surface critical threats?

  • Which surface surprising threats?

  • Which threats are most relevant overall?

The Security Cards approach to threat modeling emphasizes creativity and brainstorming over more structured approaches, such as checklists, to help users identify unusual or more sophisticated attacks. The method is suitable for purposes ranging from fundamental learning about security threats to aiding professionals in system design.

What Every Engineer Should Know About “Persona non Grata” Threat Modeling?

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Forensics & Criminal Justice, Homeland Security, Information Technology

Persona non Grata Threat Actor Profile

Persona non Grata is a war-gaming method for developing profile for threat actors who could potentially target the organization. This method provides high level insights into system design and the types of TTPs that a threat actor may use to attempt an attack or intrusion on a network.

For this CRC Press News, we are using an example of a “crown jewels” database and discussing how a security team might apply the threat model examples to the protection of that database. PASTA looked at the various aspects of the database, including

  • the technology stack, business risks associated with uncertainties caused by events that may occur during the protection of the database, and

  • how the intelligence cycle provides a constant, and constantly evolving, security posture in place to proactively search for threats.

The Persona non Grata method provided an overview of just

  • who might be interested in stealing the database files,

  • what they might want to do with those files, and

  • the potential sophistication levels of the threat actors.

These profiles help build an understanding of the primary threats to the organization and the tools those threats may employ to stage a potential attack.

The PnG approach

  • A persona non grata (PnG) represents an archetypal user who behaves in unwanted, possibly nefarious ways.

  • However, like ordinary personas, PnGs have specific goals that they wish to achieve and specific actions that they make take to achieve their goals.

  • Modeling PnG can therefore help us think about the ways in which a system might be vulnerable to abuse, and use this information to specify appropriate mitigating requirements.

  • The PnG approach makes threat modeling more tractable by asking users to focus on attackers, their motivations, and abilities.

PnG Design principle

  • make problem more tractable by giving modelers a specific focus (here: attackers, motivations, abilities)

  • Once attackers are modeled, process moves on to targets and likely attack mechanisms

PnG was the most focused TMM, showing the most consistent behavior across team.

What is a persona?

Personas are detailed descriptions of imaginary people constructed out of well-understood, highly specified data about real people.

What is a Persona non Grata

Persona non Grataan is an unacceptable or unwelcome person.

Developing a PnG; could you develop one for the unwanted and malicious intruder depicted in the attached picture (credit: https://www.infoq.com/articles/personae-non-gratae)?

  1. Motivations: What is the PnG’s motivations?

Monetary gain?

Revenge?

Recognition?

“LoLs” (laughs)?

  1. Goals: How will the PnG fulfill their motivation i.e.

    what do they want to do, and

    how do they plan to get away with it?

  2. Skills: What abilities do they have to achieve their goal?

    What other assets do they have e.g. access to infrastructure,

    relationships to those who have skills?

  3. Misuse cases: What are the misuse cases the PnG can follow to achieve their goals?

PnG Merging Process

  • Step 1: Discover domain-specific concepts

  • Step 2: Identify attack targets

  • Step 3: Visually display attack mechanisms

  • Step 4: Merge individual threats into new PnGs

  • Step 5: Check for redundancy

Features of Persona non Grata

  • Uses profiles of potential bad-guy attackers: analysis derives from anticipating what they would do given their defined goals and skills.

  • Helps identify relevant mitigation techniques

  • Direct contribution to risk management

  • Provides consistent results when used continuously

  • Detects only some subsets of threats

As a threat modeling method, Persona non Grata (PnG) focuses on the motivations and skills of human attackers.

  • It characterizes users as archetypes that can misuse the system and forces analyststo view the system from an unintended use point of view

  • Tendsto detect only a certain subset of threat types

  • This technique fits well into the agile approaches, which incorporates personas.

We could create a persona non grata with more specific attack strategies to expose vulnerability points of the product.

Cybersecurity Risk Engineering: Why Engineers Need to Look Beyond OCTAVE?

By: John X. Wang
Subjects: Biomedical Science, Business & Management, Computer Game Development, Computer Science & Engineering, Disaster Planning & Recovery , Emergency Response, Engineering - Chemical, Engineering - Civil, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Engineering - Mining, Healthcare, Homeland Security, Information Technology, Public Administration & Public Policy, Web, Web2

Operationally Critical Threat, Asset, and Vulnerability Evaluation (OCTAVE)

  • The Operationally Critical Threat, Asset, and Vulnerability Evaluation (OCTAVE) is a framework for identifying and managing information security risks.

  • It defines a comprehensive evaluation method that allows an organization to identify the information assets that are important to the mission of the organization, the threats to those assets, and the vulnerabilities that may expose those assets to the threats.

  • By putting together the information assets, threats, and vulnerabilities, the organization can begin to understand what information is at risk.

  • With this understanding, the organization can design and implement a protection strategy to reduce the overall risk exposure of its information assets.

As a security framework for determining risk level and planning defenses against cyber assaults, the framework defines a methodology to

  • help organizations minimize exposure to likely threats,

  • determine the likely consequences of an attack and deal with attacks that succeed.

OCTAVE is designed to leverage the experience and expertise of people within the organization.

  1. The first step is to construct profiles of threats based on the relative risk that they pose.

  2. The process goes on to conduct a vulnerability assessment specific to the organization.

OCTAVE phases

OCTAVE defines three phases:

  • Phase 1: Build Asset-Based Threat Profiles

  • Phase 2: Identify Infrastructure Vulnerabilities

  • Phase 3: Develop Security Strategy and Plans

The framework has gone through several evolutionary phases since that time, however, the basic principles and goals have remained the same.

The Structure of OCTAVE

The OCTAVE method is based on eight processes that are broken into three phases. In the higher education organizations, it is usually preceded by an exploratory phase (known as Phase Zero) to determine the criteria that will be used during the application of the Octave method.

OCTAVE versus the rest

OCTAVE has two variants; OCTAVE-S and OCTAVE Allegro.

  • OCTAVE-S, a simplified methodology for smaller organizations that have flat hierarchical structures, and

  • OCTAVE Allegro, a more comprehensive version for large organizations or those with multilevel structures.

OCTAVE-S has fewer processes, nevertheless adhering to the overall OCTAVE philosophy; thus simplifying application for smaller organizations. OCTAVE Allegro is a later variant which focuses on protecting information-based critical assets.

OCTAVE risk assessment method

  • With the OCTAVE risk assessment method, integration of the organization’s infosec policies and unique business needs becomes possible.

  • OCTAVE helps organizations tap into operational experience and intelligence to define risks in a business context.

  • OCTAVE risk assessment leverages organizational know-how of the business process for planning information security.

    • When outsourcing to external agencies, organizations invariably detach themselves from decision-making, leaving that responsibility to experts who are not accountable in the long run, resulting in poor understanding of the nature of the enterprise’s security posture.

      • Thus, institutionalized improvement never takes place.

    • On the other hand, with OCTAVE risk assessment, a core analysis team is required to be formed from among the organization’s employees, effectively enlisting their active participation in the decision-making process.

Using the OCTAVE method for risk assessment,

  • the core analysis team conducts workshops to gather information from different tiers of the organization for identifying critical assets.

  • Workshops can be conducted using the structured business communication.

  • Several iterations of brainstorming sessions are held to leverage collective business acumen and experience.

Risk assessment under the OCTAVE method

  • OCTAVE is self-directed and follows the “most critical assets” approach to risk analysis to prioritize areas of improvement.

  • It follows the premise of Pareto’s law (the 80-20 principle), which states that 80% of effects come from 20% of the causes.

The OCTAVE risk assessment method is divided into three phases:

  • Organizational view,

  • Technological view, and

  • Risk analysis.

Organizational view: Threat profiles based on assets

The OCTAVE risk assessment method focuses on speed, since for most businesses, time is money. Targeted workshops yield information on the fundamental, business-critical information assets, to a high degree of con-currency.

This phase has the following processes.

  • Once OCTAVE establishes assets, areas of concern, which typically have a source and outcome, are defined.

    • Security requirements to tackle these problems must conform to CIA (confidentiality, integrity and availability) precepts.

  • Organizational vulnerabilities are then identified by comparing current protection strategies against previously established requirements.

    • This process is repeated, once each for the senior management, operational management and staff.

  • The final process is the creation of a threat profile based on the above findings.

    • This gives a consolidated view of all threats, which is then mapped onto a threat tree, structured to give in-depth insight into the source and outcome of threats under the categories of asset, access, actor, motive and outcome.

Technology analysis: Infrastructural vulnerabilities

  • This phase involves identifying key infrastructural components for critical assets, and the technological vulnerabilities for key components.

  • The two steps here are identification and evaluation, wherein the different methods through which compromises may occur are analyzed.

Risk analysis: Planning and strategy

  • The concluding phase of the OCTAVE risk assessment method involves measurement and classification of individual risks as high, medium or low.

  • Then, a protection strategy in terms of policies and procedures is developed.

  • This is followed by a mitigation plan geared towards assets and an action plan defining short-term measures for dealing with breaches.

How OCTAVE Works?

  • OCTAVE is a flexible and self-directed risk assessment methodology.

  • A small team of people from the operational (or business) units and the IT department work together to address the security needs of the organization.

  • The team draws on the knowledge of many employees to define the current state of security, identify risks to critical assets, and set a security strategy.

  • It can be tailored for most organizations.

Unlike most other risk assessment methods the OCTAVE approach is driven by operational risk and security practices and not technology. It is designed to allow an organization to:

  • Direct and manage information security risk assessments for themselves

  • Make the best decisions based on their unique risks

  • Focus on protecting key information assets

  • Effectively communicate key security information

The main advantage that OCTAVE gives an organization is that it can be implemented in parts. Since it is exhaustive, organizations choose to implement portions of the workflow that they find appropriate.

Comprehensive consolidation of the threat profiles is one of the core strengths of the OCTAVE risk assessment method. This provides the key intelligence for threat mitigation under most scenarios.

OCTAVE is more flexible. Probability analyses are "optional," the only requirement being thoroughness; analysis teams are directed to consider a variety of factors that can influence probability, as well as to explicitly determine the exact numerical thresholds for "high," "medium" and "low" probabilities.

Unlike standards such as ISO 27005, OCTAVE does not require focus on all assets, thus saving time and keeping the scope relevant to the business context. OCTAVE risk assessment has been recognized as the preferred methodology for HIPAA compliance, making it relevant to companies that have outsourcing relationships with firms regulated under Health Insurance Portability and Accountability Act (HIPAA).

In summary, this method is most useful when creating a risk-aware corporate culture. The method is highly customizable to an organization’s specific security objectives and risk environment.

Cybersecurity Risk Engineering: Why Engineers Need to Look Beyond OCTAVE

Comparing with Risk Engineering approach, OCTAVE have the following drawbacks:

  • its complexity,

    • though OCTAVE threat modeling provides a robust, asset-centric view, and organizational risk awareness, the documentation can become voluminous.

    • OCTAVE lacks scalability – as technological systems add users, applications, and functionality, a manual process can quickly become unmanageable.

  • it does not produce a detailed quantitative analysis of security exposure.

    • OCTAVE rates risks, likelihoods and impacts on a three-point scale.

      • Additionally, OCTAVE implementation requires teams to specifically define for themselves what each of the three points on each respective scale actually mean.

    • this represents a weakness when compared to Risk Engineering approach, which flat-out tells those who use it what each rating means.

Cybersecurity Risk Engineering: Why Engineers Need to Look Beyond CVSS Scores

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

The Common Vulnerability Scoring System (CVSS) captures the principal characteristics of a vulnerability and produces a numerical severity score. CVSS was developed by NIST and is maintained by the Forum of Incident Response and Security Teams (FIRST) with support and contributions from the CVSS Special Interest Group. The CVSS provides users a common and standardized scoring system within different cyber and cyber-physical platforms. A CVSS score can be computed by a calculator that is available online.

A CVSS score is derived from values assigned by an analyst for each metric. The metrics are explained extensively in the documentation. The CVSS method is often used in combination with other threat-modeling methods. It provides a way to capture the principal characteristics of a vulnerability and produce a numerical score (ranging from 0-10, with 10 being the most severe) depicting its severity. The score can then be translated into a qualitative representation (such as low, medium, high, and critical) to help organizations properly assess and prioritize their vulnerability management processes.

CVSS Scores

  • The Common Vulnerability Scoring System (CVSS), a free and industry-standard way of ranking the severity of vulnerabilities, is important for anyone in the cybersecurity industry to understand, both for knowing when to rely on it and when to seek out more information.

  • A vulnerability is typically given a base score in CVSS, which is a rating from zero to 10 that gives an idea of how easy it is to exploit a vulnerability and how damaging it can be. Some vulnerabilities are also given temporal and environmental scores which modify the base score, but many are not.

  • Threat intelligence can provide much more detailed information about how vulnerabilities are actually being exploited “in the wild.” This can result in vastly different rankings from a CVSS base score.

The Common Vulnerability Scoring System is a way of assigning severity rankings to computer system vulnerabilities, ranging from zero (least severe) to 10 (most severe). According to the Forum of Internet Response and Security Teams (FIRST), CVSS is valuable for three main reasons:

  1. It provides a standardized vulnerability score across the industry, helping critical information flow more effectively between sections within an organization and between organizations.

  2. The formula for determining the score is public and freely distributed, providing transparency.

  3. It helps prioritize risk — CVSS rankings provide both a general score and more specific metrics.

For these reasons, it is helpful for everyone in the business to understand how CVSS scores are calculated. But it is also important to recognize the limitations of the system and know when to rely on it and when to get more information from threat intelligence sources.

The CVSS scoring system is now in its third iteration — CVSSv3. A CVSSv3 score has three values for ranking a vulnerability:

  1. A base score, which gives an idea of how easy it is to exploit the vulnerability and how much damage an exploit targeting that vulnerability could inflict;

  2. a temporal score, which ranks how aware people are of the vulnerability, what remedial steps are being taken, and whether threat actors are targeting it; and,

  3. an environmental score, which provides a more customized metric specific to an organization or work environment.

Each of these three scores is derived from formulas that include different subsets of metrics.

Base Score

This number represents a ranking of some of the qualities inherent to a vulnerability, which will not change over time or be dependent on the environment the vulnerability appears in. The base score comes from two subequations which themselves are each made up of a handful of metrics:

  • the Exploitability Subscore, and

  • the Impact Subscore.

Exploitability Subscore

The Exploitability Subscore is based on the qualities of the vulnerable component itself — their scores define how vulnerable the thing itself is to attack. The higher the combined score, the easier it is to exploit that vulnerability. Each metric here is ranked according to values specific to itself, not according to a numerical score.

  • The Attack Vector (AV) metric describes how easy it is for an attacker to actually access the vulnerability. The score will be higher the more remote an attacker can be — a vulnerability that requires an attacker to be physically present will receive a lower AV score than one that can be accessed through a local network, which will get a lower score than one that can be exploited through an adjacent network, and so on.

  • The Attack Complexity (AC) metric describes what conditions must exist that an attacker cannot control to exploit the vulnerability. A low score means there are no special conditions and an attacker can repeatedly exploit a vulnerability. A high score means an attacker might need to, for example, gather more information on a specific target before succeeding.

  • The Privileges Required (PR) metric describes what level of privileges an attacker must have before they can exploit a vulnerability — none required (the highest score); low privileges, meaning the attack might only affect settings and files at a basic user level; or high privileges required, meaning the attacker will need to have administrative privileges or something similar to meaningfully exploit the vulnerability.

  • The User Interaction (UI) metric describes whether the attacker will need another user to participate in the attack for it to succeed. This is defined as a binary metric for the purposes of scoring — either it’s required or it isn’t.

Impact Subscore

The Impact Subscore defines how significantly certain properties of the vulnerable component will be affected if it is successfully exploited. The first and most significant measure of impact is the Authorization Scope, or just Scope (S), of the vulnerability. The scope metric gives an idea of how badly an exploited vulnerability can impact other components or resources beyond the privileges directly associated with it. In a sense, this is a measure of the potential a vulnerability has to “break out of prison” and compromise other systems. This is also a binary metric — either an exploited vulnerability can only affect resources at the same level of privilege, or it allows an attacker to reach beyond the authorization privileges of the vulnerable component and impact other systems. When a scope change does not occur, the Impact metric reflects the following three values:

  • The Confidentiality (C) metric, in a way, is another measure of how much authority an exploited vulnerability provides. Such an exploit might result in no loss of confidentiality; a low degree, where some indirect access to restricted information is possible; or a high degree of loss that can lead to further serious tampering of sensitive information, like access to an administrator’s username and password.

  • The Integrity (I) metric reflects how much data corruption an exploited vulnerability makes possible. The score can be none; low, where some data can be modified but does not have significant consequences; or high, where there is a complete loss of protection over all data, or the data that is able to be modified would significantly impact the function of the vulnerable component.

  • The Availability (A) metric is a measurement of the loss of availability to the resources or services of the affected component. This metric is also scored as having no impact; a low impact, where there is some reduced access or interruption; and high, where there is either a complete loss of access or a persistent, serious disruption.

To sum up, a base CVSSv3 score is derived from a formula that takes into account the Exploitability Subscore, which is a measure of how easy it is for a vulnerability to be exploited, and the Impact Subscore, which is a measure of how significantly the vulnerable component will be affected if the vulnerability is successfully exploited. But this base score is really only a hypothetical measurement. It gives the dimensions of a blank canvas and tells you how much it’ll cost, but it doesn’t say what the artist will actually paint. A base CVSSv3 score can also be modified by an additional temporal score and an environmental score. Not every vulnerability assigned a base score will also have temporal and environmental scores calculated.

Temporal Score

This score, which is derived from three metrics, gives a better idea of how threat actors are actually exploiting a vulnerability, as well as what remediations are available.

Exploit Code Maturity

The Exploit Code Maturity (E) metric reflects how likely it is that a vulnerability will actually be exploited, based on what code or exploit kits have been discovered “in the wild.” This metric can either be assigned a rank of “undefined,” or be given one of four increasingly severe scores:

  • unproven, meaning there is no existence of any known exploits;

  • proof of concept, meaning some code exists but it is not practical to use in an attack;

  • functional, which means that working code exists; or

  • high, which means that either no exploit is required or the code available is consistently effective and can be delivered autonomously.

Remediation Level

The Remediation Level (RL) metric measures how easy the vulnerability is to fix. In a sense, it is the counterpoint to the Exploit Code Maturity metric. It can also be either undefined or measured in four degrees:

  • a remedial solution is unavailable;

  • an unofficial workaround exists;

  • there is a temporary fix; or

  • an official fix — a complete solution offered by the vendor — is available.

Report Confidence

The Report Confidence (RC) metric defines how confidently it can be said that a vulnerability exists. Vulnerabilities may be identified by third parties but not recognized by the component’s official vendor, or a vulnerability may be recognized but its cause unknown. This metric can either be undefined, or given one of three rankings: unknown, meaning there are some uncertain reports about the vulnerability; reasonable, meaning some major details have been shared and the vulnerability is reproducible but the root cause may remain unknown; or confirmed, where a vulnerability’s cause is known and it is able to be consistently reproduced.

Environmental Score

This score is derived from two subscores:

  • a Security Requirements Subscore, which is defined by the three components of the Impact score (Confidentiality, Integrity, and Availability) as measured within a specific environment, and

  • a Modified Base Score, which reevaluates the metrics defining the base score according to the specific environment of an organization.

Security metric

The security metric is either not defined, or given one of three scores:

  • low, meaning the loss of confidentiality, integrity, or availability caused by the vulnerability being exploited will not have a major effect on an organization or its employees or customers;

  • medium, where the effect will be significant; or

  • high, where the effect will be catastrophic.

Modified base scores

The modified base scores are evaluated in the same way as before, but the specific circumstances of one environment in which the vulnerability may exist are taken into account.

Finally, a vulnerability is assigned a CVSS base score between 0.0 and 10.0

  • a score of 0.0 represents no risk;

  • 0.1 – 3.9 represents low risk;

  • 4,0 – 6.9, medium;

  • 7.0 – 8.9, high; and

  • 9.0 – 10.0 is a critical risk score.

In summary, metric values are assigned to these metrics. Then the base score is calculated (exploitability and impact subscores are taken into account). It ranges from 0 to 10, where 10 means the highest severity.

Cybersecurity Risk Engineering: Why Engineers Need to Look Beyond CVSS Scores

  • CVSS scores can provide a great starting point for evaluating the impact of a particular vulnerability.

    • The base score provides a metric that’s reasonably accurate and easy to understand — provided you know what information the score is conveying.

    • However, many vulnerabilities will only be given a base CVSS score, unmodified by a temporal score or an environmental score, meaning the severity ranking of the score is really only assessing the vulnerability’s impact hypothetically.

  • It’s like ranking diseases based on how deadly and easily transmitted they are:

    • An all-time top 10 list might have the Black Plague and the Spanish Flu right near the top; however, they’re not really diseases you need to worry about today.

That’s where threat intelligence comes in.

  • Sound threat intelligence should not simply provide more information in the form of scores and statistics,

    • but also a deeper understanding of how and why threat actors are targeting certain vulnerabilities and ignoring others.

  • That often includes metrics that resemble the temporal and environmental scores only sometimes provided by CVSS. Cybersecurity Risk Engineering, for example, also ranks vulnerabilities based on

    • patterns in exploit sharing, and

    • numbers of links to malware.

  • This information often comes from sources that are difficult to access, like forums on the dark web. This results in a more realsitic severity ranking that is often drastically different from the base score provided by CVSS.

Industrial Design Engineering: Inventive Problem Solving for Threat-Based Privacy Design

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

Privacy is becoming a key issue in today's Industrial Design Engineering. It is of utter most importance that privacy is integrated in the software development lifecycle as soon as possible. LINDDUN is a privacy threat analysis methodology that supports analysts in eliciting privacy requirements. LINDDUN is an explicit mirroring of STRIDE-per-element threat modeling. It stands for the following violations of privacy properties:

  • Linkability

  • Identifiability

  • Non-Repudiation

  • Detectability

  • Disclosure of information

  • Content Unawareness

  • Policy and consent Noncompliance

LINDDUN is presented as a complete approach to threat modeling with a process, threats, and requirements discovery method. It may be reasonable to use the LINDDUN threats or a derivative as a tool for privacy threat enumerationin the four-stage framework, snapping it either in place of or next to STRIDE.

  • The LINDDUN threats as a tool can be used for privacy threat inventory.

  • The LINDDUN methodology is a threat modeling methodology. It helps analysts to systematically consider privacy issues and select enhancing technologies accordingly.

LINDDUN methodology consists of six steps:

  1. Define the DFD

  2. Map privacy threats to DFD elements

  3. Identify threat scenarios

  4. Prioritize threats

  5. Elicit mitigation strategies

  6. Select corresponding PETS

Specifically,

  • First, a data flow diagram (DFD) is created which is a structured graphical representation of the system using 4 major types of building blocks: entities, data stores, data flows, and processes.

  • Each DFD element type is associated with a number of privacy threat categories (7 high-level privacy threat categories were identified: Linkability, Identifiability, Non-repudiation, Dectectability, information Disclosure, content Unawareness, and policy and consent Non-compliance).

  • To identify the threats that are applicable to the analyzed system, for each building block the threats of the corresponding threat categories have to be examined.

  • The LINDDUN methodology aids the analyst by providing a set of threat trees which describe the most common attack paths for each possible combination of a threat type and a DFD element type.

  • Based on these trees, the analyst will document the identified threats using Misuse Case scenarios to describe the possible attacks in detail.

  • The threats then need to be prioritized according to their risk.

  • LINDDUN does however not explicitly provide risk analysis support.

  • The elicited threats can then be translated into privacy requirements.

  • Finally, LINDDUN provides a list of privacy solutions to mitigate the elicited threats.

In summary, LINDDUN starts with a DFD of the system that defines the system's data flows, data stores, processes, and external entities. By systematically iterating over all model elements and analyzing them from the point of view of threat categories, LINDDUN users identify a threat's applicability to the system and build threat trees.

What Every Engineer Should Know About DREAD and Penetration Testing

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

DREAD as a risk assessment model

Application Threat Modeling using DREAD and STRIDE is an approach for analyzing the security of an application. It is a structured approach that enables you to

  • identify,
  • classify,
  • rate,
  • compare, and
  • prioritize the security risks associated with an application.

Application Threat modeling should be considered separate from Risk Assessment, although similar, however, Application Threat Modeling is more of a calculated approach. Inducing Application Threat Modeling into SDLC process has its advantages for the security of the entire project. Most importantly when performing security assessments following the threat modeling approach gives the reviewer a comprehensive overview of the Application. This CRC Press News is specially focused on the DREAD methodology.

Quantitative vs. Qualitative Risk Analysis

Quantitative risk analysis

Quantitative risk analysis is about assigning monetary values to risk components. It’s composed of:

  1. Assessing value of the asset (AV)
  2. II. Calculating single loss expectancy (SLE),

SLE = AV x EF. EF is exposure factor (expressed as percentage value)

  1. Calculating annualized loss expectancy (ALE),

where ALE = SLE x ARO. ARO is annual rate of occurrence.

The countermeasure should not cost annually more than ALE. This is basically how cost/benefit analysis works.

Quantitative risk analysis

Qualitative risk analysis is opinion based. It uses rating values to evaluate the risk level. The DREAD model can be used to perform qualitative risk analysis.

Qualitative Risk Analysis with the DREAD Model

DREAD is part of a system for risk-assessing computer security threats. It provides a mnemonic for risk rating security threats using five categories.

The categories are:

  • Damage – how bad would an attack be?
  • Reproducibility – how easy is it to reproduce the attack?
  • Exploitability – how much work is it to launch the attack?
  • Affected users – how many people will be impacted?
  • Discoverability – how easy is it to discover the threat?

The DREAD name comes from the initials of the five categories listed when it was initially proposed for threat modeling.

When a given threat is assessed using DREAD, each category is given a rating from 1 to 10. The sum of all ratings for a given issue can be used to prioritize among different issues.

The threat is rated by answering the aforementioned questions and assigning rating values for every item (high, medium, low). The rating values represent the severity and are expressed as numbers (3-high, 2-medium, 1-low).

The risk rating is obtained by adding rating values for all items and comparing the results with the following table:

Risk rating

Result

High

12 – 15

Medium

8 -11

Low

5 – 7

Case Analysis with the DREAD Model

An exemplary vulnerability in web applications is provided to better understand how DREAD works in practice. Please keep in mind, that DREAD is not limited to web application vulnerabilities.

Cross-site request forgery in the admin panel allows us to add a new user and delete an existing user or all users. Let’s analyze the ratings for the items in the DREAD model.

Item

Rating

Damage potential

2

Reproducibility

2

Exploitability

3

Affected users

3

Discoverability

3

Let’s add all ratings to get the risk rating. The sum is 13 (risk rating: high).

Explanation:

  • The admin has to visit the attacker’s website so that the vulnerability is exploited. That’s why the reproducibility is medium.
  • The attacker can delete all users, making the system unavailable for them. Thus the rating for affected users is high.
  • Deleting all users doesn’t delete all data in the system. That’s why the impact on integrity is partial. Finally, there is no impact on the confidentiality of the system, provided that added user doesn’t have read permissions on default. Thus the rating for damage potential is medium.
  • The vulnerability can be easily discovered (no CSRF token, no authorization password) and exploited. That’s why the ratings for discoverability and exploitability are high.

Why Penetration Testers Should care?

Assume that an engineer found some serious access control issues in a Web Application.

  • The issues were such that one user could perform actions on behalf of other users.
  • Now the engineer is in a meeting in front of the client explaining the issues (client not being technical) couldn’t comprehend the severity of bugs.
  • The engineer had no solid ground apart from saying that these bugs are in OWASP Top 10.
  • The peer engineers argued either the issues are S1, S2 , S3 (S1 being most severe)
    • the engineerargued it to be S1,
    • others disagreed.
  • The issue could have solved had the engineer done Application Threat Modeling using DREAD and STRIDE.
    • the engineer could have said: “I calculated the risk/severity using DREAD”; that’s a solid foundation on which hardly anyone argues, specially management people, they love Standards, Frameworks and rightly so.
  • If severity of vulnerabilities is measured just by the “Name” of the vulnerability it has the potential to exaggerate or underrate the risk it pose.
    • For example, if a vulnerability allows an attacker to identify usernames in an Application, in it self its not a major risk;
    • however, lets say it is a banking application and they lockout users after 5 wrong password tries, attacker could just write a script to spray legitimate username wrong password combination on the application and it would block all the users to access banking services or any other high traffic service.

Now that is a Major risk and its accurate severity can only be calculated if we apply a methodological approach like DREAD.

Procedure

To perform Application Threat Risk Modeling use

  • OWASP testing framework to identify,
  • STRIDE methodology to Classify, and
  • DREAD methodology to rate, compare and prioritize risks, based on severity. Following are the steps involved in application Application Risk Threat Modeling:

Decompose the application

  • The first step of threat modeling is to understand how it interacts with internal and external entities, Identify entry points, privilege boundaries, access control matrix, and technology stacks being used.
  • This step in OWASP testing methodology is called information gathering phase where you gather maximum information about the target.

Identify

  • Implementation of OWASP testing framework results in identifying vulnerabilities in application, this is commonly known as Application Penetration testing.
  • In Penetration testing attackers use tools and techniques to find maximum vulnerabilities in application.

Classify Threats

  • After the vulnerabilities are identified, STRIDE methodology is used to classify these vulnerabilities.
  • During security engagements it is vital to backup your claims (about vulnerabilities) with a solid foundation like a framework or standard.
    • For example if the engineer find a vulnerability and classify it according to STRIDE, the peer engineer will get less argued about the classification.

Rate, Compare and Prioritize Threats

  • DREAD methodology is used to rate, compare and prioritize the severity of risk presented by each threat that is classified using STRIDE.
  • DREAD Risk = (Damage + Reproduciblity + Exploitability + Affected Users + Discoverability) / 5. Calculation always produces a number between 10. Higher the number means more serious the risk is.

Following is a customized mathematical approach to implement DREAD methodology:-

Damage Potential

  • If a threat exploit occurs, how much damage will be caused?



    • 0 = Nothing
    • 5 = Information disclosure that could be used in combination with other vulnerabilities
    • 8 = Individual/employer non sensitive user data is compromised.
    • 9 = Administrative non sensitive data is compromised.
    • 10 = Complete system or data destruction.
    • 10 = Application unavailability.

Reproducible

  • How easy is it to reproduce the threat exploit?
  • 0 = Very hard or impossible, even for administrators of the application.
  • 5 = Complex steps are required for authorized user.
  • 7.5 = Easy steps for Authenticated user
  • 10 = Just a web browser and the address bar is sufficient, without authentication.

Exploitability

  • What is needed to exploit this threat?
    • 2.5 = Advanced programming and networking knowledge, with custom or advanced attack tools.
    • 5 = Exploit exits in public, using available attack tools.
    • 9 = A Web Application Proxy tool
    • 10 = Just a web browser

Affected Users

  • How many users will be affected?
  • 0 = None
  • 2.5 individual/employer that is already compromised.
  • 6 = some users of individual or employer privileges, but not all.
  • 8 = Administrative users
  • 10 = All users

Discoverability

  • How easy is it to discover this threat?
  • 0 = Very hard requires source code or administrative access.
  • 5 = Can figure it out by monitoring and manipulating HTTP requests
  • 8 = Details of faults like this are already in the public domain and can be easily discovered using a search engine.
  • 10 = the information is visible in the web browser address bar or in a form.

DREAD methodology can be customized to cater the needs of your application, during consultancy engagements it should be approved from the client before starting the security assessment so that after you perform the analysis the results produced by DREAD couldn’t be challenged.

Conclusions

  • While we can mitigate the risk of an attack, we can never wholly eliminate risk from any complex application.
  • The truth in the world of security is that we recognize the nearness of threats and we deal with our risks.
  • Threat analysis enables us to examine and impart security throughout our work.
  • In our threat analysis, we have:
    • identified sensitive system assets,
    • identified ways that an attacker could compromise the system and gain access to the sensitive assets, and,
    • prioritized these threats by categorizing them as “High”, “Medium”, or “Low” risk.
  • Mitigating Threats: your web design should mitigate against all the threats that your model exposes.
    • However, in some cases, mitigation might not be practical.
      • For example, consider an attack that potentially affects very few users and is unlikely to result in loss of data or system usability.
      • If mitigating such a threat requires several months of additional effort, you might reasonably choose to spend additional time testing the driver instead.
      • Nevertheless, remember that eventually a malicious user is likely to find the vulnerability and mount an attack, and then the driver will require a patch for the problem.

Risk of cybersecurity: “he who defends everything defends nothing.”

By: John X. Wang
Subjects: Business & Management, Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

Providing useful, actionable understanding of the organization’s attacker population has been a significant weakness of traditional threat modeling methodologies. However, developing an intelligent security policy requires an understanding of the organization’s adversaries that goes beyond applying simple, pre-defined labels. Attempting to develop a proactive security policy without truly knowing the adversary is a futile exercise. History demonstrates that it is impossible for security teams to effectively distribute their finite resources across the organization’s entire attack surface. Frederick the Great in the late 1700s succinctly summarized the crisis of cybersecurity: “he who defends everything defends nothing.”

What is a threat?

A threat refers to any method that unapproved parties can use to gain access to sensitive information, networks and applications. Some of these threats may take the form of

  • computer viruses,

  • botnets,

  • application attacks, and

  • phishing scams,

among others.

These are a few common threats companies should plan for by using threat modeling techniques:

Malware

Malware, short for malicious software, is a category of cybersecurity threats that includes threats such as

  • computer viruses,

  • spyware, and

  • adware.

It’s one of the most common threats to target both businesses and individuals.

Companies can use threat modeling to ensure that their firewalls are adequately prepared, that zero-day vulnerabilities are minimized and that new exploits or malware signatures are documented. Proper planning, along with antivirus and other security software, will ensure networks are not compromised by malware.

DDoS attacks

DDoS (distributed denial of service) attacks are a method of bombarding websites and web applications with enormous traffic requests that overload the servers they are hosted on. These attacks are powered by thousands of bots and are indistinguishable from legitimate users attempting to access the site.

Companies can model their defense and response plans to prevent this from happening. Businesses can use DDoS protection software, load balancing software and network monitoring software to improve their ability to

  • discover DDoS attacks early,

  • balance workloads properly, and

  • restrict traffic access by malicious visitors.



Phishing

Phishing is a method of obtaining user information through fraudulent communications targeted directly at people. It’s often accomplished through emails disguised as coming from a legitimate source, but delivers the target’s information back to the hacker’s actual source.

Phishing can enable hackers to gain access to sensitive information or privileged applications. Businesses can prevent this type of cybercrime through

  • the use of email security software for filtering and identification,

  • along with security awareness training to ensure employees can identify fraudulent communications.

What is threat modeling from Risk Engineering perspective?

Threat modeling is a way to plan and optimize network security operations. Security teams l

  • ay out their goals,

  • identify vulnerabilities, and

  • outline defense plans to prevent and remediate cybersecurity threats.

These are a few components of threat modeling that can be used to improve security operations and effectiveness:

Secure design

Secure design is necessary during application development to ensure the identification and prevention of vulnerabilities. Code analysis and security testing during all stages of development can help to ensure bugs, flaws and other vulnerabilities are minimized. Engineers can

  • analyze their code for known flaws during development or dynamically as an application runs, and

  • perform penetration tests after development.

  • The resulting data is used to plan for future attack mitigation and to implement updates related to new threats.

Threat intelligence

It is important to keep an up-to-date database of threats and vulnerabilities to ensure applications, endpoints and networks are prepared to defend against emerging threats. These databases may consist of public information, reside in proprietary threat intelligence software, or be built in-house.

Asset identification

It’s important to keep IT and software assets properly documented at all times. Without proper tracking and documentation, these assets may possess known flaws that are not be identified. New assets, even potentially dangerous third-party assets, may be accessing networks without security teams’ knowledge.

Mitigation capabilities

Mitigation capabilities refer to a security team’s ability to detect and resolve attacks as they emerge. This may mean the identification of malicious traffic and removal of malware, or it could simply refer to contacting your managed security services provider. Either way, mitigation is essential to effective planning so that teams are aware of their ability to combat threats with their existing resources.

Risk assessment

After application code is determined to be safe and endpoints are properly implemented, companies can assess the overall risk of their various IT components. Components may be scored and ranked or simply identified as “at risk.” Either way, they will be identified and secured in order of importance.

Mapping and modeling

These methods are combined to build visual workflows and security operations plans with the goal of resolving existing issues and planning for future threats. This type of threat modeling is based on a multi-angle approach and requires threats be planned for from every potential angle.

Threat models that are missing one component of proper planning measures may leave assets susceptible to attacks. Proper implementation will lead to faster threat mitigation in real-world scenarios and simplify the operational processes associated with detection, mitigation and analysis.

Visual, Agile, and Simple Threat (VAST)

VAST is an acronym for Visual, Agile, and Simple Threat modelling. The methodology provides actionable outputs for the unique needs of various stakeholders like application architects and developers, cyber security personnel etc.. It provides a unique application and infrastructure visualization scheme such that the creation and use of threat models do not require specific security subject matter expertise.

Visual Representation using Process Flow Diagram

To deal with the limitations of DFD based threat modelling Process Flow Diagrams were introduced in 2011 as a tool to allow Agile software development teams to create threat models based on the application design process. These were specifically designed to illustrate how attacker thinks.

  • Attacker do not analyze data flow. Rather, they try to figure out how they can move through application which was the not supported in DFD based threat modelling.

  • Their analysis lays emphasis on how to abuse ordinary use-cases to access assets or other targeted goals.

  • VAST methodology uses PFD for the visual representation of application.

Threat models based on PFD view application from the perspective of user interactions. Following are the steps for PFD based threat modeling:

  1. Designing application’s use cases

  2. The communication protocols by which individuals move between use cases are defined

  3. Including the various technical controls – such as a forms, cookies etc

PFD based threat modelling has following advantages:

  • PFD based threat models are easy to understand that don’t require any security expertise.

  • Creation of process map -showing how individuals move through an application. Thus, it is easy to understand application from attacker’s point of view.

Risk Engineering: 3 Pillars for Scalable Threat Modeling Methodologies

1. Automation

Threat models are limited by the number of resource hours an application evaluation consumes. Conducting a thorough threat evaluation of a single application using manual processes could take several hours. Then multiply that by every application in an enterprise, and by several re-evaluations and updates required for ongoing post-deployment threat modeling.

 

Automated threat modeling eliminates the repetitive portion of threat modeling, taking the time needed to update a model from hours to minutes. This allows a threat modeling process to be ongoing – threats can be evaluated during design, implementation, and post-deployment on a regular basis. It also allows threat modeling to be scaled to encompass the entire enterprise, ensuring that threats are identified, evaluated, and prioritized throughout.

Often times, key stakeholders worry that threat modeling is too challenging to produce actionable results. Read our blog post where we debunk 5 Common Myths About Threat Modeling.

2. Integration

A threat modeling process must integrate with the tools used throughout the SDLC to provide consistent results for evaluation. These tools may include those targeted to support the Agile framework for software development, which emphasizes adaptive planning and continuous improvement.

With an Agile SDLC, large projects are broken down into short-term goals, completed in two-week sprints. For threat modeling methodologies to support Agile DevOps, the threat model itself must be Agile, supporting the short-term sprint structure and employing threat modeling in an environment of continuous improvement and updates.

VAST is the only threat modeling methodology that was created with the principles of Agile DevOps to support scalability and sustainability.

3. Collaboration

An enterprise-wide threat modeling system requires buy-in from key stakeholders, including software developers, systems architects, security managers, and senior executives throughout the organization.

Scalable threat modeling requires these stakeholders to collaborate – using a combined view of different skill sets and functional knowledge to evaluate threats and prioritize mitigation. Without collaboration, an enterprise-wide view is impossible to achieve. On the other hand, collaboration helps a company scale threat modeling activities to cover all stages of the SDLC and respond to new threats with a deeper understanding of the risks posed to the organization as a whole.

VAST threat modeling works best for enterprises that need to automate and scale threat modeling across the entire DevOps portfolio, and are looking for the process that will complement an Agile framework of continuous delivery. Integration with Agile, as well as other production tools in use by the team forms the foundation for a collaborative, comprehensive threat modeling process that leverages the strengths and skills of key stakeholders throughout the organization.

VAST Threat Modeling (Enterprise Focused)

The Visual, Agile, and Simple Threat modeling (VAST) methodology was conceived after reviewing the shortcomings and implementation challenges inherent in the other threat modeling methodologies. The founding principle is that, in order to be effective, threat modeling must scale across the infrastructure and entire DevOps portfolio, integrate seamlessly into an Agile environment and provide actionable, accurate, and consistent outputs for developers, security teams, and senior executives alike.

 

A fundamental difference of the VAST threat modeling methodology‍ is its practical approach. Recognizing the security concerns of development teams are distinct from those of an infrastructure team, this methodology calls for two types of threat models.

VAST: Application Threat Models

Application threat models for development teams are created with process flow diagrams (PFD). Process flow diagrams map the features and communications of an application in much the same way as developers and architects think about the application during an SDLC design session.

VAST: Operational Threat Models

Operational threat models are designed for the infrastructure teams. Though more similar to traditional DFDs than application threat models, the data flow information is presented from an attacker – not a data packet – perspective. By relying on PFDs rather than DFDs, VAST threat models do not require extensive systems expertise.

Uniquely addressing both developer and infrastructure team concerns allows organizations to incorporate threat modeling as a part of their DevOps lifecycle with different outputs for various key stakeholders.

  • The most significant difference of the VAST threat modeling methodology, however, is its ability to allow organizations to scale across thousands of threat models.

  • The pillars of a scalable threat modeling practice – automation, integration, and collaboration – are foundational to VAST threat modeling.

  • As the organization matures and new threats arise, these pillars help to develop a sustainable self-service threat modeling practice driven by the DevOps teams rather than the security team.

Summary

VAST — VAST (Visual, Agile and Simple Threat modeling) is a malleable and scalable modeling process for security planning throughout the software development lifecycle. It’s based on three pillars:

  • automation,

  • integration and

  • collaboration.

The model focuses on actionable outputs and the unique needs of developers, security personnel and executives.

VAST can be used for both operational and application threat modeling and uses workflow diagrams to illustrate

  • threats,

  • assets,

  • vulnerabilities, and

  • remediation tools

in a understandable way. It’s also designed to mirror the existing operational processes of agile software development teams.

There is no silver bullet for security operations planning, and different modeling methods may suit some businesses better than others. It’s important to understand your existing development, IT management and security operations processes before settling on a modeling format.

The fundamental value of the method is the scalability and usability that allow it to be adopted in large organizations throughout the entire infrastructure to produce actionable and reliable results for different stakeholders.

Recognizing differences in operations and concerns among development and infrastructure teams, VAST requires creating two types of models:

  1. application threat models: use process flow diagrams, representing the architectural point of view.

  2. operational threat models: created with an attacker point of view in mind based on DFDs.

This approach allows for the integration of VAST into the organization’s development and DevOps

lifecycles.

What Every Engineer Should Know About Threat Risk and Trike

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

Trike is an open source threat modeling methodology with a distinct threat rating component. It delves beyond threat modeling and into "attack graph[ing]," requiring extensive parsing and detail.

For threat rating purposes, however, it is much simpler. In the world of Trike, every attack falls into one of two attack types: elevations of privilege or denials of service. (This solves the cross-correlation problems presented by the more simplistic -- and more redundant -- STRIDE, as discussed in the earlier CRC Press News.)

With this in mind, Trike assesses the risks of these attacks impacting assets presented by four specific actions, comprising the acronym CRUD:

  • Creating

  • Reading

  • Updating

  • Deleting

(The creators of Trike acknowledge that a fifth and "exotic" action, "invoking," is possible -- and thereby assessable -- in an environment that "is specifically intended to move around code and execute that same code as part of the core function of the system." But they spend no more than a paragraph discussing and quickly dismissing it because of the rarity of the situation.)

A threat rating chart using Trike will list all possible assets connected with the system being assessed; external assets are included as well. For each asset, the risk of either attack type is assessed on a five-point scale for each CRUD action.

Trike also takes human actors into account -- such as an administrator, an account holder or an anonymous user or reader. Separately, actors are rated on a five-point scale for the risk they are assumed to present to a system based on trust. A trusted administrator with full privileges, for example, would get a rating of one, whereas an anonymous user with the most limited access would get a high rating. They are also rated on a three-point scale – always, sometimes, never – for each of the CRUD actions they have access to perform upon each asset. Five-point scales are used throughout the threat modeling aspects of Trike, as well (for instance, for assessing weaknesses).

Trike Threat Modeling (Acceptable Risk Focused)

Trike threat modeling is a unique, open source threat modeling process focused on satisfying the security auditing process from a cyber risk management perspective. It provides a risk-based approach with unique implementation, and risk modeling process. The foundation of the Trike threat modeling methodology is a “requirements model.” The requirements model ensures the assigned level of risk for each asset is “acceptable” to the various stakeholders.

With the requirements model in place, the next step in Trike threat modeling is to create a data flow diagram (DFD). System engineers created data flow diagrams in the 1970s to communicate how a system moves, stores and manipulates data. Traditionally they contained only four elements: data stores, processes, data flows, and interactors.

The concept of trust boundaries was added in the early 2000s to adopt data flow diagrams to threat modeling. In the Trike threat modeling methodology, DFDs are used to illustrate data flow in an implementation model and the actions users can perform in within a system state.

The implementation model is then analyzed to produce a Trike threat model. As threats are enumerated, appropriate risk values are assigned to them from which the user then creates attack graphs. Users then assign mitigating controls as required to address prioritized threats and the associated risks. Finally, users develop a risk model from the completed threat model based on assets, roles, actions and threat exposure.

However, because Trike threat modeling requires a person to hold a view of the entire system to conduct an attack surface analysis, it can be challenging to scale to larger systems.

Trike was created as a security audit framework that uses threat modeling as a technique. It looks at threat modeling from a risk-management and defensive perspective.

As with many other methods, Trike starts with defining a system. The analyst builds a requirement model by enumerating and understanding the system's actors, assets, intended actions, and rules. This step creates an actor-asset-action matrix in which the columns represent assets and the rows represent actors.

Each cell of the matrix is divided into four parts, one for each action of CRUD (creating, reading, updating, and deleting). In these cells, the analyst assigns one of three values: allowed action, disallowed action, or action with rules. A rule tree is attached to each cell.

After defining requirements, a data flow diagram (DFD) is built. Each element is mapped to a selection of actors and assets. Iterating through the DFD, the analyst identifies threats, which fall into one of two categories: elevations of privilege or denials of service. Each discovered threat becomes a root node in an attack tree.

To assess the risk of attacks that may affect assets through CRUD, Trike uses a five-point scale for each action, based on its probability. Actors are rated on five-point scales for the risks they are assumed to present (lower number = higher risk) to the asset. Also, actors are evaluated on a three-dimensional scale (always, sometimes, never) for each action they may perform on each asset.

Trike – A Conceptual Framework for Threat Modeling

Trike is a unified conceptual framework for security auditing from a risk management perspective through the generation of threat models in a reliable, repeatable manner. A security auditing team can use it to completely and accurately describe the security characteristics of a system from its highlevel architecture to its low-level implementation details.

Trike also enables communication among security team members and between security teams and other stakeholders by providing a consistent conceptual framework. This document describes the current version of the methodology (currently under heavy development) in sufficient detail to allow its use. In addition to detail on the threat model itself (including automatic threat generation and attack graphs), we cover the two models used in its generation, namely the requirements model and the implementation model, along with notes on risk analysis and work flows. The final version of this paper will include a fully worked example for the entire process. Trike is distinguished from other threat modeling methodologies by the high levels of automation possible within the system, the defensive perspective of the system, and the degree of formalism present in the methodology. Portions of this methodology are currently experimental; as they have not been fully tested against real systems, care should be exercised when using them.

The focus of the Trike methodology is using threat models as a risk-management tool. Within this framework, threat models are used to satisfy the security auditing process. Threat models are based on a “requirements model.” The requirements model establishes the stakeholder-defined “acceptable” level of risk assigned to each asset class. Analysis of the requirements model yields a threat model from which threats are enumerated and assigned risk values. The completed threat model is used to construct a risk model based on asset, roles, actions, and calculated risk exposure.

Summary

Trike is a threat modeling framework with similarities to the Microsoft threat modeling processes. However, Trike differs because it uses a risk based approach with distinct implementation, threat, and risk models, instead of using the STRIDE/DREAD aggregated threat model (attacks, threats, and weaknesses). Trike’s goals are:

  • With assistance from the system stakeholders, to ensure that the risk this system entails to each asset is acceptable to all stakeholders.

  • Be able to tell whether we have done this.

  • Communicate what we’ve done and its effects to the stakeholders.

  • Empower stakeholders to understand and reduce the risks to them and other stakeholders implied by their actions within their domains.

What Every Engineer Should Know About PASTA based on Decision Making Under Uncertainty

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security

How to Protect Critical Business Assets

The Process for Attack Simulation and Threat Analysis (PASTA) is a seven-step, risk-centric methodology. It provides a seven-step process for aligning business objectives and technical requirements, taking into account compliance issues and business analysis. The intent of the method is to provide a dynamic threat

  • identification,

  • enumeration, and

  • scoring

process. Once the threat model is completed security subject matter experts develop a detailed analysis of the identified threats. Finally, appropriate security controls can be enumerated. This methodology is intended to provide an attacker-centric view of the application and infrastructure from which defenders can develop an asset-centric mitigation strategy. The main goals of PASTA process are:

1.Improving visibility of cyber-threat risks: by providing risk management and information security with an holistic view of he company assets and the risk exposure from the perspective of the attackers/threat actors

2. E x t e n d i n g t h e o r g a n i z a t i o n protection domains: the compliance domain is considered as a factor in documenting security requirements; however, PASTA focuses beyond the traditional

compliance driven security domains by focusing on cyber threats as todaycompliance driven security controls can be bypassed by advanced and emerging threats

3.Leveraging existing application security processes: PASTA stages and activities leverage data and processes used today for traditional security compliance assessments such as vulnerability assessment, security tests/pen testing and secure code analysis; however, widen the focus to threats and attacks

4.Integrating with the SDLC by providing an application threat modeling process that organizations can follow to address security issues from the in cept io n of the software development lifecycle to the production delivery

5.Increasing the maturity of the organization in software security by evolving from vulnerability assessments to threats and attack analysis as the drivers for determining the risk mitigation strategy

PASTA is the Process for Attack Simulation & Threat Analysis and is a risk centric threat modeling methodology aimed at identifying viable threat patterns against an application or system environment. Built around the idea of addressing likely attack patterns to high impact use cases, this approach integrates extremely well into a process of risk management.

What is PASTA?

PASTA is the Process for Attack Simulation & Threat Analysis and is a risk centric threat modeling methodology aimed at identifying viable threat patterns against an application or system environment. Built around the idea of addressing likely attack patterns to high impact use cases, this approach integrates extremely well into a process of risk management.

 

Stage I - Define the Objectives

  • Identify business objectives and ensure an appropriate level of security requirements to support the business goals for the application yet meeting compliance with security standards.

  • Identify preliminary security and compliance risks and their business impacts to the application.

Stage I sets tone of importance around use cases, defining technical and business objectives for risks analysis (includes analysis of the likelihood of threats and technical and business impacts assessments.

Stage II Define the Technical Scope

  • Define the technical scope/boundaries of threat modeling as they depend on the various technologies, software and hardware, components and services used by the application.

  • Categorize any architectural and technologies/components whose function is to provide security controls (e.g. authentication, encryption) and security features (e.g. protection of CIA).

  • Assert that the technical details of the application architecture are documented and contain all the necessary technical details for secure design of the application and for conducting the risk analysis of the application architecture.

Stage II defines technical scope of application components.

Stage III Decompose the Application

  • Decompose the application into essential elements of the application architecture (e.g. users, servers, data assets) that can be further analyzed for attack simulation and threat analysis from both the attacker and the defender perspective.

Stage III maps what’s important to what’s in scope using Data Flow Diagrams (DFDs) with

  • decomposition of the application in basic elements such as users, roles, data storages, and data flows, functions, security controls and trust boundaries.

  • decomposition in functional components and analysis of security controls to protect the functionality provided by each component.

Stage IV Analyze the Threats

  • Enumerate the possible threats targeting the application as an asset.

  • Identify the most probable attack scenarios based upon threat agent models, security event monitoring, fraud mapping and threat intelligence reports.

  • The final goal is to analyze the threat and attack scenarios that are most probable and need to be prioritized later for attack simulation.

Stage IV correlates relevant threat patterns with

  • Definition of the threat landscape and identification of the specific threat agents targeting the application in scope for the analysis.

  • Analysis of internal and external threat agents.

  • Analysis of the threat agent’s capabilities motivations and opportunities.

  • Estimation of the threat agent probability to be realized in attacks against the application/product in scope.

    • NOTE you can use standard threat risk rating methodologies such as OWASP, DREAD.

  • Mapping of the threats to the assets (functions and data) of each the data and functional components of the application previously analyzed.

Stage V Vulnerabilities & Weaknesses Analysis

  • The main goal of this stage of the methodology is to map vulnerabilities identified for different assets that include the application as well as the application infrastructure to the threats and the attack scenarios previously identified in the previous threat analysis stage.

  • Formal methods that map threats to several generic types of vulnerabilities such as threat trees will be used to identify which ones can be used for attacking the application assets.

  • Once these vulnerabilities are identified, they will be enumerated and scored using standard vulnerability enumeration (CVE, CWE) and scoring methods (CVSS, CWSS).

Stage V is phase 1 of “proof” stages, proving viability with

  • Analysis of previously identified implementation vulnerabilities and an application security controls weaknesses such as design flaws and security control gaps identified during stage III that might expose the application assets, data and function to potential threats.

  • Correlation between threats and vulnerabilities with previously identified threats in stage IV.

  • Risk calculation of the severity of vulnerabilities and weaknesses by re-factoring the exposure of assets to potential threats.

  • Prioritization of security tests for specific type of vulnerabilities and security control weaknesses based upon the re-assessed risk severity.

Stage VI Model the Attacks

  • The goal of this stage is to analyze how the application and the application context including the users-agents, the application and the application environment, can be attacked by exploiting vulnerabilities and using different attack libraries and attack vectors.

  • Formal methods for the attack analysis used at this stage include attack surface analysis, attack trees and attack libraries-patterns.

  • The ultimate outcome of this stage is to map attacks to vulnerabilities and document how these vulnerabilities can be exploited by different attack vectors.

Stage V is phase 1 of “proof” stages, proving viability with

  • Analysis of how the various threats analyzed in Stage IV can be realized in attacks that will produce a negative impact to the organization.

  • The analysis of the attacks relies on the analysis of the chain of events leading to an observed security incident whose root cases are analyzed “post mortem”.

  • This analysis leads to the identification of the attack tools and techniques used by the attackers and the description of the various events that characterize the course of action of the attack to that these attacks can be simulated using security tests.

  • The objective of these security tests is to determine the likelihood of exploits and to identify countermeasures to prevent and detect these attacks.

  • These test cases will factor the specific threat agents, the attacking tools and attack techniques analyzed during stage IV and consider the presence of vulnerabilities and weaknesses that were previously identified in (Stage V).

  • The goal of these test cases is to simulate realistic attack scenarios and determine if exploits are possible and identify countermeasures to prevent and detect them.

STAGE VII Risk Analysis & Management

  • The goal of this stage is to analyze the risk of each attack scenario that was previously simulated and tested and identify both the technical and the business impacts.

  • After risks have been analyzed they need to be managed to reduce the impact to acceptable levels by following a risk management strategy that is alignment with the business objectives and the risk mitigation objectives defined in stage I.

Stage VII provides rationale for countermeasure development based upon residual risk.

As discussed in What Every Engineer Should Know About Decision Making Under Uncertainty, Risk is about uncertainty or, more importantly, the effect of uncertainty on the achievement of objectives. The Process for Attack Simulation and Threat Analysis is a relatively new application threat modeling methodology. PASTA threat modeling provides a seven-step process for risk analysis which is platform insensitive. The goal of the PASTA methodology is to align business objectives with technical requirements while taking into account business impact analysis and compliance requirements. The output provides threat management, enumeration, and scoring.

The PASTA threat modeling methodology combines an attacker-centric perspective on potential threats with risk and impact analysis. The outputs are asset-centric. Also, the risk and business impact analysis of the method elevates threat modeling from a “software development only” exercise to a strategic business exercise by involving key decision makers in the process.

PASTA threat modeling works best for organizations that wish to align threat modeling with strategic objectives because it incorporates business impact analysis as an integral part of the process and expands cybersecurity responsibilities beyond the IT department. This alignment can sometimes be a weakness of the PASTA threat modeling methodologies. Depending on the technological literacy of key stakeholders throughout the organization, adopting the PASTA methodology can require many additional hours of training and education.

Why PASTA?

PASTA aims to bring business objectives and technical requirements together. It uses a variety of design and elicitation tools in different stages. This method elevates the threat-modeling process to a strategic level by involving key decision makers and requiring security input from operations, governance, architecture, and development. Widely regarded as a risk-centric framework, PASTA employs an attacker-centric perspective to produce an asset-centric output in the form of threat enumeration and scoring.

The purpose of PASTA (Process for Attack Simulation and Threat Analysis) is to provide a process for simulating attacks to applications, analyzing cyber threats that originate them and mitigate cybercrime risks that these attacks and threats pose to organizations. PASTA consists of a seven stage process for simulating attacks and analyzing threats to an application environment with the objective of minimizing risk and associated impact to the business. By following this process, businesses can determine the adequate level of countermeasures that can be deployed to mitigate the risk from cyberthreats and attacks to applications.

  • PASTA allows architects to understand how vulnerabilities to the application affect threat mitigation, identify trust boundaries and classification of the data assets, and identify vulnerabilities and apply countermeasures via proper design.

  • PASTA helps developers understand which components of the application are vulnerable and then learn how to mitigate vulnerabilities.

  • Security testers can use security requirements derived through the methodology as well to create positive and negative test cases.

  • Project managers can prioritize remediation of security defects according to risks.

  • Business managers can determine which business objectives have impact on security while information risk officers can make strategic risk management decisions by mitigating technical risks while considering costs of countermeasures versus costs associated with business impact as risk mitigation factors.

A Risk Based Threat Modeling

Understanding and exercising a broad scope of real-world attack patterns better depict the viability of threats. Combined with a risk-based approach that centers on developing countermeasures commensurate to the value of the assets being protected, PASTA (Process for Attack Simulation and Threat Analysis) allows for a linear threat model to achieve both technical sophistication and accuracy and a marketable message around risk mitigation strategy. This can be achieved by realizing three key attributes as part of its methodology: topicality, substantiation, and probabilistic analysis. These attributes will be exemplified in the step-by-step coverage of the PASTA methodology in this chapter.

For any security process to be successful, it needs to be repeatable, measurable, yield results, and invite more stakeholders than those found in security and compliance. The risk- based threat model detailed in this chapter provides a linear methodology to encompass all of these aforementioned characteristics. Its multistep process is combined with a multifaceted focus to various stakeholders. In lieu of IT, information security, and business groups maintaining disaccord over security deliverables, a risk- based threat modeling approach unifies disparate goals over a linear workflow that is comprehensive yet simple to use.

What Every Engineer Should Know About Threat Model and STRIDE

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Homeland Security, Information Technology, Web, Web2

We live in a world that makes heavy use of information. We use it to determine whether we need a jacket or umbrella for the day. Businesses use it to determine whether they made a profit for the week. Governments use it to determine things like incoming tax revenue. In fact, you could say that most of us rely on it. Of course, there are those that will try to exploit that information for personal gain, perhaps through ransom or through sale to the highest bidder. It's a sad fact, but true none the less. So, it will come as no surprise that there are also those out there who are working to determine the significance of these threats since the repercussions of software failure is costly and, at times, can be catastrophic. This can be seen today in a wide variety of incidents,

  • from data leak incidents caused by misconfigured AWS S3 buckets

  • to Facebook data breach incidents due to lax API limitations

  • to the Equifax incident due to the use of an old Apache Struts version with a known critical vulnerability.

Application Security advocates encourage developers and engineers to adopt security practices as early in the Software Development Life Cycle (SDLC) as possible 1. One such security practice is Threat Modeling.

What is a Threat Model?

Threat modeling is a process by which potential threats, such as structural vulnerabilities, can be identified, enumerated, and prioritized – all from a hypothetical attacker’s point of view. The purpose of threat modeling is to provide defenders with a systematic analysis of the probable attacker’s profile, the most likely attack vectors, and the assets most desired by an attacker.

A threat model, or threat risk model, is a process that reviews the security of any web-based system, identifies problem areas, and determines the risk associated with each area. There are five steps in the process:

  • Identify Security Objectives - This step determines the overall goals the organization has in regard to its security.

  • Survey the System - This step determines the components of the system, the routes through which data travels, and trust boundaries (connections made to outside networks).

  • Decompose the System - This step determines the components of the system that have an effect on security, like the login module.

  • Identify Threats - This step enumerates any potential outside threats that the system has. This generally focuses on those that are known. (How do you identify those that aren't?)

  • Identify Vulnerabilities - This step looks at the identified threats and determines if the system is weak in these areas



What is STRIDE?

There are various ways and methodologies of doing threat models. This CRC Press News provides a high-level introduction to one methodology, called STRIDE. STRIDE is an acronym that stands for 6 categories of security risks: Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privileges. Each category of risk aims to address one aspect of security.

STRIDE is a model of threats for identifying computer security threats. It provides a mnemonic for security threats in six categories. The threats are:

  • Spoofing

  • Tampering

  • Repudiation

  • Information disclosure (privacy breach or data leak)

  • Denial of service

  • Elevation of privilege

The STRIDE was initially created as part of the process of threat modeling. STRIDE is a model of threats, used to help reason and find threats to a system. It is used in conjunction with a model of the target system that can be constructed in parallel. This includes a full breakdown of processes, data stores, data flows and trust boundaries.

Today it is often used by security experts to help answer the question

  • "what can go wrong in this system we're working on?"

Each threat is a violation of a desirable property for a system:

Desired Property

Threat

Definition

Example

Authentication

Spoofing

Impersonating something or someone else.

Pretending to be any of billg, microsoft.com or ntdll.dll

Integrity

Tampering

Modifying data or code

Modifying a DLL on disk or DVD, or a packet as it traverses the LAN.

Non-repudiation

Repudiation

Claiming to have not performed an action.

I didn’t send that email,” “I didn’t modify that file,” “I certainly didn’t visit that web site, dear!”

Confidentiality

Information Disclosure

Exposing information to someone not authorized to see it

Allowing someone to read the Windows source code; publishing a list of customers to a web site.

Availability

Denial of Service

Deny or degrade service to users

Crashing Windows or a web site, sending a packet and absorbing seconds of CPU time, or routing packets into a black hole.

Authorization

Elevation of Privilege

Gain capabilities without proper authorization

Allowing a remote internet user to run commands is the classic example, but going from a limited user to admin is also EoP.

Let's dive into each of these categories.

Spoofing

Spoofing refers to the act of posing as someone else (i.e. spoofing a user) or claiming a false identity (i.e. spoofing a process). This category is concerned with authenticity. Most security systems rely on the identification and authentication of users. Spoofing attacks consist in using another user credentials without their knowledge. Typical spoofing threats target weak authentication mechanisms, for instance those using simple passwords, like a simple 4 digits number, or those using personal information that can be easily found, like date or place of birth.

Examples:

  • One user spoofs the identify of another user by brute-forcing username/password credentials.

  • A malicious, phishing host is set up in an attempt to trick users into divulging their credentials.

You would typically mitigate these risks with proper authentication.

Tampering

Tampering refers to malicious modification of data or processes. Tampering may occur on data in transit, on data at rest, or on processes. This category is concerned with integrity. Only authorised users should be able to modify a system or the data it uses. If an attacker is able to tamper with it, it can have some consequences on the usage of the system itself, for instance if the attacker can add or remove some functional elements, or on the purpose of the system, for instance if important data is destroyed or modified.

Examples:

  • A user performs bit-flipping attacks on data in transit.

  • A user modifies data at rest/on disk.

  • A user performs injection attacks on the application.

You would typically mitigate these risks with:

  • Proper validation of users' inputs and proper encoding of outputs.

  • Use prepared SQL statements or stored procedures to mitigate SQL injections.

  • Integrate with security static code analysis tools to identify security bugs.

  • Integrate with composition analysis tools (e.g. snyk, npm audit, BlackDuck ...etc) to identify 3rd party libraries/dependencies with known security vulnerabilities.

Repudiation

Repudiation refers to the ability of denying that an action or an event has occurred. Repudiation is unusual because it's a threat when viewed from a security perspective, and a desirable property of some privacy systems. This is a useful demonstration of the tension that security design analysis must sometimes grapple with. This category is concerned with non-repudiation. Attackers often want to hide their malicious activity, to avoid being detected and blocked. They might therefore try to repudiate actions they have performed, for instance by erasing them from the logs, or by spoofing the credentials of another user.

Examples:

  • A user denies performing a destructive action (e.g. deleting all records from a database).

  • Attackers commonly erase or truncate log files as a technique for hiding their tracks.

  • Administrators unable to determine if a container has started to behave suspiciously/erratically.

You would typically mitigate these risks with proper audit logging.

Information Disclosure

Information Disclosure refers to data leaks or data breaches. This could occur on data in transit, data at rest, or even to a process. This category is concerned with confidentiality. Many systems contain confidential information, and attackers often aim at getting hold of it. There are numerous examples of data breaches in the recent years.

Examples:

  • A user is able to eavesdrop, sniff, or read traffic in clear-text.

  • A user is able to read data on disk in clear-text.

  • A user attacks an application protected by TLS but is able to steal x.509 (SSL/TLS certificate) decryption keys and other sensitive information. Yes, this happened.

  • A user is able to read sensitive data in a database.

You would typically mitigate these risks by:

  • Implementing proper encryption.

  • Avoiding self-signed certificates. Use a valid, trusted Certificate Authority (CA).

Denial of Service

Denial of Service refers to causing a service or a network resource to be unavailable to its intended users. This category is concerned with availability. A system is usually deployed for a particular purpose, whether it is a banking application or an integrated media management on a car. In some cases, attackers will have some interest in preventing regular users to access the system, for instance as a way to blackmail and extort money from the owner of the system (e.g., with ransomware).

Examples:

  • A user performs SYN flood attack.

  • The storage (i.e. disk, drive) becomes too full.

  • A Kubernetes dashboard is left exposed on the Internet, allowing anyone to deploy containers on your company's infrastructure to mine cryptocurrency and starve your legitimate applications of CPU. Yes, that happened too.

Mitigating this class of security risks is tricky because solutions are highly dependent on a lot of factors.

  • For the Kubernetes example, you would mitigate resource consumption with resource quotas.

  • For a storage example, you would mitigate this with proper log rotation and monitoring/alerting when disk is nearing capacity.

Elevation of Privileges

Elevation of Privileges refers to gaining access that one should not have. This category is concerned with authorization. Once a user is identified on a system, they usually have some sort of privileges, i.e., they are authorised to perform some actions, but not necessarily all of them. An attacker might therefore try to acquire additional privileges, for instance by spoofing a user with higher privileges, or by tampering the system to change their own privileges.

Examples:

  • A user takes advantage of a Buffer Overflow to gain root-level privileges on a system.

  • A user with limited to no permissions to Kubernetes can elevate their privileges by sending a specially crafted request to a container with the Kubernetes API server's TLS credentials. Yes, this was possible.

Mitigating these risks would require a few things:

  • Proper authorization mechanism (e.g. role-based access control).

  • Security static code analysis to ensure your code has little to no security bugs.

  • Compositional analysis (aka dependency checking/scanning), like snyk or npm audit, to ensure that you're not relying on known-vulnerable 3rd party dependencies.

  • Generally practicing least privilege principle, like running your web server as a non-root user.

Summary

In order to assess the security of a system, we must therefore look at all the possible threats. The STRIDE model is a useful tool to help us classify threats. STRIDE is a threat model methodology that should help you systematically examine and address gaps in the security posture of your applications. STRIDE is an acronym that stands for:

  • Spoofing Identity - This is a threat where one user takes on the identity of another. For example, an attacker takes on the identity of an administrator.

  • Tampering with Data - This is a threat where information in the system is changed by an attacker. For example, an attacker changes an account balance.

  • Repudiation - This is a threat where an attacker deletes or changes a transaction or login information in an attempt to refute that they ever took place. For example, deleting a purchase transaction so the item isn't charged to you.

  • Information Disclosure - This is a threat where sensitive information is stolen and sold for profit. For example, information on the latest widget is stolen and offered to a competitor for profit.

  • Denial of Service - This is a threat where the resources of a system are overwhelmed and processing stops for everyone. For example, a disgruntled attacker could have automated servers continually log into a system, tying up all connections so legitimate users can't get in.

  • Elevation of Privilege - This is a threat similar to spoofing, but instead of taking on the ID of another, they elevate their own security level to an administrator.

Threat Modeling using Attack Trees

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology, Web, Web2

What Are Attack Trees?

A common practice for studying the risk to a business is based on risk engineering and management principles. I.e., security resources are applied to vulnerabilities that pose the greatest risk to the business. Several processes for identifying and prioritizing risk are proposed in the literature. One of the most effective is threat modeling. Traditional trust modeling thought in academia involved mostly mathematical and theoretical concepts and using computer-security company marketing literature/jargon making it very hard to understand or analyze. This CRC Press News presents a practical, high-level guide to understand the concepts of threat modeling to engineers. Attack Trees are conceptual diagrams of threats on systems and possible attacks to reach those threats.

Few people truly understand computer security, as illustrated by computer-security company marketing literature that touts "hacker proof software," "triple-DES security," and the like. In truth, unbreakable security is broken all the time, often in ways its designers never imagined. Seemingly strong cryptography gets broken, too. Attacks thought to be beyond the ability of mortal men become commonplace. And as newspapers report security bug after security bug, it becomes increasingly clear that the term "security" doesn't have meaning unless also you know things like

  • "Secure from whom?" or

  • "Secure for how long?"

Clearly, what we need is a way to model threats against computer systems. If we can understand all the different ways in which a system can be attacked, we can likely design countermeasures to thwart those attacks. And if we can understand who the attackers are -- not to mention their abilities, motivations, and goals -- maybe we can install the proper countermeasures to deal with the real threats.

Attack trees are conceptual diagrams showing how an asset, or target, might be attacked. Attack trees have been used in a variety of applications. In the field of information technology, they have been used to describe threats on computer systems and possible attacks to realize those threats. However, their use is not restricted to the analysis of conventional information systems.

  • They are widely used in the fields of defense and aerospace for the analysis of threats against tamper resistant electronics systems (e.g., avionics on military aircraft).

  • Attack trees are increasingly being applied to computer control systems (especially relating to the electric power grid ). Attack trees have also been used to understand threats to physical systems.

Attack Trees: Threat Modeling Diagrams Explained

With the severity of data breaches and cybercrime escalating, it is now more important than ever to protect the confidential information your business processes. Organizations use attack tree diagrams to better understand their attack surface - the points in technical systems and applications that are vulnerable to cyberattacks. Within the realm of IT risk management, companies visualize security threats in attack tree diagrams to better understand and mitigate risk.

In an attack tree, the root node is the primary target in the attack against a technical system - there can be no parent node. Leaf nodes make up the parts and passageways that can lead to a data breach.

  • Attack trees are useful tools for IT asset risk management.

  • They can be used to help network security professionals to gain a more comprehensive understanding of specific cyberattacks, and how cyber criminals infiltrate IT systems.

  • Attack trees are also practical for conducting risk audit analysis, helping information security managers to get to the root cause of cyberattacks and prescribe strategies to remove threats.

Attack trees are hierarchical, graphical diagrams that show how low level hostile activities interact and combine to achieve an adversary's objectives - usually with negative consequences for the victim of the attack.

Similar to many other types of trees (e.g., decision trees), the diagrams are usually drawn inverted, with the root node at the top of the tree and branches descending from the root.

  • The top or root node represents the attacker's overall goal.

  • The nodes at the lowest levels of the tree (leaf nodes) represent the activities performed by the attacker.

  • Nodes between the leaf nodes and the root node depict intermediate states or attacker sub-goals.

  • Although the attacker may cause harms/damages (and the victim suffer impacts) at any level of the tree, the impacts usually increase at higher levels of the tree.

Attack Tree Example

Goal: Gain unauthorized physical access to building

Attack:

OR 1. Unlock door with key

OR 1. Steal Key

2. Social Engineering

OR 1. Borrow key

2. Convince locksmith to unlock door

2. Pick lock

3. Break window

4. Follow authorized individual into building

OR 1. Act like you belong and follow someone else

2. Befriend someone authorized outside a building

3. Appear in need of assistance (such as carrying a large box)

AND 4. Wear appropriate clothing for the location

Discussion

Attack Trees could be used to analyze problems in many different domains including but not limited to

  • Oil/gas pipelines,

  • Chemical Plants,

  • Information Technology,

  • Infrastructure, and

  • Facilities.

However, applying Attack Trees to analyze problems we are familiar with may be overkill. For instance, attacks which happen frequently (such as house break-ins) are well understood and intuitive.

  • Attack Trees are typically applied for architecture risk analysis and hence may describe attacks for specific protocols that appear in the architecture.

  • An Attack Tree for requirements engineering might start with the risks identified during a preliminary risk analysis and be refined by the analysis of the concept of operations.

While the generation of an Attack Tree can be done incrementally and be refined by multiple contributors, there is no guarantee of completeness. Often attacker-specific information is a best guess. An Attack Tree can be quite detailed, and that detail increases the cost of both creation and maintenance, particularly for a large system. On the other hand, widely applicable attacks trees could be shared and hence refined by a relatively large collection of experts. However, it should be note that Attack Trees do not necessarily represent all possible attacks. The unrepresented attacks may be more prevalent as engineers deploy larger and more complex systems.

Threat modeling Applying Cellular Manufacturing to Mitigate Risk and Uncertainty for Network Security

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology, Web, Web2

Threat modeling is a procedure for optimizing network security by identifying objectives and vulnerabilities, and then defining countermeasures to prevent, or mitigate the effects of, threats to the system. In this context, a threat is a potential or actual adverse event that may be malicious (such as a denial-of-service attack) or incidental (such as the failure of a storage device), and that can compromise the assets of an enterprise.

The key to threat modeling is to determine where the most effort should be applied to keep a system secure. This is a variable that changes as new factors develop and become known, applications are added, removed, or upgraded, and user requirements evolve. Threat modeling is an iterative process that consists of defining enterprise assets, identifying what each application does with respect to these assets, creating a security profile for each application, identifying potential threats, prioritizing potential threats, and documenting adverse events and the actions taken in each case.

Threat modelling works to identify, communicate, and understand threats and mitigations within the context of protecting something of value.

Threat modeling can be applied to a wide range of things, including software, applications, systems, networks, distributed systems, things in the internet of things, business processes, etc. There are very few technical products which cannot be threat modeled; more or less rewarding, depending on how much it communicates, or interacts, with the world. Threat modeling can be done at any stage of development, preferably early - so that the findings can inform the design.

What

Most of the time, a threat model includes:

  • A description / design / model of what you’re worried about

  • A list of assumptions that can be checked or challenged in the future as the threat landscape changes

  • A list of potential threats to the system

  • A list of actions to be taken for each threat

  • A way of validating the model and threats, and verification of success of actions taken

  • The motto is: Threat modeling: the sooner the better, however, never too late.

Why

The inclusion of threat modeling in the Software Development Life Cycle (SDLC) can help

  • Build a secure design

  • Efficient investment of resources; appropriately prioritize security, development, and other tasks

  • Bring Security and Development together to collaborate on a shared understanding, informing development of the system

  • Identify threats and compliance requirements, and evaluate their risk

  • Define and build required controls.

  • Balance risks, controls, and usability

  • Identify where building a control is unnecessary, based on acceptable risk

  • Document threats and mitigation

  • Ensure business requirements (or goals) are adequately protected in the face of a malicious actor, accidents, or other causes of impact

  • Identification of security test cases / security test scenarios to test the security requirements

Four Questions

Most threat model methodologies answer one or more of the following questions in the technical steps which they follow:

    • What are we building?

As a starting point you need to define the scope of the Threat Model. To do that you need to understand the application you are building, examples of helpful techniques are:

  • Architecture diagrams

  • Dataflow transitions

  • Data classifications

You will also need to gather people from different roles with sufficient technical and risk awareness to agree on the framework to be used during the Threat Modeling exercise.

    • What can go wrong?

  1. This is a "research" activity in which you want to find the main threats that apply to your application. There are many ways to approach the question, including brainstorming or using a structure to help think it through. Structures that can help include STRIDE, Kill Chains, CAPEC and others.What are we going to do about that?

In this phase you turn your findings into specific actions.

    • Did we do a good enough job?

Finally, carry out a retrospective activity over the work you have done to check quality, feasibility, progress, and/or planning.

Applying Cellular Manufacturing to Mitigate Risk and Uncertainty for Network Security

The technical steps in threat modeling involve answering questions: - What are we working on - What can go wrong - What will we do with the findings - Did we do a good job? The work to answer these questions is embedded in Applying Cellular Manufacturing to Mitigate Risk and Uncertainty for Network Security, ranging from incredibly informal Kanban with Post-its on the wall to strictly structured waterfalls.

The effort, work, and time frames spent on threat modeling relate to the process in which engineering is happening and products/services are delivered. The idea that threat modelling is waterfall or ‘heavyweight’ is based on threat modeling approaches from the early 2000s. Modern threat modelling building blocks fit well into agile and are in wide use.

When to threat model

When the system changes, you need to consider the security impact of those changes. Sometimes those impacts are not obvious.

Threat modeling integrates into Agile by asking “what are we working on, now, in this sprint/spike/feature?”; trying to answer this can be an important aspect of managing security debt, but trying to address it per-sprint is overwhelming. When the answer is that the system’s architecture isn’t changing, no new processes or dataflows are being introduced, and there are no changes to the data structures being transmitted, then it is unlikely that the answers to ‘what can go wrong’ will change. When one or more of those changes, then it’s useful to examine what can go wrong as part of the current work package, and to understand designs trade-offs you can make, and to understand what you’re going to address in this sprint and in the next one. The question of did we do a good job is split: the “did we address these threats” is part of sprint delivery or merging, while the broader question is an occasional saw-sharpening task.

After a security incident, going back and checking the threat models can be an important process.

Threat modeling: engagement versus review

Threat modeling at a whiteboard can be a fluid exchange of ideas between diverse participants. Using the whiteboard to construct a model that participants can rapidly change based on identified threats is a high-return activity. The models created there (or elsewhere) can be meticulously transferred to a high-quality archival representation designed for review and presentation. Those models are useful for documenting what’s been decided and sharing those decisions widely within an organization. These two activities are both threat modeling, yet quite different.

Threat modeling methodologies

Conceptually, a threat modeling practice flows from a methodology. Numerous threat modeling methodologies are available for implementation. Typically, threat modeling has been implemented using one of four approaches independently, asset-centric, attacker-centric, and software-centric. Based on volume of published online content, the four methodologies discussed below are the most well known.

STRIDE methodology

The STRIDE approach to threat modeling was introduced in 1999 at Microsoft, providing a mnemonic for developers to find 'threats to our products'. STRIDE, Patterns and Practices, and Asset/entry point were among the threat modeling approaches developed and published by Microsoft. References to "the" Microsoft methodology commonly mean STRIDE and Data Flow Diagrams. The threats are:

  • Spoofing

  • Tampering

  • Repudiation

  • Information disclosure (privacy breach or data leak)

  • Denial of service

  • Elevation of privilege

P.A.S.T.A.

The Process for Attack Simulation and Threat Analysis (PASTA) is a seven-step, risk-centric methodology. It provides a seven-step process for aligning business objectives and technical requirements, taking into account compliance issues and business analysis. The intent of the method is to provide a dynamic threat identification, enumeration, and scoring process. Once the threat model is completed security subject matter experts develop a detailed analysis of the identified threats. Finally, appropriate security controls can be enumerated. This methodology is intended to provide an attacker-centric view of the application and infrastructure from which defenders can develop an asset-centric mitigation strategy.

Trike

The focus of the Trike methodology is using threat models as a risk-management tool. Within this framework, threat models are used to satisfy the security auditing process. Threat models are based on a “requirements model.” The requirements model establishes the stakeholder-defined “acceptable” level of risk assigned to each asset class. Analysis of the requirements model yields a threat model from which threats are enumerated and assigned risk values. The completed threat model is used to construct a risk model based on asset, roles, actions, and calculated risk exposure.

VAST

VAST is an acronym for Visual, Agile, and Simple Threat modeling. The underlying principle of this methodology is the necessity of scaling the threat modeling process across the infrastructure and entire SDLC, and integrating it seamlessly into an Agile software development methodology. The methodology seeks to provide actionable outputs for the unique needs of various stakeholders: application architects and developers, cybersecurity personnel, and senior executives. The methodology provides a unique application and infrastructure visualization scheme such that the creation and use of threat models do not require specific security subject matter expertise.

Defining Safety Requirements During New Product Development (NPD)

By: John X. Wang
Subjects: Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Ergonomics & Human Factors, Occupational Health & Safety

Based on Industrial Design Engineering: Inventive Problem Solving, a systematic method for human safety integration early in the design process has been developed. Called Industrial Risk Engineering Design (IRED), this method facilitates the generation of safety requirements through past experience analysis and design choices analysis all along the design process. Design parameters thus result simultaneously from technical and safety functional requirements. This CRC Press News deals with the problem of defining safety objectives and flow down requirements early in the product design process. It highlights the mechanism offered by IRED for generating non-technical design objectives when preparing the requirements and constraints list. It shows that there are different typologies of safety objectives depending on the evolution of the product. In fact, there is a specific type of safety objective/requirements to be taken into account in a specific design stage. Finally, the applicability of the method is demonstrated through the application to a water faucet case study and mechanical person-machine interfaces.

The question that we tried to answer in this CRC Press News is:

  • what is a safety requirement?

Consideration of this question is based on the definition of design objectives (functional requirements and constraints).

Based on IRED, we can convert the defined risks into functional safety requirements, integrated into the specification document and taken into account in the design stages. These requirements are enhanced and specified throughout the design process through risk analysis related to the design choices and are considered as evolving simultaneously with the product development. Safety requirements are defined throughout design and added to the specification documents. Consequently, the specification document is enhanced through the possible undesirable events and serves for verification and validation. Integrating the safety functional requirements into the technical product design, all along the product development, constitutes the risk reduction process. These operations of design synthesis, analysis, risk and safety requirements identification correspond to the conceptual model of the IRED approach.

As the design parameters' characteristics typology depend on the design stage, the potential risks depend on the considered design stage as well. This observation leads us to consider that there is a mapping process between design and risk describing the compatibility between the design and the human characteristics. Therefore, the design process communicates with a "risk process" which is divided similarly into three steps according to the abstraction level of the solution. Thus, we noticed the Human-Principle Interaction (HPI), Human-System Interaction (HSI) and the Human- Machine Interaction (HMI). The HPI corresponds to the interaction between the human and the design solution in the conceptual design stage. The HSI corresponds to the interaction between the human and the design solution in the embodiment design stage. Finally, the HMI corresponds to the interaction between the human and the design solution in the detailed design stage. This "risk process describes the safety requirements generation and the risks identification processes.

The Human-Principal Interaction (HPI)

The HPI corresponds with the conceptual design stage. From the design point of view, at this level the overall functional requirements are decomposed into sub-requirements less abstract and one or more working principles are selected. Therefore, the potential interaction with human could be related either to the environment of the product or to the chosen working principle. From this interaction safety requirements related to the environment and to the solution's principle will result.

The Human-System Interaction (HSI)

The HSI corresponds with the embodiment design stage. The conceptual design working principles are structured and the occupied as well as available spaces are defined. In addition, the way in which the product will function is also specified. The dangerous zones and the intervention zones of the user are defined. The user location is related to functional and physical structures. At this step, the interaction with human is related either to the human activity or to the nature of the structuring parameters.

THE HUMAN-MACHINE INTERACTION (HMI)

The HMI corresponds with the detail design stage. From the design viewpoint, the product components, layouts, etc. are defined. Notice that traditionally, at the end of this stage, risks are analyzed and corrective actions are implemented. Normally, in IRED approach, potential accidents and ergonomics are handled systematically. Here, less important risks related to components, final forms, etc. are studied. At this stage, potential interaction mainly involves the design technical choices.

Ecosystem Requirements (ESRs)

We call the elements in interaction with the product in a given lifecycle situation "ecosystems". They may correspond to physical, human, environmental, etc., components. This context outlines safety requirements related to the use context of the product. These requirements are deduced from risks arisen in ground (experience feedbacks) due to the use of the same or similar products. At the associated design phase, the product is described through its desirable functionalities (mainly technical), design constraints, characteristics and super-systems. The context of ESRs completes these technical requirements with others ensuring the minimization of risks generated by the super-systems. This kind of risk exists and is totally independent of the design choices. In this regard, this type of safety requirements is expressed by an infinitive verb connecting two or more super-systems that belong to the considered use situation. In this context, safety requirements are input requirements and are specific to the overall design goals. Here, safety imposes the designer to take specific actions that could be well specified.

SYSTEM REQUIREMENTS

System Requirements (SRs)

We call the product's structure and architecture the "system". This structure is based on the solutions' principles organization and structuring. This context describes the safety requirements leading to the product's structure identification. From the human point of view, this structure induces a procedure resulting from the product functioning mode. A procedure is a set of activities cooperating in a chronological way in order to reach a specific goal. Therefore, we consider a procedure as the succession in time and space of the multiple tasks that a user must do. At this stage, design consists in functions allocation between the solution and the human in an ordered way. Functions' allocation is directly affected by the working principle chosen at the phase which defines the nature of the activity (automated or manual) as well as the human intervention degree and its frequency. This will set constraints to point out the better product's structure. Here, human safety is characterized by the human spatial position, his activity temporisation and his anthropometric data. In addition, the nature of the human activity is involved by his physical efforts limitations. These physical limitations will involve product functioning mode as well as dimensioning and materials choices. At this design stage, human characteristics are input constraints defined at the beginning of the design process. These constraints may result from either the experience feedback or the standards. The main characteristic of this context is to describe spatial and temporal separation between the product and the human. Besides these constraints, this context contains system safety requirements. These requirements consist of input constraints specification according to the physical design choices.

Sub-System Requirements (SSRs)

We call the product's components that allow finishing the product at the detail design level the "sub-system". This context describes safety requirements involving the components choices. A large number of these components constitute the human-machine interface. The human-machine describes the interaction between the user and the product in the use phase. This interface results from the nature of the human activity. More precisely, this context describes the safety requirement that leads to the product's final components choice according to the required human characteristics (vulnerability and ergonomic). The difference with the previous types of interaction is that here, safety requirements have minor effects on the global product safety. At this level, safety is expressed as functional requirements induced by the previous levels. These requirements result from the risk remaining in the previous levels.

Safety Objectives Typology

System safety objectives are defined during design and are true for a specific design and are the consequence of the design choices. Risks related to the design parameters choices are converted into safety requirements and are to be taken into account in the following design stages. System safety objectives are functional requirements. This type of functional requirement is created during the design process when some constraints are not satisfied. More precisely, it consists of unsatisfied constraints that are converted into safety requirements to be taken into account in the design stages. In fact, unsatisfied constraints are specified when progressing in design and necessitate design parameters.

Case Study 1: Dual Knob Faucet

The most common injuries from using domestic hot water are skin burns. Accidents affect mainly children and older people because of their limited mobility. Hot water can reach the 60°C, exceeding the 38°C the legal safety temperature. In addition, over 50°C, hot water causes serious burns.

The technical functional requirements of dual knobs faucet are:

  • FR1: Control the temperature of water;

  • FR2: Control the flow of water.

Definition of Water Faucet Input Safety Objectives

This requirement is then integrated to the requirements list and is measured by:

Functional requirements Measures

  • FR1 : Control the temperature of water 0 ≤ Tm ≤60°C

  • FR2 : Control the flow of water 1,5 ≤ Pm ≤ 4 bars

  • SR3 : Maintain a safe temperature 0 ≤ Tm ≤38°C

Tm and Pm are respectively the temperature and the pressure of the outlet water. In addition, limited mobility is an ergonomic problem. This risk generates a “safety constraint” «Take into account the mobility of the users» to be taken into account during the embodiment design stage. The consideration of the experience has lead to a new FR noted SR3.

Definition of The Water Faucet System Safety Requirement

Here, we will consider that the conceptual design of the water faucet is validated. The system safety requirement thus results from design parameters analysis. The analysis of the design parameters shows that the outlet water temperature may reach the 60°C and thus exceeds the legal safety temperature. If the conceptual design is validated, this risk is converted into a system safety requirement at the embodiment design.

Physical domain

Accident Tm>38°C -> burning

Functional domain

System requirements

SR2.1: Minimize the intervention of the user to assess the outlet water temperature

The little consideration of the experience at the conceptual design (HPI) stage has lead to this new “system FR” in the “embodiment design stage” (HSI). In addition, Considering the interaction between the temperature limiting device and FR1 to control the temperature, the designer’s task is to decouple the functional requirements. This could be accomplished by selecting alternative DPs. In the absence of FR3 (maintain a safe temperature), the solution would be driven to a lower triangular. It would be better to first control the temperature and then control the flow. The presence of FR3 to limit the temperature could remove the risk of burning. The presence of a coupling between FR1 and FR3 indicates an expected interaction between the temperature control and the device to maintain a safe temperature.

Case Study 2: Alpine Ski Bindings As Special Human-Machine Interfaces

Mechanical human-machine interfaces, such as an alpine ski bindings, hand power tools or vehicle steering columns, need to transmit control loads from the user to the machine. The potential to transmit injurious loads to the user should be avoided. The top FRs is to transmit control loads and the top SR is to filter injurious loads. Steering columns filter injurious loads by collapsing under impact in a collision. The collision and normal driving loads are different enough so that there is no mistaking one for the other and there is no inadvertent collapsing of steering columns. It is known from experience that ski bindings however suffer from inadvertent release, i.e., mistaking non-injurious loads for injurious loads. In the “conceptual design stage (HPI)” in the context “Ecosystem requirements” one or two safety requirements can be defined.

  • SR1 is “to avoid transmission of injurious loads”.

This is common to all such mechanical interfaces. In the case where SR1 is satisfied by a release system, whereby control might be lost, such as, a conventional releasable ski binding with explosive bolts, or an ejection seat, then

  • SR2 would be “to avoid inadvertent release”

At this point, the design is similar in some ways to the previous case study on the faucets. It is necessary to separate FR1 and 2. At the HSI stage, two sub-systems could be envisioned based on the magnitude of the loads, provided that there is a clear difference in the control and injurious loads. Experience shows however that high loads, even potentially injurious loads, can be sustained without injury for short durations. If the binding releases in these situations, then loss of control and serious injury from collisions can result. In the HSI stage this calls for a method to systematically discriminate between actual injurious situations and non-injurious, high-level, short-duration load spikes.

Two system level approaches have been developed to avoid inadvertent release. One is impulse-based and has been developed at the detailed level electrically. It tests that the load is of sufficient duration to approach injury potential before release. The other is work-based and has been developed at the detailed stage using preloaded springs to transmit control loads below the preload without significant displacement until the preload has been exceeded. It assures that work is done on the mechanism at sub-injurious loads adsorbing energy that would have caused injury or release. The preloaded spring mechanism filters injurious loads while faithfully transmitting control loads and adsorbs energy that could cause inadvertent release.

CONCLUSION

This CRC Press News deals with a method for the definition of safety objectives early in the design process. Based on Industrial Design Engineering: Inventive Problem Solving, the Industrial Risk Engineering Design (IRED) method gives the typology of safety objectives at each stage of design. When the design process is started, safety objectives are contextual requirements and ergonomics constraints. In this paper, it is shown that safety requirements generated during design are functional requirements. These requirements are the specification of safety constraints initially defined in design. Like technical objectives, safety objectives consist of input and system objectives and are described in terms of functional requirements and constraints. The application of the method to the water faucet and the ski bindings has been presented.

What Risks might be involved on the road of being Hamlet’s ally?

By: John X. Wang
Subjects: Engineering - General, Engineering - Industrial & Manufacturing

It is hard to imagine a world without Shakespeare. In Shakespeare’s “The Tragedy of Hamlet, Prince of Denmark”, the action we expect to see from Hamlet himself is continually postponed. Hamlet tries to obtain more certain knowledge about what he is doing although absolute certainty is impossible. Hamlet, a tragic hero, demonstrates his tragic flaw in his "indecision." Like an engineering project, literature often confronts contradiction, uncertainty, and Risk. Let’s evaluate the Risks being involved on the road of being Hamlet’s ally by Automotive Safety Integrity Level (ASIL).

Automotive Safety Integrity Level (ASIL)

According to the Motor vehicle safety data, by the BTS (Bureau of Transportation Statistics), more than 6 million crashes involving motor vehicles are reported every year on an average. Automotive Safety Integrity Level (ASIL) is a risk classification scheme defined by the ISO 26262 - Functional Safety for Road Vehicles standard. This is an adaptation of the Safety Integrity Level used in IEC 61508 for the automotive industry. This classification helps defining the safety requirements necessary to be in line with the ISO 26262 standard. The ASIL is established by performing a risk analysis of a potential hazard by looking at the Severity, Exposure and Controllability of the vehicle operating scenario. The safety goal for that hazard in turn carries the ASIL requirements.

Given a malfunction of a defined function at the vehicle level (e.g., an anti-lock braking system), a hazard and risk analysis follows to determine the risk of harm/injury to people and of damage to property. There are four ASILs identified by the standard: ASIL A, ASIL B, ASIL C, ASIL D. ASIL D dictates the highest integrity requirements on the product and ASIL A the lowest. Hazards that are identified as QM do not dictate any safety requirements.

  • Systems like airbags, anti-lock brakes, and power steering require an ASIL-D grade―the highest rigor applied to safety assurance―because the risks associated with their failure are the highest.

  • Head lights and brake lights generally would be ASIL-B

  • Cruise control would generally be ASIL-C.

  • Components like rear lights require only an ASIL-A grade.

For ISO 26262, the risk analysis is based on the exposure, severity, and controllability of the hazard and the resulting risk, and determines the ASIL, i.e., the level of risk reduction needed to achieve a tolerable risk. For example, let us consider a windshield wiper system. The safety analysis will determine the effects that loss of wiper function can have on the visibility of the driver. The ASIL gives guidance for choosing the adequate methods for reaching a certain level of integrity of the product. This guidance is meant to complement current safety practices. Current automobiles are manufactured at a high safety level and ISO 26262 is meant to standardize certain practices throughout the industry.

The Plot of Hamlet

The Tragedy of Hamlet, Prince of Denmark, often shortened to Hamlet, is a tragedy written by William Shakespeare sometime between 1599 and 1602. Set in Denmark, the play depicts Prince Hamlet and his revenge against his uncle, Claudius, who has murdered Hamlet's father in order to seize his throne and marry Hamlet's mother.

Events before the start of Hamlet set the stage for tragedy. When the king of Denmark, Prince Hamlet's father, suddenly dies, Hamlet's mother, Gertrude, marries his uncle Claudius, who becomes the new king.

A spirit who claims to be the ghost of Hamlet's father describes his murder at the hands of Claudius and demands that Hamlet avenge the killing. When the councilor Polonius learns from his daughter, Ophelia, that Hamlet has visited her in an apparently distracted state, Polonius attributes the prince's condition to lovesickness, and he sets a trap for Hamlet using Ophelia as bait.

 To confirm Claudius's guilt, Hamlet arranges for a play that mimics the murder; Claudius’s reaction is that of a guilty man. Hamlet, now free to act, mistakenly kills Polonius, thinking he is Claudius. Claudius sends Hamlet away as part of a deadly plot.

After Polonius's death, Ophelia goes mad and later drowns. Hamlet, who has returned safely to confront the king, agrees to a fencing match with Ophelia’s brother, Laertes, who secretly poisons his own rapier. At the match, Claudius prepares poisoned wine for Hamlet, which Gertrude unknowingly drinks; as she dies, she accuses Claudius, whom Hamlet kills. Then first Laertes and then Hamlet die, both victims of Laertes's rapier.

What ASIL might be involved on the road of being Hamlet’s ally?

Hamlet is undoubtedly an engaging and fascinating person. He is a witty, highly intelligent young man with an offbeat sense of humor. He's definitely someone you could imagine hanging out with. But as Hamlet is such a complicated character, there are downsides to getting too close to him. For one thing, he appears to be mad, or at the very least, pretending to be. This means that you're never quite sure where you stand with...

To answer the question:

  • What ASIL might be involved on the road of being Hamlet’s ally?

We need to look at the Severity, Exposure and Controllability, three elements of risk on the road of being Hamlet's ally. Hamlet is positive that Claudius killed his father and that his mother is somehow complicit. He will risk doing the 'wrong' thing if it means easing his grief & psychological suffering. He will not rest until he proves this and avenges his father's murder. However, Claudius is the king, and he is very powerful. How would you answers the question

  • What are the Severity, Exposure and Controllability, three elements of risk on the road of being Hamlet's ally?

Risk assessment is the focus of “Industrial Design Engineering: Inventive Problem Solving”.

A book dedicated “To the beautiful Sonny Wang Kindergarten where engineering dreams start”

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

Dr. John X. Wang's latest book Industrial Engineering Design: Inventive Problem Solving was featured as ISE Magazine May 2017 Book of the Month. The book was dedicated “To the beautiful Sonny Wang Kindergarten where engineering dreams start”.

Based on the book’s Section “7.7.1 Kindergarten classrooms: Where engineering dreams start Poetic”

Poetic thinking is about finding inspiration from different facets of life in aid of the production of an industrial engineering design, as well as helping the functional design achieve an emotional impact as an engineer’s best inspiration. Let’s start from the story about the Mother’s Best Flower. Lisa always expects flowers from our kids, especially on Mother’s Day. However, I believe the best flowers for any mother is her kids live an impactful life that inspires others to live with more passion, more love,and renewed vision.

My inspiration starts from my son Sonny’s story, starting from last year’s Mother’s Day. My wife Lisa checked the door of our home many times, hoping the flowers from Sonny would magically appear by the door. Unfortunately, both Lisa and I knew we could not receive flowers, because Sonny died tragically in a fire accident one and half months before. However, it was also on that day we received the good news of his church missionary. That good news reminded me of a childhood story about the best flowers for a mom.

Many years ago, a young man was going away for his mission, dangerous and noble. Before leaving his mom, he said, “Mom, I will send my horse with the best flowers to you if I cannot return to you, to my sweet home.”

The mom sealed his mouth with her hand, and said, “No, son. If I can hear the good news of your accomplished mission, if the horse can carry the good news back to our sweet home, that is the best flowers for me.”

Magically, we have been always receiving good news of Sonny’s Mission on Mother’s Day:

  • Three years ago, we received the news that Sonny accomplished his trip to Wisconsin as a Church’s youth director.

  • Two years ago, we received the news that Sonny accomplished his trip to Nebraska to help the Native American community there.

  • Last year, after Sonny passed away, we received the news about Sonny Wang’s Kindergarten from the Mission Starfish, a church in Haiti.

It was on Mother’s Day last year, on the bank of the beautiful Grand River, Lisa and I read a letter about Mission Starfish. With the help of money from Sonny’s life insurance policy the Mission was able to build the Sonny Wang Kindergarten, established in Haiti to honor our Sonny who extended his love far beyond his church and community. As shown by the attached photo (Courtesy of Mission Starfish Haiti), in Haiti, the beautiful Sonny Wang Kindergarten has classroom space for 135 kids and worship space for over 100 people!

According to research, engineering dreams start from kindergarten classrooms. Concepts about the world—the beginning of engineering— begin at birth. Young children, particularly kindergarten-aged children, have inquiring minds and are natural engineers. Kids enter kindergarten classrooms with curiosity and the ability to explore. These make them enthusiastic about learning about our world. They wonder about

  • How things work,

  • Why things change, and want to experiment, touch, and

  • What happens if ….

According to a research at the University of Maryland, where Sonny attended the Center for Young Children (CYC) many years ago, learning about engineering builds on this period in kindergarten children’s development. Engineering offers children the opportunity to do what comes naturally:

  • Observe,

  • Ask questions (what, how, and 5 whys …),

  • Manipulate objects,

  • Communicate their thinking through actions, words, drawings, or constructions, and

  • Build things together (group technology).

It’s in kindergarten classrooms that kids get the early ideas: engineering

is a way of doing. Engineering is solving problems, using a variety of materials, designing and creating, and building things that work.

On the bank of the river, I recalled it was here that Sonny and I talked about:

  • Never too old to learn, and

  • Never too young to learn.

Sonny, were you talking about future kindergarten classrooms, where many engineering dreams would start?

The book’s Section “7.7.2 The starfish and continuous improvement: Every action, no matter how small, can make a difference”

Today the Sonny Wang kindergarten building comes to our world as Creole-based digital tools are entering classrooms to improve science, technology, engineering, and mathematics (STEM) education in Haiti. Historically, Haitian children have been educated exclusively in French, a language in which most of the population are not fluent. Using Creole for Haitian education will provide Haitian children quality access for STEM education.

While Haitian children feel most comfortable with Creole, the use of French in Haiti’s classrooms has been a national education policy. School exams as well as national assessment tests are mostly conducted in French, rather than Creole. STEM course materials have been available exclusively in French, too. In Haiti’s classrooms, most children do not like to ask or answer questions because they are constantly struggling to translate from Creole into French or from French into Creole.

The use of French creates problems for teachers as well. Haiti’s teachers prefer to teach in Creole because that is the language with which they feel most comfortable also. They like to make jokes when they teach. That humor is essential for good teaching—to wake the students up, to keep them alert, and to make them feel relaxed.

Now the work of pro-Creole educators both in Haiti and in the Haitian Diaspora starts to show the key benefits of a Creole-based education at all levels of the education system. Earlier this year, Haiti adopted a new educational policy that will allow students to be educated in Creole, which is as capable of conveying complicated intellectual concepts as any other Indo-European tongue.

Creole-based digital tools meet crucial needs in Haiti by introducing modern techniques for interactive pedagogy while helping to develop digital resources in Creole. Digital tools including STAR, Mathlets, and PhET have been translated into Creole and provide proof of concept of Creole as a necessary ingredient for active learning in Haiti.

The initiative of using Creole-based digital tools will have a profound impact on the way people think about teaching STEM in mother tongues, and serve as a very important model for similar initiatives around the globe. Across large swaths of Africa and the Americas, indigenous languages continue to face systematic marginalization. This new initiative provides a guide for these populations to empower their children with engineering tools to mitigate risk and uncertainty in STEM education.

On the bank of the beautiful Grand River, Lisa and I read about Sonny Wang’s kindergarten classrooms, where beautiful engineering dreams will start. Looking at dreaming a creative tool—a catalyst—for productivity and problem solving, the new kindergarten will show explored free flow of thoughts as a design method, how daring to dream leads to final creative output.

For industrial design engineering, continual improvement process is an ongoing effort to improve products, services, or processes. Sonny lived an impactful life with a renewed vision reflected by the Mission Starfish:

  • Every action, no matter how small, can make a difference.

Here is the Mission Starfish Story as told to me:

A young boy walked along the beach and found thousands of starfish washed up because of a terrible storm. When he came to each starfish, he would pick it up and throw it back into the ocean. People watched him with amusement, and said, “Boy, why are you doing this? You can’t save all these starfish. There are too many. You can’t really make a difference!” But the boy continued to bend down, picked up another starfish, and threw it as far as possible into the ocean, and said, “Well, I made a difference to that starfish.”

By conclusion, let me call to action—for each of you, you never know what day will be your last, so live an impactful life that will send the good news of your mission to your mom. That is the best flowers you can send to your mom just as Sonny sent the best flower to his mom Lisa.

Back to the Future Tomorrow: Integrated Modular Automobile Electronics (IMAE) Will Look Like Integrated Modular Avionics

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

Ah, the dream of the self-driving car! Here we see a vintage 1957 H. Miller advertisement illustration of a family enjoying a ride in their autonomous auto. In response to customer demands for more and more applications, features and services, and to economic pressures, automakers are consolidating non-safety related and safety-related components on a single platform in their vehicles.

The avionics ARINC 653 and the automotive AUTOSAROS are the two most broadly adopted industry standards for Real Time Operating Systems (RTOS). They stem from domains with originally very different requirements: While avionics engineers have to cope with strict regulations regarding functional safety, automotive engineers generally care a lot about per-unit costs in mass production. However, recent trends in automotive, like steer-by-wire and ECU consolidation, increase the demand for functional safety measures also in this domain.

The market opportunity for driver assisted, driver piloted, and fully driverless cars is projected to reach 42 Billion by 2025. As a result, advanced “IMA like” automotive solutions are being explored and are focused on developing and maturing next generation transformational IMA technologies. Five future IMA dual use focus areas that have the potential to be transformational enablers for affordability, technology refresh, and new capabilities in both automotive and avionics IMA markets are:

  • heterogeneous multicore processing,

  • scalable autonomy and data fusion software components,

  • hypervisor enabled mixed criticality software infrastructure,

  • unified QoS networking, and 5.) Model Based Design (MBD).

The recent exponential increase in the number and complexity of in-vehicle electronics has transformed the automobile. What was once primarily an assembly of mechanical components has become a system integrating mechanical and electronic components, with the electronic components representing both a substantial portion of the added value and a disproportionate share of the headaches. With a century of experience behind them, automakers have the building of the mechanical part of the car down to constant improvement and refinement of details.

The in-vehicle electronics, that is, the head unit with its complex infotainment system and the dozens of Electronic Control Units (ECUs) are another matter, however. Not only are these systems evolving rapidly, but consumer demand for new applications and services is straining automakers’ ability to deliver.

Thus, automakers must continuously venture into uncharted territory as they seek to satisfy consumer demand while building better cars. Of course, car makers strive to provide all these new features without breaking the bank. Along with the availability of new generations of more powerful processors, this last requirement: to minimize the cost passed on to customers, is driving consolidation of multiple in-vehicle systems onto one board. A design that eliminates one $50 module per vehicle translates into a substantial sum when multiplied by 5 million vehicles.

This consolidation creates its own problems, however. Not least of these is that many in-vehicle systems are safety-related systems, while others are consumer applications and impossible to prove as safe. All these disparate systems may need to run on the same CPU. To this we can add that with the advent of the connected car any system in a vehicle will be potentially accessible from outside the vehicle;

  • while this will open many new possibilities, such as M2M-enabled over-the-air (OTA) software and firmware updates, it also creates new security and safety vulnerabilities.

Connected, but isolated

Paradoxically, it is the very interconnections of today’s electronics systems that require us to ensure the isolation of their components. More powerful processors allow us to combine safety-related components with an in-vehicle infotainment system on the same board. Some of these components are in turn connected to the cloud, and even periodically subject to M2M-enabled updates. These advancements make today’s systems vulnerable to interference from within in the form of wayward processes, and from without, in the form of malicious attacks against inadequately protected software. Isolating safety-related components and ensuring detection and recovery are both more difficult and more critical than they ever were.

The problem, then, is how to design and validate a system that incorporates components such as a sophisticated 3D display running consumer-grade applications that is unlikely to require safety certifications, and components such as a blind spot detection module whose dependability and freedom from undesired interference must be rigorously engineered and proven.

The automotive industry’s transition from an ownership model to a usage-based model—driven by the “sharing economy,” autonomous driving, and mobility-as-a-service platforms—is forcing the automotive manufacturing industry to make extensive changes in automobile architectures and the automotive business supply chain. Future automobiles will have electronics systems that closely resemble commercial aircraft systems, which optimize safety, security, reliability, and availability. These next-generation automotive systems will virtualize and centralize many electro-mechanical control systems that are currently distributed throughout today’s vehicles. In addition, these systems will be based on open standards that are derived from a wide range of transportation industry solutions.

Aircraft Lead the Way into the Future

These technology trends started over 30 years ago, when Boeing was designing the Boeing 777 wide-body aircraft. Reducing Size, Weight, and Power (SWaP) requirements are guiding principles in new aircraft designs, where reducing the size of a component, lowering the weight of a component, or reducing the power consumption of a component are essential design goals.

Prior to the Boeing 777, aircraft systems traditionally had a federated architecture, where every supplier provided its own electronics enclosure with unique power and connectivity requirements, and unique reliability and redundancy capabilities. All the disparate electronic subsystems were then interconnected into a “federated” environment on the aircraft. This earlier model is similar to the electronics architecture in automobiles today. Although federated avionics systems were easy to manufacture and service, outfitting an increasingly more capable aircraft with ever more federated systems continued to increase SWaP requirements, reducing efficiency with every new capability. This architecture was tolerable in an era of very inexpensive fuel, but as fuel costs rose, it was clear that the approach needed to change.

Boeing contracted Honeywell to create a common avionics cabinet, the Aircraft Information Management Systems (AIMS), for the Boeing 777. This electronics cabinet hosted multiple functions on the 777 that typically had been federated systems, altering the architecture of the airborne electronics for the 777. Honeywell supplied most of the software in this common computing environment for the Boeing 777, and the resulting specification for this environment was the ARINC 653 time and space separation application executive (APEX). ARINC 653 is now the standard for this aircraft systems architecture, called Integrated Modular Avionics (IMA).

The next major commercial design innovation for Boeing was the Boeing 787 Dreamliner. Boeing 787’s Common Core System (CCS) based on the ARINC 653 specification, which enabled a diverse ecosystem of hosted function suppliers to deliver software that executed on this common virtualization platform. This new platform changed the business environment for Boeing suppliers, and a new role-based business standard, RTCA DO-297, emerged that defined business roles—IMA platform suppliers, IMA applications suppliers, and IMA systems integrators—as well as processes and workflow for those roles. This IMA role-based supplier business methodology evolved into the current, global standard for the development of all large commercial aircraft today.

Standard Light the Way for Technological Advancement and Balanced Competition

Over the last three decades, the aviation industry has developed multiple standards that have improved the technology used in modern airframes as well as the business efficiency of the supply chain. Use of standards will provide similar technological advancement and business efficiencies in the automotive industry as well.

Below are some example standards that have emerged in the aviation industry. Some of these aviation standards may directly apply or influence similar standards in the automotive industry.

ARINC 653 and DO-297 in Commercial Aviation

The ARINC 653 technical standard and the DO-297 role-based business standard provide fundamental benefits and efficiencies for modern commercial aircraft:

  • The common compute environment enables a level playing field for the entire electronic systems supply chain, and enables a level playing field for competitors.

  • The SWaP footprint is dramatically reduced: Boeing stated that by using the IMA approach it was able to shave 2,000 pounds off the avionics suite.

  • The complexity of federated compute/function boxes distributed all over the aircraft is removed.

  • System availability and reliability increase due to the redundancy of the Boeing 787 CCS.

In addition, ARINC 653 time and space separation architectures enable applications with different levels of safety criticality to share the common compute platform, thereby optimizing the use of computer resources and allowing the Human–Machine Interface (HMI), flight controls, and aircraft systems to safely share the common compute resource.

Finally, ARINC 653 was designed for safety certification, which decreased aircraft programs’ certification risk. This technical and business foundation enabled the creation of a competitive aircraft supply chain, where most new capabilities and upgrades have multiple suppliers bidding on the new opportunities.

A time- and space-partitioning standard for automotive electronics that enables applications with different safety levels to share a common compute platform could pave the way toward easier development of automotive electronics systems, with reduced SWaP requirements to improve automobile fuel efficiencies. This partitioned architecture could also provide for greater safety and security attributes.

The FACE Approach and SAE AS6518 in Manned and Unmanned Military Avionics

The U.S. military also had a federated-systems problem that helped exacerbate a slowing-evolution-of-capabilities problem, which in turn denied military personnel the modern capabilities required to fight in modern, sophisticated conflict scenarios. With this in mind, six years ago the U.S. Army, U.S. Navy, and U.S. Air Force, along with nearly all of the U.S. defense suppliers, created a managed consortium to develop a new technical and business standard called the Future Airborne Capability Environment (FACE™), managed by The Open Group®.

The FACE Consortium based its technology foundation on ARINC 653, but also extended its specification to include the POSIX® (UNIX) standard, so that many mission systems and other, typically non-safety-certified applications could also be included in this specification. The FACE technical standard uses a layered software architecture, where application, communications/transport, I/O, and platform-specific capabilities are layered on top of the operating system standard, enabling any FACE component at any layer to be easily assembled with a mix of other FACE components.

The FACE software layer architecture leverages both commercial and military standards, and the technical specification includes over 100 proven industry standards. These teams also created a common data model, flexible enough to include all modern military aircraft systems, and aligned this model with the Department of Defense (DoD) Unmanned Aircraft System (UAS) Control Segment (UCS) Architecture data model, which is now an SAE standard (SAE AS6518). This standardization has created a very powerful foundation for all future manned and unmanned aircraft for the U.S. military and coalition partners.

Finally, the FACE business team modeled the FACE business specification after RTCA DO-297, but expanded this with a Contract Guide and a FACE conformance certification process that can validate FACE components, which can be referenced in a FACE Web-based applications store. These FACE certification activities and Web stores are just now coming online.

Similar to the FACE initiative, an industry-wide consortium of automotive companies could be formed to drive the definition and adoption of a standard specification for a layered software architecture for consolidated automotive electronics on common, shared platforms. This specification could streamline the innovation, development, deployment, and maintenance of next-generation automotive electronics.

OpenGL and OpenGL SC in Both Aircraft and Automotive HMIs

The complex graphics systems found in today’s aircraft have standardized on the Khronos Group’s OpenGL and OpenGL SC (Safety-Critical) specifications. This standardization has enabled the market to create both safety-critical cockpit graphics, and also a supply chain for OpenGL graphics drivers and tools that have full commercial off-the-shelf (COTS) RTCA DO-178C safety certificationevidence for a variety of graphics devices. Companies such as ANSYS, CoreAVI, Ensco, and Presagis have sophisticated design tools, test tools, and simulation tools that provide a clear path to DO-178C and DO-254 safety certification with their COTS product lines.

Visualization of the state of a vehicle, along with its real-time IoT sensor environment, will enable the driver or the operator of a car to immediately deliver the most efficient, safe, and secure experience to future automotive users. Both aircraft and next-generation automotive dashboards are already sharing common design OpenGL tools and safety-certified platform components.

AUTOSAR in Electronic Automotive Systems

Over the last decade, the automotive industry has built the Automotive Open System Architecture (AUTOSAR) set of standards that specifies basic software modules, application interfaces, and a common development methodology based on a standardized exchange format. The AUTOSAR layered software architecture is designed to enable electronic components from different suppliers to be used in multiple platforms (vehicles), enabling the move to all-electric cars and vehicles with higher software content that can be more easily upgraded over the service life of the vehicle. AUTOSAR aims to improve cost-efficiency without compromising quality or safety.

Even with AUTOSAR and other standards from SAE and other organizations, the overall technical and business model for the vast majority of today’s automobiles is still a federated environment in which suppliers define the requirements for their systems, and the Original Equipment Manufacturer (OEM) or systems integrator designs automobiles within these constraints. This technical and business architecture is the reason that many cars today have scores of processors distributed throughout the vehicle, increasing the complexity of wiring harnesses and other support systems.

This federated architecture was designed to optimize supplier integration; it was not designed to increase safety or meet stringent safety and security certification requirements, nor was it designed to directly reduce vehicle complexity or SWaP requirements. The automotive industry could well benefit from an expansion of the AUTOSAR standard to an ARINC 653–like specification for Integrated Modular Automobile Electronics (IMAE) that virtualizes many of the current federated systems. This specification could help reduce SWaP requirements, reduce development and testing costs, and improve the efficiencies of the industry’s supply chain. And like the partitioning in ARINC 653, this module separation holds the promise to create safer and more secure automotive platforms.

Summary: Standards Light the Way

Although the technologies mentioned in this CRC Press News contain some of the world’s most valuable and competitive intellectual property, they are based on open standards—the ARINC, AUTOSAR, The Open Group, RTCA, and SAE standards mentioned in this CRC Press News are all readily available to any interested party. Some of these technologies may be open source, many are proprietary, but all of these products rely on the market surrounding an open standard. The use of open standards drives innovation and reduces business friction.

Aircraft Are Leading IOT Data Generation Platforms

Aircraft cockpits are not the only consumers of aircraft flight data. Today’s aircraft are complex IoT platforms that generate terabytes of data per aircraft per day—data used not only in the cockpit by pilots but also by airline supply chains to optimize performance, safety, and operations. Boeing 787s interconnect nearly every system in the airplane, from the engines to the flaps to the landing gear. The Pratt & Whitney Geared Turbofan jet engines have over 5,000 sensors that generate 10 GB/s per engine, yielding over 2 Tbps for a typical twin engine on a commercial airliner such as the Airbus 320neo or Boeing 737 MAX. Pratt & Whitney estimates that with these systems generating data, their needs for data streaming will reach 12 PB each year. (By comparison, an instrumented Formula 1 car produces around 1.2 GB/s.)

Why this high level of IoT data capture? This real-time IoT intelligence can reveal trend and fault patterns on specific aircraft systems, classes of aircraft systems, or entire fleets of aircraft that can drive immediate actions for optimizing flight and ground operations, or queue up parts and Maintenance and Repair Operations (MRO) teams that can drive higher levels of aircraft availability. This increases aircraft efficiency and serviceability, reduces operational interruptions, and has the potential to reduce major performance and maintenance issues. Everyone wins:

  • airlines obtain greater operational performance with higher fuel efficiency,

  • aircraft systems manufacturers gain valuable insight for creating even more powerful, reliable, safe, and efficient systems for future insertion, and

  • passengers enjoy greater comfort, safety, and entertainment while aboard these modern aircraft.

The collection and integration of this IoT data is fully autonomous, driving higher IoT sampling rates for the entire industry. Analysis of this data is a service that is delivered as a scaled business model, from very low-cost to more comprehensive levels that drive fleet specific intelligence, optimization, and management.

The investment in technology for commercial aircraft is driven by four key drivers:

  • Safety: All commercial aircraft accidents have a global audience. Higher rates of safety directly influence higher rates of passenger airline traffic.

  • Efficiency: Minimizing operations expenses, with fuel being one of the highest expenses, defines profitability.

  • Multi-vendor supply chain: Aircraft manufacturers need a strong, consistent ecosystem of suppliers to keep technology innovation high to remain both competitive and efficient.

  • MRO and aircraft availability: Canceled flights due to equipment have a large financial impact. Planes are measured by flight time and percent utilization.

Future automobiles will need a similar set of IoT data and intelligence to survive the far higher demands of mobility-as-a-service. This will increase by a factor of ten for autonomous platforms, where IoT sensing will not only enable safer, faster, and more reliable travel, but also help protect against accidents and unexpected liabilities. In addition, this IoT intelligence will allow automotive fleets to enjoy airline-like MRO benefits such as improved fuel efficiency, and to enable an accurate predictive maintenance service environment.

How Can the Automotive Industry Leverage?

How people use automobiles is changing at a rapid pace. Personal automobiles have defined affluence for over a century. Commercial vehicles have defined success in a wide range of businesses, from long haul to delivery services (such as DHL, FedEx, and UPS) to bus lines to taxis. But the rise of autonomous vehicles, along with new sharing economy businesses, is changing the character and responsibilities of autonomous ground and air vehicles. A large proportion of future vehicles will not be considered “personal” property. Many transportation systems will be autonomous and rented “for the moment” of usage, instead of by the day, week, or month of possession. The responsibility for safety and security will be transferred from the human driver to the vehicle manufacturer and the service operator; this will force automobile designers and manufacturers to build vehicles of the future with many of the strict safety and security design constraints that aircraft manufacturers have used for decades, allowing the future service operators to deploy these vehicles into new markets with confidence.

Back to the Future Tomorrow

If we were living in the 1980s, this business transformation would be daunting and fraught with peril. But in today’s world, the requisite business and technology framework already exists, has been proven with an incrementally improving safety record, and is captured in a wide range of comprehensive open standards. These standards will underpin the convergence of manned and unmanned, airborne and ground systems vehicles over the next decade, and deliver new levels of safety, security, reliability, and serviceability for a new era of vehicles to serve the expanding uses of modern travel.

The automobile industry is moving in a direction that closely reflects the requirements of the commercial airline industry. Although improvements in vehicle quality, Maintenance, Repair and Operations (MRO) and fuel efficiency have had a relatively modest impact on automotive sales over the last 30 years, the impact of the sharing economy and 24/7 autonomous, commercial use of automobiles and transportation services will be far greater. As autonomous vehicles become prevalent, and as vehicles increasingly deliver mobility as a service, safety and security concerns will escalate in importance; percent utilization will also become a concern and a key factor in value and in related quality metrics and operating expenses. These increases in operational quality must be based on open standards and COTS tooling to maintain an escalating cadence for innovation and reliability. With shared personal mobility companies such as Uber, Lyft, and others, the architecture and design of cars will migrate in the direction of aircraft and airlines, with similar supply chain concerns and operations metrics.

Future electronics systems in automobiles will look a lot like the airborne electronics in commercial aircraft. The automotive platform changes are being driven by three significant trends:

  • Consumer value is derived from many cross-functional use cases, such as Advanced Driver Assist Systems (ADAS).

  • Vehicles are connecting with expanding Internet of Things (IoT) environments.

  • The automotive economy is advancing into a digitized, usage based model, with commercial enterprises such as Uber, Lyft, Carma, Didi Chuxing, Enterprise CarShare, Getaround, Maven, Turo, and Zipcar leading this global expansion.

This paradigm shift places a greater emphasis on safety, security, and reliability, and brings the automotive industry closer to the commercial aviation industry. This new automotive business model is based on the transportation of goods and passengers with vehicles that are primarily owned and operated by commercial companies rather than private individuals. Oerating expense (OPEX) savings will motivate the entire industry to create new architectures that ensure safety, security, and platform consolidation efficiencies.

A microkernel OS may be able to provide a full and rich set of OS features to support consumer demands while at the same time ensuring that the system meets its safety requirements. The trusted code in a microkernel OS is simple and small, with a well-tested and short execution path that is granted system-level privileges. In short, a microkernel OS is inherently appropriate for safety-related systems.

Autonomous Driving Computing Architecture

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Information Technology, Nanoscience & Technology

Existing computing solutions for Level 4 autonomous driving often consume thousands of Watts, dissipate enormous amounts of heat, and cost tens of thousands of dollars. These power, heat, and cost barriers thus make autonomous driving technologies difficult to transfer to the general public. With inventive problem solving, featured by modular, secure, dynamic, high-performance, and energy-efficient, an autonomous driving computing architecture and software stack can be developed. For example, a simulated system on an ARM Mobile SoC consumes 11 W on average and is able to drive a mobile vehicle at 5 miles per hour. With more computing resources, the simulated system would be able to process more data and would eventually satisfy the need of a production-level autonomous driving system.

Computer Architecture Design Exploration

Here, we attempt to develop some initial understandings of the following questions:

  • what computing units are best suited for what kind of workloads;

  • if we considered an extreme, would a mobile processor be sufficient to perform the tasks in autonomous driving, and

  • how to design an efficient computing platform for autonomous driving.

Robot Operating System (ROS) for autonomous driving computing stack

The aforementioned autonomous driving computing stack, which provides several benefits:

modular: more ROS nodes can be added if more functions are required

  • secure: ROS nodes provide a good isolation mechanism to prevent nodes from impacting each other

  • highly dynamic: the run-time layer can schedule tasks for max throughput, lowest latency, or lowest energy consumption

  • high performance: each heterogeneous computing unit is used for the most suitable task to achieve highest performance

  • energy-efficient: can use the most energy-efficient computing unit for each task, for example, a DSP for feature extraction.

Design of Autonomous Driving Computing Platform

The reason why we could deliver high performance on an ARM mobile SoC is that we can utilize the heterogeneous computing resources of the system and use the best suited computing unit for each task so as to achieve best possible performance and energy efficiency. However, there is a downside as well:

  • we could not fit all the tasks into such a system, for example,

    • object tracking,

    • change lane prediction,

    • cross-road traffic prediction, etc.

In addition, we need for the autonomous driving system to have the capability to upload raw sensor data and processed data to the cloud; however, the amount of data is so large that it would take all of the available network bandwidth.

The aforementioned functions, object tracking, change lane prediction, cross-road traffic prediction, data uploading etc. are not needed all the time. For example,

  • the object tracking task is triggered by the object recognition task

  • the traffic prediction task is, in turn, triggered by the object tracking task.

  • The data uploading task is not needed all the time either since uploading data in batches usually improves throughput and reduces bandwidth usage.

If we designed an ASIC chip for each of these tasks, it would be a waste of chip area, thus an FPGA would be a perfect fit for these tasks. We could have one FPGA chip in the system and have these tasks time-share the FPGA. It has been demonstrated that using Partial- Reconfiguration techniques, an FPGA soft core could be changed within less than a few milliseconds, making time-sharing possible in real-time.

With respect to computing stack for autonomous driving, at the level of the computing platform

layer, an SoC architecture consists of

  • an I/O subsystem that interacts with the front-end sensors;

  • a DSP to pre-process the image stream to extract features;

  • a GPU to perform object recognition and some other deep learning tasks;

  • a multi-core CPU for planning, control, and interaction tasks;

  • an FPGA that can be dynamically reconfigured and time-shared for data compression and uploading, object tracking, and traffic prediction, etc.

These computing and I/O components communicate through shared memory.

  • On top of the computing platform layer, we could have a run-time layer to map different workloads to the heterogeneous computing units through OpenCL, and to schedule different tasks at runtime with a run-time execution engine.

  • On top of the Run-Time Layer, we have an Operating Systems Layer utilizing Robot Operating System (ROS)design principles, which is a distributed system consisting of multiple ROS nodes, each encapsulating a task in autonomous driving.

Autonomous Driving on Mobile Processor

Let’s explore the edges of the envelope and understand how well an autonomous driving system could perform on the aforementioned ARM mobile SoC. A vision-based autonomous driving system can be implemented on this mobile SoC. Here, we can utilize

  • DSP for sensor data processing tasks, such as feature extraction and optical flow, which is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.

  • GPU for deep learning tasks, such as object recognition;

  • two CPU threads for localization tasks to localize the vehicle at real-time;

    • one CPU thread for real-time path planning;

    • another CPU thread for obstacle avoidance.

      Note that multiple CPU threads can run on the same CPU core if a CPU core is not fully utilized.

Surprisingly, though virtual testing, it turns out that the performance was quite impressive when simulating this system’s implementation on the ARM Mobile SoC with HIL tesing. The localization pipeline is able to process 25 images per second, almost keeping up with image generation at 30 images per second. The deep learning pipeline is capable ofperforming 2 to 3 object recognition tasks per second. The planning and control pipeline is designed to plan a path within 6 ms. When running this full system, the SoC consumes 11 W on average. With this system, we would be able to drive the vehicle at around 5 miles per hour without any loss of localization, quite a remarkable feat, considering that this ran on a mobile SoC. With more computing resources, the system should be capable of processing more data and allowing the vehicle to move at a higher speed, eventually satisfying the needs of a production-level autonomous driving system.

Matching Workloads to Computing Units

For matching workloads to computing units, we would need to understand which computing units are best fitted to convolution and feature extraction workloads, which are the most computation-intensive workloads in autonomous driving scenarios. We could conducted Design f Experiment (DoE) on an off-the shelf ARM mobile SoC consisting of a four-core CPU, a GPU, as well as a DSP. To study the performance and energy consumption of this heterogeneous platform, we implement the DoE to optimize feature extraction and convolution tasks on CPU, GPU, and DSP based on measured chip-level energy consumption.

First, we can implement a convolution layer, which is commonly used, and is the most computation-intensive stage, in object recognition and object tracking tasks. Let’s summarize the performance and energy consumption results:

  • when running on the CPU, each convolution takes about 8 ms to complete, consuming 20 mJ;

  • when running on the DSP, each convolution takes 5 ms to complete, consuming 7.5 mJ;

  • when running on a GPU, each convolution takes only 2 ms to complete, consuming only 4.5 mJ.

These results confirm that GPU is the most efficient computing unit for convolution tasks, both in performance and in energy consumption.

Next, we implemented feature extraction, which generates feature points for the localization stage, and this is the most computation expensive task in the localization pipeline. Let’s summarize the performance and energy consumption results:

  • when running on a CPU, each feature extraction task takes about 20 ms to complete, consuming 50 mJ;

  • when running on a GPU, each convolution takes 10 ms to complete, consuming 22.5 mJ;

  • when running on a DSP, each convolution takes only 4 ms to complete, consuming only 6 mJ.

These results confirm that DSP is the most efficient computing unit for feature processing tasks, both in performance and in energy consumption. Note that we did not simulate the implementations of other tasks in autonomous driving, such as localization, planning, obstacle avoidance etc. on GPUs and DSPs as these tasks are control heavy and would not efficiently execute on GPUs and DSPs.

Existing Implementations

In the section of Computer Architecture Design Exploration, to understand how the chip makers attempt to solve these problems, we have looked at the existing autonomous driving computation solutions provided by different chip makers. To understand the main points in autonomous driving computing platforms, let’s look at an existing computation hardware implementation of a level 4 autonomous car from a leading autonomous driving company.

Existing Processing Solutions

Let’s examine some existing computing solutions targeted for autonomous driving.

GPU-Based Solutions

When performing the DoE based simulation, t he Nvidia PX platform was the leading GPU-based solution for autonomous driving.

  • Each PX 2 consists of two Tegra SoCs and two Pascal graphics processors.

  • Each GPU has its own dedicated memory, as well as specialized instructions for Deep Neural Network acceleration.

To deliver high throughput, each Tegra connects directly to the Pascal GPU using a PCI-E Gen 2 x4 bus (total bandwidth: 4.0 GB/s). In addition, the dual CPU-GPU cluster is connected over Gigabit Ethernet, delivering 70 Gigabits per second. With optimized I/O architecture and DNN acceleration, each PX2 is able to perform 24 trillion deep learning calculations every second. This means that, when running AlexNet deep learning workloads, it is capable of processing 2,800 images/s.

DSP-Based Solutions

Texas Instruments’ TDA provides a DSP-based solution for autonomous driving.

  • A TDA2x SoC consists of two floating point C66x DSP cores and four fully programmable Vision Accelerators, which are designed for vision processing functions.

  • The Vision Accelerators provide eight-fold acceleration on vision tasks compared to an ARM Cortex-15 CPU, while consuming less power.

  • Similarly, CEVA XM4 is another DSP-based autonomous driving computing solution. It is designed for computer vision tasks on video streams. The main benefit for using CEVA-XM4 is energy efficiency, which requires less than 30mW for a 1080p video at 30 frames per second.

FPGA-Based Solutions

Altera’s Cyclone V SoC is one FPGA-based autonomous driving solution which has been used in Audi products.

  • Altera’s FPGAs are optimized for sensor fusion, combining data from multiple sensors in the vehicle for highly reliable object detection.

  • Similarly, Zynq UltraScale MPSoC is also designed for autonomous driving tasks.

When running Convolution Neural Network tasks, it achieves 14 images/sec/Watt, which outperforms the Tesla K40 GPU (4 images/sec/Watt). Also, for object tracking tasks, it reaches 60 fps in a live 1080p video stream.

ASIC-Based Solutions

MobilEye EyeQ5 is a leading ASIC-based solution for autonomous driving. EyeQ5 features

  • heterogeneous,

  • fully programmable accelerators,

    • each of the four accelerator types in the chip are optimized for their own family of algorithms, including

      • computer-vision,

      • signal processing, and machine-learning tasks.

This diversity of accelerator architectures enables applications to save both computational time and energy by using the most suitable core for every task. To enable system expansion with multiple EyeQ5 devices, EyeQ5 implements two PCI-E ports for inter-processor communication.

Vision-Based Autonomous Driving

LiDAR is capable of producing over a million data points per second with a range up to 200 meters. However, it is very costly (a high-end LiDAR sensor costs over tens of thousands of dollars). Thus, let’s explore an affordable yet promising alternative, vision-based autonomous driving.

LiDAR vs. Vision Localization

The localization method in LiDAR-based systems heavily utilizes a particle filter, while vision-based localization utilizes visual odometry techniques. These two different approaches are required to handle the vastly different types of sensor data.

  • The point clouds generated by LiDAR provide a “shape description” of the environment, however it is hard to differentiate individual points.

    • By using a particle filter, the system compares a specific observed shape against the known map to reduce uncertainty.

  • In contrast, for vision-based localization, the observations are processed through a full pipeline of image processing to extract salient points and the salient points’ descriptions, which is known as feature detection and descriptor generation.

    • This allows us to uniquely identify each point and apply these salient points to directly compute the current position.

Vision-Based Localization Pipeline

In detail, vision-based localization undergoes the following simplified pipeline:

  1. by triangulating stereo image pairs, we first obtain a disparity map which can be used to derive depth information for each point.

  2. by matching salient features between successive stereo image frames, we can establish correlations between feature points in different frames. We can then estimate the motion between the past two frames.

  3. by comparing the salient features against those in the known map, we can also derive the current position of the vehicle.

Impact on Computing

Compared to a LiDAR-based approach, a vision-based approach introduces several highly parallel data processing stages, including

  • feature extraction,

  • disparity map generation,

  • optical flow,

  • feature match,

  • Gaussian Blur, etc.

These sensor data processing stages heavily utilize vector computations and each task usually has a short processing pipeline, which means that these workloads are best suited for DSPs. In contrast, a LiDAR-based approach heavily utilizes the Iterative Closest Point (ICP) algorithm, which is an iterative process that is hard to parallelize, and thus more efficiently executed on a sequential CPU.

Tasks in Autonomous Driving

Autonomous Driving is a highly complex system that consists of many different tasks. In order to achieve autonomous operation in urban situations with unpredictable traffic, several real-time systems must interoperate, including

  • sensor processing,

  • perception,

  • localization,

  • planning, and

  • control.

Note that existing successful implementations of autonomous driving are mostly LiDAR-based: they rely heavily on LiDAR for

  • mapping,

  • localization, and

  • obstacle avoidance,

while other sensors are used for peripheral functions.

Sensing

Normally, an autonomous vehicle consists of several major sensors. Indeed, since each type of sensor presents advantages and drawbacks, in autonomous vehicles, the data from multiple sensors must be combined for increased reliability and safety. They can include the following:

GPS and Inertial Measurement Unit (IMU)

The GPS/IMU system helps the autonomous vehicle localize itself by reporting both inertial updates and a global position estimate at a high rate.

  • GPS is a fairly accurate localization sensor, howver its update rate is slow, at about only 10 Hz, and thus not capable of providing real-time updates.

  • Conversely, an IMU’s accuracy degrades with time, and thus cannot be relied upon to provide reliable position updates over long periods of time.

    • However, an IMU can provide updates more frequently, at or higher than 200 Hz to satisfy the realtime requirement. Assuming a vehicle traveling at 60 miles per hour, the traveled distance is less than 0.2 meters between two position updates, (this means that the worst case localization error is less than 0.2 meters).

By combining both GPS and IMU, we can provide accurate and real-time updates for vehicle localization. Nonetheless, we cannot rely on this sole combination for localization for three reasons:

  • its accuracy is only about one meter;

  • the GPS signal has multipath problems, meaning that the signal may bounce off buildings, introducing more noise;

  • GPS requires an unobstructed view of the sky and would thus not work in environments such as tunnels.

LiDAR

LiDAR is used for

  • mapping,

  • localization, and

  • obstacle avoidance.

It works by bouncing a laser beam off of surfaces and measures the reflection time to determine distance. Due to its high accuracy, it is used as the main sensor in most autonomous vehicle implementations. LiDAR can be used to

  • produce high-definition maps,

  • localize a moving vehicle against high-definition maps,

  • detect obstacles ahead, etc.

Normally, a LiDAR unit, such as Velodyne 64-beam laser, rotates at 10 Hz and takes about 1.3 million readings per second. There are two main problems with LiDAR:

  • when there are many suspended particles in the air, such as rain drops and dust, the measurements may be extremely noisy.

  • a 64-beam LiDAR unit is quite costly.

Camera

Cameras are mostly used for object recognition and object tracking tasks such as

  • lane detection,

  • traffic light detection, and

  • pedestrian detection, etc.

To enhance autonomous vehicle safety, existing implementations usually mount eight or more 1080p cameras around the car, such that we can use cameras to detect, recognize, and track objects in front of, behind, and on both sides of the vehicle. These cameras usually run at 60 Hz, and, when combined, would generate around 1.8 GB of raw data per second.

Radar and Sonar

The radar and sonar system is mostly used as the last line of defense in obstacle avoidance. The data generated by radar and sonar shows the distance to the nearest object in front of the vehicle’s path. Once we detect that an object is close ahead, there may be a danger of a collision, then the autonomous vehicle should apply the brakes or turn to avoid the obstacle. Therefore, the data generated by radar and sonar does not require much processing and usually is fed directly to the control processor, and thus not through the main computation pipeline, to implement such “urgent” functions as swerving, applying the brakes, or pre-tensioning the seatbelts.

Perception

After getting sensor data, we feed the data into the perception stage to understand the vehicle’s environment. The three main tasks in autonomous driving perception are

  • localization,

  • object detection, and

  • object tracking.

Localization

Localization is a sensor-fusion process, such that GPS/IMU, and LiDAR data can be used to generate a high-resolution infrared reflectance ground map. To localize a moving vehicle relative to these maps, we could apply a particle filter method to correlate the LiDAR measurements with the map. The particle filter method has been demonstrated to achieve real-time localization with 10-centimeter accuracy and to be effective in urban environments. However, the high cost of LiDAR could limit its wide application.

Object Detection

In recent years, however, we have seen the rapid development of vision-based Deep Learning technology, which achieves significant object detection and tracking accuracy . Convolution Neural Network (CNN) is a type of Deep Neural Network (DNN) that is widely used in object recognition tasks. A general CNN evaluation pipeline usually consists of the following layers:

  • The Convolution Layer which contains different filters to extract different features from the input image.

    • Each filter contains a set of “learnable” parameters that will be derived after the training stage.

  • The Activation Layer which decides whether to activate the target neuron or not.

  • The Pooling Layer which reduces the spatial size of the representation to reduce the number of parameters and consequently the computation in the network.

The Fully Connected Layer where neurons have full connections to all activations in the previous layer. The convolution layer is often the most computation-intensive layer in a CNN.

Object Tracking

Object tracking refers to the automatic estimation of the trajectory of an object as it moves.

  • After the object to track is identified using object recognition techniques, the goal of object tracking is to automatically track the trajectory of the object subsequently.

  • This technology can be used to track nearby moving vehicles as well as people crossing the road to ensure that the current vehicle does not collide with these moving objects.

  • In recent years, deep learning techniques have demonstrated advantages in object tracking compared to conventional computer vision techniques.

  • Specifically, by using auxiliary natural images, a stacked Auto-Encoder can be trained offline to learn generic image features that are more robust against variations in viewpoints and vehicle positions.

  • Then, the offline trained model can be applied for online tracking.

Decision Making under Uncertainty

Based on the understanding of the vehicle’s environment, the decision stage can generate a safe and efficient action plan in real-time. The tasks in the decision stage mostly involve probabilistic processes and Markov chains.

Prediction

One of the main challenges for human drivers when navigating through traffic is to cope with the possible actions of other drivers which directly influence their own driving strategy.

  • This is especially true when there are multiple lanes on the road or when the vehicle is at a traffic change point.

  • To make sure that the vehicle travels safely in these environments, the decision unit generates predictions of nearby vehicles, and decides on an action plan based on these predictions.

  • To predict actions of other vehicles, one can generate a stochastic model of the reachable position sets of the other traffic participants, and associate these reachable sets with probability distributions.

Path Planning

Planning the path of an autonomous, agile vehicle in a dynamic environment is a very complex problem, especially when the vehicle is required to use its full maneuvering capabilities.

  • A brute force approach would be to search all possible paths and utilize a cost function to identify the best path.

    • However, the brute force approach would require enormous computation resources and may be unable to deliver navigation plans in real-time.

  • In order to circumvent the computational complexity of deterministic, complete algorithms, probabilistic planners have been utilized to provide effective real-time path planning.

Obstacle Avoidance

As safety is the paramount concern in autonomous driving, at least two levels of obstacle avoidance mechanisms need to be deployed to ensure that the vehicle will not collide with obstacles.

  • The first level is proactive, and is based on traffic predictions.

    • At runtime, the traffic prediction mechanism generates measures like time to collision or predicted minimum distance, and based on this information, the obstacle avoidance mechanism is triggered to perform local path re-planning.

  • If the proactive mechanism fails, the second-level, the reactive mechanism, using radar data, will take over.

    • Once the radar detects an obstacle, it will override the current control to avoid the obstacles.

Discussions

An autonomous vehicle must be capable of sensing its environment and safely navigating without human input. Indeed, the US Department of Transportation's National Highway Traffic Safety Administration (NHTSA) has formally defined five different levels of autonomous driving:

  • Level 0: the driver completely controls the vehicle at all times; the vehicle is not autonomous at all.

  • Level 1: semi-autonomous; most functions are controlled by the driver, however some functions such as braking can be done automatically by the vehicle.

  • Level 2: the driver is disengaged from physically operating the vehicle by having no contact with the steering wheel and foot pedals. This means that at least two functions, cruise control and lane-centering, are automated.

  • Level 3: there is still a driver who may completely shift safety-critical functions to the vehicle and is not required to monitor the situation as closely as for the lower levels.

  • Level 4: the vehicle performs all safety-critical functions for the entire trip, and the driver is not expected to control the vehicle at any time since this vehicle would control all functions from start to stop, including all parking functions.

Levels 3 and 4 autonomous vehicles must sense their surroundings by using multiple sensors, including LiDAR, GPS, IMU, cameras, etc. Based on the sensor inputs, they need to be able to

  • localize themselves, and in real-time,

  • make decisions about how to navigate within the perceived environment.

Due to the enormous amount of sensor data and the high complexity of the computation pipeline, autonomous driving places extremely high demands in terms of

  • computing power, and

  • electrical power consumption.

Existing designs often require equipping an autonomous car with multiple servers, each with multiple high-end CPUs and GPUs. These designs come with several problems:

  • First, the costs are extremely high, thus making autonomy unaffordable to the general public.

  • Second, power supply and heat dissipation become a problem as this setup consumes thousands of Watts, consequently placing high demands on the vehicle’s power system.

In this CRC Press News, we have explored computer architecture techniques for autonomous driving.

  • First, we have introduced the tasks involved in current LiDAR-based autonomous driving.

  • Second, we have explored how vision-based autonomous driving, a rising paradigm for autonomous driving, is different from the LiDAR-based counterpart.

  • Then, we have examined existing system implementations for autonomous driving.

  • Next, considering different computing resources, including CPU, GPU, FPGA, and DSP, we have further exploed the most suitable computing resource for each task.

Based on the simulation results of running autonomous driving tasks on a heterogeneous ARM Mobile

SoC, we have exploed a system architecture for autonomous driving, which is

  • modular,

  • secure,

  • dynamic,

  • energy-efficient, and

  • capable of delivering high levels of computing performance.

Summary

In summary, in the CRC Press News, we have described the computing tasks involved in autonomous driving, examine existing autonomous driving computing platform implementations. To enable autonomous driving, the computing stack could simultaneously provide

  • high performance,

  • low power consumption,

  • low thermal dissipation, and

  • low cost.

We have also discussed possible approaches to design computing platforms that will meet these needs.

What Every Engineer Should Know About Designing AI enabled System with SOTIF (Safety Of The Intended Functionality)

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

  • Autonomous Vehicles (AV) needs more intelligence when it is on the move.

  • The intelligence is not just an algorithm driven based on multiple sensor inputs alone, but here the intelligence need to be highly situational aware and by keeping the current vehicle dynamics.

  • This needs lot situational and scenario based complex computation and communication with multiple Electronic Control Unit (ECU) within the vehicle. The real challenge is the computation need to happen in real time and react accordingly to avoid unreasonable risks.

  • Most of the external scenario are sensed 360o typically using multiple Cameras, LiDAR, Radar etc. and different Deep Learning or Deep Neural Network (DNN) algorithms does the perception.

  • Many Advances Driver Assistance Systems (ADAS) gives alerts and assist for better driving experience. In case of L4+ Autonomous Vehicles, the DNN algorithm might also assist Adaptive Cruise Control (ACC) functions and help the vehicle to maneuver smoothly or guide to safe zone in case of unenviable situation.

  • Today even though the ADAS systems are designed to comply with ISO26262 standard norms to cover electronic and electrical (E/E) safety and respective software components are verified/validated toughly, despite there are chances of failure in the system due to “unknown, unsafe” scenarios.

  • The autonomous vehicles getting exposed to unknown, unsafe situations are possible in the real world environment and even when the ADAS hardware and software are operating correctly.

  • Globally automotive industry had witnessed accident and other mess-up events w.r.to misbehave of autonomous drive technology.

Examples

  • AI enabled systems failure could be due to AI algorithm not interpreting the real-time scenario.

  • Inadequateness in the AI enabled system for perception and safe drive operation.

  • This could be because inadequate algorithm or inadequate training dataset to address different operating conditions, road scenarios and drive regions.

  • Human - Mishandling or Negligence of ADAS function or AI system.

  • Driver negligence and no attention for Alerts and Warnings provided (either visual/sound)

  • Driver misusing the configuration settings (setting for ADAS system or automated function) which created malfunction of AI system.

  • Dynamic change in the road scene surroundings

  • Example the static object detected becomes sudden dynamic object

  • Unusual objects or holographic image movement etc.

  • External Sensor instability / out-of-calibration / external sensor failure (could be intermittent failure)

  • The image sensor calibration problem or refresh problem

  • Image sensor intermittent failure.

  • Sensor performance limitation

SOTIF mindset to overcome engineering defect:

  • In the latest ECU systems, as the code density and intelligent algorithm increases, the probability of system failure could also increase due to software failure rather the electronic and electrical (E/E) system errors, because the latest systems are heavily packed with AI or DNN algorithms.

  • Also there could be possibilities where sophisticated AI enable systems can be further fed with driver bio-metric information with multiple L3+ functions.

  • In these kind of systems, the “scenario to system reaction time” is very crucial.

  • The SOTIF (Safety Of The Intended Functionality) ISO/PAS 21448 aids to avoid unreasonable risk and hazard resulting due to – Functional Deficiencies of the actual intended function or System misuse by human.

  • The SOTIF can be applied to ADAS systems or any other emergency systems which could lead to safety hazard which many not be due to the system failure.

  • So while designing the AI enabled systems for a specific functionality the design should be done to ensure the intended functionality guaranteeing, the system shall not affect due to Performance Limitation or any Human Misuse.

    • So developing intelligent systems with SOTIF considerations will enable to design systems which are “Situational aware”.

  • While designing the intelligent systems, the designers should ensure the system (which could comprise of Semiconductors device + AI algorithm + Application software + Sensor interface + HMI controls etc.) shall not have any functional or performance limitations, for which it is designed.

  • To ensure the Performance Limitation the device used + algorithm used + training dataset + hardware platform and the KPI (Key Performance Indicators) are viable and essential to meet the “Intended Functionality without any Functionality deficiency” and required performance and the system reaction time is guaranteed in all possible scenarios.

For example, assume the below systems;

Example 1:

  • A system requires a multi-sensor, input signal processing followed by a deep neural network inference and update the result in user HMI panel or Digital Instrument Cluster panel with visual and audio alerts.

Example 2:

  • A system takes multiple camera sensors (with higher frame rate and resolution) does the complex image processing followed by a classification using deep neural network algorithm.

  • Also another algorithm (or) functional blocks uses the classified result + vehicle parameter and does multiple iterations of estimation and generates controls parameters to other ECU and also to HMI panel.

In this kind of intelligent systems, the designer should ensure the algorithm (or) silicon device (or) interface (on-chip/off-chip) (or) the platform deployed, does not create any unintended system behavior because of performance limitation.

  • SOTIF can be applied and analyzed for various hazard event models, which could cause potential failure possibility that could affect the “intended functionality” or failure due to “Human misuse / mishandling” of user settings or parameters.

  • This methodology will enable the designers to identify all possible situations which could falls under “unknown, unsafe” scenarios and helps to determine the proper system time-budgeting which help to avoid performance limitation issues.

  • With SOTIF based guidelines and taking environmental scenarios into account, the designers can identify and evaluate scenarios and trigger events which will enable to design ‘Intelligent Situational Aware’ system without overdoing.

  • For the SOTIF verification and validation approach, the combination of “intended functionality and reasonably foreseeable-misuse (i.e. human misuse)” can be taken into account while identifying the hazard events.

  • Moreover, if the intelligent system involves functions/algorithms which has non-predictive behavior the validation and automation complexity increases.

  • However, the new imaging technology with Intelligent-Sense with AI powered is gaining more momentum.

    • When these type of sensors are deployed in the ADAS system the “sensor performance and accuracy” becomes more vital w.r.to SOFIT where the end-to-end system performance and intended functionality need to be guaranteed in the integrated validation testing.

  • So a combination of “Real-Time Road Scenario VS Misuse VS System Indented Function” will basically help in deriving and testing the system behavior by appropriately generating the hazard events.

  • Typically for the Area1 and Area4 scenarios the normal intended functionality can be deployed and verified.

  • For the Area2 the independent system behavior (with intended functionality), other random trigger events with associated road scenario, possible misuse of system function, negligence of driver for the system alerts can be simulated and validated at the system level.

    • Also according to the functionality of intelligent system, algorithm deployed and the end-to-end response time could be verified to figure out for performance limitation and to figure out the functional improvements.

  • For Area3 which falls under “unknown, unsafe” scenarios, the enduring scenarios or trigger events can be generated synthetically on the road environment (for example. Highway, City Drive, Traffic situation, School Zone, Hospital Zone, Men-at-Work, Service lane crossing etc.) and can be evaluated Qualitatively & Quantitatively to assess the system behavior, as a black-box testing using a Hardware-in-Loop (HIL) setup. This could help in residual scenario testing.

Summary:

  • While designing an Intelligent System for ADAS or Autonomous Drive system, the vehicles understanding the real-time road environment and becoming “aware and intelligent" is a real challenge.

  • The SOTIF ISO/PAS 21448 will have substantial focus and impact on the Autonomous Vehicles.

  • The Intelligent Systems with artificial intelligence or System which does more perception features will foresee a real benefit in safety validation while adopting to SOTIF considerations.

  • Also validating a system for reasonably foreseen-misuse with road-scenarios will enable to build robust system which uses the real-time data from the V2X.

  • So with ISO/PAS 21448 based verification & validation and deploying the latest Hardware-in-Loop (HIL) testing shall enable to design and validate better sophisticated and intelligent systems.

Design Constraints for an autonomous driving system

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Occupational Health & Safety, Statistics

Industrial Design Engineering Project: DIY an autonomous driving system

According to National Highway Traffic Safety Administration, “Today, our country is on the verge of one of the most exciting and important innovations in transportation history— the development of Automated Driving Systems (ADSs), commonly referred to as automated or self-driving vehicles.” (see attached Figure).

Performance Constraints:

  • Specification: a tail latency lower than 100 ms at a frame rate higher than 10 frames per second.

GPU-, FPGA-, ASIC-accelerated autonomous driving systems can significantly reduce the processing tail latency by factors of 169×, 10×, 93× respectively. We identify six classes of constraints including performance, predictability, storage, thermal, power and others, and have arrived at some unique observations about autonomous driving systems. For instance, we discover it is critical for the system to be able to finish the end-to-end processing at a latency less than 100 ms and a frame rate higher than 10 frames per second, to react fast enough to the constantly changing traic condition. To provide better safety, autonomous driving systems should be able to react faster than human drivers, which suggests the latency for processing traffic condition should be within 100 ms.

In addition to processing latency, autonomous driving systems also need to frequently update their “understanding” to keep up with the continuously changing real-time traffic condition. In other words, the frame rate needs to be high, in case the real-time traffic condition changes drastically between two neighboring frames. To react quickly to the constantly changing traffic condition, the system should be able to react faster than human reaction time, which suggests a frequency of once every 100 ms.

To capture the stringent performance predictability requirements of such mission critical real-time system, tail latency (i.e., high quantiles of the latency distribution) should be used to quantify the performance of the system.

  • Frame rate: The frame rate determines how fast the real-time sensor data can be fed into the process engine.

  • Processing latency: The processing latency of recognizing scenes and making operational decisions determines how fast the system can react to the captured sensor data.

Performance Constraints: autonomous driving system should be able to process current traffic conditions within a latency of 100 ms at a frequency of at least once every 100 ms.

Predictability Constraints

Specifically, the predictability of the processing latency is critical for the autonomous driving system to quickly react to the real-time traic condition reliably. To capture the non-determinism of large-scale distributed systems, tail latency, defined as the high quantiles of the latency distribution (e.g., 95th-, 99th- percentile latency), is often used to evaluate the performance of such systems instead of mean latency.

The localization algorithm has large performance variability, which is challenging for the autonomous driving system to react to the real-time traffic. As a result, tail latency, high quantiles like 99.99thpercentile or even worst case latency, should be used to evaluate the performance of such systems to reflect the stringent predictability requirements. We have empirically demonstrated why tail latency should be used.

Predictability Constraints: Due to the large performance variability of autonomous driving systems, tail latency (e.g., 99th-, 99.99th- percentile latency) should be used as the metric to evaluate the performance, in order to capture the stringent predictability requirement.

For latency of each algorithmic component on a multi-core CPUs system in the end-to-end autonomous driving system, the latency contributed by each of object detection engine (DET), object tracking engine (TRA), and localization engine (LOC) has already exceeded the latency requirement of the end-to-end system. These three components dominate the end-to-end latency, and thereby are the computational bottlenecks that prevent us from meeting design constraints.

The limited on-chip memory on FPGAs is not sufficient to hold all the network architecture configurations, so the networks are executed layer by layer. For each layer, the memory controller initiates the data transfer and the layer definition is used by the header decoder unit (HdrDc_unit) to configure the layer. Processing Elements (PEs) consumes data in the WeightBufer and InputBufer to compute the output and store it to the Output Buffer. To hide the data transfer latency, we have implemented double buffering for all buffers. Each

PE, primarily consists of multiply-accumulate (MAC) units instantiated by the Digital Processing Processors (DSPs) on the fabric, then performs the necessary computation on the data stored in the Weight Buffer and Input Buffer and writes the output to the Output Buffer. To hide the data transferring latency, we implement double buffering for all buffers to prefetch the needed data in advance while executing the current layer. Overall, we are able to achieve a utilization higher than 80% on the available Adaptive Logic Modules (ALMs) and DSPs.

As the input images streaming into the image buffer, they are filtered by the mask window. The feature detector extracts the features of interest and store them into the feature point buffer. It is then consumed by the rotate unit to rotate the corresponding coordinates, and the generated feature descriptors are stored into the descriptor buffer. To optimize the performance of our design, we have implemented all the complex trigonometric functions with Lookup Tables (LUTs) to avoid the extensive use of multipliers and dividers, which reduces the latency by a factor of 1.5×. Because of the simplicity of this design, we can achieve high clock rate and thereby low latency.

FPGA Implementation: by synthesizing on real systems, we have demonstrates our Feature Extraction (FE) implementation can execute at a frequency of 250MHz on the Stratix V development board. By implementing complex trigonometric functions with LUTs, we improve the performance of FE by a factor of 1.5×.

ASIC Implementation: we are able to achieve a 4× reduction in latency by replacing complex trigonometric function computations with LUTs.

Mean Latency

It is impractical to run either DET or TRA on the multi-core CPU systems, as the latency of each individual component is already significantly higher than the end-to-end system latency constraints (i.e., 100 ms). This is because both of these components are using Deep Neural Network (DNN)-based algorithms, which demands large amount of computing capacity that conventional multi-core CPUs does not offer. On the contrary, GPUs provide significantly lower mean latency across all three workloads benefiting from the massive parallel processing power provided by the large number of processors. Although FPGAs achieve significant latency reduction comparing to CPUs, their mean latency for DET (i.e., 369.6 ms) and TRA (i.e., 536.0 ms) are still too high to meet the latency constraints at 100 ms. This is largely due to the limited number of DSPs available on the fabric. To support these complex DNN-based algorithms, large amount of DSPs on FPGAs are needed to provide significant compute power, which can be achieved by much advanced FPGAs (e.g., Xilinx VC709 FPGA board). As we expected, ASICs can achieve signiicant latency reduction, where the mean latency for executing TRA is as lowas 1.8 ms. Note the reason why DET runs slower on ASICs than GPUs is because of the limited clock frequency at 200 MHz this particular design can operate at, which does not preclude similar designs with high clock frequencies to outperform GPUs.

Finding 1. Multicore CPUs are not viable candidates for running object detection (DET) and object tracking (TRA), which are composed of complex DNN-based algorithms that demand large amount of computational resources. The limited number of DSPs becomes the main bottleneck preventing FPGAs from meeting the performance constraints.

Tail Latency

Although multicore CPUs can execute the localization algorithm within the performance constraints on average (i.e., mean latency), they suffer from high tail latency across all three workloads. Due to its large performance variability, tail latency should be used to evaluate the performance of autonomous driving systems to meet the performance predictability constraints. The other computing platforms do not experience any significant increase from mean latency to the tail latency, which is highly preferable for such mission-critical real-time applications.
Finding 2. Due to the large performance variability of localization algorithm, tail latency should be used to evaluate the performance of autonomous driving systems to meet the real-time constraints, whereas conventional metrics like mean latency can easily cause misleading conclusions.

End-to-End Performance

The end-to-end latency is determined by the slowest path between LOC and DET + TRA, because they are executed in parallel. For End-to-End Performance, we have observed that certain configurations (e.g., LOC on CPUs, DET and TRA on GPUs) can meet the performance constraints at 100ms latency when mean latency is considered, but are no longer viable when considering the tail latency. This again confirms our observation that tail latency should be used when evaluating autonomous driving systems. In addition, none of the viable designs has multi-core CPUs due to the inherent non-determinism and unpredictability. With acceleration, we are able to reduce the end-to-end tail latency from 9.1s (i.e., on multi-core CPUs) to 16.1ms to meet the real-time processing constraints.
Finding 3. Accelerator-based design is a viable approach to build autonomous driving systems, and accelerator platforms with high performance predictability (e.g., ASICs) are preferable to meet the real-time processing constraints.

Scalability Analysis

Besides the processing latency, the performance predictability of autonomous driving systems is also determined by the functional aspects Å› the accuracy of making the correct operational decisions. Increasing camera resolution can significantly boost the accuracy of the autonomous driving systems by sometimes as much as 10%. For example, doubling the input resolution can improve the accuracy of VGG16, a DNN-based state-of-the-art object detection algorithm, from 80.3% to 87.4%. Therefore, we have investigated the system scalability of our accelerator-based autonomous driving systems in supporting future higher resolution cameras. Therefore, we modify the resolution of the benchmarks to study this question, and obtain the end-to-end latency as a function of the input camera resolution of various accelerator-based configurations. As we can see from the igure, although some of the ASIC- and GPU-accelerated systems can still meet the real-time performance constraints at Full HD resolution (1080p), none of these configurations can sustain at Quad HD (QHD). Computational capability still remains the bottleneck that prevents us from beneiting from the higher system accuracy enabled by higher resolution cameras. As a result, GPU-, FPGA-, and ASIC-accelerated systems can reduce the tail latency of related algorithms by 169×, 10×, and 93× respectively.

Year 2020, shall we start to see AI-based Autonomous Vehicles on the road?

Summary of Findings to Achieve These

  • We find that while power hungry accelerators like GPUs can predictably deliver the computation at low latency, their high power consumption, further magnified by the cooling load to meet the thermal constraints, can significantly degrade the driving range and fuel efficiency of the vehicle.

  • We also demonstrate that computational capability remains the bottleneck that prevents us from benefiting from the higher system accuracy enabled by higher resolution cameras.

What Every Engineer Should Know About UL 4600 “Standard for Safety for the Evaluation of Autonomous Products”

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

  • The proposed UL 4600 standard focuses on ensuring that a comprehensive safety case is in place, including safety claims, argumentation, and evidence.

  • It is intended to cover computer-based system aspects of autonomous operation.

  • It is specifically designed to build upon the strengths of existing standards such as ISO 26262, and evolving standards such as ISO 21448-SOTIF.

  • It is not a competing standard to those and other standards being developed.

  • UL 4600 permits claiming appropriate credit for conforming to those standards while ensuring autonomy-specific gaps are filled.

Play well with others

  • UL 4600 “Standard for Safety for the Evaluation of Autonomous Products” was not developed in a vacuum.

  • It arrives with a full awareness of existing safety standards.

  • On the safety standards landscape, the industry already has ISO 26262 (Functional Safety) and ISO 214448 (SOTIF), both designed for vehicles with a human driver responsible for safe operation.

  • In contrast, UL 4600 deals head-on with full autonomy.

  • The standard explains, that “complete removal of humans from performing aspects (including supervision) of autonomous item operation brings with it numerous additional concerns.”

  • UL 4600 was developed to addresses these “additional concerns.”

Philosophical vs. Prescriptive

  • On one hand, SOTIF is taking a more philosophical approach to define “unknown, unsafe” scenarios for ADAS and AVs.

  • The UL 4600, on the other hand, prefers a goal-based approach that prescribes topics to be addressed in creating a safety case,

  • Machine learning is another area where the two groups take varying approaches.

  • Many members working within the SOTIF feel more comfortable limiting the scope of SOTIF development to lower-level autonomous vehicles.

  • UL 4600’s authors intend to specifically cover validation of any machine learning based functionality and other autonomy functions used in life-critical applications.

  • At this point, however, neither UL 4600 nor SOTIF is recognized as an autonomy standard by the engineering community at large.

  • UL 4600 is complementary.

    • UL 4600 was designed in the interest of SOTIF.

    • UL 4600, for example, addresses validation of AI trained methods, something that is not covered in ISO 26262 or SOTIF.

  • SOTIF calls out the need for methods and processes while UL4600 is a methodology and procedure for getting it done.

  • With SOTIF, the industry is trying to understand the limits of the technology, so they can better define the operating domains.

Areas of specific emphasis

Areas of specific emphasis include safety practices for machine learning-based approaches, functionality for which complete requirements are not available, addressing “unknown unknowns” in safety argumentation, and ensuring that adequate fault mitigation capabilities are present in systems that do not have oversight by human drivers.

UL 4600 vs. ISO 26262 and ISO/PAS 21448 (SOTIF)

  • UL 4600 is a relative newcomer to automotive safety standards development.

  • ISO 26262 already exists, while ISO/PAS 21448 (safety of the intended functionality or “SOTIF”) are well into development.

The most frequently asked question about UL 4600 is why the automotive world needs yet another standard?

Answer: the UL 4600 group is closely in touch with leaders in ISO 26262 and ISO/PAS 21488, resolving potential overlap is an ongoing activity.

  • Developing safety standards based on the assumption that systems will have no responsible human driver is what separates UL 4600 from other standards.

  • In contrast, “existing standards such as ISO 26262 and ISO/PAS 21448 were envisioned for vehicles that ultimately have a human driver responsible for safe operation of the vehicle,

    • The technology in robocars and other autonomous systems exceeds the scope of these and other traditional safety standards.

    • Those standards are necessary, but not sufficient.

  • While ISO 26262 and SOTIF provide safety as a “target” to shoot for, UL 4600 offers a “the center of the target.”

    • For instance, UL 4600 will expect from automotive designers a lot of details, such as, “if you are doing X, don’t forget to do Y.”

    • Other standards show “how to get to safety,” but UL 4600 prescribes “where you end up with your system.”

Why Do We Need Another Standard?

  • Current safety standards provide essential guidance for designing safe vehicles.

    • However, existing standards such as ISO 26262 and ISO/PAS 21448 were envisioned for vehicles that ultimately have a human driver responsible for safe operation of the vehicle.

    • With existing standards, safety is typically achieved via following a specified design process, together with the imposition of specific technical requirements and validation methods.

    • Higher degrees of risk result in more rigorous engineering requirements to ensure appropriate risk mitigation.

  • Rather than require a particular technical approach, UL 4600 concentrates on ensuring that a valid safety case is created.

      • A safety case includes three elements: goals, argumentation, and evidence.

      • Goals describe what it means to be safe in a specific context, such as generic system-level safety goals (e.g., don’t hit pedestrians) and element safety requirements (e.g., ensure a computing chip produces correct computational results despite potential transient hardware faults).

      • Arguments are a written explanation as to why a goal is achieved (e.g., vehicle-level argumentation that the system can detect and avoid pedestrians, including ones that are unusual or appear in the roadway from behind obstacles, within the limits of physics and subject to the vehicle displaying appropriate defensive driving behavior).

      • Evidence supports that the arguments are valid, typically based on analysis, simulations, and test results (e.g., for a computing chip mathematical analysis of error correction codes combined with the results of fault injection experiments).

Goal based and technology-agnostic

  • The key to the UL 4600 approach is that it is goal based and technology-agnostic.

  • That means UL 4600 requires explaining why the self-driving car is safe without requiring the use of any specific design approach or specific technology use.

    • For example, using LIDAR is not required. Rather, the safety case has to credibly argue that relevant objects will be successfully detected and classified with whatever sensors are installed within the limits of the intended operational design domain.

    • Similarly, there is no fixed limit on the number of road testing miles that must be accumulated before deployment.

      • Rather, the safety case must argue that an acceptably robust combination of analysis, simulation, closed course testing, and safe public road testing have been performed to ensure an appropriate level of system safety for the initial vehicle and each software update.

Summary

  • UL 4600 is specifically designed to work well with existing automotive safety standards such as ISO 26262 and ISO/PAS 21448.

    • However, it is generic enough that it can also play well with other standards as autonomy becomes adopted into other domains.

    • The overarching safety case for the whole system can take inputs from both traditional safety activities and new approaches required to validate novel technical approaches such as machine learning.

  • Compliance with UL 4600 permits other safety standards such as ISO 26262, ISO/PAS 21448, IEC 61508, MIL STD 882, etc., as well as security standards where such conformity is demonstrated.

    • UL 4600 makes it very clear that it’s not the only safety standard AV designers need.

      • You also need good engineering methods such as those discussed in other standards, including IEC 61508, ISO 26262 and ISO/PAS 21448 (SOTIF).

Functional Safety of Autonomous Travel: Seeing the Road Through the LIDAR Lens

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Nanoscience & Technology, Occupational Health & Safety

As a surveying method that measures distance to a target by illuminating the target with pulsed laser light and measuring the reflected pulses with a sensor, Light Detection and Ranging (LIDAR) helps to assure Road Safety for Autonomous Travel.

Successful perception algorithms also tend to be probabilistic. For example, the evidence grid framework accumulates diffuse evidence from individual, uncertain sensor readings into increasingly confident and detailed maps of a robot's surroundings. This approach yields a probability that an object is present, but never complete confidence. Furthermore, these algorithms are based on prior models of sensor physics (e.g., multipath returns) and noise (e.g., Gaussian noise on LIDAR-reported ranges) which are themselves probabilistic and sensitive to small changes in environmental conditions. LIDAR, short for light radar, is a crucial enabling technology for self-driving cars. The sensors provide a three-dimensional point cloud of a car's surroundings, and the concept helped teams win the DARPA Urban Challenge back in 2007. LIDAR systems have been standard on self-driving cars ever since.

Why LIDARis essential for self-driving cars

If the safety requirement is to “detect all pedestrians within 10 meters”, this is not completely specifiable because it is unclear what the complete set of necessary and sufficient conditions are for identifying what a pedestrian is. On the other hand, the safety requirement “detect obstacles within 10 meters” could be precisely specified and implemented using programming because obstacle detection can be performed with an appropriate combination of sensors (e.g., LIDAR, RADAR, etc.) and signal processing. Today, most self-driving cars rely on a trio of sensor types: cameras, radar, and LIDAR. Each has its own strengths and weaknesses. Cameras capture high-resolution color images, however they can't measure distances with any precision, and they're even worse at estimating the velocity of distant objects.

Radar can measure both distance and velocity, and automotive radars have gotten a lot more affordable in recent years. Radar is good when you're close to the vehicle. However because radar uses radio waves, they're not good at mapping fine details at large distances.

LIDAR offers the best of both worlds.

  • Like radar, LIDAR scanners can measure distances with high accuracy.

  • Some LIDAR sensors can even measure velocity.

  • LIDAR also offers higher resolution than radar.

  • That makes LIDAR better at detecting smaller objects and at figuring out whether an object on the side of the road is a pedestrian, a motorcycle, or a stray pile of garbage.

  • And unlike cameras, LIDAR works about as well in any lighting condition.

In order to let a car run autonomously, first it has to sense the external environment/surroundings; process the data and act by making meaningful decisions. In this sense, process and act chain, the sensing part of the external environment is taken care by sensors like camera, radar, LIDAR and referred as surround sensors. Apart from surround sensors, other sensors like vehicle odometry sensors and actuators are also important to feed the information to decision-making block. For example, the steering wheel angle and wheel speed is important data for a car to make the right decision along with surrounding information. So broadly we would divide sensors into the following three categories

  • Surround sensors: These are mounted on the external/internal surface of the car and useful to provide surrounding information.
    Example: Camera, radar, LIDAR, ultrasonic, infrared camera, IMU, GPS and digital map etc.

  • Vehicle odometry sensors: These sensors capture the information about vehicle motion. Example: wheel speed, acceleration, yaw rate, steering wheel angle etc.

  • Actuators: These are the sensors which translate the human/machine actions.
    Example: Break Torque, Engine Torque, restraint actuators, wheel spring etc.

The three big factors that distinguish LIDAR sensors

Car makers have been using different sensors mainly LIDAR, radar, camera and ultrasonic for safety features like ACC (Automatic Cruise Control), LKA (Lane Keep Assist), blind spot detections,forward collision warning, and very recently for active safety features like AEB (Auto-Emergency Brake) as well. In recent past, industry has seen the usage of more sensor/information like satellite information, vehicle and infrastructure (V2V and V2x) and LIDAR to improve the robustness of these safety features. There is significant overlap of the information provided by these sensors.

At the same time, their degree of reliability varies. For example, radar and camera both can identify the distance of an object. However, the reliability of information from a radar sensor is higher as compared to a camera. Autonomous driving systems need to provide the highest degree of reliability and would require a good overlap of information from different sensors to make a confident decision.The basic idea of LIDAR is simple: a sensor sends out laser beams in various directions and waits for them to bounce back. Because light travels at a known speed, the round-trip time gives a precise estimate of the distance.

While the basic idea is simple, the details get complicated fast. Every LIDAR maker has to make three basic decisions:

  • how to point the laser in different directions,

  • how to measure the round-trip time, and

  • what frequency of light to use.

We'll look at each of these in turn.

Beam-steering technology: Most leading LIDAR sensors use one of four methods to direct laser beams in different directions:

  • Spinning LIDAR. This approach has the advantage of 360-degree coverage, but critics question whether spinning LIDAR can be made cheap and reliable enough for mass-market use.

  • Mechanical scanning LIDAR uses a mirror to redirect a single laser in different directions. Some LIDAR companies in this category use a technology called a micro-electro-mechanical system (MEMS) to drive the mirror.

  • Optical phased array LIDAR uses a row of emitters that can change the direction of a laser beam by adjusting the relative phase of the signal from one transmitter to the next.

  • Flash LIDAR illuminates the entire field with a single flash. Current flash LIDAR technologies use a single wide-angle laser. This can make it difficult to reach long ranges since any given point gets only a small fraction of the source laser's light. Multi-laser flash systems would have an array of thousands or millions of lasers—each pointed in a different direction.

Distance measurement

LIDAR measures how long light takes to travel to an object and bounce back. There are three basic ways to do this:

Time-of-flight LIDAR send out a short pulse and measures how long it takes to detect the return flash.

Frequency-modulated continuous-wave (FMCW) LIDAR

Frequency-modulated continuous-wave (FMCW) LIDAR sends out a continuous beam whose frequency changes steadily over time. The beam is split into two, with one half of the beam getting sent out in the world, then being reunited with the other half after it bounces back. Because the source beam has a steadily changing frequency, the difference in travel distance between the beams translates to slightly different beam frequencies. This produces an interference pattern with a beat frequency that is a function of the round-trip time (and therefore of the round-trip distance). This might seem like a needlessly complicated way to measure how far a laser beam travels, but it has a couple of big advantages. FMCW LIDAR is resistant to interference from other LIDAR units or from the Sun. FMCW LIDAR can also use Doppler shifts to measure the velocity of objects as well as their distance.

Amplitude Modulated Continuous Wave (AMCW) LIDAR

Amplitude Modulated Continuous Wave (AMCW) LIDAR can be seen as a compromise between the other two options. Like a basic time-of-flight system, AMCW LIDARs send out a signal and then measure how long it takes for that signal to bounce back. But whereas time-of-flight systems send out a single pulse, AMCW systems send out a more complex pattern (like a pseudo-random stream of digitally encoded one and zeros, for example). Supporters say this makes AMCW LIDAR more resistant to interference than simple time-of-flight systems.

Laser wavelength

The LIDARs featured in this article use one of three wavelengths:

  • 850 nanometers,

  • 905 nanometers, or

  • 1550 nanometers.

This choice matters for two main reasons. One is eye safety. The fluid in the human eye is transparent to light at 850 and 905nm, allowing the light to reach the retina at the back of the eye. If the laser is too powerful, it can cause permanent eye damage.

On the other hand, the eye is opaque to 1550nm light, allowing 1550nm LIDAR to operate at much higher power levels without causing retina damage. Higher power levels can translate to longer a range.

So why doesn't everyone use 1550nm lasers for LIDAR? Detectors for 850 and 905nm light can be built using cheap, ubiquitous silicon technologies. Building a LIDAR based on 1550nm lasers, in contrast, requires the use of exotic, expensive materials like indium gallium arsenide.

And while 1550nm lasers can operate at higher power levels without a risk to human eyes, those higher-power levels can still cause other problems. And, of course, higher-power lasers consume more energy, lowering a vehicle's range and energy efficiency.

In summary, as a a detection system which works on the principle of radar, but uses light from a laser instead, LIDAR is essential to assure Road Safety for Autonomous Travel.

What Every Engineer Should Know About SOTIF: fine tuning highly automated vehicle and automated vehicle safety

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

  • The number of highly automated and self-driving cars is increasing rapidly, and their technology is speeding ahead of regulation.

    • However, strong public concerns about their safety have prompted the sector to increase safety standards beyond the item-based ISO 26262 to achieve safety goals compliance.

  • The Safety of the Intended Functionality (SOTIF) will likely become the highly automated vehicle (HAV) and automated vehicle (AV) standard in the coming years as the industry moves towards ultimate safety completeness.

This article explores what this means for the automotive industry.

Playing it safe

  • The US Department of Transportation has just eased federal oversight on HAVs and AVs, giving the nod to OEMs and tech companies like Aurora and Waymo to further their tests on public roads.

  • Also, Europe’s Transport Commission announced earlier this year that it aims to make the continent a world leader in autonomous vehicles systems, with a €450m investment in road and telecoms networks to help catch up with China and the American markets.

  • However, one big question remains – what are the next safety standard steps to fully comply with the Safety Goals (SG) beyond ISO 26262 and E/E systems functionality?

    • Perhaps SOTIF is the long-awaited answer?

  • The SOTIF initiative aims to provide guidance for how to apply safety requirement completeness for AV artificial intelligence systems out on the streets.

  • This will be achieved by developing a more integral standard that mainly involves the sensing systems in charge of sorting all possible dangerous situations – even without any fault in the sensing system itself.

  • The cause of an accident might be that the processing algorithm takes a hazardous decision about the environment,

    • so the SOTIF initiative aims at providing guidance to manage such a violation of a SG.

  • In this regard, SOTIF is a more holistic standard that transcends – or complements – ISO 26262 E/E system standardization.

Why SOTIF?

  • SOTIF originated as a sub-working group within ISO 26262’s second revision.

  • It is also the result of inadequate safety requirement fine-tuning and of improper item definitions:

 

“Our position is that ISO 26262 needs to be complemented to explicitly prescribe activities, e.g. refinement verification, and corresponding work products for refinement, on every existing level of the reference lifecycle.”
  • In addition to the ISO 26262 second revision that will be released later this year, and all the other established functional safety standards like IEC 61508, IEC 61511, EN 5012X series, DO-254, etc., a new regulatory framework is emerging.

The framework is SOTIF

  • For safety there are compliance things to be met.

  • There is more documentation for ISO 26262 compliance and for random faults there is extra work that has to be done.

    • However, for systematic failure analysis, you just need a slightly more inclusive mindset than you have for functional verification.

  • Further, SOTIF’s new safety standard approach was crucially triggered by the HAVs and AV interaction with the road in mind.

  • There are a number of unknown and unsafe conditions out there that affect how an AV responds, which the SOTIF standard is going to address.

  • And what’s more important is that its introduction will pose an innovative take on safety standardization.

From System Failure to System Complexity

  • SOTIF has been released under the ISO designation PAS 21448.

  • It covers the validation and the verification of systems with complex sensing and algorithms, whose limitations in performance could cause safety hazards in the absence of a malfunction.

  • It focuses on verifying and demonstrating the safety of a system, not only from its functional verification perspective but also from its sheer complexity. And this is the true value of the new standard.

  • SOTIF is a gargantuan yet complimentary effort to deliver safety completeness.

  • The complexity problem will then be possible to master by introducing safety requirements in so many steps that each step can be verified with regards to safety correctness and completeness.

  • However, many questions still remain as to the safety challenges that the industry faces to apply automated driving systems on the street:

    • from how will the new and more complex artificial intelligence adapt and respond to the road environment and offer a safe vehicle personality?

    • to how will OEMs deal with the recurrent system updates as a car gains mileage?

  • For now, SOTIF is still the beginning of a new and exciting part of the automotive industry’s next big development.

    • It begins to answer some of the key questions facing the automotive industry as autonomy increases.

ISO PAS 21148

  • The new PAS 21148 standard revolves around the topic of Safety of the Intended Functionality (SOTIF).

    • The idea is to assess whether an application’s intended function may involve certain hazards.

  • This goes above and beyond protecting a system from malfunctions, as has been looked at until now under functional safety (ISO 26262).

    • A number of additional assessment methods are required for this new task.

  • With complex development projects involving modern assistance systems for automated driving, evaluating desired functions takes a considerable amount of time.

  • PAS 21448:2019 takes this need into account.

    • The standard lays down an evaluation framework for identifying and assessing any possible hazards for road users posed by target functions.

  • SOTIF (PAS 21448) focuses on the nature and characteristics of functions.

  • Until now, ISO standard 26262 only dealt with avoiding malfunctions.

    • This approach is no longer sufficient for advanced driver assistance systems technology used in autonomous driving.

  • SOTIF establishes a methodology for evaluating target functions, also providing an appendix offering guidelines and examples of suitable procedures.

What Every Engineer Should Know About Fail Silent

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Nanoscience & Technology

To develop “Industrial Design Engineering: Inventive Problem Solving”, we need to know about

  • Fail-silent: the system recognizes that it is receiving the wrong information due to a fault, so the ongoing operation moves to degraded mode.

Qualitative Analysis – From Fail Safe to Fail Silent

Different applications have different safe state conditions:

  • In some cases the system architect prefers a hard stop such as reset, fail safe pin activation.

  • In other scenarios soft stop or a degraded mode may be preferred as this allows application continuity.

Example. Battery Management is a perfect example of the second case and this is the main driver for enabling fail silent architectures.

Fail silent mode is a new software configurability offered at the hardware level, to have a flexible safety behavior that is adapted to multiple safety goals of the application. Reset and fail safe activation are configurable and safe, meaning the right level of dependability at the system level can be selected.

In the BMS example above, this specific solution enables the system to work in degraded mode, even after the failure, with the right level of availability to continue managing the car’s energy in the E/E architecture.

In safety-relevant systems of today’s vehicles, the most frequent reaction to a fault is to deactivate or reset the faulty function. This is referred to as fail-silent. It is easy to implement this type of solution, and it is effective for achieving a safe state and maintaining it. However, E/E systems in the vehicle are increasingly assuming other functions that must remain available in case of a fault, e.g. when a micro-controller fails. This behavior is referred to as fail operational and in the following as fault tolerant. In the future, the demand for fault tolerant systems will increase substantially in manufactured automobiles. One example: in some of today’s heavy SUVs, it is necessary to keep steering assist systems active to assure that drivers can handle steering safely.

Modular Safety Concept for Fail-Silent Systems

Functional Safety Engineers (FSEs) use a modular concept to efficiently tailor the various safety mechanisms to a specific project. Here they make a rough distinction between measures for micro-controller integrity, measures for functional monitoring and comprehensive measures. Measures for establishing integrity of micro-controllers are selected according to the highest Automotive Safety Integrity Level (ASIL) of the software that is used. They are independent of the function to be performed, and they are determined by the required diagnostic coverage (DC) for a specific ASIL

Example. Micro-controller manufacturers often set specific requirements based on their safety analyses. For example, a DC for ASIL D requires Built-In Self-Tests (BITs) that are executed periodically by the software. Generally, starting with ASIL B the probability of occurrence of so called Single Event Upsets (SEUs) must be considered. Micro-controllers in lock-step mode and memory with Error Detection and Correction Codes (ECC RAM, ECC ROM) offer effective protection against SEUs. Both safety mechanisms are realized in the hardware, are nearly transparent to software development, and are therefore very efficient solutions.

The developer normally implements additional mechanisms in the application to perform functional monitoring. They include monitoring tasks for sensors and actuators, as well as limiters and program flow monitoring (logic monitoring). Program flow monitoring can be achieved with an AUTOSAR watchdog, for example. Functional monitoring and microcontroller integrity are defined and implemented according to the specific project. However, there are also mechanisms that are used in nearly every safety-related ECU and are independent of functionality and ASIL. Almost every ECU with ASIL software also executes QM software. To ensure coexistence according to ISO 26262, memory separation and monitoring of time constraints are needed.

  • Memory partitioning is realized by an AUTOSAR operating system with Scalability Class 3 (SC3) that controls a memory protection unit (MPU) with the required ASIL.

  • The watchdog usually handles monitoring of time requirements by deadline monitoring.

  • As soon as safety-related data is exchanged between more than one ECU, communication protection comes into play. AUTOSAR offers an effective safety mechanism for this purpose in the form of end-to-end protection (E2E).

Fail-Safe architecture

Should an automated system fail, manufacturers have two methods to meet the ‘Fail Safe’ requirements that guarantee the continued safe operation of the vehicle:

  • In the safety-relevant systems of today’s vehicles, the most common response to a failure is to deactivate or reset the faulty function - this is known as Fail-Silent.

  • Fail Silent system development is well covered by ISO 26262. While easy to implement, it is effective in achieving a safe state and preserving it. 

  • Systems with Fail Operational behavior, on the other hand, maintain full or degraded functionality after a malfunction has been detected.

In higher level automated driving systems it is no longer sufficient to simply deactivate a function to reach a safe state. The safe state has to ensure continued, if reduced, power and functionality. Thus, systems with conditional automation L3 are generally Fail Operational systems, implemented as 1oo2D (one-out-of-two with diagnostics).

In these highly automated systems functional operations must continue in case of a fault - a behavior referred to as Fail Operational with Fault Tolerant capability.

When designing such a system, it is important that the architecture is taken into account for the quantification of the effect of random hardware failures. Usually, a safety system is considered as a serial system of three subsystems:

  • sensor(s),

  • logic solver(s), and

  • final element(s).

That is, the system is able to perform its safety function if and only if all these subsystems are able to perform their respective safety sub-functions.

  • Fail-silent systems enter a state that does not interfere with other safety related systems in case of a failure.

  • Fail-silent is, if after one or several failures, the system switches into a state where it does not interfere with other components.

This CRC Press News uses the following terms as types of behavior:

  • Elements (i.e. systems, HW- or SW-elements) showing fail-safe behavior enter an active or passive safe state, and cease to perform their functions, in case of a certain amount of failures.

  • Elements (i.e. systems, HW- or SW-elements) showing fail-silent behavior enter a safe state with no interference with other elements, and cease to perform their functions, in case of a certain amount of failures.

  • Elements (i.e. systems, HW- or SW-elements) showing fail-operational behavior continue to perform a deï¬ned set of their intended functions to a deï¬ned extent, with a deï¬ned performance, for a deï¬ned time in case of a certain amount of failures. For example the term fail-operational system denotes thereby a system that shows fail-operational behavior in case of a deï¬ned number of failures.

Fail-silent strategy for bus communication

For safety critical redundant systems deterministic timing is of paramount importance. Protocols, like Time Triggered Protocol (TTP) and ARINC 664-7 solve this issue by introducing bus synchronization and pre-determined time-slots for the bus messages.

  • This way each system connected to the bus knows when to send and when to receive signals. This way it can be prevented that signals go missing due to high bus load, a failure mode that is difficult to handle in fault-tolerant systems.

  • An additional advantage is that the ”age” of signals used for voting is well-known, making voter designs much easier.

The automotive FlexRay protocol is also a time-triggered protocol. Another interesting aspect of the fault-tolerance is the concept of bus guard’s. All bus systems with a single physical layer (e.g. one twisted shielded pair of cable) are prone to the ”babbling idiot” problem. ”Babbling idiot” denotes the failure mode where one communication node blocks the bus access by continuously sending messages. To handle this issue, the TTP bus uses bus guards to decouple nodes that would disturb the communication on the bus they are connected to. This is a classic fail-silent strategy for bus communication.

Functional safety requirements

Functional safety is related to minimizing the hazards resulting from a faulty system. The faults in a system may occur because of hardware/software errors, permanent/transient errors, or because of random/systematic errors.  The following are the possible reactions when an error occurs:

  • Fail-dangerous: Possibly causes a hazard in the case of a failure

  • Fail-inconsistent: Provided results will be noticeably inconsistent in the case of a failure

  • Fail-stop: Completely stops itself in the case of a failure

  • Fail-safe: Returns to or stays in a safe state in the case of a failure

  • Fail-operational: Continues to work correctly in the case of a failure

  • Fail-silent: Will not disturb anyone in the case of a failure

  • Fail-indicate: Indicates to its environment that it has failed

The implementation of functional safety in a system typically means "mapping" the first three types of reactions above into any of the last four reactions which ensure minimal hazards results from the system failure.

With Industrial Design Engineering: Inventive Problem Solving”, safety architectures and system design aim to enable full redundancy to facilitate higher levels of autonomous driving and fault tolerance in the case of failure.

Radar and Functional Safety technology for Advanced Driving Assistance Systems (ADAS)

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - Industrial & Manufacturing, Ergonomics & Human Factors, Information Technology, Nanoscience & Technology, Occupational Health & Safety

This CRC Press News will describe advanced development in 77 GHz radar technology, enabling smaller and better collision avoidance systems. Then new development in functional safety chipset solution, MCU and analog, will be explained. The combination of these technologies forms a comprehensive safe solution for advanced driver assistance or autonomous driving. The development of these technologies is driven by the automotive market and can be redeployed to many other types of mobile machines.

Radar systems will become more and more prevalent in cars in the near future. They offer a number of comfort and safety applications.

  • Short range radar from a few centimeters up to 30 meters can be used for

    • blind spot detection,

    • backing aid or parking slot measurement to guide the car to self-park.

  • Long range radar up to 250 meters can be used to enable an adaptive cruise control aligned with the speed of the preceding car.

More critical functions can be enabled such as

  • collision warning,

  • emergency braking,

  • pre-crash sensing that could trigger seat-belt tensiometer, or

  • other active and passive safety features.

With these later functions, it is obvious that the electronic control system needs to reach highest functional safety level as the system will eventually steer or brake the car without the driver intervention.

More and more applications and a steadily increasing market penetration are showing the success of radar based driver assistance systems. While in recent years most of those systems were focusing on the advance of the drivers' comfort, today many safety applications are offered. As those systems can directly influence the vehicle dynamics, functional safety in terms of normative requirements, such as the ISO 26262 is gaining more interest. A built-in self-test is able to monitor multiple receiver paths by measuring the amplitude and phase imbalance among all channels.

The advance in this radar technology development can be leveraged in other applications such as mobile industrial machine, cranes, factory safety equipment where an area needs to be closely protected. Coupling radar with machine vision can also create a powerful combination with both technologies supplementing each other in order to create more accurate and reliable systems. Radar works through rain, fog and dirt when vision does not. Radar also extends further in distance and event in non direct line of sight. A system combining vision and radar with some smart sensor fusion algorithm could leverage the benefits of both sensing technologies.

The performance of a radar system may be affected by failures of the system components and by environmental influences that can lead to a critical state. However in contrast to handling of E/E architecture failures there are no explicit requirements for avoidance or mitigation of environmental influences on the sensor detection performance given by ISO 26262.

77 GHz Radar Technology

In a collision warning system, a 77 GHz transmitter emits signals that is reflected from objects ahead and are captured by multiple receivers integrated throughout the vehicle. The transmitter emits a frequency modulated continuous wave signal, meaning that the frequency varies up and down over a fixed period of time by typically a triangle wave signal. Since radio waves propagate at constant speed of light, distance can be calculated by measuring the frequency difference between the transmitted and received waves knowing the frequency slope over time. Speed measurement uses the Doppler effect which uses the difference between the observed reflected signal frequency and the emitted frequency.

Radar systems are not new. What is new is that car makers want to include them in medium line kind of cars in a few years, so the system has to be really low cost and high quality. This is a big shift from specialized and costly radar systems to standard car equipment type. The challenge is then to reduce cost while actually improving quality and defect part per million.

Radar sensors use a limited frequency bandwidth and a limited measurement time to sense an environment which exhibits a very broad range of complexity, dynamics and parasitic effects:

  • The temperature can differ drastically between the cold start and after a multi-hour ride in summer

  • The environment fluctuates within short time

  • The environmental complexity differs drastically between city and highway traffic

  • The typical behavior of car drivers differs noticeably from country to country

  • Radar waves are attenuated to the fourth power from the distance to an object

  • The reflection coefficient of a target object differs by a factor of more than 100 between a person / motorbike and a lorry or in a multi-storey car park

  • Radar wave propagation is disturbed by dirt, heavy rain or snow

  • Parasitic Doppler frequency shifts due to rotating fans, vibrating parts, …

  • The road infrastructure like guard rails or tunnel walls reflect radar waves causing multipath propagation

  • Signals of different noise sources are superimposed to the actual measurement signal

  • The separation and object discrimination capability of a radar device is limited and thus may lead to the misinterpretation or wrong clustering of distributed targets

These effects are added to sensor-internal non-ideal effects like limited Variable Frequency Oscillator (VCO) phase noise or limited isolation between the transmitting and receiving path.

Functional Safety Microcontroller

A micro-controller is used to control the RF radar transmitter and to process the data coming from the receiver. Given the critical safety nature of the application, a functional safety MCU is used. The challenge for safety engineers is to architect their system in a way that prevents dangerous failures or at least sufficiently controls them when they occur.

Dangerous failures may arise from:

  • Random hardware failures

  • Systematic hardware failures

  • Systematic software failures

The functional safety standard IEC 61508 and its automotive adaptation ISO 26262 are applied to ensure that electronic systems in general industry and automotive applications are acceptably safe.

  • The IEC 61508 document defines four general Safety Integrity Levels (SILs) with SIL 4 denoting the most stringent safety level.

  • The ISO document defines four Automotive Safety Integrity Levels (ASILs) with ASIL D denoting the most stringent safety level.

Each level corresponds to a range of target likelihood of failures of a safety function.

There is no direct correlation between the SIL and ASIL levels. The ISO 26262 takes the safety process and requirements to a deeper level. From the beginning of the design process, evidence must be collected to show that the product has been developed according to regulation standards. Any potential deviations that have been identified must be documented to ensure that adequate mitigation is in place. They are different ways to implement safe MCUs.

  • The traditional technique is to use two separate MCUs to duplicate the software on physically different controllers. The same software can be run identically on each MCU and then the results are compared. If they are the same all is good, if not then the system knows there is an error and either solves it and/or puts the system into a safe state.

  • Another option is that one MCU only runs safe software and monitor the other MCU which is running the application software.

Decreased S/N

The most important challenge in radar signal processing, when evaluating the signal spectra in the frequency domain, is the selection of the best threshold level in the presence of thermal noise fluctuations and clutter effects. Moving the threshold level too high above the thermal noise floor reduces the target detection probability, especially for weak target reflections that are only a few dB above the noise level. On the other hand, when setting the threshold level too close to the noise floor, random noise peaks may trigger false alarms by surpassing the threshold level. With the Neyman-Pearson criterion, a decision rule is constructed that has a maximum probability of detection while not allowing the probability of false alarm to exceed a certain value. In [FPR], a relation between probability of detection, Signal-to-Noise Ratio (SNR) and false alarm rate for a sinusoidal signal can be derived with the following assumptions:

  • There are only thermal, Gaussian noise fluctuations in the radar signal.

  • There are no other radar interferers or environmental clutter present.

  • Radar signal processing using the polar coordinates with Rayleigh distributed noise yields valid results.

  • The target detection probability is a function of the reflected signal strength and the threshold level.

  • The false alarm rate is a function of noise statistics and the threshold level.

  • A 1 MHz victim receiver bandwidth (i.e. 106 noise pulses per second may cause a false alarm rate) is assumed.

  • Ideal transmitter and receiver components (no non-linearity, VCO phase noise, receiver noise figure,…) are assumed.

  • An ideal target (not distributed, i.e. all reflected energy in a single sinusoidal waveform) is assumed.

Monte Carlo Simulations reveal that for a lower false alarm rate the SNR has to be higher and that this holds also true for the probability of detection. Broadband interference in the radar signal spectrum decreases the SNR.

Functional Safety Companion Device

To support a total system solution for functional safety applications, a class of companion power System Basis Chips (SBCs) combining both safety monitor role for the MCU and power supply generation are needed.

These SBC devices provide power to MCUs and other system loads and optimize energy consumption through lowpower saving modes. They also typically integrate physical layers interfaces and a serial peripheral interface to allow control and diagnostic with the MCU. The combination of the MCU and analog system basis chip, designed as a Safety Element out of Context (SEooC), facilitates the assessment of the safety of a system. This architecture enables the number of components at the system level to be reduced, addresses the functional safety requirements and increases reliability. Four safety measures are implemented to secure the interaction between the MCU and SBC:

  • uninterrupted supply

  • fail-safe inputs to monitor critical signals

  • fail-safe outputs to drive fail-safe state and

  • watchdog for advanced clock monitoring

System versus chipset compliance

Functional safety compliance is achieved at system-level which is the responsibility of the system designer. The MCU and SBC chip set are designed independently of its final application which can be a barking car system, Advanced Driver Assistance System or a moving crane. The chip set is thus developed by treating it as a Safety Element out of Context (SeooC). An SEooC is a safety-related element which is not developed in the context of a particular vehicle function or end application, following the Industrial Design Engineering guideline for developing SeooC components from the ISO26262 specification.

Functional Safety of Multi-Core Processing

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology, Nanoscience & Technology

Functional safety is a formal discipline that analyses a system to identify:

  • Safety Function Requirements - what safety functions are performed by the system

  • Safety Integrity Requirements - what degree of certainty is necessary that the safety function will be performed without a dangerous failure

The objective of Functional Safety is freedom from unacceptable risk of physical injury or of damage to the health of people either directly or indirectly (through damage to property or to the environment).

Complexity and electrification of automobiles increases steadily, which involves an enhancement of safety-requirements of applied E/E-components, controllers and their connection and interaction in between. As a fundamental safety-relevant standard, the E-Gas monitoring concept has been introduced in the '90s. Recently, a new version of this E-Gas monitoring concept has been published, which includes the application of multi-core processors.

Each architecture (Power, X86, ARM) is available in a number of configurations, enabling vendors to meet a range of computing and power requirements within their product lines. In addition, some products offer built-in accelerators that may or may not be of value for a given application.

In order to fulfill safety standards/regulations such as ISO 26262 there is the need to implement redundant calculations which can be compared to each other during run-time in order to increase reliability of the systems and finally deliver products which are state-of-the-art. Implementing redundancy concepts on one chip of course is subjected to Common Mode Failure (CMF), because of single hardware, however calculations have shown that many failures happening in software can be detected and handled by redundancy concepts in multi-core systems.

The Power and ARM Architectures are based on a Reduced Instruction Set Computer (RISC) design while the X86 processors are Complex Instruction Set Computers (CISC). This is really a simplification, since Power and ARM A-core processors use some complex operations, so hey can’t be considered a pure RISC design. In comparison, Power and ARM have a simpler instruction set than x86.

Automotive OEMs are requiring semiconductor suppliers to develop ADAS system-on-chips (SoCs) and modules in compliance to the ISO 26262 Functional Safety standard. Safety critical systems rely on SoCs to meet Automotive Safety and Integrity Levels (ASIL) specific to each application. Using certified IP also helps SoC designers reduce supply chain risk and accelerate the requirements specification, design, implementation, integration, verification, validation and configuration of their SoC-level functional safety.

The processor IP plays an import role in the ISO 26262 certification/compliance process. All three architectures (Power, X86, ARM) contain protected Intellectual Property (IP). For high-integrity use, knowledge of internal processor information is required to understand where resource contention exists that can affect deterministic operation. For Power and X86, this IP is considered to be a trade secret and access to IP is dependent on the supplier. ARM IP is also protected but can be purchased. Recent high-end ARM cores are approaching the performance of Power cores (e6500), particularly if the e6500 accelerators are disabled for high-integrity use.

CISC processors, such as X86, perform multiple operations with a single instruction. Compared to a RISC processor, additional hardware is included in each core to accomplish the complex instruction. As a result the compiled instructions are more compact, requiring less code memory for a given function. In general, the additional transistors required to support CISC will also require additional power. That said, Intel has invested heavily in reduced transistor feature size which mitigates some of the power consumption differences.

Pure RISC processors (Power, ARM) have a minimal set of instructions that operate in a single clock cycle. This simplicity reduces the number of transistors required to support the instruction set however more instructions are required to perform a given function, compared to a CISC design. As a result, code space memory size will be higher for a pure RISC application, compared to a similar CISC application. The RISC architectures lend themselves to safety-critical applications, because processor characterization of deterministic behavior is easier to accomplish with a simpler design. With fewer transistors, RISC processors tend to be more power-efficient than CISC processors. ARM-based products have been extensively used in battery or otherwise low-powered devices such as smartphones.

Summary

Multicore architectures seem to be an interesting option for safety-related applications. PC and server-class processor computing performance is achieved by integrating multiple processing cores into a single package. The multiple cores share some level of resources that affect determinism while generally providing higher performance. Functional safety certification/compliance has become more complicated as a result of MCP development because of reduced determinism and potential contention.

For Autonomous Vehicles, Product certification/compliance using MCPs, and safety certification/compliance requires showing evidence of meeting the functional safety requirements set forth by ISO 26262. Compared to SCPs, MCP integrators are more reliant on supplier-provided MCP information to understand internal resource contention (interference channels) and how the MCP arbitrates contention.

Multicore parametrization/programming/debugging is challenging. For computing-intensive applications such as Advanced Driver-Assistance Systems (ADAS), multiprocessor and multi-core processor systems are increasingly being used in embedded systems. CRC Press News titled “Multi-Core Processor Functional Safety Considerations for Autonomous Vehicles” shows how requirements for a multi-core system design can be implemented in compliance to ISO 26262 modeling concepts, i.e. safety standards. Accordingly, safety-relevant system-level requirements have to be broken down to the subordinate system components (e.g., controllers) as well as the hardware and software architecture. In safety-critical application domains, there are also requirements and limitations that must be taken into account during partitioning, mapping, and distribution processes. For the distribution of software components to the individual processing cores, various algorithms are mostly used, since a manual distribution under consideration of all constraints is error prone, time consuming, and ineffective. CRC Press News titled “Multi-Core Processor Functional Safety Considerations for Autonomous Vehicles” finally presents a way to determine and identify requirement types along with methods and tools that are supported by A Notional MCP Software Architecture for Autonomous Vehicles.

Multi-Core Processor Functional Safety Considerations for Autonomous Vehicles

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology, Nanoscience & Technology

Introduction

Autonomous Vehicles require safety-critical processing of information. Power Architecture processors have dominated safety-critical processing since the late 1990s, when major processing suppliers exited the MIL-qualified and/or aviation-certified markets. Since that time, four trends have emerged:

  • Military and commercial safety certification has become more rigorous

  • Server/desktop architectures have focused on performance at the expense of determinism

  • System-on-Chip (SoC) architectures are offered, with multiple processing cores (multicore) in a single package to increase performance over single-core processors

  • The industrial automation industry is increasing safety requirements for autonomous manufacturing, and the automotive industry is offering driver assistance, including autonomous operation, creating a large market for relatively low-power, high-integrity processing

Although the Power Architecture will remain a viable safety-critical processor technology for some time, new industrial processing products and architectures are poised to enter (or re-enter) the markets of safety-critical processing. Safety-critical processing markets of different industries require the following similar capabilities:

  • Longer product availability lifetimes (5-15 years) than consumer/server-grade processors

  • Low power draw

  • Extended temperature operation

  • High safety integrity

This CRC Press News introduces the microprocessor industry support and certification issues. High-level activities to bring safety critical products using Multi-Core Processors (MCPs) are also discussed, based on current industrial MCP development, with all cores operational, supporting Design Assurance Level (DAL)/Safety Integrity Level (SIL)/Automotive Safety Integrity Level (ASIL).

Industrial suppliers are currently at a crossroads where new processing products are offered in SoC and MCP packages and alternative computing architectures are challenging the Power architecture. The initial investment to provide certifiable products employing MCPs can easily cost more than $10M, which makes processor choice important. Thus, the modern industry is faced with a decision to continue with the Power architecture products, which have reliably served the safety-critical industries for nearly two decades, or specify potential competitors such as x86 and recent A-core ARM MCPs.

  • The focus of this CRC Press News is on medium power (15-45W), medium-high performance processors that provide that performance by using multiple cores.

  • This CRC Press News doesn’t pick a winner as there probably isn’t a single solution for all industrial applications, and at this time there remain many activities underway by MCP suppliers that will affect trade studies.

  • However, this CRC Press News does describe the hurdles in bringing a new computing architecture to markets in industrial products, and describes alternative MCP supplier’s interest and plans for the industrial markets.

MCP Functional Safety Certification Technical Issues

Faced with little opportunity to increase processing power of Single-Core Processors (SCPs), processor suppliers increase performance by providing multiple cores in a single package. When software applications share resources potential resource contention exists that must be solved for deterministic safety-critical systems. This condition is called an interference channel. “Virtual Machine” software partitioning solved application-application resource contention on SCPs in the late 1990s. Today the use of software partitioning in safety-critical systems is widespread, and this integrity technology is extended to MCPs.

MCP architectures complicate interference mitigation strategies by adding another layer of potential contention (core-to-core) in the architecture. In this case, the MCP supplier provides contention arbitration mechanisms, not the integrator. This means that for safety-critical MCP applications, the integrator must have more intimate knowledge of machine-level arbitration than in the case of SCPs, and that knowledge must be provided by the MCP supplier.

Take th he QorIQ T2080 internal architecture as an MCP example to demonstrate interference channels.

  • The T2080 MCP includes four Power e6500 cores that can be dual-threaded as Thread 1 and Thread 2 (T1 and T2).

  • Each of the four e6500 cores has dedicated data and instruction caches.

  • If dual-threading is enabled, an interference channel exists for threads sharing L1 cache memory on the same e6500 core.

  • If dual-threading is not enabled, no L1 cache interference channels exist which simplifies the interference mitigation strategy.

  • However, the L2 cache is shared by the four cores, so an interference channel exists there.

Other areas of potential contention are external memory access through the shared memory controller and Input/Output channels. The Coherency Fabric provides resource access arbitration mechanisms.

Example: A Notional MCP Software Architecture for Autonomous Vehicles

Let’s discuss a notional MCP software architecture. Three core configurations can be implemented on four cores or more in a MCP package. Application software executes in Virtual Machines (VMs) that are enabled in two ways,

  • virtualized by a hypervisor, which configures and initializes the VMs then recedes into the background, only to respond to exceptions, and

  • by partitioning provided by Guest Operating Systems (GOSs).

The hypervisor and GOS products are typically provided by a 3rd-party supplier, and other “foundation software” such as Board Support Packages (BSP) and software drivers are typically developed by the integrator.

  • A hypervisor or virtual machine monitor (VMM) is computer software, firmware or hardware that creates and runs virtual machines.

  • A computer on which a hypervisor runs one or more virtual machines is called a host machine, and each virtual machine is called a guest machine.

In this notional MCP software architecture, the hypervisor and low-layer BSP and driver software are common to all cores, however the integrator can select a Guest Operating System (GOS) that minimizes impact to existing application software, which can greatly reduce application software porting costs.

Other software architecture examples may include an OS that integrates hypervisor and GOS together. While there are advantages to these types of products, they may not support GOS products from other suppliers.

For Autonomous Vehicles’ applications, three possible core configurations are:

  • The Core 0 GOS components support ISO 26262 partitions.

  • The other cores rely on the hypervisor to provide partitioning.

  • The processing space within each partition is statically allocated processing, memory and I/O resources, and is called a Virtual Machine (VM), since the application software running within a partition has no knowledge of any other application executing on the same core.

  • The Core 1 has a single partition with a General Purpose GOS, such as Linux, as an example.

  • Core 2 configuration has a separate GOS in each of the partitioned Vms.

While partitioning improves software modularity and coupling (good things), the real motivation for partitioning is reducing certification costs of application software through hard separation and guaranteed availability of required resources (memory, space, time, and I/O resources).

SoCs and MCPs are much more complex than partitioned single-core devices due to increased contention for shared resources such as cache memory and system memory, I/O and internal communications. Complex SoCs require additional safety engineering rigor, and MCPs have other additional considerations for safety-critical applications.

Multi-Core Processor Functional Safety Considerations for Autonomous Vehicles

Current automotive processors demonstrate very high levels of safety integrity, availability and reliability to comply with the ISO 26262 functional safety standard. However, with the advent of autonomous vehicles and the expected reduction of mechanical controls in favor of electronic subsystems that execute ever more complex software, there will be a growing need to improve on these functionalities, especially to reduce system downtime and achieve fail-functional capabilities.

PC and server-class processor computing performance is achieved by integrating multiple processing cores into a single package. The multiple cores share some level of resources that affect determinism while generally providing higher performance.

Compared to Single-Core Processors (SCPs), MCP integrators are more reliant on supplier-provided MCP information to understand internal resource contention (interference channels) and how the MCP arbitrates contention.

As an example, ARM and X86 are currently providing products to the automotive market from two diverse directions,

  • X86 from desktop/server markets where processors are selected on computing performance; and

  • ARM from the embedded/battery-powered market, where low power consumption is paramount.

ARM and x86 currently support the automotive industry which has increasing need of high-integrity operation, longer product availability periods, and extended temperatures, all desirable for autonomous vehicles’ applications. In addition, the automotive industry has developed safety guidance including specific guidelines for application of Functional Safety, enabling state-of-the-art semiconductors technologies for Autonomous Vehicles.

What Every Engineer Should Know About Safety Of The Intended Functionality (SOTIF) and ISO/PAS 21448

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

  • The fast development of autonomous driving technology is prompting regulators to rethink safety standards.

  • Applying a SOTIF approach, ISO/PAS 21448:2019 is widely expected to become the first universal standard for (highly) autonomous vehicles.

How do ISO 26262 and ISO/PAS 21448 relate?

  • So far, ISO 26262 has been the primary standard in automotive development.

    • That regulation focuses on functional safety, aiming to eliminate the chance of electric/electronic (E/E) systems malfunctions.

      • In the context of autonomous systems, however, malfunctions aren’t the most important factor you’ll need to worry about.

  • An autonomous car’s situational awareness is the result of combining data from a variety of complex sensor systems, lasers, lidars, cameras, radars, etc.

  • All this data is processed and interpreted by algorithms driven by Machine Learning and Artificial Intelligence.

    • Extreme or unforeseen situations may confuse these algorithms, resulting in unsafe behavior.

  • In self-driving cars, risks may stem from a wide set of factors:

    • a misuse of the function by the driver (that can be reasonably expected),

    • performance limitations of sensor or other systems,

    • and even unforeseen changes in the vehicle’s environment (including extreme weather conditions).

  • A new set of regulations, ISO/PAS 21448:2019 is being devised to account for edge cases that may give rise to safety hazards that do not result from any system failures.

    • Rather than focusing on failures, this new standard covers malfunctions in the absence of faults: any unintended consequences that result from the technological shortcomings of the system by design.

  • Put simply, ISO/PAS 21448 complements ISO 26262.

    • It uses the same vocabulary, but extends it with autonomous-specific terms, and its scope is complementary to ISO 26262.

  • They are different standards, and it is their combination that helps autonomous developers avoid hazardous situations – both in the presence and in the absence of malfunctions and unintended use cases.

Safety of the Intended Functionality

  • The keyword of paramount importance that ISO/PAS 21448 is built around is SOTIF, or the Safety of the Intended Functionality. SOTIF can be defined as follows:

The absence of unreasonable risk due to hazards
  • resulting from functional insufficiencies of the intended functionality or
  • by reasonably foreseeable misuse by persons
is referred to as the Safety Of The Intended Functionality (SOTIF).
  • ISO/PAS 21448 provides guidance on the design, verification, and validation measures that developers can apply in order to achieve the SOTIF in their autonomous mobility products.

  • It helps developers attain safety requirement completeness to ensure safety even when the system is used in unknown or unsafe conditions, including the reasonably foreseeable misuse of autonomous vehicles.

  • The standard, however, does not cover feature abuse (e.g. cases where the system is intentionally altered).

What ISO/PAS 21448:2019 means for automotive developers

  • For developers of autonomous driving technologies, this new standard means a new approach to systematic failure analysis.

  • Rather than focusing solely on malfunctions, ISO/PAS 21448 takes the complexity approach, requiring developers to account for any potential hazards resulting from the sheer complexity of the technologies covered by the standard.

    • Many see that as the future of safety standardization.

  • What this means for developers, in practical terms, is increased focus on testing strategies and a need to apply statistical analysis in their safety validation efforts.

    • Virtual simulation, that is, simulating a vast variety of road conditions to verify the intended and safe functioning of their autonomous technologies, is becoming a fundamental strategy.

In summary, ISO/PAS 21448 is the first step of regulatory efforts to ensure the functional safety of AVs, it is definitely an important and long-awaited initiative.

Industrial Design Engineering Project: DIY an autonomous spraying system

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Information Technology, Nanoscience & Technology, Physics

Autonomous systems for precise spraying

Advances in different technologies, such as global navigation satellite systems, geographic information systems, high-resolution vision systems, innovative sensors and embedded computing systems, are finding direct application in agriculture. These advances allow researchers and engineers to automate and robotize agricultural tasks no matter the inherent difficulties of the natural, semi-structured environment in which these tasks are performed. Following this current trend, this article aims to describe the development and assessment of a robotized patch spray -20ying system that was devised for site-specific herbicide application in agricultural crops and is capable of working in groups or fleets of autonomous robots. The robotized patch sprayer consisted of an autonomous mobile robot based on a commercial agricultural vehicle chassis and a direct-injection spraying boom that was tailor-made to interact with the mobile robot. There were diverse sources (on-board and remote sensors) that can supply the weed data for the treatment. The main features of both the mobile robot and the sprayer are presented along with the controller that harmonized the behavior of both main subsystems. Laboratory characterization and field tests demonstrated that the system was reliable and accurate enough to accomplish the treatment over 99.5% of the detected weeds and treatment of the crop with no weed treated was insignificant; approximately 0.5% with respect to the total weed patches area, achieving a significant herbicide savings. A robotized patch sprayer would be featured by:

  • Smart herbicide application system based on a direct-injection sprayer.

  • System allows an important herbicide saving using a weed detection system.

  • Robotized sprayer has been tested in laboratory and field experiment.

  • Resolution to apply the weed treatment is of approximately 0.5 m.

  • System has been used experimentally in real wheat crops.

On-line trajectory planning for autonomous spraying vehicles

In this CRC Press New, a new potential application of on-line trajectory planning for autonomous sprayers is discussed. The current generation of these vehicles use automatic controllers to maintain the height of the spraying booms above the crop. However, such systems are typically based on ultrasonic sensors mounted directly on the booms, which limits the response of the controller to changes in the terrain, resulting in a sub-optimal spraying process. To overcome these limitations, it is possible to use 3D maps of the terrain ahead of the spraying booms based on laser range-finder measurements combined with GPS-based localization. Potential boom trajectory planning solutions which utilize the 3D maps are considered and their accuracy and real-time suitability is explored based on data collected from field tests. The point optimization and interpolation technique presents a practical solution demonstrating satisfactory performance under real-time constraints.

Spraying strategy optimization with genetic algorithm for autonomous air-assisted sprayer in helio greenhouses

Subject to narrow space and large planting density, many advanced crop protection machines are not suitable for many helio greenhouses. Under these conditions, a novel autonomous air-assisted sprayer is needed to realize automatic spraying and to improve droplet deposition uniformity. For such a sprayer design concept, focusing on precise air-assisted spraying control, a spraying optimization strategy could be developed to obtain uniform deposition of droplets on crops.

  • Firstly, the autonomous sprayer and its spraying sub-system can be developed, and the simplified airflow model can be constructed and validated via spraying experiments.

  • Furthermore, based on this model, the relationship between droplets deposition area and spraying mechanism posture can be derived; and

  • An offline optimal spraying strategy based on Genetic Algorithm (GA) is established.

The algorithm can be implemented by encoding 4 key parameters to control the spraying mechanism’s behavior, and the Coefficient of Variation (CV) of droplets deposition on various parts of crops can bed computed as the fitness value in algorithm iterations. Illustrative optimization processes could be simulated to show the characteristics of uniform spraying solutions developed. With optimized parameters pre-set in the autonomous sprayer, real spraying tests can be carried out in a helio greenhouse. Simulation results and real tests data can be used to validate the effectiveness and convenience of the potential optimal spraying strategy. With this spraying strategy optimization, optimized spraying behavior can be planed for different crops with different planting patterns in helio greenhouses.

Autonomous-guided orchard sprayer using overhead guidance rail

Since the application of chemicals in confined spaces under the canopy of an orchard is hazardous work, it would need to develop an autonomous guidance system for an orchard sprayer. Such an autonomous guidance system could steer the vehicle by tracking an overhead guidance rail, which could be installed on an existing frame structure. The autonomous guidance system would consist of

  • an microprocessor,

  • an inclinometer,

  • two interface circuits of actuators for steering and ground speed control; and

  • a fuzzy control algorithm.

In addition, overhead guidance rails for both straight and curved paths could be devised, and a trolley could be designed to move smoothly along the overhead guidance rails.

Industrial Design Engineering Project: DIY an autonomous spraying system

The aim of agricultural robotics is to enable automated operation of different farming processes by developing robust and autonomous agricultural vehicles. These intelligent machines would perform tasks like

  • ploughing,

  • spraying or

  • harvesting

autonomously with minimal intervention from a human user. This Industrial Design Engineering Project would enable autonomy for horizontal boom sprayers. The modern generation of these vehicles feature adjustable spraying booms which can be automatically controlled to maintain a constant distance from the crop. This would a critical process as the height of the boom affects the amount and distribution of the sprayed substance. The boom control systems so designed would rely on boom-mounted ultrasonic sensors for measuring the height and level of the booms. The ultrasonic sensors, whilst inexpensive, are relatively slow and provide noisy information for only a small patch of the terrain immediately below the spraying boom. This would result in a sub-optimal spraying process and also would restrict the maximum speed of the sprayer, since only a reactive control strategy be possible. Through the Industrial Design Engineering Project, we can investigates a control system based on alternative sensing technology employing Laser Range-Finders (LRF) and predictive terrain modeling enabling a longer “look-ahead”. The core component of the autonomous spraying system is a local 3D map of the terrain, reconstructed from a scanning laser rangefinder and precise pose information provided by GPS and IMU sensors. With such an approach the terrain can be sensed in advance, so that the trajectory planner and controller would have more time to adjust the height of the booms. Such an inventive problem solving approach would not only improve the control accuracy but could also enable new applications such as terrain-based vehicle steering or variable-rate spraying, leading towards development of fully autonomous spraying vehicles. We potentially extend the Industrial Design Engineering Project by including on-line trajectory planning for the boom controller.

A Tale of Autonomous Driving towards Functional Safety

By: John X. Wang
Subjects: Engineering - Industrial & Manufacturing

Within the automobile industry, the functional safety as a process is based on the guidelines specified by ISO 26262, an international safety standard for automotive.

ISO 26262 standard defines functional safety as the

“absence of unreasonable risk due to hazards caused by malfunctioning behavior of electrical/electronic systems”.

For ISO 26262 compliance; a functional safety consultant identifies and assesses hazards (safety risks). These hazards are then categorized based on their criticality factor under the Automotive Safety Integrity Level (ASIL) under ISO 26262. Such a clear classification of hazards helps to:

  • Establish various safety requirements to mitigate the risks to acceptable levels

  • Smoothly manage and track these safety requirements

  • Ensure that standardized safety procedures have been followed in the delivered product.

Automotive Safety Integrity Level (ASIL)

Automotive Safety Integrity Level (ASIL), specified under the ISO 26262 is a risk classification scheme for defining the safety requirements. Under the ISO 26262, ASILs are assigned by performing a risk analysis of a potential hazard by looking at various risk parameters

  • Severity,

  • Exposure and

  • Controllability

of the vehicle operating scenario.

ASIL and Safety Criticality of Automotive Components

The safety lifecycle of any automotive component, within the ISO 26262 standard starts with the definition of the system and its safety-criticality at the vehicular level.

This is done through hazard analysis and risk assessment for the corresponding automotive component (hardware/ software), necessary for the determination of the Automotive Safety Integrity Level (ASIL).

Hence, determination of ASIL forms the very first phase of the automotive system development.  Here, basically all potential scenarios of hazards and dangers are evaluated for a particular automotive component, the occurrence of which can be critical for vehicle safety.

For example,

  • an unexpected inflation of airbag or

  • failures of brakes

are potential safety hazards that should be assessed and managed in advance. This step is followed by identifying the safety goals for each component, which are then classified according to either the QM or ASIL levels, under the ISO 26262 standard.

Safety goals are basically the level of safety required by an automotive component to function normally without posing any threats to the vehicle.

Example. For example, for a car door,

  • The safety goal could be both the importance of having it opened or closed depending on which action is safe under a particular condition.

  • During instances of fire inside the vehicle or a flood, the safety goal would be to have the car door opened as quickly as possible so that the passengers can escape.

  • On the contrary, while the vehicle is moving fast, the safety goal related to the door will be to remain closed- accidental opening of door of a moving car could lead to greater risks.

Determining the ISO 26262 ASIL for an Automotive Application

There are four ASILs identified by the ISO 26262 standard: ASIL A, ASIL B, ASIL C, ASIL D.

ASIL D represents the highest degree of automotive hazard and ASIL A the lowest. There is another level called QM (for Quality Management level) that represents hazards that do not dictate any safety requirements.

 

For any particular failure of a defined function at the vehicle level, a hazard and risk analysis (HARA) helps to identify the intensity of risk of harm to people and property. Once this classification is completed, it helps in identifying the processes and the level of risk reduction needed to achieve a tolerable risk. Safety goal definition as per ASIL is performed for both hardware and software processes within automotive design to ensure highest levels of functional safety.These safety levels are determined based on 3 important parameters:

Severity of failure + Probability of exposure + Controllability = ASIL

Severity (S)

Severity (S) defines the seriousness or intensity of the damage or consequences to the life of people (passengers and road users) and property due to safety goal infringement. The order of severity is:

  • S1 for light and moderate injuries;

  • S2 for severe and life-threatening injuries, and 

  • S3 for life-threatening incidences.

Exposure (E)

Exposure (E) is the measure of the possibilities of the vehicle being in a hazardous or risky situation that can cause harm to people and property. Various levels of exposure such as

  • E1: very low probability,

  • E2: low probability,

  • E3: medium probability,

  • E4: high probability

are assigned to the automotive component being evaluated.

Controllability (C)

Controllability (C) determines the extent to which the driver of the vehicle can control the vehicle if a  safety goal is breached due to  failure or malfunctioning of any automotive component  being evaluated. The order of controllability is defined as: C1<C2<C3 (C1 for easy to control while C3 for difficult to control).

The ISO 26262 ASIL Allocation

The ASIL levels – ASIL A, B, C ,and D are assigned based on an allocation table defined by the ISO 26262 standard.Let us try to understand the determination of ASIL values for various components based on the E,C and S parameters. Few observations from the ASIL allocation table,

  • A combination of S3, E4 and C3 (the extremes of the 3 parameters) refers to a highly hazardous situation. Hence the component being evaluated is identified to be ASIL D, which means it is prone to severely life-threatening events in case of a malfunction and calls for the most stringent levels of safety measures.

  • On the contrary, a combination of S1, E1 and C1 (the lowest levels of the 3 parameters in terms of safety-criticality) calls for QM levels, which means the component is not hazardous and does not emphasize safety requirements to be managed under the ISO 26262.

  • Similarly, combination of the medium levels – S2, E4 and C3 or S2,E3 and C2 defines either an ASIL C or an ASIL A.

The intensity of the hazard thus depends on the ASIL levels of the components , under consideration. Allocation of ASIL helps in identifying how much threat the malfunctioning of a particular component can cause under various situations.

Under the framework of the ISO 26262 ASIL and functional safety; the safety goals are more critical than the functionality of the automotive component. Let us take the example of charging of a vehicle battery to understand this statement.

The safety goals associated with a battery is a more critical consideration to be evaluated as per ASIL, more than the battery itself as shown in the table below. The overcharging of battery at a speed below 6.21371 mph is not as serious a situation as overcharging at very high speeds, where the possibilities of overheating and consequent fire could also be high:

Vehicle Condition

Cause of malfunction

Possible hazard

ASIL

Running Speed< 6.21371 mph

Charging of battery pack beyond allowable energy storage

Overcharging may lead to thermal event

A

Running Speed> 6.21371 – 31.0686 mph

Charging of battery pack beyond allowable energy storage

Overcharging may lead to thermal event

B

Running Speed>  31.0686 mph

Charging of battery pack beyond allowable energy storage

Overcharging may lead to thermal event

C

Thus, ASIL determination forms a very critical process in the development of highly reliable and functional safe automotive applications. In today’s time where the car designs have become increasingly complex with huge number of ECUs, sensors and actuators, the need to ensure functional safety at every stage of product development and commission has become even more important.

This is why modern day automotive manufacturers are very particular about meeting the highest automotive safety standards in accordance to the ISO 26262 standard and ASIL Levels.

Automotive Functional Safety Best-Practices

The modern day vehicles have evolved from predominantly mechanical machine into an electronically-controlled system. Today, a car consists of hundreds of automotive Electronic Control Units (ECU) and millions of lines of software code.

With passage of time and the lightening pace of technical advancement, the number of ECUs within automobiles is also increasing. This increase in complexity and functionalities of ECU-based car design is driven by a need for a comfortable driving experience along with safety and pollution control.

The automotive ECUs power many of the advanced function and features available in modern cars including advanced driver assistance(ADAS) , telematics, passive safety systems, engine management – to name a few.

Functions such as

  • adaptive cruise control,

  • crash protection systems,

  • active body control and

  • Electronic Stability Program (ESP)

are increasing in complexity and taking an ever more active role in controlling the car. These functions are realized by systems of sensors, actuators and interconnected electronic control units. The systems must be designed to function under a variety of operating conditions and must adhere to a number of mechanical, hardware and software constraints. In order to be able to mitigate the emerging product and production risks associated with such systems as well as ensuring the high level of quality required of automotive systems, significant improvements to engineering processes are necessary.

In this article, we describe our experiences in adapting companies' development processes to conform to safety standards and to cope with the challenges mentioned above. We detail key success factors in overcoming these challenges and provide practical examples from working with global OEMs and tier-one suppliers on implementing safety standards in E/E development. 

Functional safety: selected standards, architectures, and analysis

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Ergonomics & Human Factors, Information Technology, Nanoscience & Technology, Occupational Health & Safety, Statistics

Two functional safety standards: IEC 61508 and ISO 26262

The race to self-driving cars is making the news almost daily. This new market has been an incredibly fast driver for the evolution of SoC development for automotive applications. Advanced Driver Assistance Systems (ADAS), the precursor of fully autonomous vehicles, have led to an exponential increase in the amount and complexity of electronics in cars. Modern luxury cars are reported to have up to 90 Electronic Control Units (ECUs) implementing several of the advanced features, such as

  • adaptive cruise control,

  • collision avoidance system,

  • automatic parking.

ADAS applications require environment recognition based on processing of data from radar, Light Detection and Ranging (LIDAR), and camera and sensor fusion, which is very computationally intensive and requires the support of advanced process nodes to meet the performance/watt demands. Consequently, the automotive industry is also witnessing a migration to advanced technologies, which can present a bigger challenge for safety engineering, for example,

  • process variation,

  • electrostatic discharge,

  • electromigration.

Safety becomes a fundamental requirement in the automotive systems to guarantee a tolerable level of risk. Safety can be defined by referring to two existing safety standards:

  • IEC 61508 (International Electrotechnical Commission (IEC), which is a functional safety standard for the general electronics market developed by the IEC, and

  • ISO 26262 (IEC), which is a functional safety standard for automobiles developed by ISO. Introduced in 2007,

IEC 61508 and ISO 26262 has rapidly been affirmed as the guideline for the Industrial Design Engineers. Compliance to these requirements has been traditionally addressed by car manufacturers and system suppliers. However, with the increasing complexity, the industry is taking a divide-and-conquer approach, and all participants of the supply chain are now called to support and enable functional safety and reliability standards. These metrics are becoming an integral part of the semiconductor design flow.

Functional safety architectures

Two of the most common architectures implemented to detect random faults are single and

dual-channel systems.

Single-channel systems

Single-channel systems are typically the simplest to implement and use existing processing, memory and data communication paths in the system. However, the reliability and diagnostic ability of most implementations are limited by the fact that the diagnostic functions run on the same data/power/ clock lines as the main system. Simplicity has a price, and the safety and performance levels of such systems are typically limited to SIL/ASIL-2 or below and Cat-2 systems with Programmable Logic controller (PLc )or below.

Dual-channel architectures

Dual-channel architectures provide two completely independent data/logic processing and communication, voltage and clock paths throughout the system. Not only is there independent redundancy, but it’s possible to execute and compare any necessary safety functions on both channels. It’s extremely unlikely that the same error will occur on both. However, if the results between the channels don’t match and an error is detected on one of them, both systems can be brought safely to a safe state.

Dual-channel architectures are more expensive to implement than single-channel architectures and more complex to design, however can achieve higher SILs/PLs. SIL-3 and Cat-3/Cat-4 PLd and Ple systems typically use this approach. A diagnostic system could include the following:

  • Inter-MCU communication

  • Software error diagnosis

  • Power supply voltage diagnosis

  • Other circuits diagnosis

Functional safety Analysis

Failure Mode Effects and Diagnostic Analysis

A Failure Modes and Effects Analysis (FMEA) is a systematic way to identify and evaluate the effects of different component failure modes, to determine what could eliminate or reduce the chance of failure, and to document the system in consideration. An FMEDA (Failure Modes, Effects and Diagnostic Analysis) is an FMEA extension. It combines standard FMEA techniques with extension to identify online diagnostics techniques and the failure modes relevant to safety instrumented system design.

FMEDA can be applied to:

  • carry out a detailed Failure Modes, Effects and Diagnostics Analysis (FMEDA) according to IEC 61508/ ISO 26262 with: traceable failure rates and traceable distribution of the failure rates for the different failure modes

  • calculate the safety metrics like Safe Failure Fraction (SFF) or Probability of Failure on Demand average (PFDavg) according to the requirements of IEC 61508 or Single Point Fault Metric (SPFM),

  • Latent Fault Metric (LFM) and hardware failure rate according to the requirements of ISO 26262

  • generate a complete FMEDA report for each analyzed element or subsystem

  • concentrate on the analysis work by offloading the development team from searching and selecting activities for failure rates and failure modes.

Fault Tree Analysis

Fault Tree Analysis (FTA) or Markov Modeling are techniques to evaluate the safety of a given system based on its architecture. The safety engineers will review the product architecture and model qualitatively and quantitatively the effect of independent and dependent failures of the units that compose the system, and determine quantitatively the likelihood of satisfying the safety goals (ISO) or providing satisfactory safety functions (IEC). The deliverable of this task is the FTA report for the safety goals (ISO) or safety functions (IEC) specified in the Safety Requirements Specifications (SRS).

Fault Insertion Test

Typically a FMEDA is supported by means of fault insertion tests where specific component failures are simulated to confirm the existence of assumed diagnostics and to determine the exact behavior in situations where that behavior is not trivial from the design.

The outcome of Fault Insertion Test is a Fault Insertion Tests specification that meets all IEC 61508 requirements for such a document.

In summary, Fault Insertion Testis an analysis tool which shall help and support hardware designers of safety-related systems. It allows tracking of critical failure modes which can be the basis for fault insertion tests.

Mechanical FMEA

The detailed mechanical Failure Modes, Effects and Diagnostic Analysis (FMEDA) is a technique used to evaluate the safety of a given mechanical product based on the detailed mechanical drawings of the assembly

The results of a FMEDA are a set of failure rates that can be used to determine the probability of failure, and the Safe Failure Fraction (SFF) needed for a given SIL. Typically the mechanical FMEDA is supported by a design / construction FMEA to prove the robustness against systematic design faults.

The outcome of this service is a detailed mechanical FMEDA / FMEA report that meets all IEC requirements for such a document.

 

Dependent Failures and Common Cause Failure (CCF)

Dependent failures and Common Cause Failures (CCF) are the most important factor for limiting the achievable IEC or ISO Target Failure Measure in redundant systems. The Common Cause Failure Analysis (CCA) is an advanced technique evaluating the behavior of redundant subsystems under expected Common Cause Initiators (CCI). It can be determined if sufficient logical and physical independence measures are in place to combat the expected dependent failures and CCI.

The safety engineers will review the product architecture and evaluate the measures against dependent failures and CCI and estimate the resulting impacts.

The deliverable of Dependent Failures and Common Cause Failure (CCF) are a list of safety measures to strengthen independence and sets of impact analysos for redundant subsystems. 

Software HAZard Analysis (S/W HAZAN)

The Software HAZard Analysis (S/W HAZAN) is an advanced technique evaluating the behavior of critical S/W functions under expected fault conditions. Given the expected fault conditions it can be determined if sufficient protection measures are in place to combat these fault conditions. The list of protection measures helps in creating a checklist for integration testing.

The safety engineers will review the S/W architecture and source code structure, and collect arguments for detection and containment of potential systematic problems.

The deliverable of this task is the S/W HAZAN report, listing all run-time safety integrity measures that must be implemented, and a list of Fault Injection Tests.

Timing Analysis

Though the hardware architectural metrics described so far do not include timing constraints, it is easy to understand that the complete evaluation of the safety mechanisms shall involve timing performance. In fact, the system must be able to detect faults and transition to a safe state within a specific time, or Fault Tolerant Time Interval (FTTI); otherwise, the fault can become a system-level hazard. Timing Analysis also involves the Diagnostic Test Interval (DTI), the part of FTTI allotted to detect the fault (ISO 26262y). Just as a reference example, the DTI for fault detection in a CPU can be around 10ms, while around 100ms would be allocated for FTTI of the whole system.

Summary

The role of Safety Related Systems is more and more increased in the last years, and the industries must deal with the related standards. In fact, the aim of a Safety Requirements Specification (SRS) is to reduce the risk from a hazardous state of the system under control to a tolerable level. This CRC Press News discusses the functional safety assessment based on the evaluation of the Safe Failure Fraction (SFF) for a complex system. Being such evaluation made according to IEC61508 and ISO 26262 standards, the CRC Press News focused on some related ambiguities. By using the approach presented here it is possible to study SFF of different complex systems in compliance with qualitative requirements described in the aforementioned standards.

Resolvers Vs Encoders: Motor Control Design for Functional Safety

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Environmental Science

A resolver is an angular position sensor that is commonly used in harsh, rugged environments. A fully Electric Vehicle (EV) may use multiple resolvers for a variety of control systems that perform rotary motion and additional resolvers may be required to create system redundancy for safety. A Resolver-to-Digital Converter (RDC) interface processes the analog output of the resolver sensor and communicates it in digital format to the engine control unit (ECU) in an EV. When designing an RDC interface, it is important to select the right RDC architecture to ensure that the circuit operates consistently under stringent conditions (such as vehicle acceleration).

Resolvers for Motor Control

Designing a differentiated motor drive is a complex task. Often these drives are single-processor that combine constraints of real-time embedded designs such as limited memory size and processing time, with the complications that motors bring - electrical noise and faults. When you add functional safety and certification requirements - the new design, test, and documentation deliverables require a significant amount of additional effort.

Today's systems are also more complex and more dependent on the electronic control of motoring operations that need to meet strict industry functional safety standards. Whether it is the motor in control of the power steering assist in a car, controlling the lift and doors of an elevator, or a directly connected to the drum of a front load washing machine without belts or gears, functional safety in motor operation is fundamentally important. A motor system designed with functional safety will have a lower level of risk from improper operation. When a failure does occur, whether it is a random or systematic fault, the functionally safe design will detect this fault and respond to minimize impact.

International functional safety standards are defined to ensure that functional safety techniques are detailed for a specific industry sector and that these techniques are consistently applied. IEC 61508 is a basic safety standard which is the basis of all IEC and some ISO functional safety standards. It is used as a basis for sector-specific standards but where these do not yet exist, it is also intended for direct use. Some standards that refer to IEC 61508 include: EN 50128 — railway, IEC 60601 — medical equipment, IEC 61511 — process industry, ISO13849/ IEC 62061 — industrial machinery, IEC 60880 — nuclear power industry and IEC 50156 — furnaces.

Automotive designers must comply to ISO 26262 safety requirements to support Quality-Managed (QM) and ASIL-A to ASIL-D for applications such as steering, braking, transmission, electric vehicle battery management, and advanced driver-assistance systems (ADAS). TI is a member of U.S. and international working groups for ISO 26262.

Designers for household appliances strive to meet IEC 60730, and/or related standards UL 1998 and IEC 60335, supporting Class A to Class C.

A typical motor control system block diagram consists of processing feedback from motor rotor sensors as well as measuring voltages and currents from the inverter (strategically and deterministically), and then processing this data to be used as inputs to regulate compensation of torque, speed and position control loops to finally generate an appropriate Pulse-Width Modulator (PWM) output to the inverter. These closed loops are standard and depend on a great number of components, both hardware and software.

When measuring the inverter voltages and currents, designers must know if the Analog-to-Digital Converter (ADC) is both functional and producing correct results. A common technique connects a PWM output to an ADC input through a filter. The full-scale ADC range can then be tested. Some TI microcontrollers even integrate a Digital-to-Analog Converter (DAC) to serve this purpose. One method to gain safety coverage is to have multiple ADCs converting the same control signals. This allows a comparison to occur on the actual signal used in the control process.

The next step is processing these signals. For an example implementation, while two CPUs execute the same code, comparison logic guarantees that each software instruction is executed exactly the same for both CPUs and notifies the system immediately if they do not match. Also, every local Flash and RAM access by these CPUs is checked by a Single-bit Error Correcting and Double bit Error Detecting (SECDED) Error Code Correction Controller (ECC). To extend coverage further, both the CPU and memory have hardware BIST (Built-In Self Test) to verify functionality at start up. Embedded diagnostics also include self test capability to ensure proper operation before start of safety critical operation.

With the processing now complete, the next step is to output appropriate PWMs to the inverter. These outputs can be verified by connecting them to input captures. To get more system coverage, a designer can connect the motor phases to the input captures, using appropriate signal conditioning, to verify that the transitions are within expectations.

Functional safety is of great importance for Electric and Hybrid Electric Vehicles (EV/HEVs). One way to improve functional safety of EV/HEVs is developing reliable and robust fault diagnosis and fault tolerant control systems so that, once a component failure is detected, effective remedial actions are taken to avoid system failures. Robust fault diagnosis is a necessary precursor to fault tolerant control strategies. In this paper, we use a V-diagram to present a systematic process for conducting automotive fault diagnosis, in connection with fault tolerant control development. We also introduce a diagnostic approach based on structural analysis that can form the basis of fault tolerant control. The structural analysis approach makes it possible to evaluate a system's analytic redundancy by its structural model, using the mathematical model of the system in matrix form, from which it is possible to derive a set of analytic redundant relations for fault detection and isolation. Structural analysis is useful in the early stages of diagnostic design because specific knowledge of the system parameters in numerical form is not required. In the paper, we demonstrate the application of this methodology, and its broad usefulness, by carrying out the design of a diagnostic strategy for an electric vehicle equipped with a Permanent Magnet Synchronous Machine (PMSM) drive system. The EV case study focuses on sensor faults, but the methodology is applicable to any component faults.

Resolvers Vs Encoders for Motion Control

Environmental Protection

Encoders and resolvers can be packaged in basically the same ways, so there is no advantage to one or the other as far as environmental protection.

Operating Environment

The resolver is sturdy and possibly more suitable for a severe environment application where extreme continuous temperatures or very high vibration might be encountered. However, an encoder can be very robust as well, so the only real advantage here would be the ability of the resolver to withstand higher continuous temperatures. If your application requires the feedback device to run at over 125 degrees Celsius, you should probably go with the resolver.

Complexity

Most motion control systems can work with resolvers or encoders, so there is no real difference in wires and interconnects. However, resolvers are analog devices and they require a converter to format the measurement for processing by a digital computer. This conversion is done by a Resolver to Digital Converter, (RDC) or by a DSP and suitable input filtering circuitry. The application must carefully consider a number of parameters and select fixed components in order to successfully utilize a resolver in any given application. The minimum resolver application must consider reference frequency, bandwidth, maximum tracking rate, number of bits in the conversion result, input filtering, AC coupling of the reference, phase compensation of the signal and reference, and offset adjustment. All of these issues will affect the measurement process and affect the overall accuracy of the control loop. As an example, something like the Analog Devices AD2S80A can be used. Converters of this type are tracking converters, which are implemented using a type two servo. A type two servo is a closed loop control system, which is characterized as having zero error for constant velocity or stationary inputs. Conversely, this type of system will demonstrate errors in all other situations, and the magnitude of these errors must be controlled through optimized “tuning” of the converter. The fact that the converter itself has dynamics becomes an important part of the system design. Being type two, the converter can introduce up to 180 degrees of phase lag into the system. For a 12 bit converter using a 400 Hz reference, the RDC bandwidth (-3db point) will be less than 100Hz. Using the same reference, a 14 bit converter will have a bandwidth of 66 Hz, and a 16 bit converter will have a bandwidth of 53 Hz. A 100 Hz -3db bandwidth means that there will be approximately 3db of peaking and 45° of phase shift at 40 Hz. As many servos attempt to close position loops near these frequencies, an added 45° phase shift would be undesirable. It should also be noted that, although the RDC tracking rate may not be exceeded, a system with difficult load dynamics could well prove unstable when the RDC dynamics are introduced. The situation only worsens when 14 bit or 16 bit converters are utilized. Another feature of the RDC is that the maximum slew, or tracking rate, is limited by the resolver reference, or carrier, frequency. For example, the Analog Devices AD2S80A RDC, using a 400 Hz reference, will have a tracking limit of 1500 rpm. This value can be increased to 18,750 rpm using a 5kHz reference. In contrast, encoders can be much simpler. When an encoder is used in a control system, there are only two issues to consider, desired resolution and the maximum rpm the encoder will need to work at. There are also no encoder dynamics to deal with as the measurement process is fundamentally digital. An encoder provides data with guaranteed signal separation and symmetry at up to the rated speed, and that is all there is. Many encoders are now available with resolutions up to 25 bits, and some are capable of allowing the motors to run at 10,000 rpm with close to this resolution. As a result, an encoder based application will generally have a wider dynamic range capability.

Data Format

Resolvers are “absolute” measurement systems. They provide a unique Sin/Cos voltage at every point in a 360 degree rotation. An encoder can be absolute or incremental, and the output will be digital. Absolute encoders usually have a serial databus, which the drive must be able to communicate with in order to read out the position measurement. This requires the designer to verify the encoder and the drive are compatible. Incremental encoders are ubiquitous and every drive made can interface to them.

Accuracy

An encoder will be more accurate than a resolver and its associated conversion process.

Flexibility

If at some point in the future more performance or positioning accuracy becomes necessary, it will be much easier to upgrade an encoder based system. An encoder application will require only the feedback element to be replaced, and a software change to the drive so that the new line count is accounted for. A resolver change will require reconsideration of the associated converter, supply voltage and frequency, filtering etc, and this may not be possible to change unless the drive has been developed with this in mind.

SUMMARY

Resolvers provide absolute position information and are capable of operating in relatively high temperature and shock environments because they are similar in construction to the motor itself. However, they are inflexible in their application, and must be specifically “tuned” to meet the drive system requirements. Encoders on the other hand, can be absolute or incremental, simplify the design task, are more accurate, allow for a wider dynamic range and are more flexible should changes be necessary in the future.

Functional Safety for in Green Electronics Manufacturing

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Materials Science, Nanoscience & Technology, Occupational Health & Safety, Statistics

Murphy:
“Whatever can go wrong will go wrong, and at the worst
possible time, in the worst possible way.”

Integrated circuits (ICs) are at the root of all modern safety systems in Green Electronics Manufacturing for Creating Environmental Sensible Products. Integrated circuits supply the logic and either control the sensors or, to a growing extent, are the sensors. Integrated circuits drive the final elements to achieve a safe state and they are the platform on which the software runs. The level of integration possible within semiconductors can simplify the system-level implementation at the cost of the added complexity within the IC itself. This level of integration gives improvements in system reliability due to part count reduction and offers opportunities for increased diagnostic coverage with lower diagnostic test intervals—all at a cost that makes safety affordable. It could be argued that this level of integration is a bad thing because of the added complexity. However, with the price of complexity in the integrated circuits can come a major simplification at the module and system levels. Surprisingly, while there are functional safety standards that address process control, machinery, elevators, variable speed drives, and toxic gas sensors, there is no functional safety standard dedicated to integrated circuits. Instead, bits and pieces of the requirements and knowledge are spread around IEC 61508 and other Level B and C standards. This article gives guidance on interpreting the existing functional safety standards for semiconductors.

Introduction

The manufacturer confirms to comply with the following guidelines:
• Machinery Directive – 2006/42/EC
• Noise Emission Standards – 2000/14/EC
• Electromagnetic Compatibility Standards – 2004/108/EC

Typically, integrated circuits are developed to either IEC 61508 or ISO 26262. In addition, there are sometimes additional requirements in the level two and level three standards. Developing and assessment to the functional safety standards are what give the confidence that these sometimes complex integrated circuits are sufficiently safe. When IEC 61508 was written it was targeted at bespoke systems, as opposed to open market mass produced integrated circuits. This article will review and comment on the known functional safety requirements for integrated circuits. While the article concentrates on IEC 61508 and its application in industrial sectors, much of the material is relevant to applications such as automotive, avionics, and medical.

Functional Safety

Machinery Directive 2006/42/EC
  • All machinery that are put in circulation in the European Economic Union, have to satisfy the requirements of the Machinery Directive.
  • Some key principles:
    • The machinery has to be built in such a way that no person is exposed to a hazard during
      • Operation
      • Mounting/Service
      • Predictable misuse
    • Principles for risk reduction
      • Elimination or minimization of risks as far as possible
      • Take protective measures against risks which cannot be eliminated
      • Information of users of remaining risks
    • Controls need to be designed in such a way, that hazard situations are avoided, specifically considering:
      • Predictable usage scenarios and external influences
      • Defects in hardware or software
      • Errors in logic
      • Predictable misuse

Functional safety is the part of safety that deals with confidence that a system will carry out its safety related task when required to do so. Functional safety is different from other passive forms of safety such as electrical safety, mechanical safety, or intrinsic safety.

Functional safety is an active form of safety; for example, it gives confidence that a motor will shut down quickly enough to prevent harm to an operator who opens a guard door or that a robot will operate at a reduced speed and force when a human is nearby.

Standards

Harmonized standards for Machinery Directive 2006/42/EC
EN ISO 13849 and other safety standards as a part of the machinery directive
Presumption of conformity: “Where products have been designed in accordance with suitable harmonized Standards, market surveillance authorities are obliged to assume that the essential requirements of the applicable directive(s) which are covered by these Standards have been fulfilled. The products are deemed to comply with the directives (presumption of conformity).”

The key functional safety standard is IEC 61508.1 The first revision of this standard was published in 1998 with revision two published in 2010 and work beginning in 2017 to update to revision three with a probable completion date of 2022. Since the first edition of IEC 61508 was published in 1998, the basic IEC 61508 standard has been adapted to suit fields such as automotive (ISO 26262), process control (IEC 61511), PLC (IEC 61131-6), IEC 62061(machinery), variable speed drives (IEC 61800-5-2), and many other areas. These other standards help interpret the very broad scope of IEC 61508 for these more limited fields.

Some functional safety standards such as ISO 13849 and D0-178/D0-254 have not been derived from IEC 61508. Nevertheless, anybody familiar with IEC 61508 and reading these standards would not be too surprised by the contents.

Within a safety system, it is the safety functions that perform the key functional safety activities when the system is running. A safety function defines an operation that must be carried out to achieve or maintain safety. A typical safety function contains an input subsystem, a logic subsystem, and an output subsystem. Typically, this means that a potentially unsafe state is sensed, and something makes a decision on the sensed values and, if deemed potentially hazardous, instructs an output subsystem to take the system to a defined safe state.

The time between the unsafe state existing to achieving a safe state is critical. A safety function might, for instance, consist of a sensor to detect that a guard on a machine is open, a PLC to process the data, and a variable speed drive with a safe torque off input that kills a motor before a hand inserted in a mahine can reach the moving parts.

Safety Integrity Levels

However, there is also a certain risk for the manufacturer: an accident which could have been avoided by following the EN ISO 13849-1, can turn into a problem for the manufacturer and the responsible persons involved. Essential for the clarification o liability is the state of technology which is no longer represented by EN 954-1 based on deterministic approaches or the choice of the control architecture, but by the broader defined EN 13849.

SIL stands for safety integrity level and is a means to express the required risk reduction needed to reduce the risk to an acceptable level. According to IEC 61508, the safety levels are 1, 2, 3, and 4, with an order of magnitude increase in safety as you go from one level to the next. SIL 4 is not seen in machinery and factory automation where generally no more than one person is typically exposed to a hazard. It is rather reserved for applications like nuclear and rail where hundreds or even thousands of people can be hurt. There are also other functional safety standards such as automotive, which uses ASIL (automotive safety integrity levels) A, B, C, and D and ISO 13849. Its performance levels a, b, c, d, and e can be mapped to the SIL 1 to SIL 3 scale.

The author is not convinced that a claim of greater than SIL 3 is possible for a single IC. However, it is noted that the tables in Annex F of IEC 61508-2:2010 show a SIL 4 column.

The Three Key Requirements

Required risk minimisation and Performance Level:
Severity of injury:
  • S1 slight (usually reversible injury)
  • S2 serious (usually irreversible injury which may include death)
Frequency and / or duration of exposure to danger:
  • F1 rarely up to not very frequently and / or the time of exposure to danger is short
  • F2 frequently up to continuously and / or the time of exposure to danger is long
Probability of avoiding the exposure [P]
  • P1 possible under certain conditions
  • P2 rarely possible

Functional safety imposes three key requirements on the development of ICs. These requirements are explored in the following sections.

Requirement 1—Follow a Rigorous Development Process

IEC 61508 is a full lifecycle model. It covers all the phases from safety concept, to requirements capture, to maintenance, and, eventually, to the disposal of the item. Not all of these phases are relevant to an integrated circuit and training and experience are required to identify those that are. IEC 61508 offers a V model for an ASIC and, along with the review, audit, and other requirements in IEC 61508, it represents a system that, while it cannot guarantee safety, has been shown to generate safe systems and ICs in the past.

Most IC manufacturers already have rigorous new product development standards because of the high cost of changing a faulty integrated circuit. A set of masks alone for a low geometry process can cost over $500k. This and the long lead times already force integrated circuit designers to implement a rigorous development process with good verification and validation stages. One of the big differences for functional safety is that safety must not alone be achieved but must also be demonstrated so that even the best IC manufacturers will be required to add a safety process on top of their normal development process to ensure that the correct evidence of compliance is created and archived.

Faults introduced by the development process are referred to as systematic faults. These are faults that can only be fixed through a design change. These faults can include faults related to requirements capture, insufficient EMC robustness, and insufficient testing.

Annex F of IEC 61508-2:2010 lists a set of dedicated measurements that the experts on the IEC committee deemed suitable for use to develop integrated circuits. Table F.2 applies to FPGA and CPLD, while table F.1 applies to digital ASICs. The measures are given as R (recommended) or HR (highly recommended), depending on the SII and, in some cases, alternative techniques are offered. Very few of the requirements should be of much surprise to an IC supplier with a good development process, but the requirement for 99% stuck at fault coverage for SIL 3 is challenging, especially for small digital or mixed-signal parts where a lot of the circuitry is at the periphery of the block. The requirements in revision two of the standard are only for digital ICs, but many can also be applied to analog or mixed-signal ICs (the next revision of ISO 26262 will contain similar tables and has versions for analog and mixed-signal integrated circuits).

In addition to tables F.1 and tables F.2, there is some introductory text that also gives insights. For instance, in this introductory text is an allowance to use proven in use tools and it offers a suggestion of 18 months of use across projects of similar complexity as being reasonable. This means that the full tool requirements from IEC 61508-3 need not apply.

A proven in use claim may be available to module/systems designers if they have successfully used an IC in the past and know the application and the failure rate from the field. This claim is much harder for integrated circuit designers or manufacturers to make as they generally do not have enough knowledge of the final application or what percentage of the failing units from the field are returned to them for analysis.

Software

All software errors are systematic because software does not age. Therefore any on-chip software should consider the requirements of IEC 61508-3. Typically, on-chip software might include a kernel/bootloader on a microcontroller/DSP. However, in some cases the microcontroller/DSP could contain a small microcontroller preprogrammed by the IC manufacturer to implement a block of logic instead of using a state machine. This preprogrammed microcontroller software would also need to meet the requirements of IEC 61508-3. Application level software is typically the responsibility of the module/system designer as opposed to the IC manufacturer, but the IC supplier may need to provide tools such as compilers or low level drivers. If those tools are used in the development of safety related application software then the IC manufacturer would need to supply enough information for the end user to meet the tool requirements in IEC 61508-3:2010 clause 7.4.4.

The author has also programmed in C and in many other programming languages. He has done a limited amount of Verilog programming. Verilog and its sister VHDL are examples of two HDLs (hardware definition languages) used to design digital integrated circuits. The question as to whether an HDL is software is an interesting one, but for now following IEC 61508-2:2010 Annex F is sufficient. In practice the author has found that if Annex F is followed then in combination with the other requirements of IEC 61508 (the life cycle phases, etc.) it doesn’t really matter whether HDL is considered as software or not, as the developer still ends up doing all the required tasks. A related interesting standard is IEC 62566,2 which deals with safety functions for the nuclear industry developed using an HDL.

Requirement 2—Be Inherently Reliable

IEC 61508 imposes reliability requirements in the form of a PFH (average frequency of dangerous failure per hour) or PFD (probability of failure on demand). These limits are tied to an adult’s risk of dying from natural causes and the idea that going to work or about your daily business should not significantly increase this. The maximum PFH for a SIL 3 safety function is 10–7/h or a dangerous failure rate of approximately once per 1000 years. Expressed as FIT (failure in time/failure per billion hours of operation), this is 100 FIT.

Given that a typical safety function has an input block, a logic block, and an actuator block, and that the PFH budget must be allocated across all three blocks, it is entirely possible that the PFH for a given IC can be in single digits (<10 FIT). Redundant architectures can be used to allow higher numbers so that two items of 100 FIT each can give equivalent confidence to one item with a reliability of 10 FIT limited by CCF (Cmmon Cause Failure) concerns. However, redundancy consumes a lot of space and energy, and adds to the cost.

IC manufacturers such as Analog Devices supply reliability information for all their released ICs on sites such as analog.com/reliabilitydata, based on accelerated life testing. This is sometimes frowned upon because the reliability evaluation is done in a lab under artificial conditions. Instead, the use of industry-wide standards such as SN 295003 or IEC 623804 are recommended. These standards, however, have a number of issues:

They predict reliability at the 99% confidence level and IEC 61508 only requires data at the 70% confidence level and so the standards are pessimistic.

They mix random and systematic failure modes. These are meant to be dealt with differently under IEC 61508.

  • They are not frequently updated.

  • They make no allowance for the quality differences between suppliers.

What standards such as SN 29500 do demonstrate is how reliable on-chip transistors really are. If two ICs of 500k transistors each are used to implement a safety function they would have a FIT of 70 each for a total system FIT of 140. However, if the two ICs are replaced by one IC of one million transistors, the FIT for that one IC is only 80, which is a reduction of over 40%.

Soft errors are often neglected within ICs. Soft errors are different from traditional reliability predictions in that they disappear once the power is cycled. They are caused by neutron particles from space or alpha particles from the packaging material striking on-chip RAM cells or flip-flop (FF) and changing the stored value. ECC (double bit error detect and single bit error correction) can be used to detect and seamlessly correct errors in RAMs but at a cost of reduced speed and higher on-chip errors. Parity adds less overhead but leaves the system designer to solve the error recovery issue. If parity or ECC techniques are not used, the soft error rate can exceed the traditional hard error rate by up to a factor of 1000 (IEC 61508 offers a figure of 1000 FIT/MB for RAMs). The techniques available to address soft errors in the FF (flip-flops) used to implement logic circuits are not as satisfactory but watchdog timers, time redundancy in calculations, and other techniques can help.

Requirement 3—Be Fault Tolerant

No matter how reliable the product, bad things will sometimes still happen. Fault tolerance accepts this reality and then addresses it. Fault tolerance has two main elements. One is the use of redundancy and the other is the use of diagnostics. Both accept that failures will occur no matter how good the reliability of the ICs or the development process used to develop the IC.

Redundancy can be identical or diverse, and it can be on-chip or off-chip. Annex E of IEC 61508-2:2010 offers a set of techniques to demonstrate that sufficient measures have been taken to support claims for on-chip redundancy in digital circuits using nondiverse redundancy. Annex E appears to have been targeted at dual lock-step microcontrollers and no guidance is given for on-chip independence for

  • Analog and mixed-signal integrated circuits

  • Between an item and its on-chip diagnostics

  • Digital circuits employing diverse redundancy

However, in some cases Annex E can be intelligently interpreted for these cases. An interesting item within Annex E is the βIC calculation, which is a measure of on-chip common cause failures. It allows a judgment of sufficient separation provided the sources of common cause failure represent a β of less than 25%, which is high in comparison to the 1%, 5%, or 10% found in the tables of IEC 61508-6:2010.

  • Diagnostics are an area in which integrated circuits can really shine. On-chip diagnostics can

  • Be designed to suit the expected failure modes of the on-chip blocks

  • Add no PCB space due to the limited requirement for external pins

  • Operate to a high rate (minimum diagnostic test interval)

  • Obviate the need for redundant components to implement diagnostics by comparison

This means that on-chip diagnostics can minimize the system cost and area. Generally the diagnostics are diverse (different implementation) to the item they monitor on-chip and so it is unlikely they will fail in the same way and at the same time as the item they are monitoring. When they do, it is likely that they would have the same issues (often related to EMC, power supply issues, and over temperature) even if the diagnostics were implemented in a separate chip. While the standard does not contain the requirement, there are concerns related to using on-chip power supply monitors and watchdog circuits, which are diagnostics of last resort. Some external assessors will insist on such diagnostics being off-chip.

Generally, the diagnostics on simpler integrated circuits will be controlled by a remote microcontroller/DSP with measurements done on-chip but the results shipped off-chip for processing.

IEC 61508 requires minimum levels of diagnostic coverage given as SFF (safe failure fraction), which considers safe and dangerous failures and is related but different from DC (diagnostic coverage), which neglects safe failures. The measure of success of the implemented diagnostics can be measured using a quantified FMEA or FMEDA. However, the diagnostics implemented within an IC can also cover components external to the IC and items within the IC can be covered by system-level diagnostics. When an IC developer performs the FMEDA, the assumption must be given that the IC developer doesn’t generally know the details of the final application. In ISO 26262 terminology, this is known as an SEooC (safety element out of context). For end users to make use of the IC-level FMEDA, they must satisfy themselves that the assumptions still hold for their system.

 

While Table A.1 (and indeed Tables A.2 to A.14) of IEC 61508-2:2010 give good guidance on the IC faults that should be considered when analyzing an IC, an even better discussion of the topic is given in Annex H of IEC 60730:2010.5

Development Options for an Integrated Circuit

Classification of the Mean Time To Failure of IC Device )MTTFd) values according to standard:
Total period: 3 up to 100 years:
  • 3 up to 10 years = low
  • 10 up to 30 years = medium
  • 30 up to 100 years = high
MTTFd values exceeding 100 years will be considered with max. 100 years in their safety calculation
MTTFd is a measure for the quality of the component

There are several options for developing integrated circuits to be used in functionally safe systems. There is no requirement in the standard to only use compliant integrated circuits, but rather the requirement is that the module or system designers satisfy themselves that the chosen integrated circuit is suitable for use in their system.

The available options include

  • Developing fully in compliance to IEC 61508 with an external assessment and safety manual

  • Developing in compliance to IEC 61508 without external assessment and with a safety manual

  • Developing to the semiconductor companies’ standard development process but publish a safety data sheet

  • Developing to the semiconductor companies’ standard process

Note—for parts not developed to IEC 61508, the safety manual may be called a safety data sheet or similar to avoid confusion with parts developed in compliance to a safety manual.

Option 1 is the most expensive option for semiconductor manufacturers, but also offers potentially the most beneficial to module or system designers. Having such a component where the application shown in the safety concept for the integrated circuits matches that of the system reduces the risk of running into problems with the external assessment of the module or system. The extra design effort for a SIL 2 safety function can be on the order of 20% or more. The extra effort would probably be higher, except that semiconductor manufacturers typically already imply a rigorous development process even without functional safety.

Option 2 saves the cost of external assessment but otherwise the impact is the same. This option can be suitable where customers are going to get the module/system externally certified anyway and the integrated circuit is a significant part of that system.

Option 3 is most suitable for already released integrated circuits where the provision of the safety data sheet can give the module or system designer access to extra information that they need for the safety design at the higher levels. This includes information such as details of the actual development process used, FIT data for the integrated circuit, details of any diagnostics, and evidence of ISO 9001 certification for the manufacturing sites.

Option 4 will, however, remain the most common way to develop integrated circuits. Use of such components to develop safety modules or systems will require additional components and expense for the module/ system design because the components will not have sufficient diagnostics requiring dual-channel architecture with comparison as opposed to single-channel architectures. Without a safety data sheet, the module/ system designer will also need to make conservative assumptions and treat the integrated circuit as a black box.

In addition, semiconductor companies need to develop their own interpretations of the standards and the author’s own company has developed internal documents ADI61508 and ADI26262 for this purpose. ADI61508 takes the seven parts of IEC 61508:2010 and interprets the requirements in terms of an integrated circuit development.

A SIL 2/ 3 Development

Sometimes an integrated circuit can be developed to all the systematic requirements per SIL 3. This means all of the relevant items from table F.1 of IEC 61508-2:2010 for SIL 3 are observed and that all of the design reviews and other analyses are done to a SIL 3 level. However, the hardware metrics may only be good enough for SIL 2. Such a circuit could be identified as a SIL 2/3 or more typically SIL M/N, where the M represents the maximum SIL level that can be claimed in terms of the hardware metrics and the N the maximum SIL level that can be claimed in terms of the systematic requirements. Two SIL 2/3 integrated circuits can be used to implement a SIL 3 module or system because having two SIL 2 items in parallel upgrades the combination to SIL 3 in terms of hardware metrics, but each item is already at SIL 3 in terms of the systematic requirements. If instead the integrated circuits were only SIL 2/2, putting two such integrated circuits in parallel would still not make it SIL 3 as it would be SIL 3/2 at the best.

Applying the Hardware Metrics to an Integrated Circuit

Except in cases where almost the entire safety function is implemented by an integrated circuit, it is very hard to specify SFF, DC, or PFH limits to a semiconductor. Taking SFF as an example, while the SFF is required to be greater than 99% for SIL 3, this applies to the entire safety function rather than the integrated circuit. If the integrated circuit comes in at 98%, it can still be used to implement a SIL 3 safety function, but other parts of the system will need to achieve a higher coverage to compensate. The safety manual or safety data sheet for the integrated circuit needs to publish the λDD, λDU, and λ for use in the system-level FMEDA.

Ideally, the IC requirements would be derived for a system-level analysis, but often this is not the case and the development is effectively an SEooC (see ISO 26262) or a safety element out of context. In the case of an SEooC, the IC developer needs to make assumptions about how the IC will be used in systems. The system or module designer must then compare these assumptions to their real system to see if the functional safety of the IC is sufficient for their system. These assumptions can decide whether a diagnostic is implemented on the IC or at the system level and so impact on IC-level features and capabilities.

Security

A system cannot be safe unless it is also secure. Presently the only guidance in IEC 61508 or ISO 26262 related to security is to refer the reader to the IEC 62443 series.6 However, IEC 62443 appears to be more targeted at larger components, such as entire PLC components, rather than to individual ICs. The good news is that most of the requirements in the functional safety standards to eliminate systematic faults also apply to security. The lack of any references is interesting because, in some cases, hardware can supply a hardware root of trust and features like a PUF (physically unclonable function), which is important for safety and security.

Conclusions

Safety-critical automotive applications have stringent demands for functional safety and reliability. Traditionally, functional safety requirements have been managed by car manufacturers and system providers. However, with the increasing complexity of electronics involved, the responsibility of addressing functional safety is now propagating through the supply chain to semiconductor companies and design tool providers. This CRC Press Nerws introduces some basic concepts of functional safety analysis and optimization and shows the bridge with the tradition design flow. Considerations are presented on how design methodologies are capturing and addressing the new safety metrics.

The existing IEC 61508 covers everything from developing an integrated circuit to an oil refinery. While there are dedicated sector specific standards for such areas as machinery and process control, and, while there is some guidance in IEC 61508 revision two for integrated circuits, there is no standard specific to integrated circuits. The lack of specific requirements leaves the requirements open to interpretation and therefore conflicts can arise between the expectations of multiple customers and external assessors.

This means that sectors will be inclined to make sector specific requirements for integrated circuits in their higher level standards. Such requirements can already be seen in standards such as EN 50402,7 but most especially in the 2016 draft of ISO 26262,8 where a new part, part 11, deals specifically with integrated circuits.

It is expected that revision 3 of IEC 61508, due to be published sometime around 2021, will expand and clarify the guidance on integrated circuits. The author is lucky to be part of IEC TC65/SC65A MT61508-1/ 2 and MT 61508-3, and so will, therefore, get a chance to participate in such endeavors. Perhaps a future revision might have a part 8 dedicated just to semiconductors so that there is consistency across the sectors, allowing integrated circuits to be developed that meet the requirements of all the sectors.

Even then it is unlikely that the standard will contain everything that an IC manufacturer needs to design an IC with functional safety requirements. Requirements related to security, EMC, etc., will still need to be derived from systems application knowledge in Green Electronics Manufacturing for Creating Environmental Sensible Products .

Safety architectures of hybrid electric and electric vehicles

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Environmental Science, Information Technology, Nanoscience & Technology, Occupational Health & Safety

Chapter 3 of Dr. John X. Wang’s most recent book “Industrial Design Engineering: Inventive Problem Solving

  • Safety, Reliability, and Risk Management

Vehicle Electrification is an automotive global market trend that self-driving cars and trucks are adopting to help centralize control and seamless exchange of data and information across systems, to reduce hazards, decrease emissions, and optimize traffic too. The growing focus on the electrification of vehicles is supporting the evolution of the safety electronics demand for both cars and trucks due is helping to enhance user experience by integrating smartphones and electronic devices on

  • high-end infotainment,

  • Advanced Driver Assistance Systems (ADAS),

  • digital clusters, and

  • telematics applications.

Several top manufacturers are not only working with electronic embedded technologies —to increase safety and reduce the vehicle weight, but also on alternative propulsion technologies like flexible fuel, natural gas engines, or Hybrid Electric (HEV) and all-Electric Vehicles (EV) for integrating efficient functional systems to achieve higher fuel efficiency. This is boosting the demand for

  • safety solutions,

  • electrification of engine mechanism,

  • propulsion technologies, and

  • infotainment innovations.

This accelerated electrification of vehicles, together with the public adoption of connected vehicle concept and the integration of advanced safety features, is kicking the demand for reliable and robust E/E systems. The global automotive safety electronics market is expected to reach around $40 billion by 2023 —with a 12% Compound Annual Growth Rate (CAGR) growth between 2017-2023, according to the Automotive Safety Electronics Market - Global Outlook and Forecast 2018-2023 report from Research and Markets.

Driven by government regulations, the extended applications for E/E systems embedded across HEV/EV segments are being engineered to comply with the highest ISO 26262 Automotive Safety Integrity Level (ASIL-D) guaranteeing a safe state activation when something out-of-ordinary happens, especially critical on autonomous vehicles. All those E/E systems require a safety Microcontroller Unit (MCU) and a reliable, safe source of power connected to the battery of the vehicle —the System Basis Chip (SBC). Both MCUs and SBCs are the backbone of embedded architectures that includes independent hardware monitoring, simplifying Electronic Control Units (ECUs) design.

Autonomous Vehicles demand advanced safety and secure architectures (with a dedicated quantitative and qualitative safety analysis) to size the risk, improve system robustness, and predict system after failure —through configurable fail-safe or fail-silent behaviors. The electrification of vehicles trend requires reliable E/E systems capable of taking decisions and acting as a human driver —or close enough; combining functional safety and electric control systems to decide and act on applications like parking brake, steering, powertrain, anti-lock braking, or transmission systems.

Functional Safety is key to ensure that products operate safely — and even if they fail, they are still capable of entering in a controlled safe operation mode. Let’s say you want to make a left turn using your electrical power steering and the control unit malfunctions. With Functional safety and enough redundancies, the car will give you degraded assistance in the steering to move it in a safe place.

Think about the modern car. It’s more complex than ever, with increasing electronics and millions of lines of code running it. As our car becomes more automated, the complexity will continue to rise.

It makes functional safety even more important to automakers. They can’t choose to ignore it.

Today, vehicles operate with a traditional fail-safe engine control unit architecture. This detects the fault, transitions the system to safe state but at the end, the driver is still able to take back the control of the vehicle.

Gradually, as electronic systems evolve to Levels 4 and 5, the dependence on the driver diminishes as the vehicle has sufficient redundancy and diversity to continue full operation despite the detection of a fault.

System failure prevention: from fail-safe system architectures

In a fail-safe architecture, the power supply delivers and monitors over- and under-voltage to the microcontroller and the other peripherals. It is also in charge of sensing and evaluating the MCU safety operation through the watchdog and HW Error monitoring functions. If a fault is detected, the system goes into safe state (driven by the safety power supply) which guarantees that the function is maintained in a known and defined state (not uncontrolled).

To fail-operational system architectures: How do they work?

As vehicles move beyond the first levels of automation, new fail-operational system architectures are required to add more functionality to the vehicle. Fail-operational systems guarantee the full or degraded operation of a function even if a failure occurs. In this instance, the target applications are characterized as needing

  • high-performance,

  • high level of safety integrity, and

  • high level of availability.

The fault detection and reaction is controlled by independent hardware since a fail-operational system includes minimum two fail-silent units. To remove common cause failures, even the supply is ensured by redundant and independent batteries.

Depending on the SAE level targeted by the car maker, the backup function can be used for several seconds, to several minutes.

  • For Level 3 of automation, the driver is informed by the system that there is a failure and to take back the control of the vehicle.

  • Starting at L4, the driver is no more informed of a fault, so the robot (car) will most likely park the vehicle in a safe area for the occupants of the vehicle and the other road users.

Safety architectures and system design aim to enable full redundancy to facilitate higher levels of autonomous driving and fault tolerance in the case of failure.

Design for functional safety of hybrid electric and electric vehicles

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Environmental Science, Ergonomics & Human Factors, Nanoscience & Technology, Public Administration & Public Policy

With the introduction of electronic control units to automotive vehicles, system complexity has increased. With this change in complexity, new standards have been created to ensure safety at the system level for these vehicles. Furthermore, vehicles have become increasingly complex with the push for electrification of automotive vehicles, which has resulted in the creation of hybrid electric and battery electric vehicles.

Hybrid electric and electric vehicles today have more computing power than a fighter jet from the 1960s. There has been a huge influx of electronics integrated throughout a vehicle, including the engine, power train, chassis, body systems, comfort systems, active safety and Advanced Driver-Assistance Systems (ADAS). The value of all this technology is upwards of 30% of the hybrid electric and electric vehicles’s total production cost.

Electrification of automotive vehicles has become much more common in recent years, specifically with the development of hybrid electric vehicles, such as the Chevrolet Volt, and fully electric vehicles, like the Tesla Model S. Hybrid electric vehicles are the stepping stone to fully electric vehicles and have a wide array of powertrain configurations that the vehicles can use. They start off with

  • a conventional powertrain,

  • an internal combustion engine

  • a transmission

and add high voltage components to provide power to an electric motor that can supply torque to the wheels of the vehicle.

The two major kinds of powertrains are

  • series-hybrid vehicles

  • parallel-hybrid vehicles

Parallel-hybrid vehicles use both the internal combustion engine and the electric motor to

supply torque to the wheels of the vehicle. The use of both electric and internal combustion

torque sources creates three major benefits for the vehicle.

  • First it has the ability to supply a large amount of torque directly to the wheels due to the vehicle having multiple torque sources.

  • The second is that it has the ability to run in a conventional mode in case the battery depletes before the driver can charge the battery.

  • Third, plug-in parallel-hybrid electric vehicles have an all-electric range that does not require petroleum to be used.

Series-hybrid vehicles still have

  • both an internal combustion engine and electric motor

  • electric motor is the only component that provides torque to the wheels

  • internal combustion engine acts as a generator to charge the battery and enable the vehicle to go a farther distance.

This powertrain is the most similar to an all-electric vehicle because only the motor provides torque to the wheels, just with the added benefit of being able to charge the battery during use rather than having to charge once it is parked.

The problem is that, unlike a mechanical system, electronic systems’ reliability cannot be easily evaluated – and this fact has already led to nationwide recalls and other major problems. Many of these events were linked to

  • faulty electronics in component or sub-systems that did not have proper quality or resistance levels,

  • an electronic design that didn’t integrate all possible failure modes, or

  • control units that are unable to communicate between each other.

Complexity is creating challenges for safety engineers – which is why almost all hybrid electric and electric vehicles development teams are building their vehicles around the ISO 26262 standard.

Both kinds of hybrid vehicles have different placements for the electric motor and can use multiple electric motors in the powertrain.

  • The first is position one, where the electric motor is directly on the engine and is generally used to enable the use of stop-start in order to allow the internal combustion engine to shut down when the vehicle stops and can then be re-started by the electric motor. Also, these motors are generally used in conjunction with at least one other motor somewhere else in the vehicle.

  • The next placement is position two, which is between the engine and the transmission of the vehicle. This position is much more common.

  • A position three, or P3, motor is placed after the transmission along the driveshaft.

ISO 26262 defines the entire production process with the goal of minimizing the risk of malfunction in electronic safety-related systems. It also details requirements for supporting processes—including software tools used in the design, test, and manufacture of semiconductors. And while it includes a section (part 5) on hardware development, the standard had no specific guidelines for semiconductors themselves.

ISO 26262 makes the complex electrical systems that have created huge unseen costs much more safe and reliable. It is the industry’s attempt at establishing best practices for designing reliable and safe automotive electronics systems. The standard requires that car makers perform an evaluation of the vehicle design to create an ‘Automotive Safety Integrity Level’ (ASIL) rating that describes the failure impact based on exposure, controllability and severity. Car companies use this evaluation to design a vehicle’s electronic system architecture. The architecture’s requirements will be shared along the supply chain - and may have an impact on component selection. ISO 26262 facilitates in-depth discussion between major stakeholders in the production process, including equipment partners, component companies and car makers.

So, in the ISO 26262, published in early 2018, you can find an entirely new section (part 11) for semiconductor and silicon IP suppliers that support the existing part 5. The section includes tips, recommendations, and examples for creating ISO 26262 compliant ICs and IP, and includes information on failure rates, transient faults, and diagnostic coverage. While it doesn’t cover every aspect of design and test for high quality and reliability, it offers important guidelines. Some aspects of creating high quality and high-reliability ICs and IP, such as improving test and diagnosis of the analog components, is still up to the savvy IC design team.

For example, switches are not the most expensive component within automotive electronics, but are critical to the success of any project because of their role within electronic units. ISO 26262 doesn’t provide straightforward guidance as to which switches must be used in specific situations which can be frustrating for some engineers. Rather, the standard provides guidance on what the necessary reliability or self-controllability needs are based on the ASIL safety level. Below are the five most common areas engineers must keep in mind while designing hybrid electric and electric vehicles:

  • There can be no misinterpretation of turn signals. This means that car makers should look into switches with a clear mechanical position or electrical signal.

  • Electrical systems must be able to self-detect failures. This is why selecting double-throw switches – which enable mechanics to check and see if a line is open, broken or neither – or switches with differential impedance, is important in electrified handles or latch systems.

  • Reliability is paramount under ISO 26262. hybrid electric and electric vehicles dvelopment teams should incorporate redundancy over the switches to facilitate monitoring on two parallel circuits, which typically requires double-pole switches. If you need high intensity within your system, a double switch solution with haptic on one side and without on the other side will work well. Haptic technologyrecreates the sense of touch by applying forces, vibrations, or motions to the user.

  • Robust design is important in today’s vehicles, so predictable ageing is part of the standard. Engineers need to understand the implications of each component choice within their electronic unit design. When evaluating switches, look for vendors that are able to perform the necessary testing that guarantees the performance of their products.

  • ISO 26262’s emphasis on durability and reliability makes long-life and changeover switches the most common solution for any switch decision that needs to be made. Tact, snap and pushbutton switches make sense in specific instances but, depending upon the design, keeping things simple is often the best option for engineers.

Following ISO 26262 guidelines as well as the tips above and product designers will be in a prime position to avoid costly recalls and ensure the quality of their vehicles.

Inventive Problem Solving: decouple software and hardware development of autonomous driving

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Ergonomics & Human Factors, Information Technology, Occupational Health & Safety

Optimization

An autonomous car is a self-driving vehicle that has the capability to perceive the surrounding environment and navigate itself without human intervention. For autonomous driving, complex autonomous driving algorithms, including

  • perception,

  • localization,

  • planning,

  • and control,

are required with many heterogeneous sensors, actuators, and computers. To manage the complexity of the driving algorithms and the heterogeneity of the system components, engineers applies distributed system architecture to the autonomous driving system; here they would need a development process and a system platform for the distributed system of an autonomous car. The development process provides the guidelines to design and develop the distributed system of an autonomous vehicle. For a heterogeneous computing system of the distributed system, a system platform provides a common development environment by minimizing the dependence between the software and the computing hardware. A Controller Area Network (CAN) can be applied as the main network of the software platform to optimize the network bandwidth, fault tolerance, and system performance.

Safety

However, future mobility is not only about optimization, but also about safety and the reduction of accidents. Through the changes arising with upcoming vehicles’ increasing electrification and automation, more or less pure electronic systems – so-called X-by-Wire systems –, their Electric/Electronic (E/E) architecture have to meet novel requirements, above all with regard to dependability. This means, the underlying architecture must guarantee that some safety-critical vehicle functionality is always available.

Example. For example, in case of Steering-by-Wire, it must be ensured that steering is possible until the car has reached a full stop, even in the case of a failure.

Advantage of this new architecture:

  • Automatic repair of the failure of a Steer-by-Wire function through another still operating component.

This so-called fail-operational behavior requires generic failure handling mechanisms, which can inherently be supported by a fail-safe approach.

The R&D of a complete software architecture for autonomous vehicles would include:

  • the development of a high-level multiple-vehicle graphical console,

  • the implementation of the vehicles’ low-level critical software,

  • the integration of the necessary software to create the vehicles’ operating system,

  • the configuration and building of the vehicles’ operating system kernel,

  • the implementation of device drivers at the kernel-level, specifically a complete Controller Area Network (CAN) subsystem for the Linux kernel featuring a socket interface, protocols stack and several device drivers.

High-performance central computing units are replacing outmoded distributed computing architecture. A revolutionary new architecture is needed that can take advantage of what has become the state of the art in consumer electronics:

  • internet connectivity,

  • cloud computing,

  • swarm intelligence, and

  • over-the-air feature updates.

At the top of the forthcoming hierarchical software architecture will be the central computing platforms for vehicle domains such as infotainment, autonomous driving, and body control, including a communications server that links the central platform with electronic control units, sensors, and actuators.

Next down in the hierarchy are the integrated electronics control units, standard ECUs similar to those used today for electronic stability control or engine control, in which OEM-specific functions are integrated.

Further down the hierarchy are commodity control units, off-the-shelf parts with a range of standard functionalities (such as window lift or other body control functions). These standard ECUs will be free of any carmaker-specific functions or software code.

Inventive Problem Solving: benefits of decoupling software and hardware development

  • Flexibility: Carmakers can more easily supdate vehicles with new features after sale.

  • Far easier to develop software on known, standard operating systems/hardware platforms

  • Faster deployment by eliminating redundant validation and testing of reused software

  • Quicker deployment of diverse implementations of functionalities—e.g., fault-tolerant versions

  • Improved engineering productivity

  • Minimized risk of bringing updated features to market

  • Proven application software can be reused

These result in a significant reduction in the amount of software that must be developed and validated.

Model Based Software-in-the-Loop Testing of Closed-Loop Automotive Software

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology, Mathematics, Statistics

In the automotive industry, major innovations are nowadays driven by software[. Over the past years, functionality realized by software grew from basic headlight control to advanced systems, such as Active Brake Assist (ABA) and Electronic Stability Control (ESC), interacting with multiple sensors, actuators and other systems. Since software may directly influence the driving behavior by controlling actuators, e.g., brakes, software failures may result in damage to passengers and to the environment. To increase the safety of software controlled systems in the automotive domain, the ISO 26262 recommends extensive testing for safety critical software. In particular, testing safety critical software using the final hardware is highly recommended by the ISO 26262. However, removing a software failure in late development phases implicates high costs. In order to detect and remove software faults as early as possible in the development process and thus decrease costs, software can be tested on model level.

Modeling Test Scenarios

To create a test scenario, a scene has to be replicated virtually using high-fidelity models and simulation tools. A test scenario is implemented in the simulated environment using the model elements such as vehicle, vehicle environment sensors – radar, lidar, GPS, HD maps, etc., road, road traffic, various traffic participants – pedestrians, bicyclists, signs, and their expected behaviors. Once a test scenario has been modeled, the real benefit is that it can be translated into endless different test cases by managing parameter variations.

Engineers that need to simulate a lot of different test scenarios do not have to specify each one manually. Many scenario descriptions can be imported using resources and standards such as Open DRIVE, Open SCENARIO, and Open Simulation Interface (OSI), among others. Engineers can also use their own captured sensor data from on-road vehicle testing. The test scenarios thus defined can now be used to validate various autonomous driving algorithms.

Software in the Loop (SIL)

Software in the Loop (SIL) is testing any software/firmware/algorithm/control system in such a way that a piece of software simulating a piece of hardware, or simulating a physical component, or a physical system, including possibly its response or other characteristics, is in your communication stream in a system which is either open-ended (feed-forward only), or with feedback. SIL is any testing where you have a simulated piece of actual product hardware (in other words: software in place of hardware) in your test.

SIL versus Hardware in the Loop (HIL) testing

HIL or ‘Hardware in the Loop’ testing is by its very nature a resource-hungry solution to testing, requiring multi-skilled teams able to set up and configure both the execution platform and the I/Os as well as the modeling environment.

While filling an important part of the overall test process, it may take longer to prove out significant issues with HIL testing than can be achieved in a SIL test scenario as testing hardware availability is not required.

SIL testing and simulation can thus be a useful technique for software proving at earlier stages of the design. SIL is featured by the following:

  • SIL simulation represents the integration of compiled production source code into a mathematical model simulation, providing engineers with a practical, virtual simulation environment for the development and testing of detailed control strategies for large and complex systems.

  • With SIL, engineers can use a PC to directly and iteratively test and modify their source code, by directly connecting software to a digital plant model substituting for costlier systems, prototypes or test benches. SIL makes it possible to test software prior to the initialization of the hardware prototyping phase, significantly accelerating the development cycle.

  • SIL enables the earliest detection of system-level defects or bugs, significantly reducing the costs of later stage troubleshooting, when the number and complexity of component interactions is greater. SIL provides an excellent complement to traditional HIL simulation, while helping to accelerate time-to-market and ensuring the more efficient software development.

SIL Simulation with a Top Model

You can test code generated from model by running a top-model SIL simulation. With this approach:

  • You test code generated from the top model, which uses the standalone code interface.

  • You configure the model to load test vectors or stimulus inputs from the MATLAB workspace.

  • You can easily switch the top model between the normal and SIL simulation modes.

Software-in-the-Loop Modeling and Simulation

SIL is the inclusion of compiled production software code into a simulation model.

  • Software-in-the-loop M&S can be viewed as Simulation-based Software Evaluation.

  • A software system can be executed under simulated input conditions for the purpose of evaluating how well the software system functions under such input conditions.

  • For example, the software used to display operational information on a handheld computer can be executed under simulated input data (e.g., video, voice, images, text) received from many different sources for the purpose of evaluating how well the software satisfies its requirements.

  • Software-in-the-loop M&S is a cost-effective method for evaluating a complex, mission-critical software system before it is used in the real world.

Strategic Use of HIL and SIL

HIL and SIL simulations have long been used to test electronic control units (ECUs) and software. Now they have a new application field: calibration and parameterization of a vehicle stability controller, using simulation. A virtual calibration procedure like this requires far more precise models and new approaches to optimizing vehicle dynamics, and also raises a lot of development process issues.

In SIL, the actual Production Software Code is incorporated into the mathematical simulation that contains the models of the Physical System. This is done to permit inclusion of software functionality for which no model(s) exists, or to enable faster simulation runs. SIL

  • enables the inclusion of control algorithm functionality for which no model exists

  • increases simulation speed by including compiled code in place of interpretive models

  • verifies that code generated from a model will function identically to the model

  • guarantees that an algorithm in the modeling environment will function identically to that same algorithm executing in a production controller

SIL for early-stage testing of AUTOSAR software component

In Embedded Software, the early-stage testing of source code is important since it may reduce the future development cost. However, at the automotive domain, the conventional software testing methods depend on actual system and hardware such as Hardware-in-the-Loop simulator and vehicle, then the early-stage testing method of automotive software is still in immature status. Therefore the testing of automotive software component is delayed until they are developed enough to run at the actual hardware. Implementing Best Inventive Problem Solving (IPS) Practices for Embedded Software Testing, we can develop a method to do dynamic analysis and test the AUTOSAR software component by SIL simulation without target system hardware. It provides rapid prototyping environment to validate the behavior of automotive software and helps to improve the quality of software component from the early error detection.

A Tribute to Dr. Nikola Tesla: Best Inventive Problem Solving (IPS) Practices for Embedded Software Testing

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Mathematics

As the author of the book My Inventions: The Autobiography of Nikola Tesla, Dr. Nikola Tesla once said,

Of all things, I liked books best.”

Dr. Nikola Tesla, one of the word's greatest electrical inventors and designers, was found dead on the night of January 7, 1943 in his suite at the Hotel New Yorker.

Shall the Safety Engineering Community respond to these challenges by radically re-thinking the architecture of the HIL test platform and defining a next generation approach with inventive problem solving? Such a new architecture could introduce the concept of a HIL-Bus to integrate the functionality of multiple existing HIL sub-systems and meet the needs of a function-centric test environment. Dr. Nikola Tesla said:

If you want to find the secrets of the universe, think in terms of energy, frequency and vibration.”

Hardware-In-the-Loop (HIL) simulation is a technique that is used in the development and test of complex real-time embedded systems by thinking in terms of energy, frequency and vibration. etc.. HIL simulation provides an effective platform by adding the complexity of the plant under control to the test platform.

Today, HIL simulation is a standard component in the vehicle development process as a method for testing electronic control unit (ECU) software. HIL simulation is used for all aspects of development, naturally including safety relevant functions and systems. This applies to all test tasks (from function testing to release tests, testing a single ECU or an ECU network, and so on) and also to different vehicle domains: The drivetrain, vehicle dynamics, driver assistance systems, interior/comfort systems and infotainment are all tested by HIL Thus, both the system modeling and control works can be then tested and/or validated quickly and safely in the real environment using this HIL setup. Hence, HIL testing methodology is the most time efficient and cost effective way to bring research and development of new systems and control algorithms one step closer to the real applications and productions. In vehicle development, the HIL test method is a well-known firm fixture in the product design progress as a method of testing electronic control unit software. ...

Several commercial tool chains of HIL platforms for vehicles such as LABCAR, dSPACE, VeHIL or Autonomie are known. These kinds of tools can be utilized not only to test functionalities but also optimize the structure/parameters of one or even a network of control units. Consequently, they can improve the design success rate while eliminating the risk of development errors.

The previous ECU-centric system architecture with its point-to-point communication infrastructure is no longer an efficient design approach. Automotive electronic systems are becoming truly distributed to achieve increased system functionality by tightly coupling ECUs. These changes encourage a significantly revised approach to automotive system test. This shift in system design is reflected in the evolution of ISO26262, derived from IEC61508, with its focus on functional safety assurance. It has a huge impact on automotive test platforms, HIL simulation, test platform provider.

Which sections of iso 26262 apply to such a new architecture of the HIL test platform?

The standard requires the following phases for software development: (The numbers correspond to the relevant section in the ISO 26262-6 standard.)

6. Specification of software safety requirements

7. Software architectural design

8. Software unit design and implementation

9. Software unit testing

10. Software integration and testing

11. Verification of software safety requirements

Each of these phases requires specification of software tools used in development and testing.

For Simulation & Test of Autonomous Driving in Real-Time

Apply such a new architecture to ADAS HIL:

  • Verification & validation of automotive components (e.g. ECU)

  • Real-time model execution (Vehicle, Road, Traffic, Driver, Environment)

  • Standardized & modular HiL test system family: smartTEST L, M, S or XS

  • Measurement technology based on NI PXIe & NI CompactRIO

  • Control & evaluation of vehicle busses (CAN, LIN, FlexRay)

  • Signal conditioning with NI SLSC (Switch, Load & Signal Conditioning) or AEROspice / SMARTbrick

  • Active Breakout Panel (BOP) with Power BOP, Signal BOP, Video BOP, …

  • Power Distribution Unit (PDU) with safety system, circuit breaker, …

  • Software based on NI LabVIEW, NI TestStand, NI VeriStand & NI DIAdem

  • Fault Insertion Units (FIUs) compliant with ISO 26262 – Functional Safety

  • Customer-specific applications etc.

ISO_DIS_26262-8(E) calls for the qualification of software development tools that are to be used in the development of safety related items under an ISO 26262 compliant process.

Code Verification

ISO 26262 provides several options for verifying software designs and implementations. The approach described in the CRC Press News allows for a limited trace review to detect unintended functionality in the generated code, such as code not traced to a block or signal. The approach enable automatically generating a trace matrix for this purpose. Alternatively, you can compare model coverage to code coverage using Simulink Coverage during software-in-the-loop (SIL) testing.

Alternatively, you can check MISRA compliance with this approach. Use of MISRA checking and code coverage analysis is especially helpful if your project includes a mixture of generated and hand-coded software. For added rigor, you can select a sound static analysis tool to prove the absence of run-time errors such as divide-by-zero.

ISO 26262 highly recommends back-to-back testing for ASILs C and D. It notes the importance of testing in a representative target hardware environment, and stresses the need to be aware of differences between the test and hardware environments:

Differences between the test environment and the target environment can arise in the source code or object code, for example, due to different bit widths of data words and address words of the processors.

However as every engineer should know, there are many sources for potential numeric differences across platforms, especially for floating-point data. Some begin as minor but accumulate and grow, especially in feedback control systems. Thus, ISO 26262 lists various in-the-loop methods for back-to-back testing. Software unit testing can be executed in different environments, for example:

  • model-in-the-loop tests;

  • software-in-the-loop tests;

  • processor-in-the-loop tests; and

  • hardware-in-the-loop tests.

Summary

Typically, the overall process compliance and considerations are taken into account for the robust design of the product, however we must be sure to not forget about the verification side of product development. Many of the functional safety standards specifically call out requirements and steps to be completed for test in addition to the design. At the same time, these considerations must be taken into account in all variations of verification that take place throughout the development cycle:

This CRC Press News summarizes best Inventive Problem Solving (IPS) practices for Embedded Software Testing of Safety Compliant Systems. Dr. Nikola Tesla said:

Invention is the most important product of man’s creative brain. The ultimate purpose is the complete mastery of mind over the material world, the harnessing of human nature to human needs.”

Will Autonomous Cars Look Like Unmanned Aerial Vehicles (UAVs)?

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Information Technology, Nanoscience & Technology

The automotive industry’s transition from an ownership model to a usage-based model— driven by the “sharing economy,” autonomous driving, and mobility-as-a-service platforms—is forcing the automotive manufacturing industry to make extensive changes in automobile architectures and the automotive business supply chain. Future automobiles will have electronics systems that closely resemble commercial aircraft systems, which optimize

These next-generation automotive systems will virtualize and centralize many electro-mechanical control systems that are currently distributed throughout today’s vehicles. In addition, these systems will be based on open standards that are derived from a wide range of transportation industry solutions. As the autonomous car evolves, automakers face a complex question:

  • How to enable self-driving cars to process massive amounts of data and then come to logical and safe conclusions about it.

Today, most automakers accomplish that with a distributed form of data processing. That is, they place intelligence at the sensors. More recently, though, that’s begun to change. Many engineers now favor a more centralized form of data processing, in which simple sensors send raw unprocessed data to a powerful central processor, which does all the “thinking.”

How to define distributed and centralized autonomous vehicle architectures?

The definitions were limited to sensor fusion.

  • Distributed meant that every sensor node knew what every other node was doing.

  • Centralized meant that there was only one central point that collected all the information and created the sensor fusion map.

Is there a solution that’s a hybrid of those two electrical architectures. If yes, how does that work?

The hybrid concept is a middle solution. There’s a central unit that works at a higher abstraction level. And there are domains. The domains can work geographically, for example, in the front and back of the car. Or they can be based on cameras and sensors.

What’s been the primary solution to date?

Up to now, systems were distributed because there was not a real centralized solution in the market. However today, because of the computing capabilities of specialized electronic circuits like Graphics Processing Unit (GPU), it’s entirely possible to do a centralized architecture.

By adopting the latest developments from IMA), the automotive industry could reduce system lifecycle costs and gain new integrated platform capabilities which support incremental modernization, simplify upgrades and modifications for automotive platforms. A specific set of architecture design patterns and computational models, used in IMA, enables the design of less complex integrated systems which can collect and process all system sensor data in (hard) real-time, supports seamless sensor data fusion for Integrated vehicle health management (IVHM), and enables the integration of critical and non-critical functions. Accompanied with robust system engineering, RTCA DO-254 / DO-178C DAL A/B design assurance, novel integrated architectures for autonomous cars can be designed to fit with robust modular form factors. Such integrated architectures can be extended with COTS computing and sensor fusion LRUs used for IMA applications.

What are the advantages and disadvantages of using a distributed architecture?

  • The advantage might be that you don’t have to bring in a huge amount of data. You don’t have the problem of carrying data in a secure and efficient way from the edge to the center. And you can effectively put things together in the most cost efficient way.

  • The negative aspect is that you have to distribute the information simultaneously and synchronize it across all the nodes. And this has become practically impossible when you exceed three or four nodes.

What are the advantages and disadvantages of a centralized architecture?

  • You get the best possible information. If you don’t touch the data, don’t modify it, don’t filter it at the edge, then you get the maximum possible information.

  • The disadvantage is that your center becomes a monster. It’s huge. You have to move data from as many as 12 cameras with four megapixels each, so you’re moving gigabytes. And you have to move radar data, so you’re moving gigabytes again. You end up having this huge amount of data that comes in at a high frequency rate, and it has to be processed. Your machine at the center is non-scalable, and when you don’t scale, you can’t offer capabilities for the long term, which will be needed in automotive.

As we move closer to actual vehicle autonomy, is one or the other starting to emerge as a leader?

It becomes clear now that there needs to be a centralized function for the planning phase – planning means path-finding, maneuvering and motion trajectory. It would not be the end-to-end (centralized architecture) that Nvidia wants to have. We would still need to have intelligent sensors that can reduce the bandwidth and optimize the cost somewhere between the edge and the center.

Like described before, future electronics systems in automobiles will look a lot like the airborne electronics in commercial aircraft. The automotive platform changes are being driven by three significant trends:

  • Consumer value is derived from many cross-functional use cases, such as advanced driver assist systems (ADAS).

  • Vehicles are connecting with expanding Internet of Things (IoT) environments.

  • The automotive economy is advancing into a digitized, usage-based model.

This paradigm shift places a greater emphasis on safety, security, and reliability, and brings the automotive industry closer to the commercial aviation industry. This new automotive business model is based on the transportation of goods and passengers with vehicles that are primarily owned and operated by commercial companies rather than private individuals. Operating expense (OPEX) savings will motivate the entire industry to create new architectures that ensure safety, security, and platform consolidation efficiencies.

Would the hybrid architecture be the future?

In the future, hybrid solution would be the path because there is always the need to process close to the sensor, whether it’s for cameras, or antennas for radar, or cloud point analysis. At the same time, there will always be a need for a centralized place where all the local maps will be brought together to complete the centralized model.

A time- and space-partitioning standard for automotive electronics that enables applications with different safety levels to share a common compute platform could pave the way toward easier development of automotive electronics systems, with reduced Size, Weight, and Power (SWaP) requirements to improve automobile fuel efficiencies. This partitioned architecture could also provide for greater safety and security attributes.

What does that mean for the future of automotive sensors?

There’s so much you can do to make the sensor better and more useful for SAE Level 3, Level 4 and Level 5 automated vehicles. Visualization of the state of a vehicle, along with its real-time IoT sensor environment, will enable the driver or the operator of a car to immediately deliver the most efficient, safe, and secure experience to future automotive users. Both aircraft and next-generation automotive dashboards are already sharing common design OpenGL tools and safety-certified platform components.

Wouldn’t it be in the automaker’s best interest to go with a distributed system?

That would be right. That way, a lot of the development work could be offloaded to the suppliers. The question is, does the OEM want that? How does the OEM control a completely distributed system? They don’t. It puts them totally in the hands of the Tier One, with no chance of controlling it themselves.

The problem is it’s very difficult to control a distributed system. In order to make it work, you need to agree on languages, formats, protocols, and networking. It’s super tough. If the OEM could force their suppliers to do all that, they’d have a good life. However it is questionable whether they can force all of the Tier Ones to do the same type of modeling, the same type of mapping, the same type of algorithms. The Tier Ones need to compete, and to do that, they have to offer differences.

An industry-wide consortium of automotive companies could be formed to drive the definition and adoption of a standard specification for a layered software architecture for consolidated automotive electronics on common, shared platforms. This specification could streamline the innovation, development, deployment, and maintenance of next-generation automotive electronics.

As we approach Level 5, will an Integrated Architecture be necessary?

Let’s hope for it. Integrated Modular Avionics (IMA) are real-time computer network airborne systems. This network consists of a number of computing modules capable of supporting numerous applications of differing criticality levels. The market opportunity for driver assisted, driver piloted, and fully driverless cars is projected to reach 42 Billion by 2025. As a result, advanced “IMA like” automotive solutions are being explored and are focused on developing and maturing next generation transformational IMA technologies. Five future IMA dual use focus areas that have the potential to be transformational enablers for affordability, technology refresh, and new capabilities in both automotive and avionics IMA markets are:

  • heterogeneous manycore processing,

  • scalable autonomy and data fusion software components,

  • hypervisor enabled mixed criticality software infrastructure,

  • unified QoS networking, and

  • Model Based Design (MBD).

Over the last decade, the automotive industry has built the Automotive Open System Architecture (AUTOSAR) set of standards that specifies basic software modules, application interfaces, and a common development methodology based on a standardized exchange format. The AUTOSAR layered software architecture is designed to enable electronic components from different suppliers to be used in multiple platforms (vehicles), enabling the move to all-electric cars and vehicles with higher software content that can be more easily upgraded over the service life of the vehicle. AUTOSAR aims to improve cost-efficiency without compromising safety.

Even with AUTOSAR and other standards from SAE and other organizations, the overall technical and business model for the vast majority of today’s automobiles is still a federated environment in which suppliers define the requirements for their systems, and the Original Equipment Manufacturer (OEM) or systems integrator designs automobiles within these constraints. This technical and business architecture is the reason that many cars today have scores of processors distributed throughout the vehicle, increasing the complexity of wiring harnesses and other support systems. This federated architecture was designed to optimize supplier integrations; it was not designed to increase safety or meet stringent safety and security certification requirements, nor was it designed to directly reduce vehicle complexity or SWaP requirements. The automotive industry could well benefit from an expansion ofthe AUTOSAR standard to an ARINC 653–like specification for integrated modular automobile electronics that virtualizes many of the current federated systems. This specification could help reduce SWaP requirements, reduce development and testing costs, and improve the efficiencies of the industry’s supply chain. And like the partitioning in Integrated Modular Avionics (IMA) based on ARINC 653, the specification for an Integrated Architecture holds the promise to create a safe and robust Autonomous Vehicle.

Industrial Design Engineering Project: DIY an autonomous driving system

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Electrical, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Materials Science, Nanoscience & Technology, Physics, Polymer Science

Optimizing the architecture of an autonomous driving system is particularly challenging for a number of reasons. As the Technology Readiness Levels (TRLs) of self-driving vehicles increase, it is necessary to investigate the Electrical/Electronic(E/E) system architectures for autonomous driving, beyond proof-of-concept prototypes. This CRC Press News presents the principal components needed in a functional architecture for autonomous driving, along with reasoning for how they should be distributed across the architecture. A functional architecture integrating all the concepts and reasoning is also presented, including the following considerations:

  • Algorithms: The system has to make “correct” operational decisions at all times to avoid accidents, and advanced machine learning, computer vision, and robotic processing algorithms are used to deliver the required high precision. These algorithms are compute intensive.

  • Constraints: The system must be able to react to traffic conditions in real-time, which means processing must always finish under strict deadlines (about 100ms in this work).

  • Overall power consumption: The system must operate within a power budget to avoid negatively impacting driving range and fuel efficiency.

Algorithms: Kano model for concept design

The KANO model categorizes user needs in three different groups;

  • unspoken basics: unspoken basics is the kind of needs that the user takes for granted, hence, it renders low customer value.

  • spoken performances: spoken performances are needs that has been explicit demanded but have not been available to the user before.

  • unspoken excitement: the last category is unspoken excitement which is a need that the user does not know he or she has, but becomes a positive surprise when fulfilled .

This method was used after the user needs was defined, it is integrated in the specification of

requirements. The architectural components are divided into categories pertaining to

  • perception,

  • decision and control, and

  • vehicle platform manipulation.

The architecture itself is divided into two layers comprising

  • the vehicle platform and

  • a cognitive driving intelligence.

The distribution of components among the architectural layers considers two extremes:

  • one where the vehicle platform is as “dumb” as possible, and

  • the other, where the vehicle platform can be treated as an autonomous system with limited intelligence.

For concept design, a clean split between the driving intelligence and the vehicle platform is recommend. The architecture description includes identification of stakeholder concerns, which are grouped under the business and engineering categories. It also needs to include the presence of explicit components for world modeling, semantic understanding, and vehicle platform abstraction unique to the vehicle architecture. There are several defined levels of automation,

  • With level 2 being ‘partial automation’ in which the automated system controls steering and acceleration/deceleration under limited driving conditions.

  • At level 3 the automated system handles all driving tasks under limited conditions (with a human driver taking over outside of that).

  • By level 5 the system is fully automated.

For concept design, let’s target Highly Autonomous Vehicles (HAVs), operating at levels 3-5. Moreover, they using vision-based systems using cameras and radar for sensing surroundings, rather than the much more expensive LIDAR.

Captured sensor data is fed to an object detector and a localizer (for identifying vehicle location at decimeter-level) in parallel. The object detector identifies objects of interest and passes these to an object tracker to associate the objects with their movement trajectory over time. The object movement information from the object tracker and the vehicle location information from the localizer is combined and projected onto the same 3D coordinate space by a sensor fusion engine like Kalman filter, etc..

The fused information is used by the motion planning engine to assign path trajectories (e.g. lane change, or setting vehicle velocity). The mission planning engine calculates the operating motions needed to realize planned paths and determine a routing path from source to destination.

For each of the six main algorithmic components (object detection, object tracking, localization, fusion, motion planning, and mission planning), let’s identify and select state of the art algorithms.

Object detection

For object detection, the Deep Neural Network (DNN)-based detection algorithm is used. Object detection is focused on the four most important categories for autonomous driving:

  • vehicles,

  • bicycles,

  • traffic signs, and

  • pedestrians.

Object tracking

DNN is used for object tracking. To track a single object, so a pool of trackers can be used.

Localization

For localization Simultaneous Localization and Mapping (SLAM) is used, which gives high accuracy and can also localize the vehicle regardless of viewpoint.

Fusion

No reference is given for the fusion engine component, however it is comparatively simple compared to the other components, and perhaps was custom built for the project. It combines the coordinates of tracked objects from the DNN trackers with the vehicle position from SLAM and maps them onto the same 3D-coordinate space to be sent to the motion planning engine.

Motion & Mission planning

Autoware is the world's first "all-in-one" open-source software for self-driving vehicles. Both motion and mission planning components come from the Autoware open-source autonomous driving framework. Motion planning is done using graph-based search to find minimum cost paths when the vehicle is in large open spaces like parking lots or rural areas, in more structured areas ‘conformal lattices with spatial and temporal information’ are used to adapt the motion plan to the environment. Mission planning uses a rule-based approach combining traffic rules and the driving area condition, following routes generated by navigation systems such as Google Maps. Unlike the other components which execute continuously, mission planning is only executed once unless the vehicle deviates from planned routes.

Theory of Constraints: detailed design

How quickly an autonomous system can react to traffic conditions is determined by the frame rate (how fast we can feed real-time sensor data into the process engine), and the processing latency (how long it takes us to make operational decisions based on the captured sensor data). The fastest possible action by a human driver takes 100-150ms (600-850ms is more typical). The target for an autonomous driving system is set at 100ms.

An autonomous driving system should be able to process current traffic conditions within a latency of 100ms at a frequency of at least once every 100ms.

The system also needs extremely predictable performance, which means that long latencies in the tail are unacceptable. Thus 99th, or 99.99th, percentile latency should be used to evaluate performance.

All those processors kick out a lot of heat. We need to keep the temperature within the operating range of the system, and we also need to avoid overheating the cabin. This necessitates additional cooling infrastructure.

Overall power consumption: optimization

Finally, we need to keep an eye on the overall power consumption. A power-hungry system can degrade vehicle fuel efficiency by us much as 11.5%. The processors themselves contribute about half of the additional power consumption, the rest is consumed by the cooling overhead, and the storage power costs of storing tens of terabytes of map information.

CPU-only baseline system

With the CPU-only baseline system (16-core Intel Xeon at 3.2GHz), we can clearly see that three of the key components: object detection, tracking, and localization, do not meet the 100ms individually, let alone when combined.

Beyond CPUs

Looking further into these three troublesome components, we can see that DNN execution is the culprit in detecting and tracking, and feature extraction takes most of the time in localization.

…conventional multicore CPU systems are not suitable to meet all the design constraints, particularly the real-time processing requirement. Therefore, we port the critical algorithmic components to alternative hardware acceleration platforms, and investigate the viability of accelerator-based designs.

GPUs

The bottleneck algorithmic components are ported to GPUs using existing machine learning software libraries. For object detection, DNN is implemented using the cuDNN library from Nvidia, For object tracking, DNN is ported using Caffe (which in turn using cuDNN), and SLAM is porte to GPUs using the OpenCV library.

FPGA

FPGA optimized versions of DNN and feature extraction are built using the Altera Stratix V platform.

Previously published ASIC implementations are used for DNNs, and a feature extraction ASIC is designed using Verilog and synthesized using the ARM Artisam IBM SOI 45nm library.

Here’s how the CPU, GPU, FPGA, and ASIC implementations compare for latency and power across the three components:

We can immediately rules out CPUs and FPGAs for the detection and tracking components, because they don’t meet the tail latency target. Note also that specialized hardware platforms such as FPGAs and ASCIs offer significantly higher energy efficiency.

The perfect blend?

We can use different combinations of CPU, GPU, FPGA, and ASIC for the three different components, in an attempt to balance latency and power consumption.

GPU-based object detection, and ASICs for tracking and localization

  • Latency: Focusing just on latency for the moment, the lowest tail latency of all comes when using GPU-based object detection, and ASICs for tracking and localization.

  • Power consumption: If we look at power consumption though, and target a maximum of 5% driving range reduction, then we can see that this particular combination is slightly over our power budget.

all-ASIC or ASIC + FPGA-based localization

The all-ASIC or ASIC + FPGA-based localization are the only combinations that fit within the 5% range-reduction budget, and also (just!) meet the latency budget.

Higher resolution cameras can significantly boost the accuracy of autonomous driving systems. We can continue to look at end-to-end latency as a function of input camera resolution. Some of the ASCI and GPU accelerated systems can still meet the real-time performance constraints at Full HD resolution (1080p), however none of them can sustain Quad HD: computational capability still remains the bottleneck preventing us from benefiting from higher resolution cameras.

Summary

Engineering design has been used as the research method, which focuses on creating solutions intended for practical application. The architecture has been refined and applied over a 5 year period to the construction of prototype autonomous vehicles in three different categories, with both academic and industrial stakeholders. In this CRC Press News, we show the following

  • Low latency: GPU- FPGA- and ASCI-accelerated systems can reduce the tail latency of [localization, object detection, and object tracking] algorithms by 169x, 10x, and 93x respectively…

  • Low latency + low power consumption: Power-hungry accelerators like GPUs can predictably deliver the computation at low latency, their high power consumption, further magnified by the cooling load to meet the thermal constraints, can significantly degrade the driving range and fuel efficiency of the vehicle.

  • Objective: maximize expected utility value (the driving range and fuel efficiency of the vehicle).

Note: this is a companion project to Industrial Design Engineering Project: DIY Smart Home Robot using Arduino.

Dynamical Movement Primitives (DMPs) for Safety Engineering

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Engineering - Mining, Mathematics, Occupational Health & Safety, Public Administration & Public Policy, Statistics

I have traveled to mines in Wyoming. Today, though autonomy is a hot topic across several industries as self-driving cars and heavy trucks hold the promise of changing the way we transport ourselves and our goods, it’s old news in a way for mining. Safety has been a key aspect in the mining industry even from its earliest stages, but the importance with which it is regarded has become far greater in recent times. Currently the biggest Compound Annual Growth Rate (CAGR) in mining electronics revenue can be attributed to safety applications. However, with a growing amount of electronics content making up a Bill Of Materials (MOM), there is now a necessity to switch from the long-established best practices approach to well-defined universal guidelines. As a result, industry protagonists have joined forces to develop a standard with far-reaching implications.

The word “safety” is subject to various different interpretations. However, when applied to modern mining vehicles’ design it can generally be categorized using the following structure:

Passive safety

Assuming that an accident is effectively inevitable, the aim of passive safety mechanisms is to minimize the severity of that accident. The passive safety elements found within a vehicle include seatbelts, crumple zones, etc.

Active safety

The systems that are concerned with active safety will aim to avoid accidents altogether in addition to the minimization of its effects if an accident occurs. Seatbelt pre-tensioning, airbag deployment, predictive emergency braking, anti-lock braking systems and traction control are all examples of this.

Functional safety

This focuses on ensuring that all of the electrical and electronic systems including power supplies, sensors, communication networks, actuators, etc. in an active safety related system, function correctly.

Complex motions for robots are frequently generated by switching among a collection of individual movement primitives. Regarding safety, the robot needs to be able to detect and react to collisions, whereas controllability enables a human to easily interact with the robot by, for instance, grabbing it and guiding it to the right location. To assure safety, the concept of Dynamical Movement Primitives (DMPs) has become popular for modeling of motion, commonly applied to robots. This CRC Press News explores a framework that allows a robot operator to adjust DMPs in an intuitive way. This can be developed an architecture which consists of four main components: perception system, tasks planning, motion planner, and control systems that allow autonomous operations in backhoe machines. Based on this, a motion planning system based on Learning from Demonstration using DMPs as control policy can be implemented, which allows backhoe machines to perform operations in autonomous manner.

The DMP is a non-linear differential equation that encode movements, which are used to learn tasks in backhoe machines. The non-linear dynamic equation codes the basic behavioral pattern and its formulation is based in attractor theory for learning movement primitives. Therefore, the purpose of this control policy is to reach the goal state with a particular trajectory shape, independent of its initial state. Besides, DMPs are a compact representation of high-dimensional planning policies and it must have the following properties:

  • The convergence to the goal state must be guaranteed.

  • The DMP formulation must generate any desired smooth trajectory.

  • DMP have to be temporal and spatial invariant.

  • The formulation must be robust against perturbations due to the inherent attractor dynamic.

This architecture of supervised learning for the generation of motor skills in excavator intelligent agent, which is robust to change in target positions. The approach could be implemented in other kind of operations as digging a foundation or leveling a mount of soil for the machine here modeled. The autonomous truck loading architecture could be extended to complete automation of earthmoving operations in backhoe machines. Also, this technology could be easily extended to other kinds of similar machines. For successful application of the techniques to constraint-based planning and control for safe operation of autonomousvehicles, It is important to develop a high-level task planning that make decisions of type of truck loading under uncertainty based in the perception system; this would choose the most appropriate DMP for the task.

Year 2020, shall we start to see AI-based Autonomous Vehicles on the road?

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Ergonomics & Human Factors, Information Technology, Mathematics, Nanoscience & Technology, Statistics

Autonomous driving technology is a complex system, potentially consisting of the following three major subsystems:

  • algorithms, including sensing, perception, and decision; The algorithm subsystem extracts meaningful information from sensor raw data to understand its environment and make decisions about its actions.

  • client, including the Robotics Operating System (ROS) and hardware platform; The client subsystem integrates these algorithms to meet real-time and reliability requirements. (For example, if the sensor camera generates data at 60 Hz, the client subsystem needs to make sure that the longest stage of the processing pipeline takes less than 16 milliseconds (ms) to complete.)

  • cloud platform, including data storage, simulation, High-Definition (HD) mapping, and deep learning model training. The cloud platform provides offline computing and storage capabilities for autonomous cars. Using the cloud platform, we are able to test new algorithms and update the HD map—plus, train better recognition, tracking, and decision models.

Autonomous driving algorithms

The algorithms component consists of: sensing and extracting meaningful information from sensor raw data; perception, to localize the vehicle and understand the current environment; and decision, to take actions to reliably and safely reach destinations.

Sensing

Normally, an autonomous vehicle consists of several major sensors. Since each type of sensor presents advantages and drawbacks, the data from multiple sensors must be combined. The sensor types can include the following:

A. GPS/IMU

The GPS/IMU system helps the AV localize itself by reporting both inertial updates and a global position estimate at a high rate, e.g., 200 Hz. While GPS is a fairly accurate localization sensor, at only 10 Hz, its update rate is too slow to provide real-time updates. Now, though an IMU’s accuracy degrades with time, and thus cannot be relied upon to provide accurate position updates over long periods, it can provide updates more frequently—at, or higher than, 200 Hz. This should satisfy the real-time requirement. By combining GPS and IMU, we can provide accurate and real-time updates for vehicle localization.

B. LIDAR

LIDAR is used for mapping, localization, and obstacle avoidance. It works by bouncing a beam off surfaces and measuring the reflection time to determine distance. Due to its high accuracy, it is used as the main sensor in most AV implementations. LIDAR can be used to produce HD maps, to localize a moving vehicle against HD maps, detect obstacle ahead, etc. Normally, a LIDAR unit, such as Velodyne 64-beam laser, rotates at 10 Hz and takes about 1.3 million readings per second.

C. Cameras

Cameras are mostly used for object recognition and object tracking tasks, such as lane detection, traffic light detection, pedestrian detection, and more. To enhance AV safety, existing implementations usually mount eight or more cameras around the car, such that we can use cameras to detect, recognize, and track objects in front, behind, and on both sides of the vehicle. These cameras usually run at 60 Hz, and, when combined, generate around 1.8 GB of raw data per second.

D. Radar and Sonar

The radar and sonar system is used for the last line of defense in obstacle avoidance. The data generated by radar and sonar shows the distance from the nearest object in front of the vehicle’s path. When we detect that an object is not far ahead and that there may be danger of a collision, the AV should apply the brakes or turn to avoid the obstacle. Therefore, the data generated by radar and sonar does not require much processing and is usually fed directly to the control processor—not through the main computation pipeline—to implement such urgent functions as swerving, applying the brakes, or pre-tensioning the seatbelts.

Perception

Next, we feed the sensor data to the perception subsystem to understand the vehicle’s environment. The three main tasks in autonomous driving perception are localization, object detection, and object tracking.

1. Localization

GPS/IMU localization

While GPS/IMU can be used for localization, GPS provides fairly accurate localization results but with a slow update rate; IMU provides a fast update with less accurate results. We can use Kalman filtering to combine the advantages of the two and provide accurate and real-time position updates. The IMU propagates the vehicle’s position every 5 ms, but the error accumulates as time progresses. Fortunately, every 100 ms, we get a GPS update, which helps us correct the IMU error. By running this propagation and update model, we can use GPS/IMU to generate fast and accurate localization results.

Nonetheless, we cannot solely rely on this combination for localization for three reasons:

  • It has an accuracy of only about one meter;

  • The GPS signal has multipath problems, meaning that the signal may bounce off buildings and introduce more noise;

  • GPS requires an unobstructed view of the sky and thus does not work in closed environments such as tunnels.

Vision-based localization

Cameras can be used for localization, too. Vision-based localization undergoes the following simplified pipeline:

  • by triangulating stereo image pairs, we first obtain a disparity map that can be used to derive depth information for each point;

  • by matching salient features between successive stereo image frames in order to establish correlations between feature points in different frames, we could then estimate the motion between the past two frames; and

  • we compare the salient features against those in the known map to derive the current position of the vehicle.

However, since a vision-based localization approach is very sensitive to lighting conditions, this approach alone would not be reliable.

LIDAR for localization

Therefore, LIDAR is usually the main sensor used for localization, relying heavily on a particle filter. The point clouds generated by LIDAR provide a “shape description” of the environment, but it is hard to differentiate individual points. By using a particle filter, the system compares a specific observed shape against the known map to reduce uncertainty.

To localize a moving vehicle relative to these maps, we apply a particle filter method to correlate the LIDAR measurements with the map. The particle filter method has been demonstrated to achieve real-time localization with 10-centimeter accuracy and is effective in urban environments. However, LIDAR has its own problem: when there are many suspended particles in the air, such as raindrops or dust, the measurements may be extremely noisy.

Sensor-fusion process

Therefore, to achieve reliable and accurate localization, we need a sensor-fusion process to combine the advantages of all sensors.

2. Object recognition

Since LIDAR provides accurate depth information, it was originally used to perform object detection and tracking tasks in AVs. In recent years, however, we have seen the rapid development of deep learning technology, which achieves significant object detection and tracking accuracy.

A Convolution Neural Network (CNN) is a type of deep neural network that is widely used in object recognition tasks. A general CNN evaluation pipeline usually consists of the following layers:

  • the convolution layer uses different filters to extract different features from the input image. Each filter contains a set of “learnable” parameters that will be derived after the training stage;

  • the activation layer decides whether to activate the target neuron or not;

  • the pooling layer reduces the spatial size of the representation to reduce the number of parameters and consequently the computation in the network; and last, 4) the fully connected layer connects all neurons to all activations in the previous layer.

Once an object is identified using a CNN, next comes the automatic estimation of the trajectory of that object as it moves—or, object tracking.

3. Object tracking

Object tracking technology can be used to track nearby moving vehicles, as well as people crossing the road, to ensure the current vehicle does not collide with moving objects. In recent years, deep learning techniques have demonstrated advantages in object tracking compared to conventional computer vision techniques. By using auxiliary natural images, a stacked autoencoder can be trained offline to learn generic image features that are more robust against variations in viewpoints and vehicle positions. Then, the offline-trained model can be applied for online tracking.

Decision

In the decision stage, action prediction, path planning, and obstacle avoidance mechanisms are combined to generate an effective action plan in real time.

1. Action prediction

One of the main challenges for human drivers when navigating through traffic is to cope with the possible actions of other drivers, which directly influence their own driving strategy. This is especially true when there are multiple lanes on the road or at a traffic change point. To make sure that the AV travels safely in these environments, the decision unit generates predictions of nearby vehicles then decides on an action plan based on these predictions.

To predict actions of other vehicles, one can generate a stochastic model of the reachable position sets of the other traffic participants, and associate these reachable sets with probability distributions.

2. Path planning

Planning the path of an autonomous, responsive vehicle in a dynamic environment is a complex problem, especially when the vehicle is required to use its full maneuvering capabilities. One approach would be to use deterministic, complete algorithms—search all possible paths and utilize a cost function to identify the best path. However, this requires enormous computational resources and may be unable to deliver real-time navigation plans. To circumvent this computational complexity and provide effective real-time path planning, probabilistic planners have been utilized.

3. Obstacle avoidance

Since safety is of paramount concern in autonomous driving, we should employ at least two-levels of obstacle avoidance mechanisms to ensure that the vehicle will not collide with obstacles. The first level is proactive and based on traffic predictions. The traffic prediction mechanism generates measures like time-to-collision or predicted-minimum-distance. Based on these measures, the obstacle avoidance mechanism is triggered to perform local-path re-planning. If the proactive mechanism fails, the second-level reactive mechanism, using radar data, takes over. Once radar detects an obstacle ahead of the path, it overrides the current controls to avoid the obstacle.

The client system

The client system integrates the above-mentioned algorithms together to meet real-time and reliability requirements. There are three challenges to overcome:

  • the system needs to make sure that the processing pipeline is fast enough to consume the enormous amount of sensor data generated;

  • if a part of the system fails, it needs to be robust enough to recover from the failure; and

  • the system needs to perform all the computations under energy and resource constraints.

 Robot Operating System (ROS)

A Robot Operating System (ROS) is a widely used, powerful distributed computing framework tailored for robotics applications.

Each robotic task, such as localization, is hosted in an ROS node. These nodes communicate with each other through topics and services. It is a suitable operating system for autonomous driving, except that it suffers from a few problems:

  • Reliability: ROS has a single master and no monitor to recover failed nodes.

  • Performance: when sending out broadcast messages, it duplicates the message multiple times, leading to performance degradation.

  • Security: it has no authentication and encryption mechanisms.

Although ROS 2.0 promised to fix these problems, it has not been extensively tested, and many features are not yet available.

Therefore, in order to use ROS in autonomous driving, we need to solve these problems first.

A. Reliability

The current ROS implementation has only one master node, so when the master node crashes, the whole system crashes. This does not meet the safety requirements for autonomous driving. To fix this problem, we implement a ZooKeeper-like mechanism in ROS. As shown in Figure 7, the design incorporates a main master node and a backup master node. In the case of main node failure, the backup node would take over, making sure the system continues to run without hiccups. In addition, the ZooKeeper mechanism monitors and restarts any failed nodes, making sure the whole ROS system stays reliable.

B. Performance

Performance is another problem with the current ROS implementation—the ROS nodes communicate often, as it's imperative that communication between nodes is efficient. First, communication goes through the loop-back mechanism when local nodes communicate with each other. Each time it goes through the loopback pipeline, a 20-microsecond overhead is introduced. To eliminate this local node communication overhead, we can use a shared memory mechanism such that the message does not have to go through the TCP/IP stack to get to the destination node. Second, when an ROS node broadcasts a message, the message gets copied multiple times, consuming significant bandwidth in the system. Switching to a multicast mechanism greatly improves the throughput of the system.

C. Security

Security is the most critical concern for an ROS. Imagine two scenarios: in the first, an ROS node is kidnapped and is made to continuously allocate memory until the system runs out of memory and starts killing other ROS nodes and the hacker successfully crashes the system. In the second scenario—since, by default, ROS messages are not encrypted—a hacker can easily eavesdrop on the message between nodes and apply man-in-the-middle attacks.

To fix the first security problem, we can use Linux containers (LXC) to restrict the number of resources used by each node and also provide a sandbox mechanism to protect the node from each other, effectively preventing resource leaking. To fix the second problem, we can encrypt messages in communication, preventing messages from being eavesdropped.

Hardware platform

To understand the challenges in designing a hardware platform for autonomous driving, let us examine the computing platform implementation from a leading autonomous driving company. It consists of two compute boxes, each equipped with an Intel Xeon E5 processor and four to eight Nvidia Tesla K80 GPU accelerators. The second compute box performs exactly the same tasks and is used for reliability—if the first box fails, the second box can immediately take over.

In the worst case, if both boxes run at their peak, using more than 5000 W of power, an enormous amount of heat would be generated. Each box costs $20K to $30K, making this solution unaffordable for average consumers.

The power, heat dissipation, and cost requirements of this design prevent autonomous driving from reaching the general public (so far). To explore the edges of the envelope and understand how well an autonomous driving system could perform on an ARM mobile SoC, we can implement a simplified, vision-based autonomous driving system on an ARM-based mobile SoC with peak power consumption of 15 W.

Surprisingly, the performance is not bad at all: the localization pipeline is able to process 25 images per second, almost keeping up with image generation at 30 images per second. The deep learning pipeline is able to perform two to three object recognition tasks per second. The planning and control pipeline is able to plan a path within 6 ms. With this system, we are able to drive the vehicle at around five miles per hour without any loss of localization.

Cloud platform

Autonomous vehicles are mobile systems and therefore need a cloud platform to provide supports. The two main functions provided by the cloud include distributed computing and distributed storage. This system has several applications, including simulation, which is used to verify new algorithms, High-Definition (HD) map production, and deep learning model training. To build such a platform, we use Spark for distributed computing, OpenCL for heterogeneous computing, and Alluxio for in-memory storage. We can deliver a reliable, low-latency, and high-throughput autonomous driving cloud by integrating these three.

Simulation

The first application of a cloud platform system is simulation. Whenever we develop a new algorithm, we need to test it thoroughly before we can deploy it on real cars (where the cost would be enormous and the turn-around time too long).

Therefore, we can test the system on simulators, such as replaying data through ROS nodes. However, if we were to test the new algorithm on a single machine, either it would take too long or we wouldn't have enough test coverage. Spark can be used to manage distributed computing nodes, and on each node, we can run an ROS replay instance. In one autonomous driving object recognition test set, it took three hours to run on a single server; by using the distributed system, scaled to eight machines, the test finished in 25 minutes.

HD map production

HD map production is a complex process that involves many stages, including raw data processing, point cloud production, point cloud alignment, 2D reflectance map generation, HD map labeling, as well as the final map generation.

Using Spark, we can connect all these stages together in one Spark job. A great advantage is that Spark provides an in-memory computing mechanism, such that we do not have to store the intermediate data in hard disk, thus greatly reducing the performance of the map production process.

Deep learning model training

As we use different deep learning models in autonomous driving, it is imperative to provide updates that will continuously improve the effectiveness and efficiency of these models. However, since the amount of raw data generated is enormous, we would not be able to achieve fast model training using single servers.

To approach this problem, we can develop a highly scalable distributed deep learning system using Spark and Paddle (a deep learning platform recently open-sourced by Baidu).

In the Spark driver, we can manage a Spark context and a Paddle context, and in each node, the Spark executor hosts a Paddler trainer instance. On top of that, we can use Alluxio as a parameter server for this system. Using this system, we have achieved linear performance scaling, even as we add more resources, proving that the system is highly scalable.

How We Put Them Together Potentially in AUTOSAR Context?

Now that we have a good understanding of the main components of an autonomous vehicle, let’s review a scenario of how they all work together.

Scenario

The car has stopped at an intersection in front of the red light.

Mission

The car should move forward when the traffic light turns green without violating any traffic laws or hurting other beings.\

1. Sensors

The car’s sensors take in raw information about the environment. It does not know what this information means yet — at least not until it gets to the perception stage.

2. V2X technology

The traffic light communicates to the car that it has just turned green. Other surrounding cars communicate their position in the environment.

3. Perception Stage

The vehicle turns the raw information coming in from the perception stage into actual meaning. The camera information reveals that the light has just turned green and that there is a pedestrian crossing in front of the vehicle into the street.

4. Planning Stage

The vehicle combines the sensing information processed during the perception stage with the incoming V2X information to determine how to behave. The car’s policy is to generally move when the light turns green; however, it has an overriding policy that it should not run over pedestrians. What should the car do in this scenario? The car decides that, based on the combination of environmental information and the general policies of how it should operate, it should not move.

5. Control Stage

The car must translate its decision to not move into an action. In this case, this action (or rather, inaction in this case) is to stay still and keep the brakes applied.

6. Actuators

The car keeps the brake applied, which is the result of its decision-making process stated above.

AUTOSAR is a consortium between OEMs and component suppliers which supports standardization of the software infrastructure needed to integrate and run vehicle’s software. The adoption and use of AUTOSAR is OEM-specific. In the AUTOSAR context, the functional components in the attached Figure represent AUTOSAR software components. The interfaces between components can be specified through AUTOSAR’s standardized interface definitions.

As you can see, the technology behind the autonomous vehicle is not extremely difficult to understand when boiled down into major concepts. Autonomous driving (and artificial intelligence in general) is not one technology; it is an integration of many technologies. It demands innovations in algorithms, system integrations, and cloud platforms. With Inventive Problem Solving techniques in Industrial Design Engineering, people anticipate that by 2020 (next year), we will start to see AI-based Autonomous Vehicles on the road.

Industrial Design Engineering: Inventive Problem Solving for ISO 26262 – The Second Edition

By: John X. Wang
Subjects: Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Occupational Health & Safety

The Second Edition: what’s in it

A new version of the automotive safety standard just arrived this year. This CRC Press reviews the main updates and see how it is combining with the incoming SOTIF autonomous driving standard.

Carmakers are steadily integrating higher volumes of automated driving features into road vehicles, making functional safety a top priority for their whole industry. To address that, the automotive giants and the companies that supply their electrical and/or electronical (E/E) systems follow the ISO 26262 standard, established in 2011 by the International Organization for Standardization. Until now, ISO 26262 has addressed many aspects of functional safety for passenger vehicles with a maximum gross weight of 3,500kg. An update to the standard has just been published this year. It includes a lot of upgrades, eliminates the weight limit, and thereby expands its coverage to other vehicle categories, including

  • heavier road cars,

  • trucks, busses, and

  • motorcycles.

Notably, the second edition also includes guidelines on the design and use of semiconductors in an automotive functional safety context.

ISO 26262 defines the functional requirements for the risk assessment and the confidence in the use of such software tools. This is described in Clause 11 of ISO 26262, Part 8. It sets down the requirements applying to tool qualification documentation for all software used in automotive design, manufacturing, and in-system operation.

This is already set out and required in the current edition of the standard. What, though, are the significant incoming changes?

What is new and significant in the second edition?

The second edition of ISO 26262 includes several important changes and additions with implications for software vendors. These include:

 

  • Updates to the PMHF (Probabilistic Metric for Hardware Failure) equation and verification of safety analysis.

  • Extending the process to ensure confidence in the use of software tools to include vendor validation

  • Guidelines on the application of ISO 26262 to semiconductors in the standard’s new Part 11.

ISO 26262 delivers a minimum set of requirements to fulfill functional safety aspects, but it does not – and cannot – cover all safety aspects of a product. It is the responsibility of the system supplier to ensure that the product meets the highest safety, reliability, and performance metrics. Ensuring software tool safety compliance is part of that process.

Confidence in the use of software tools

From an ISO 26262 perspective, software tools used to create components for automotive systems must be qualified to do their job in a functional safety design environment. Proof is delivered by way of a certified qualification report.

All the software tool qualification and classification requirements are described in Part 8 of the standard. It specifically defines a set of ‘tool confidence levels’ (TCL1-3). These classify the confidence requirements.

  • TCL is thus a measure of the possibility of the software being responsible for a failure in a component, and of the ability of the software to detect that problem. TCL1 is the highest; TCL3 is the lowest.

For a software tool used in the development of an automotive system, ISO 26262 describes four qualification methods for achieving a certain confidence level, however not all of them are required. Different methods are recommended according to the design’s targeted Automotive Safety Integration Level (ASIL).

  • For example, if the component is aiming for the base ASIL-A or ASIL-B levels, methods 1a and 1b are “highly recommended” (++), and methods 1c and 1d are “recommended” (+).

The software tool qualification for TCL 2 only provides a “highly recommended” (++) verification method (1c). The development process method (1d) for qualification has been demoted to “recommended” (+).

The four methods are just a part of the tool qualification process. The software tool qualification report is an executive summary of the classification and validation process, the results, recommendations, project-specific process measures, and detailed information about the use of the tool. Software development tools classified as TCL1 are suitable for use on components targeting ASIL-D, the most stringent of the four ASIL levels.

New ISO 26262 Part 11 on semiconductors

Semiconductor companies that are Tier-2 automotive suppliers must meet many tough requirements set by their OEM and Tier-1 clients. They must show evidence that the development of ICs and systems delivered to those customers follow – or have followed – appropriate design, verification, and validation flows that use qualified software tools. ISO 26262 supports this by describing the requirements for tool qualification.

 

The second edition of ISO 26262 includes a new chapter (Chapter 11) that gives guidelines on the application of ISO 26262 to semiconductors.

  • Part 11 provides a comprehensive overview of functional-safety related items for the development of a semiconductor product.

  • It includes general descriptions of a semiconductor component and its development and possible partitioning.

  • It moves on to address important items related to ISO 26262, including sections about hardware faults, errors, and failure modes.

  • It also addresses intellectual property (IP), specifically with regard to any ISO 26262-related IP with one or more allocated safety requirements.

The Second Edition: what isn’t in it

What remains missing from ISO 26262:2018, however, is detail on how to handle the development of autonomous vehicles. This missing topic is addressed in a new standard to follow the second ISO 26262 release, ISO/PAS 21448. It is more commonly referred to as SOTIF, standing for ‘Safety of the Intended Functionality’.

The considerations ultimately addressed by ISO 26262 and SOTIF touch on all parts of the automotive supply chain.

  • The design automation software, for example, is required to address the quality and reliability of the components in an automotive product environment.

Robust Designs, Inventive Problem Solving, and Safety of the Intended Functionality (SOTIF)

In 2014, SAE International established a common terminology for automated driving in the J3016 standard. It describes six levels of automated driving. Greater degrees of Autonomous Driving (AD), also known as driverless-driving or self-driving, are gradually introduced from Level 2 onward.

Autonomous driving is at an early stage. The cars we see on the road today are typically Level-2 vehicles. The industry is only now taking its first steps toward the commercial launch of Level-3 autonomous vehicles. While Level-3 vehicles are a reality, they still face legal and regulatory challenges that hamper the implementation. Vehicles capable of meeting the definitions of ‘high’ and ‘full’ autonomous driving— Level 4 and Level 5 —are for the future.

ISO 26262 remains the foundation for providing safe systems, safe hardware, and safe software. These aim to ensure independent, safe operation in the event of a failure. The ISO 26262 standard establishes state-of-the-art processes and architecture, clearly setting rules that allow a system to be safe.

The SOTIF standard, still currently in development, will provide guidelines for Level-0, Level-1, and Level-2 autonomous drive (AD) vehicles. Even these levels of autonomy still have the world’s AD experts struggling to define how to make a system safe.

They face a conundrum: Autonomous vehicles must be safe even when they do not fail. So the SOTIF standard is being drafted to provide guidance that assures an autonomous vehicle functions and acts safely during normal operation. Topics covered in SOTIF will therefore include:

  • Detail on advanced concepts of the AD architecture

  • How to evaluate SOTIF hazards that are different from ISO 26262 hazards;

  • How to identify and evaluate scenarios and trigger events;

  • How to reduce SOTIF related risks;

  • How to verify and validate SOTIF related risks; and

  • The criteria to meet before releasing an autonomous vehicle.

This will comprise the foundations of an AD methodology. Next, comes the implementation. Autonomous verification and validation must meet many tests, from simulation to full vehicle, which include factors encompassing the entire 4D environment such as weather, road condition, surrounding landscape, object texture, and possible driver misuse.

SOTIF will provide many methods and guidelines for the inclusion of environmental scenarios for use during advance concept analysis and, later on, validation. The SOTIF committee would like to guide users through the documentation of different scenarios, the safety analysis of those scenarios, the verification of both safety scenarios and various trigger events, and the validation of the vehicle in the environment with applied safe systems. These factors will be paramount to compliance with the upcoming standard on AD.

These advanced concepts, evaluations, and tests will go well beyond previous development processes. With that in mind, our reliance on test platforms, software tools, digital-twin simulations, or hardware in the loop, is set to become more important than ever.

So when developing your AD system, you absolutely must be confident that your team has access to the best-in-class software tools.

Functional safety of hybrid electric and electric vehicles

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Environmental Science, Materials Science, Statistics

Functional safety of hybrid electric and electric vehicles has attracted a great deal of attention among automobile industries worldwide. Torque security is one of the main hazards that should be considered for functional safety of electrified vehicles. Over the past decades, a significant number of accidents have been reported to be caused by unintended acceleration that results from torque security problems. This CRC Press News discuss torque security problems in electric vehicles using tools like Failure Modes and Effect Analysis (FMEA) approach. The fault scenarios that can potentially result in loss of torque security in electrified vehicles can be evaluated by Monte Carlo Simulation.

The complexity of both hardware and software has increased significantly in hybrid electric and electric vehicles over the past decade. This is apparent even in the compact passenger car market segment where the presence of Electronic Control Units (ECU) has nearly tripled. In today’s luxury vehicles, software can reach 100 million lines of code and are only projected to increase. Without preventive measures, the risk of safety critical system malfunction becomes unacceptably too high. The functional safety standard ISO 26262, published as the second edition in last year (2018), provides crucial safety-related requirements for vehicles. The Hazard Analysis and Risk Assessment (HARA) is the key analysis used to identify potential risks and develop the highest level safety requirements to mitigate these identified risks. Throughout development, various safety analysis are required including qualitative and quantitative analysis.

Like described above, torque security is an important element for hybrid electric and electric vehicles development. For Drive-by-Wire systems that are commonly used in modern vehicles, pedal position signals are critical for ensuring correct calculation of torque request to the rest of the powertrain. Based on Monte Carlo Simulation, a model based fault diagnostic scheme for electrified vehicle powertrain torque security problems can be developed with special focus on pedal position sensor faults. Implementing the design of the diagnostic strategy, a fault tolerant control strategy can be establihsed to mitigate the effect of the faults. The diagnostic and fault tolerant control scheme are tested and validated through Model-in-the-Loop (MIL) and Hardware-in-the-Loop (HIL) simulations.

As an obvious advantage, this model-based approach is consistent with ISO 26262 – functional safety standard. By defining and implementing functional safety requirements, including item and function definition, and Hazard Analysis and Risk Assessment, the design of a model-based diagnostic and Fault Tolerant Control (FTC) system can lead to a systematic problem solving for hybrid electric and electric vehicles’ functional safety architecture design.

Robust Designs, Inventive Problem Solving, and Safety of the Intended Functionality (SOTIF)

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Ergonomics & Human Factors, Information Technology, Nanoscience & Technology, Occupational Health & Safety, Public Administration & Public Policy, Statistics

Robust Designs and Safety of the Intended Functionality (SOTIF)

Achieving Robust Designs is the foundation of developing Dependable, Reliable, and Affordable Autonomous Vehicles satisfying safety requirements. The first part of the ISO26262 includes the Hazard Analysis and Risk Assessment, which evaluates the potential risks due to malfunctions in the item to define top-level safety requirements: the safety goals. The subsequent parts of ISO26262 provide requirements and guidance to avoid and control the random hardware and systematic faults that could violate the safety goal. However, for some systems that rely on environment sensors, there can be safety violations with a system free from faults if the sensor or the processing algorithm takes a hazardous decision about the environment. There is a need to provide guidance to manage these violations. Depending on the functional concept, several aspects of the intended functionality may be safety-related, for example:

  • The ability of the function to correctly comprehend the situation and behave safely

  • The robustness of the function sufficient with regard to signal noise

A new version of the road functional safety standard, the Second Edition of ISO 26262, arrived last year (year 2018). It included a lot of upgrades, eliminated the weight limit (maximum gross weight of 3,500kg), and thus expanded its coverage to other vehicle categories besides passenger vehicles, including heavier road cars, trucks, busses, and motorcycles. Notably, the second edition also includes guidelines on the design and use of semiconductors in an automotive functional safety context.

What remained missing from ISO 26262:2018, however, is detail on how to handle the development of autonomous vehicles. This missing topic will be addressed in a new standard to follow the second ISO 26262 release, ISO/PAS 21448. It is more commonly referred to as SOTIF, standing for ‘Safety of the Intended Functionality’.

The new world of autonomous vehicles is posing many challenges to road safety. A robust system begins with robust sensors. When arguing safety for an autonomous road vehicle it is considered very hard to show that the sensing capability is sufficient for all possible scenarios that might occur. Already for today’s manually driven road vehicles equipped with Advanced Driver Assistance Systems (ADAS), it is far from trivial how to argue that the sensor systems are sufficiently capable of enabling a safe behavior. In this paper, we argue that the transition from ADAS to Automated Driving Systems (ADS) enables new solution patterns for the safety argumentation dependent on the sensor systems. A key factor is that the ADS itself can compensate for a lower sensor capability, by for example lowering the speed or increasing the distances. The robust design strategy allocates safety requirements on the sensors to determine their own capability. This capability is then to be balanced by the tactical decisions of the ADS equipped road vehicle. The SOTIF standard, still currently in development, will provide guidelines for Level-0, Level-1, and Level-2 autonomous drive (AD) vehicles. Even these levels of autonomy still have the world’s AD experts struggling to define the safety goals related to SOTIF.

Define the safety goals related to SOTIF

The considerations ultimately addressed by future updated of ISO 26262 and SOTIF will touch on all parts of the automotive supply chain. The design automation software, for example, will be required to address the quality and reliability of the components in an automotive product environment.

Autonomous vehicles must be safe even when they do not fail but interact with the application/usage environment incorrectly. So the SOTIF standard is being drafted to provide guidance that assures an autonomous vehicle functions and acts safely during normal operation by assessing the intended functions' interaction with the application/usage environment. Limiting or disabling certain ADAS or autonomous functions when a sensor is faulty, unavailable or cannot do its job, is viewed as standard practice. Scenarios to be considered might include accident damage, ice build-up on a front-mounted radar, snow obscuring road lane markings, or a dead insect on the windscreen obscuring a camera. All can be handled by a combination of sensor diagnostics and processing intelligence.

The unique safety demands of Autonomous Vehicles (AV) will undoubtedly be a challenge for road safety, but the emergence of new international standards is setting the direction the development of AV will have to take. Topics covered in SOTIF will therefore include:

  •  Detail on advanced concepts of the AV architecture
  • How to evaluate SOTIF hazards that are different from ISO 26262 hazards;

  • How to identify and evaluate scenarios and trigger events;

  • How to reduce SOTIF related risks;

  • How to verify and validate SOTIF related risks; and

  • The criteria to meet before releasing an autonomous vehicle.

As functional safety will play a large part in ensuring robust autonomous systems, standards such as ISO 26262 will need to address autonomy. In the first edition of the standard there was very little specific content related to autonomy. The perspective of the First Edition of ISO 26262 (ISO 26262: 2011) was constrained by the Vienna Convention requirement for the driver to maintain control of the vehicle at all times and with an assumption that electronic systems could therefore “fail silent” in the case of a malfunction.

Now with Edition 2 of the standard (ISO 26262: 2018), has this changed? One area of improvement incorporated in the new edition is related to “fail operational” systems, as some control systems may require a degree of availability in order to maintain safe operation. The standard now considers how to design systems that can continue operation in the presence of failures. Another development area, for inclusion in a later revision of the standard, is around “safety of the intended functionality” – how factors such as sensor performance can be addressed; for example, a false-positive detection of an obstacle by a forward-looking radar. For a function such as Autonomous Emergency Braking (AEB) we want to avoid an un-demanded brake application. One potential cause of this event is that the radar sensor reports the presence of an object that isn’t another vehicle; instead, it is a metal plate on the road during construction works. The challenge is in how we ensure that the sensor correctly discriminates between targets that should cause brake application, and those that should not.

Despite the significant improvements over the First Edition, the Second Edition of ISO 26262 is still firmly grounded in the constraints of a traditional vehicle. As such, there is further work that needs to be done before it fully addresses the unique requirements brought about by autonomy. This includes hazard analysis and availability. Hazard analysis considers driver “controllability”, which needs reinterpreting for a highly automated function. To assure functional safety, hazard analysis, needs to consider whether an average driver will be able to maintain control or take some action to mitigate the effects of a failure if one occurs. For a highly automated function, the driver may not be able to take action within a reasonable period of time. As such, a different approach to hazard analysis may be required. Furthermore, additional consideration must focus on the architectures and concepts for assuring the availability of autonomous systems. As vehicles become fully autonomous, this requirement will stretch from an extended period of time to the extent of a complete arbitrary vehicle journey. This will comprise the foundations of an AV methodology. Next, comes the implementation. Autonomous verification and validation must meet many tests, from simulation to full vehicle, which include factors encompassing the entire 4D environment such as weather, road condition, surrounding landscape, object texture, and possible driver misuse.

Coupled with these changes is the potential shift in liabilities – autonomous systems are being publicized as removing driver error, cited as being the most common cause of traffic accidents. If such a system fails, however, to whom does this responsibility shift? Some manufacturers are already suggesting they might assume liability in the event of a highly automated system failing – the practicality of this will require further consideration – whilst others are taking a more cautious view. In either case, this only underlines the need to have a high degree of assurance and resilience in the systems that deliver highly automated driving functions. SOTIF would provide many methods and guidelines for the inclusion of environmental scenarios for use during advance concept analysis and, later on, validation. SOTIF would guide users through the documentation of different scenarios, the safety analysis of those scenarios, the verification of both safety scenarios and various trigger events, and the validation of the vehicle in the environment with applied safe systems. These factors will be paramount to compliance with the upcoming standard on AV.

Summary: the reliance on inventive problem solving

In summary, we are on the road to making fully autonomous vehicles a reality, and while ISO 26262 sets out the basis on which such systems will be developed, there is more work to do to extend its concepts to deal with such vehicles’ unique safety requirements. In the meantime, expert guidance and adaptation to existing standards is required to cover the development and testing of these systems. Along with SOTIF, advanced concepts, evaluations, and tests will go well beyond previous development processes. With that in mind, the reliance on inventive problem solving, test platforms, software tools, digital-twin simulations, or hardware in the loop, is set to become more important than ever.

Emerging technique to engineer a safe and robust Autonomous Vehicle

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Ergonomics & Human Factors, Information Technology, Mathematics, Occupational Health & Safety, Public Administration & Public Policy, Statistics

With increasing deployment of Autonomous Vehicles (AVs) on the road and the goal of moving towards full autonomy in near future, reliability and safety of these systems are of high concern. There have been already several reports on incidents involving AVs, e.g., the fatal crashes of

An Autonomous Vehicle (AV) depends on the sensors like RADAR and camera for the perception of the environment, path planning, and control. With the increasing autonomy and interactions with the complex environment, there have been growing concerns regarding the safety and reliability of AVs. The safe operation of an AV depends not only on the proper functioning of the sensors, actuators, and mechanical components, but also the proper operation of the autonomous control software and its interactions with other components, human driver, and environment. The faults in the controller may become activated either by the environmental conditions or human errors, and get propagated into the system resulting in safety hazards.

The CRC Press News explores the prospect of integrating Risk Engineering, Engineering Decision Making under Uncertainty, Engineering Robust Design with Six Sigma and Systems-Theoretic Process Analysis (STPA) with potential application to develop a fault injection framework for assessing the resilience of an an AV under different environmental conditions and faults affecting sensor data. To increase the coverage of unsafe scenarios during testing, we can develop a strategic software fault-injection approach where the triggers for injecting the faults are derived from the unsafe scenarios identified during the high-level hazard analysis of the system.

As a potential benefit, the strategic fault injection approach could increase the hazard coverage compared to random fault injection and, thus, can help with more effective simulation of safety-critical faults and testing of AVs. In addition, this CRC Press News provides insights on the performance of AV’s safety mechanisms and its ability in timely detection and recovery from faulty inputs.

Autonomous vehicles are one of the most complex software-intensive Cyber Physical Systems (CPS). In addition to the basic car mechanisms (e.g. gas/brake system, steering system), they are equipped with driving assistance mechanisms such as Adaptive Cruise Control (ACC), Lane Keeping Assist System (LKAS), and Assisted Lane Change. AVs use smart sensors (e.g. camera, RADAR, LIDAR) and machine learning (ML) algorithms for perception of the surrounding environment, path finding, and navigation. For example, LKAS uses computer vision and ML to process camera data, locate the right/left lane markers, and adjust the steering angle to keep the vehicle inside the lanes.

The implementation of AVs involves an increase in the number and depth of system interactions in comparison to user-driven cars. There is a corresponding need to address the system safety implications of autonomy. Traditional hazard analysis techniques are not designed to identify hazardous states caused by system interactions. An emerging technique based on the integration Risk Engineering, Engineering Decision Making under Uncertainty, Engineering Robust Design with Six Sigma and Systems-Theoretic Process Analysis (STPA), allows for inclusion of system-level causal factors by focusing on component interactions. Such an emerging technique can be applied to a Lane Keeping Assist (LKA) system, resulting in identification of design constraints and requirements needed to engineer a safe and robust Autonomous Vehicle.

Year 2019: moving towards achieving Autonomous Vehicle’s Functional Safety

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Chemical, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Ergonomics & Human Factors, Information Technology, Materials Science, Mathematics, Nanoscience & Technology, Occupational Health & Safety, Public Administration & Public Policy, Statistics

Moving towards achieving Autonomous Vehicle’s functional safety in year 2019 and beyond, electronic systems are becoming more common and sophisticated. Because of this, safety and verification are a critical part of every automotive chip design as well as the IP integrated into the silicon. Compliance with ISO 26262, which describes automotive functional safety requirements and levels, has become mandatory for products in this domain. System and chip designers must comply with the standard allowing their end products to be certified.

The electronics content of cars is increasingly rapidly. Ten years ago, perhaps 20% of the value of a car was made up of electronics. Today it is more like 35%, and in ten years’ time electronics could represent half of a car’s value – as well as enabling 90% of its innovative features. You only have to look at a Tesla, with its innovative electric powertrain, evolving autonomous driving capabilities and iPad-like controls to see where we’re heading.

More electronics means more Systems-on-Chip (SoCs) to design and verify, and greater competition to supply this rapidly growing market. New product development teams can design and verify SoCs, however serving the automotive market introduces constraints that IC vendors attempting to migrate from, for example, the mobile-phone sector may not be familiar with. To compete, new entrants will either have to learn very quickly, or get some help from a trusted IP vendor.

Functional safety is critical for SoCs that are the technology backbone for Autonomous Vehicles, Advanced Driver Assistance Systems (ADAS), infotainment equipment, and other in-car systems. However, meeting the various safety standards can be time-consuming and labor-intensive, involving voluminous amounts of data that changes as the standards evolve.

One key challenge in developing Autonomous Vehicles is incorporating functional safety for those SoCs in the safety-critical path. The industries have worked long and hard to understand safety and reduce risk, in part through the development of the Automotive Safety Integrity Levels (ASIL) defined in the ISO 26262 standard. These combine the probability of exposure to a hazard, the extent to which it is controllable by a driver, and the severity of a failure to control such a hazard, into four categories, A thru D. Of these, ASIL D represents the integrity level necessary in the most safety-critical circumstances.

What does functional safety for SOCs involve?

Functional safety is the concept that a system will remain dependable and function as intended even in the face of an unplanned or unexpected occurrence. If a system is functionally safe, then it is assumed that the system is able to avoid unacceptable risk of physical injury or damage.

For SOCs, there are two foundational requirements of a functionally safe system:

  • Redundancy provides multiple processing paths that limit the risk that any one error will disrupt the system

  • Checkers monitor the systems and trigger error response and recovery features when needed

As SoCs move into smaller process nodes they become more susceptible to errors. For example, phenomena including radiation sources, magnetic fields, and internal wear can all be disruptive to an advanced-node SoC. To assure that an SoC is functionally safe, a designer would typically need to establish a functional verification environment where errors (faults) could be injected into the system. Redundant logic would vote on the correct data to eliminate errors and maintain continuous operation. Checkers would monitor the erroneous data within specified time periods and apply error corrections.

Following certain methodologies can make it more efficient for designers to ensure an Autonomous Vehicle will behave as anticipated, even if something unplanned or unexpected occurs. A set of design and verification technologies that automate fault tolerant, fault injection and result analysis for intellectual property (IP), SoC, and system designs can reduce ISO 26262 compliance efforts by up to 50 percent.

Managing Autonomous Vehicle’s safety is a holistic process – everything has to work together correctly for the system to offer the expected levels of safety protection. This means that foundational components such as embedded processors must meet the requirements of the specified ASIL. To meet ASIL D this includes a system level requirement of fewer than 1% single points of failure. In practice this means that a processor going in to an ASIL D certifiable chip must implement Cyclic Redundancy Check (CRC) or Error Checking and Correction (ECC) on caches and closely coupled memories, include a watchdog timer, and operate in lockstep with a redundant core. In a lockstep implementation, two cores run the same code and include a mechanism for comparing the outputs of the two cores and flagging any discrepancies. Extensive safety documentation is also required to demonstrate that risks have been clearly identified and assessed: these documents then could become a key part of the ISO 26262 certification process.

Additionally, the self-checking safety monitor can be introduced to ensure lockstep operation, and can delay the activity of one of the redundant cores relative to the other while still comparing results in the correct program counter order, to avoid potential issues related to glitches that affect both cores at once (e.g. a signal transient).. There’s also hardware stack protection to check for overflow and underflow of reserved stack space – to prevent data corruption and program crashes – and a watchdog timer to help recover from deadlocks and enable countermeasures against tampering.

Furthermore, a functional safety solution based on multi-purpose built-in self-test and repair infrastructure for SoCs can be developed. This solution allows building a hierarchical network and managing it in multiple in-field test and repair modes.

As the opportunity represented by the development of Autonomous Vehicles grows, and the rate at which it innovates accelerates, competition to provide the key SoCs can only intensify. Although well-designed, carefully verified hardware is critical to achieving ISO 26262 certification, what will really set competitors apart in the automotive market will be how quickly they can meet evolving market requirements and bring a differentiated solution to the market.

Moving towards achieving Autonomous Vehicle’s functional safety in year 2019 and beyond, ensuring that an automotive SoC is functionally safe also gives drivers and passengers confidence in their vehicles. Integrating safety verification into the functional verification flow can be an effective way to speed up the process and manage the effort of complying with standards such as ISO 26262. Using functional verification and fault simulation technologies can also minimize your safety verification effort. With these methodologies and technologies, companies can spend more time creating safe and unique automotive designs.

Functional Safety: Risk Mitigation of Deep Neural Network for Autonomous Driving

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Environmental Science, Nanoscience & Technology, Occupational Health & Safety, Public Administration & Public Policy, Statistics

Machine learning (ML) applications generate a continuous stream of success stories from various domains. ML enables many novel applications, also in safety-critical contexts. However, the functional safety standards such as ISO 26262 did not evolve to cover ML. Which parts of ISO 26262 would represent the most critical gaps between safety engineering and ML development?

Machine Learning – where is it used in Autonomous Driving (AD)?

In recent years, ML and its powerful models, like Deep Neural Networks (DNNs), have been widely used in a number of industries. Today, we have vehicles on the road whose braking, accelerating (adaptive cruise control, automatic emergency braking and intelligent traffic sign recognition) and steering decisions (lane keeping assistance and traffic jam assistance) are based on sensing, the logic of which has been developed via ML. The driver is still responsible and can override the system actuation – laterally or longitudinally.

The sensor fusion algorithms and decision algorithms, whether acceleration, braking or steering, are still deterministic model based, however this will change in the next generation of autonomous systems (SAE Level 4).

ML steps can be divided into four independent categories:

  • sensing (e.g. image recognition by camera)

  • mapping (some mapping definitions are based on crowd-sourced data that position the vehicle according to detected reference objects (e.g. traffic lights, commercial posts, etc.))

  • processing (fusion algorithms for combining radar, lidar, map, GPS, camera and v2x data)

  • decisions (autonomous driving algorithms are trained on real-world driving patterns and a road book)

Machine Leaning: its main problems for safety

There are several characteristics of ML that can impact safety or safety assessment:

  • Non-transparency. All types of ML models contain knowledge in an encoded form, but this encoding is harder to interpret for humans in some types than in others. Neural network models are considered non-transparent, and significant research effort has been devoted to making them more transparent.

  • Example.For example, a sensing system can define various road edges even in an image where structure and color are almost identical, as it will compute the entire scene using ML. To mitigate the risk due to false-positives, most of the OEM will include mapping and positioning to assure high confidence. This non-transparency is an obstacle to safety assurance and lowers confidence that the model is operating as intended. Another problem of non-transparency is when one has to explain the root causes of accidents.

  • Error rate. An ML model typically does not operate perfectly and exhibits an error rate. Thus, the absolute “correctness” of an ML component, even with respect to test data, is seldom achieved, and it must be assumed that it will periodically fail. Furthermore, many detection systems have debatable confidence levels. For example, lane detection systems may not detect washed-out lanes or may make wrong detection with tar lanes (especially in dense city areas). Additionally, even if the estimate of the error rate is accurate, it may not reflect the error rate that the system actually experiences while in operation after a finite set of inputs because the true error rate is based on an infinite set of samples.

  • Training based. ML models are created by training them on a set of data (traffic signs, lanes, vehicles and people). These data are labeled using a subset of possible inputs that could be encountered operationally. Therefore, the training set is necessarily incomplete, and there is no guarantee that it is even representative of the space of possible inputs. In addition, learning may over-fit a model by capturing details incidental to the training set, rather than general to all inputs. Another factor is that, even if the training set is representative, it may under-represent the safety-critical cases because these are often rarer in the input space.

  • Instability. DNN models, which are more-powerful ML models, are typically trained with optimization algorithms, which may have various optima – consequently, one may get different results with the same training set. This characteristic makes it difficult to debug models or reuse parts of previous safety assessments.

Having laid these characteristics out, one can define potential impact areas in the current ISO 26262 standard.

ISO 26262 Impacts and Recommendations

Based on Decision Making Under Uncertainty, we can list the potential impacts of ML in the current standard and pointed to some recommendations and improvements:

  • Hazard Analysis and Risk Assessment (HARA). It is more difficult to estimate its risk and controllability.
    Example. The extended use of AD can create behavioral changes – meaning that the driver’s assumed controllability may be delayed or not present, for example, if the vehicle tracks the wrong lane marking and smoothly leaves the desired lane into opposite traffic.
    Recommendations for ISO 26262: The definition of hazards should be broadened to include harm potentially caused by complex behavioral interactions between humans and the vehicle that are not due to a system malfunction.

  • Failure Mode and Effects Analysis (FMEA). ML faults have their own characteristics, and many fault types and failure modes have been cataloged for neural networks, however their failure modes have to be very carefully revisited.
    Recommendations for ISO 26262: It is recommend to use fault detection tools and techniques that take into account the unique features of neural networks.

  • Level of ML usage. As mentioned in the introduction, currently, ML is widely used for sensing; however, in future, ML could be used to implement an entire software system, including its architecture, using an end-to-end approach. |
    Example.For example, we can train a DNN to make appropriate steering commands directly from raw sensor data, side-stepping typical Autonomous Vehicle (AV) architectural components such as lane detection, path planning, etc. An end-to-end approach deeply challenges the assumptions underlying ISO 26262. Another challenge with an end-to-end approach is that, in some cases, the size of the training set needs to be exponentially larger than when a programmed architecture is used.
    Recommendations for ISO 26262: Although using an end-to-end approach has shown some recent successes with autonomous driving, researchers recommend that an end-to-end use of ML should not be encouraged by ISO 26262, due to its incompatibility with the assumptions about stable hierarchical architectures of components.

  • Required software techniques. Part 6 of ISO 26262 deals with product development at the software level and specifies 75 software development techniques.
    Example.After evaluating all of these techniques for efficiency in ML assessments, it turned out that only 30% of the highly recommended techniques (++) could be used without any adaptation in ML SW assessments for ASIL C and D. More than 40% were completely inapplicable,  while approximately 20% could be used with adaptation to ML.
    Recommendations for ISO 26262: Many techniques are specifically biased toward the assumption that code is implemented using an imperative programming language. In order to remove this bias, it is recommended that the requirements be expressed in terms of the intent and maturity of the techniques based on Capability Maturity Model (CMM), rather than their specific details.

Sensing and Fusion Maturity

It is recommended to make a clear distinction between system malfunction (i.e. failure that is recognized) and normal operation (where false-positive detection can happen). It is recommended to reduce sensing failure by implementing an end-to-end mapping and localization engine for full autonomy, which delivers on-line crowd-sourced data that stabilize vehicles by ensuring that they follow paths even when there are no lanes or lanes are washed out or misleading. This is definitely a risk engineering consideration, as the vehicle will receive data (reference objects) that enhance its self-positioning up to a few centimeters.

On top of that, it was said that the detection capacity is quite mature – currently, it is the latest generation of the ML system, which has been continuously built on with training data. It is not only image texture that is taken into account – a holistic path prediction model helps the algorithm to define actual objects, lanes and edges. One can roughly analyze how accurate the entire system is or what happens if there is Long-Term Evolution (LTE)/Global Systems for Mobile communications (GSM) signal corruption so that the system works in a limited mode, which definitely should be within the scope of FMEA.

Decision-making under Uncertainty model

The idea of Decision Making Under Uncertainty is to base its decision-making model not on an accident-free assumption. Instead, a vehicle’s decision-making should compute such paths/behavior that will not be responsible for causing accidents. Additionally, algorithms can be developed based on the model to respond to the mistakes of other drivers to secure the highest safety level for its vehicle’s occupants.

Summary

Machine learning (ML) applications generate a continuous stream of success stories from various domains. ML enables many novel applications, also in safety-critical contexts. However, the functional safety standards such as ISO 26262 did not evolve to cover ML.

DeepNeural Networks (DNNs) have been successful in solving a wide range of machine learning problems. Specialized hardware accelerators have been developed to accelerate the execution of DNN algorithms for high-performance and energy efficiency. Recently, they have been deployed in Green Electronics Manufacturing (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. Soft errors caused by high-energy particles have been increasing in hardware systems, and these can lead to catastrophic failures in DNN systems. Traditional methods for building resilient systems, e.g., Triple Modular Redundancy (TMR), are agnostic of the DNN algorithm and the DNN accelerator’s architecture. Hence, these traditional resilience approaches incur high overheads, which makes them challenging to deploy. Evaluating the resilience characteristics of DNN systems with robust design methods based on Six Sigma,wecan identify the error resilience’s dependency on the data types, values, data reuses, and types of layers in the design.

Based on these insights, we can formulate guidelines for designing resilient DNN systems and develop efficient DNN protection techniques to mitigate soft errors. The techniques can significantly reduce the rate of Silent Data Corruption (SDC) in DNN systems with acceptable performance and area overheads. Risk Mitigation techniques includes

  • Data type choice (Programming)

  • Symptom-based Error Detection (Software)

  • Selective Latch Hardening (Hardware)

  • Algorithmic Error Resilience (Fault Tolerant Design)

Year 2019, the Robots Are Coming with Fault Tolerant System Architecture

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Mathematics, Nanoscience & Technology, Statistics

Autonomous vehicles use sensors to perceive objects and other traffic participants. The sensor data feeds into a computer and the computer runs software to understand the environment and to make driving decisions based on this understanding. This software is very complex and contains computer vision, 3D perception, machine learning, localization, decision making, trajectory planning, and control algorithms. These algorithms need a software framework to run in.

Software is playing a key role in automotive industry, and Robot Operating System (ROS) has its place for specific activities. ROS is used in automotive industry as a component to be adapted and deployed in cars. However, its use varies according to a set of parameters, and its reliability depends on these values, and usage models.

ROS is used by the majority of companies developing applications for autonomous vehicles and by almost all academic labs. ROS is analogous to iOS’s or Android’s SDK in the mobile stack, providing a software framework for developers to build applications on top. ROS has found widespread adoption because it is open source, has a strong community behind it and has seen a decade of application development, and provides a rich ecosystem in addition to the software framework, such as sensor drivers, algorithms, visualization, simulation, build tools and much more. ROS is excellent for rapid prototyping and development but is not designed to run in safety- and security-critical applications. As the industry shifts from development to productization and from testing to real-world applications, the highest level of robustness and reliability is needed.

Robotics software has been chronically facing problems in industry and academy due to the lack of standardization, interoperability and reuse of software libraries. The most relevant problems that prevented the robotics community from producing a healthy software ecosystem were:

  • lack of code reuse;

  • higher needs of integration of

  • components and;

  • finding the appropriate trade-off between efficiency and robustness.

As a solution, free software and open source software (FOS) initiatives such as the Robot Operating System (ROS) initiative were promoted. In particular, ROS provides operating system-like tools and package tools. ROS defines different entities including nodes, message topics and services. Nodes are processes or software modules that can communicate with other nodes by passing simple messages (or data structures) by means of publisher/subscriber mechanisms on top of TCP or UDP. In ROS a service is modeled as a pair of messages, one for request and another for reply. ROS has several client libraries implemented in different languages such as C++, Python, Octave or Java in order to create ROS applications. Its major advantage is code reuse and sharing. ROS has been successfully used in different kinds of robots such as autonomous guided vehicles and in the automotive industry. For example, ROS can be applied to support the Co-Pilot system at highly automated vehicles; the driver should take over the control within a certain time constraint when the system requests it, otherwise the system pulls over the car safely. ROS is also used for establishing a Collision Avoidance system for Autonomous Driving tasks. ROS is interesting for autonomous cars because:

  • There is a lot of code for autonomous cars already created. Autonomous cars require the creation of algorithms that are able to build a map, localize the robot using lidars or GPS, plan paths along maps, avoid obstacles, process point-clouds or cameras data to extract information, etc… All kind of algorithms required for the navigation of wheeled robots is almost directly applicable to autonomous cars. Hence, since those algorithms have already been created in ROS, self-driving cars can just make use of them off-the-shelf.

  • Visualization tools already available. ROS has created a suite of graphical tools that allow the easy recording and visualization of data captured by the sensors, and represent the status of the vehicle in a comprehensive manner. Also, it provides a simple way to create additional visualizations required for particular needs. This is tremendously useful when developing the control software and trying to debug the code.

  • It is relatively simple to start an autonomous car project with ROS onboard. You can start right now with a simple wheeled robot equipped with a pair of wheels, a camera, a laser scanner, and the ROS navigation stack, and you are set up in a few hours. That could serve as a basis to understand how the whole thing works. Then you can move to more professional setups, like for example, buying a car that is already prepared for autonomous car experiments, with full ROS support.

ROS, a framework for robotics applications, provides a dynamic middle-ware with publisher/subscriber communication and a remote-procedure-call mechanism. It was initially developed in academia, but recently also industrial users are engaged to prepare ROS for use in products. It does not provide guaranteed timing behavior and dependable communication yet; however, At present, ROS presents two important drawbacks for autonomous vehicles:

  • Single point of failure. All ROS applications rely on a software component called the roscore. That component, provided by ROS itself, is in charge of handling all coordination between the different parts of the ROS application. If the component fails, then the whole ROS system goes down. This implies that it does not matter how well your ROS application has been constructed. If roscore dies, your application dies.

  • ROS is not secure. The current version of ROS does not implement any security mechanism for preventing third parties to get into the ROS network and read the communication between nodes. This implies that anybody with access to the network of the car can get into the ROS messaging and kidnap the car behavior.

ROS community plans to address this in the future. Due to the benefits of its reusability and productivity, the component-based approach has become the primary technology in service robot software frameworks, such as ROS. However, all the existing frameworks including ROS are very limited in fault tolerance support, even though the fault tolerance function is crucial for the commercial success of service robots. Based on Dr. John X. Wang’s book titled “Engineering Robust Design with Six Sigma”, we can develop a rule-based fault tolerant framework together with widely-used, representative fault tolerance measures. Si most faults in components and applications in service robot systems have common patterns, we can equip the framework with the required fault tolerant functions. The system integrators construct fault tolerance applications from non-fault-aware components by declaring fault handling rules in configuration descriptors or/and adding simple helper components, considering the constraints of the components and the operating environment. Much more consistency in system reliability can be obtained with less effort of system developer.

For implementation, we can build XML rule files defining the rules for probing and determining the fault conditions of each component, contamination cases from a faulty component, and the possible recovery and safety methods. The rule files are established by a system integrator and the fault manager in the framework controls the fault tolerance process according to the rules. Dynamic Fault Tree can be applied to evaluate the effectiveness of the Robot Operating System (ROS) based fault-tolerant architecture for Autonomous Vehicles.

Year 2019, the Robots Are Coming with Fault Tolerant System Architecture.

Happy New Year! What Every Engineer Should Know about ISO 26262:2018

By: John X. Wang
Subjects: Computer Game Development, Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Information Technology

Safety Element out of Context (SEooC)

  • ISO 26262 was originally written around an implicit expectation that chips are built from scratch entirely within one organization, and this is a dated assumption.

  • There was also not enough guidance for IP-based design or design distributed across multiple companies or sites.

  • The workaround for an IP supplier has been to use the Safety Element out of Context (SEooC) mechanism. However, this depends heavily on human interpretation, by the component vendor on what may be relevant to the integrator and vice-versa, with little guidance from the 2011 version of the standard.

  • The new Part 11 of the updated standard provides more detailed and useful examples for engineers in the semiconductor and semiconductor IP industry.

Fail operational

  • In 2011, the common view was that if something went wrong in that automation, a light would flash, or a beeper would beep and the driver would take back control.

  • However, technology is moving a lot faster than that expectation, to the point it may be safer for a backup system to take over control.

  • The industry has coined a new term, “fail operational”, to address this;

    • the system doesn’t just fail silently,

    • it becomes fault-tolerant.

  • When approaching autonomous driving, if the system fails, it must have a faster response backup than depending on the driver.

Safety Of The Intended Functionality (SOTIF)

  • Increased system autonomy inserts a new set of safety concerns:

    • What happens when the system causes a safety issue even though there are no systematic errors and no random errors (like due to cosmic rays, etc.)?

    • In other words, what if the intended functionality of the system results in a safety issue?

  • These type of issues created by new automation technologies are not handled in ISO 26262:2018 2nd edition, which continues to focus on the resilience of hardware and software to safety-related risks and the processes used to create the hardware and software.

  • These system safety concerns will be addressed in a new standard, commonly called SOTIF for Safety Of The Intended Functionality (to be addressed by ISO/PAS 21448:2019 instead)

Cybersecurity

  • Cybersecurity issues were not addressed in 2011 edition.

  • How should Cybersecurity be factored into safety analysis?

  • Part 2, “Management of functional safety,” of the updated spec takes a step towards this by requiring a design organization create and maintain “effective communication channels” between functional safety, Cybersecurity and other organizations relevant to functional safety.

  • It doesn’t spell out what exactly organizations need to do to prove theyhave met this expectation.

  • IP integrators generally set the bar for suppliers to meet what they consider to be necessary and sufficient.

Intellectual Property (IP)

  • Chapter 11 of the 2nd Edition is going to be helpful to semiconductor and IP design teams, and also to the fabs.

  • There’s a big section on dealing with IP, from the perspectives of the IP developer and the integrator and how these two should interact, including discussion on integrating the IP as a SEooC.

  • If the IP integrator determines that the fulfillment of safety requirements is not possible with the supplied IP, a change request to the supplier can be raised.

  • In other words, even though a IP supplier think its IP is finished, if the system integrators need something additional to meet their safety objective, the IP supplier would still be on the hook to provide it.

Year 2019, we need to answer the following questions for Self Driving Cars

Industrial Design Engineering: Inventive Problem Solving for ISO 26262 – The Second Edition

By: John X. Wang
Subjects: Computer Science & Engineering, Energy & Clean Technology, Engineering - Electrical, Engineering - Environmental, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Environmental Science, Information Technology, Materials Science, Mathematics, Occupational Health & Safety, Statistics

The Second Edition: what’s in it

A new version of the automotive safety standard just arrived this year. This CRC Press reviews the main updates and see how it is combining with the incoming SOTIF autonomous driving standard.

Carmakers are steadily integrating higher volumes of automated driving features into road vehicles, making functional safety a top priority for their whole industry. To address that, the automotive giants and the companies that supply their electrical and/or electronical (E/E) systems follow the ISO 26262 standard, established in 2011 by the International Organization for Standardization. Until now, ISO 26262 has addressed many aspects of functional safety for passenger vehicles with a maximum gross weight of 3,500kg. An update to the standard has just been published this year. It includes a lot of upgrades, eliminates the weight limit, and thereby expands its coverage to other vehicle categories, including

  • heavier road cars,

  • trucks, busses, and

  • motorcycles.

Notably, the second edition also includes guidelines on the design and use of semiconductors in an automotive functional safety context.

ISO 26262 defines the functional requirements for the risk assessment and the confidence in the use of such software tools. This is described in Clause 11 of ISO 26262, Part 8. It sets down the requirements applying to tool qualification documentation for all software used in automotive design, manufacturing, and in-system operation.

This is already set out and required in the current edition of the standard. What, though, are the significant incoming changes?

What is new and significant in the second edition?

The second edition of ISO 26262 includes several important changes and additions with implications for software vendors. These include:

 

  • Updates to the PMHF (Probabilistic Metric for Hardware Failure) equation and verification of safety analysis.

  • Extending the process to ensure confidence in the use of software tools to include vendor validation

  • Guidelines on the application of ISO 26262 to semiconductors in the standard’s new Part 11.

ISO 26262 delivers a minimum set of requirements to fulfill functional safety aspects, but it does not – and cannot – cover all safety aspects of a product. It is the responsibility of the system supplier to ensure that the product meets the highest safety, reliability, and performance metrics. Ensuring software tool safety compliance is part of that process.

Confidence in the use of software tools

From an ISO 26262 perspective, software tools used to create components for automotive systems must be qualified to do their job in a functional safety design environment. Proof is delivered by way of a certified qualification report.

All the software tool qualification and classification requirements are described in Part 8 of the standard. It specifically defines a set of ‘tool confidence levels’ (TCL1-3). These classify the confidence requirements.

  • TCL is thus a measure of the possibility of the software being responsible for a failure in a component, and of the ability of the software to detect that problem. TCL1 is the highest; TCL3 is the lowest.

For a software tool used in the development of an automotive system, ISO 26262 describes four qualification methods for achieving a certain confidence level, however not all of them are required. Different methods are recommended according to the design’s targeted Automotive Safety Integration Level (ASIL).

  • For example, if the component is aiming for the base ASIL-A or ASIL-B levels, methods 1a and 1b are “highly recommended” (++), and methods 1c and 1d are “recommended” (+).

The software tool qualification for TCL 2 only provides a “highly recommended” (++) verification method (1c). The development process method (1d) for qualification has been demoted to “recommended” (+).

The four methods are just a part of the tool qualification process. The software tool qualification report is an executive summary of the classification and validation process, the results, recommendations, project-specific process measures, and detailed information about the use of the tool. Software development tools classified as TCL1 are suitable for use on components targeting ASIL-D, the most stringent of the four ASIL levels.

New ISO 26262 Part 11 on semiconductors

Semiconductor companies that are Tier-2 automotive suppliers must meet many tough requirements set by their OEM and Tier-1 clients. They must show evidence that the development of ICs and systems delivered to those customers follow – or have followed – appropriate design, verification, and validation flows that use qualified software tools. ISO 26262 supports this by describing the requirements for tool qualification.

 

The second edition of ISO 26262 includes a new chapter (Chapter 11) that gives guidelines on the application of ISO 26262 to semiconductors.

  • Part 11 provides a comprehensive overview of functional-safety related items for the development of a semiconductor product.

  • It includes general descriptions of a semiconductor component and its development and possible partitioning.

  • It moves on to address important items related to ISO 26262, including sections about hardware faults, errors, and failure modes.

  • It also addresses intellectual property (IP), specifically with regard to any ISO 26262-related IP with one or more allocated safety requirements.

The Second Edition: what isn’t in it

What remains missing from ISO 26262:2018, however, is detail on how to handle the development of autonomous vehicles. This missing topic is addressed in a new standard to follow the second ISO 26262 release, ISO/PAS 21448. It is more commonly referred to as SOTIF, standing for ‘Safety of the Intended Functionality’.

The considerations ultimately addressed by ISO 26262 and SOTIF touch on all parts of the automotive supply chain.

  • The design automation software, for example, is required to address the quality and reliability of the components in an automotive product environment.

Robust Designs, Inventive Problem Solving, and Safety of the Intended Functionality (SOTIF)

In 2014, SAE International established a common terminology for automated driving in the J3016 standard. It describes six levels of automated driving. Greater degrees of Autonomous Driving (AD), also known as driverless-driving or self-driving, are gradually introduced from Level 2 onward.

Autonomous driving is at an early stage. The cars we see on the road today are typically Level-2 vehicles. The industry is only now taking its first steps toward the commercial launch of Level-3 autonomous vehicles. While Level-3 vehicles are a reality, they still face legal and regulatory challenges that hamper the implementation. Vehicles capable of meeting the definitions of ‘high’ and ‘full’ autonomous driving— Level 4 and Level 5 —are for the future.

ISO 26262 remains the foundation for providing safe systems, safe hardware, and safe software. These aim to ensure independent, safe operation in the event of a failure. The ISO 26262 standard establishes state-of-the-art processes and architecture, clearly setting rules that allow a system to be safe.

The SOTIF standard, still currently in development, will provide guidelines for Level-0, Level-1, and Level-2 autonomous drive (AD) vehicles. Even these levels of autonomy still have the world’s AD experts struggling to define how to make a system safe.

They face a conundrum: Autonomous vehicles must be safe even when they do not fail. So the SOTIF standard is being drafted to provide guidance that assures an autonomous vehicle functions and acts safely during normal operation. Topics covered in SOTIF will therefore include:

  • Detail on advanced concepts of the AD architecture

  • How to evaluate SOTIF hazards that are different from ISO 26262 hazards;

  • How to identify and evaluate scenarios and trigger events;

  • How to reduce SOTIF related risks;

  • How to verify and validate SOTIF related risks; and

  • The criteria to meet before releasing an autonomous vehicle.

This will comprise the foundations of an AD methodology. Next, comes the implementation. Autonomous verification and validation must meet many tests, from simulation to full vehicle, which include factors encompassing the entire 4D environment such as weather, road condition, surrounding landscape, object texture, and possible driver misuse.

SOTIF will provide many methods and guidelines for the inclusion of environmental scenarios for use during advance concept analysis and, later on, validation. The SOTIF committee would like to guide users through the documentation of different scenarios, the safety analysis of those scenarios, the verification of both safety scenarios and various trigger events, and the validation of the vehicle in the environment with applied safe systems. These factors will be paramount to compliance with the upcoming standard on AD.

These advanced concepts, evaluations, and tests will go well beyond previous development processes. With that in mind, our reliance on test platforms, software tools, digital-twin simulations, or hardware in the loop, is set to become more important than ever.

So when developing your AD system, you absolutely must be confident that your team has access to the best-in-class software tools.

Autonomous Travel: 3 Steps to Integrate QFD, TRIZ and Taguchi DOE for Functional Safety

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Chemical, Engineering - Civil, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Engineering - Mining, Occupational Health & Safety

A safety-critical system has to qualify the performance-related requirements and the safety-related requirements simultaneously. Conceptually, design processes should consider both of them simultaneously. Dr. John X. Wangs most recent book “Industrial Design Engineering: Inventive Problem Solvingrevealed that safety-related functions must be simultaneously resolved with the development of performance-related functions, particularly, in case of safety-critical systems.

Since, success and failure domain analyses are essential for the investigation of performance-related and safety-related requirements, respectively, the book integrated Axiomatic Design (AD), Fault Tree Analysis (FTA), and Theory of Inventive Problem Solving (TRIZ) to resolve the challenge of developing safety-related functions and performance-related functions simultaneously, including the following

  • A design evolution framework considering feedbacks from AD to identify functional couplings, TRIZ methodology to explore uncoupling solutions and FTA to improve reliability in a systematic way.

It was presented that several iterations between AD–TRIZ–FTA would result into an optimized design which could be tested against the desired performance and safety criteria. With applications to all the contemporary industries including automotive industry, the book presented related case studies to

  • Support automotive safety engineering with features allowing the creation of consistent and complete work products and to simplify.

  • Automate workflow steps from early analysis through system development to software development.

More precisely, the related case studies provides design insights to support for

  • Model creation and reuse

  • Analysis and documentation

  • Configuration and code generation.

Having featured as ISE Magazine May 2017 Book of the Month, Dr. Wang’s book presented a tool chain supporting the application of a safety engineering workflow aligned with the automotive safety standard ISO 26262. In particular, the tool chain focuses on the following:

  • Support for property checking and model correction

  • Support for fault tree generation and FMEA (Failure Modes and Effects Analysis) table generation.

Based on the case study of hybrid electric vehicle development, Dr. Wang’s book demonstrated that the abilities above are able to strongly support FTA (Fault Tree Analysis) and FMEA for functional safety assessment.

Specifically, the book presented an innovative step by step method based on TRIZ tools used according to the general approach suggested by FMEA. The aim of the innovative step by step method consists in building an improved risk management model for design and to enhance the capability of anticipating problems and technical solutions to reduce failure occurrence. The method adopts tools used to model the system, such as functionality, Su-Fields models, resource evaluation and tools dedicated to problem solving such as standard solutions. The resulting method allows a better definition of the system decomposition and functioning and provides a sharp definition of the events and failures potentially occurring into the system, which is not provided by standard FMEA. Moreover, the high importance given to resources since the beginning of the method is extremely effective for understanding system evolution and to generate resource-based solution to problems dealing with product risk. The overall method has been developed so that technicians are not supposed to have a high level expertise in TRIZ tools. In order to evaluate the method it was tested with students with basic level TRIZ education and some application with industrial case studies were performed.

Autonomous Vehicles (AVs) including Autonomous Trucks. are the future evolution of Land Transportation Systems (LTs). AVs promise an improvement in road safety. However, safety requirements stay a big challenge for their development due to a lack of insights on how the Autonomous Land Transportation Systems (ALTs) safety requirements will evolve. A previous CRC Press News titled “Engineering Robust Autonomous Truck Designs with Six Sigma” explores an analysis method of LTs safety requirements evolution toward ALTs. The Automotive Safety Integrity Level (ASIL) metric is used to ensure the safety criticality with following 3 steps:

  1. QFD (Quality Function Deployment) has been successfully applied by Toyota in vehicle design since 1970’s. QFD effectively establishes the relationship between Voice of Customer (VOC) and technical parameters with House of Quality, and also deploys with cascade to all the levels of product realization within enterprises. QFD sets up the direction and target of the product design, however, QFD may not be able to find out how to break through the bottle neck and create an optimized solution.

  2. TRIZ (Creativity Techniques) is developed by Genrich S. Altshuller (Russian) based on 2.5 million patents and 1500 man-years of research. TRIZ can dramatically improve the creativity of engineer with the systematic skills. It can be used at all the stages of the life cycle of the product to improve the quality of the product, expand the market, protect knowledge copyrights as well as research on the new product. TRIZ can resolve the issues of “Bottle Neck” with all problem/conflict solving skills, and also create the new product design.

  3. DOE (Taguchi) can strengthen the parameter selection process which may not be effectively provided by TRIZ. DOE can be applied to reach the best parameter combination for the product. Taguchi DOE concentrates on parameter design. With the parameter design, the test planning can be developed, the Noise factor can be reduced based on S/N ratio, thus improve the robustness of the product.

Having featured as ISE Magazine May 2017 Book of the Month, Dr. John X. Wang’s book titled Industrial Design Engineering: Inventive Problem Solving can be applied to integrate QFD, TRIZ and Taguchi DOE to effectively improve the product design process and achieve the robust product design for custom satisfaction, assuring Functional Safety.

Happy New Year! What Every Engineer Should Know about ISO 26262:2018

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Engineering - Industrial & Manufacturing, Engineering - Mechanical, Occupational Health & Safety, Physics

Safety Element out of Context (SEooC)

  • ISO 26262 was originally written around an implicit expectation that chips are built from scratch entirely within one organization, and this is a dated assumption.

  • There was also not enough guidance for IP-based design or design distributed across multiple companies or sites.

  • The workaround for an IP supplier has been to use the Safety Element out of Context (SEooC) mechanism. However, this depends heavily on human interpretation, by the component vendor on what may be relevant to the integrator and vice-versa, with little guidance from the 2011 version of the standard.

  • The new Part 11 of the updated standard provides more detailed and useful examples for engineers in the semiconductor and semiconductor IP industry.

Fail operational

  • In 2011, the common view was that if something went wrong in that automation, a light would flash, or a beeper would beep and the driver would take back control.

  • However, technology is moving a lot faster than that expectation, to the point it may be safer for a backup system to take over control.

  • The industry has coined a new term, “fail operational”, to address this;

    • the system doesn’t just fail silently,

    • it becomes fault-tolerant.

  • When approaching autonomous driving, if the system fails, it must have a faster response backup than depending on the driver.

Safety Of The Intended Functionality (SOTIF)

  • Increased system autonomy inserts a new set of safety concerns:

    • What happens when the system causes a safety issue even though there are no systematic errors and no random errors (like due to cosmic rays, etc.)?

    • In other words, what if the intended functionality of the system results in a safety issue?

  • These type of issues created by new automation technologies are not handled in ISO 26262:2018 2nd edition, which continues to focus on the resilience of hardware and software to safety-related risks and the processes used to create the hardware and software.

  • These system safety concerns will be addressed in a new standard, commonly called SOTIF for Safety Of The Intended Functionality (to be addressed by ISO/PAS 21448:2019 instead)

Cybersecurity

  • Cybersecurity issues were not addressed in 2011 edition.

  • How should Cybersecurity be factored into safety analysis?

  • Part 2, “Management of functional safety,” of the updated spec takes a step towards this by requiring a design organization create and maintain “effective communication channels” between functional safety, Cybersecurity and other organizations relevant to functional safety.

  • It doesn’t spell out what exactly organizations need to do to prove theyhave met this expectation.

  • IP integrators generally set the bar for suppliers to meet what they consider to be necessary and sufficient.

Intellectual Property (IP)

  • Chapter 11 of the 2nd Edition is going to be helpful to semiconductor and IP design teams, and also to the fabs.

  • There’s a big section on dealing with IP, from the perspectives of the IP developer and the integrator and how these two should interact, including discussion on integrating the IP as a SEooC.

  • If the IP integrator determines that the fulfillment of safety requirements is not possible with the supplied IP, a change request to the supplier can be raised.

  • In other words, even though a IP supplier think its IP is finished, if the system integrators need something additional to meet their safety objective, the IP supplier would still be on the hook to provide it.

Year 2019, we need to answer the following questions for Self Driving Cars

A Risk Engineering Approach to Functional Safety

By: John X. Wang
Subjects: Computer Science & Engineering, Engineering - Electrical, Engineering - General, Information Technology, Statistics

Compliance standards, especially those that involve relatively new functional safety elements, will likely add additional requirements to the development process. For example, the increasing complexity and abundance of automotive electronic systems led to the creation of a new functional safety standard called ISO 26262. Similar regulations for other industries abound; DO-178B/C for aerospace standards and IEC 60601 for medical standards.

Common to all of these safety standards is a risk engineering approach to determine the criticality and potential hazards associated with key system functions. The primary goal of these standards is to prevent the failure of a system or device that could cause injury, harm or death. If a failure is unavoidable, then the system should be failsafe. Looking at the different integrity levels for the DO-178C "Software Considerations in Airborne Systems and Equipment Certification" development guidance and ISO26262 "Road vehicles – Functional safety" standard,

  • What they mean for code coverage?

  • Why they are not equivalent?

  • What’s the difference between a SIL and a DAL?

  • How does it affect Code Coverage, the percentage of code which is covered by automated tests?

How to find Integrity Level?

Most safety standards use the concept of an integrity level, which is assigned to a system or a function. This level will be based on an initial analysis of the consequences of the software going wrong. Both standards have clear guidance on how to identify your integrity level.

  • For example, DO-178C has Software Levels, which are assigned based on the outcome of "anomalous behavior" of a software component – Level A for "Catastrophic Outcome", Level E for "No Safety Effect".

  • ISO26262 has ASIL (Automotive Safety Integrity Level), based on the exposure to issues affecting the contractility of the vehicle.

For comparison, ASILs range from D for the highest severity/most probable exposure, and A as the least. So the underlying concept is similar – find out how severe the effect and how likely it is your software can go wrong. The difference is in the scales and criteria used.

What does integrity level mean?

The allocated integrity level is linked to a set of processes to follow when developing a software system – the higher the level then the more rigorous and stringent these processes are.

Example. Let’s take code coverage requirements as an example. The purpose of code coverage testing is to determine how much of a software code has been exercised by requirements based test cases. It can be a powerful tool in locating any code that doesn’t trace to a requirement, or limitations in your testing. The more severe the consequences of the software code going wrong, then the more evidence need to establish the test cases to find potential problems in the code.

Table 1 shows the code coverage requirements for DO178C from Annex A, Table A-7 of the standard. At Level C you only need to demonstrate that your tests cover all the statements in your software. However, at Level A you need three types of code coverage, including the most stringent, Multiple Condition/Decision Coverage for which every possible condition must be shown to independently affect the decision/software’s outcome. Further, we would need to run these tests and demonstrate that the testing process has been performed by someone not directly involved with the development process. In software testing, the Modified Condition/ Decision Coverage (MC/DC) is a code coverage criterion that requires every condition in a decision be tested independently to reach full coverage.

Table 1 Code Coverage Requirements in DO-178C

 

Modified Condition/ Decision Coverage (MC/DC)

Decision Coverage

Statement Coverage

Level A

With Independence

With Independence

With Independence

Level B

-

With Independence

With Independence

Level C

-

-

Required

Level D

-

-

-

Level E

-

-

-

Let’s compare this with the code coverage requirements in ISO26262. Table 2 summarizes the requirements from Tables 12 and 15 of Part 6 of ISO26262. There are two different sets, one at the unit level and one at the architectural level. Techniques are highly recommended (++) or recommended (+). No requirement for independence is there. So whilst at the highest ASIL we still require MC/DC, the requirement for how that’s achieved is different. Also code coverage must be applied at multiple abstraction levels.