Maser's thesis proposal.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

560 lines
13KB

  1. #LyX 2.3 created this file. For more info see http://www.lyx.org/
  2. \lyxformat 544
  3. \begin_document
  4. \begin_header
  5. \save_transient_properties true
  6. \origin unavailable
  7. \textclass article
  8. \begin_preamble
  9. \usepackage{inputenc}
  10. \usepackage{natbib}
  11. \usepackage{hyperref}
  12. \usepackage{graphicx}
  13. \usepackage[colorinlistoftodos]{todonotes}
  14. \usepackage{parskip}
  15. \setlength{\parskip}{10pt}
  16. \usepackage{tikz}
  17. \usetikzlibrary{arrows, decorations.markings}
  18. \usepackage{chngcntr}
  19. \counterwithout{figure}{section}
  20. \end_preamble
  21. \use_default_options true
  22. \maintain_unincluded_children false
  23. \language english
  24. \language_package default
  25. \inputencoding auto
  26. \fontencoding global
  27. \font_roman "default" "default"
  28. \font_sans "default" "default"
  29. \font_typewriter "default" "default"
  30. \font_math "auto" "auto"
  31. \font_default_family default
  32. \use_non_tex_fonts false
  33. \font_sc false
  34. \font_osf false
  35. \font_sf_scale 100 100
  36. \font_tt_scale 100 100
  37. \use_microtype false
  38. \use_dash_ligatures true
  39. \graphics default
  40. \default_output_format default
  41. \output_sync 0
  42. \bibtex_command default
  43. \index_command default
  44. \paperfontsize default
  45. \spacing single
  46. \use_hyperref false
  47. \papersize default
  48. \use_geometry false
  49. \use_package amsmath 1
  50. \use_package amssymb 1
  51. \use_package cancel 1
  52. \use_package esint 1
  53. \use_package mathdots 1
  54. \use_package mathtools 1
  55. \use_package mhchem 1
  56. \use_package stackrel 1
  57. \use_package stmaryrd 1
  58. \use_package undertilde 1
  59. \cite_engine basic
  60. \cite_engine_type default
  61. \biblio_style plain
  62. \use_bibtopic false
  63. \use_indices false
  64. \paperorientation portrait
  65. \suppress_date false
  66. \justification true
  67. \use_refstyle 1
  68. \use_minted 0
  69. \index Index
  70. \shortcut idx
  71. \color #008000
  72. \end_index
  73. \secnumdepth 2
  74. \tocdepth 2
  75. \paragraph_separation indent
  76. \paragraph_indentation default
  77. \is_math_indent 0
  78. \math_numbering_side default
  79. \quotes_style english
  80. \dynamic_quotes 0
  81. \papercolumns 1
  82. \papersides 1
  83. \paperpagestyle default
  84. \tracking_changes false
  85. \output_changes false
  86. \html_math_output 0
  87. \html_css_as_file 0
  88. \html_be_strict false
  89. \end_header
  90. \begin_body
  91. \begin_layout Standard
  92. \begin_inset ERT
  93. status open
  94. \begin_layout Plain Layout
  95. \backslash
  96. begin{titlepage}
  97. \end_layout
  98. \end_inset
  99. \end_layout
  100. \begin_layout Standard
  101. \begin_inset ERT
  102. status open
  103. \begin_layout Plain Layout
  104. \backslash
  105. centering
  106. \end_layout
  107. \end_inset
  108. \end_layout
  109. \begin_layout Standard
  110. \shape smallcaps
  111. \size largest
  112. Master thesis project proposal
  113. \begin_inset Newline newline
  114. \end_inset
  115. \end_layout
  116. \begin_layout Standard
  117. \begin_inset VSpace 0.5cm
  118. \end_inset
  119. \end_layout
  120. \begin_layout Standard
  121. \series bold
  122. \size huge
  123. Tracing in Distributed Systems
  124. \end_layout
  125. \begin_layout Standard
  126. \begin_inset VSpace 2cm
  127. \end_inset
  128. \end_layout
  129. \begin_layout Standard
  130. \size larger
  131. Martins Eglitis eglitis@student.chalmers.se
  132. \end_layout
  133. \begin_layout Standard
  134. \begin_inset VSpace 1.5cm
  135. \end_inset
  136. \end_layout
  137. \begin_layout Standard
  138. \size large
  139. Relevant completed courses:
  140. \end_layout
  141. \begin_layout Itemize
  142. EDA093, Operating Systems
  143. \end_layout
  144. \begin_layout Itemize
  145. EDA387, Computer Networks
  146. \end_layout
  147. \begin_layout Itemize
  148. EDA263, Computer Security
  149. \end_layout
  150. \begin_layout Itemize
  151. EDA491, Network Security
  152. \end_layout
  153. \begin_layout Standard
  154. \begin_inset VSpace vfill
  155. \end_inset
  156. \end_layout
  157. \begin_layout Standard
  158. \begin_inset VSpace vfill
  159. \end_inset
  160. \end_layout
  161. \begin_layout Standard
  162. \size large
  163. \begin_inset ERT
  164. status open
  165. \begin_layout Plain Layout
  166. \backslash
  167. today
  168. \end_layout
  169. \end_inset
  170. \begin_inset Newline newline
  171. \end_inset
  172. \end_layout
  173. \begin_layout Standard
  174. \begin_inset ERT
  175. status open
  176. \begin_layout Plain Layout
  177. \backslash
  178. end{titlepage}
  179. \end_layout
  180. \end_inset
  181. \end_layout
  182. \begin_layout Section
  183. Introduction
  184. \end_layout
  185. \begin_layout Standard
  186. Distributed systems are ubiquitous, providing daily services such as network
  187. applications (web search, online shopping, gaming), communications (networks,
  188. sensors), transportation, and many more.
  189. Although there is no single definition of distributed system, it can be
  190. perceived as a system that is logically or functionally distributing the
  191. workload of the goal over multiple processing units
  192. \begin_inset CommandInset citation
  193. LatexCommand cite
  194. key "ghoshDistributedSystemsAlgorithmic2015"
  195. literal "false"
  196. \end_inset
  197. .
  198. \end_layout
  199. \begin_layout Standard
  200. In many situations, the number of actual computers that serve a single goal
  201. is tremendous.
  202. For example, there are thousands of servers involved in serving a web request
  203. when using the Google search engine
  204. \begin_inset CommandInset citation
  205. LatexCommand cite
  206. key "sigelmanDapperLargeScaleDistributed"
  207. literal "false"
  208. \end_inset
  209. .
  210. Besides the user-defined goal, other tangential goals are serviced, such
  211. as collecting statistics, checking grammar, or showing tailored ads.
  212. Horizontally scaled systems, consisting of a myriad of less powerful servers,
  213. are well-suited for networking related workloads
  214. \begin_inset CommandInset citation
  215. LatexCommand cite
  216. key "barrosoWebSearchPlanet2003"
  217. literal "false"
  218. \end_inset
  219. .
  220. \end_layout
  221. \begin_layout Standard
  222. The execution trace often leads outside the boundaries of a single entity
  223. - services can be managed by different teams in different countries, using
  224. various programming languages and frameworks
  225. \begin_inset CommandInset citation
  226. LatexCommand cite
  227. key "sigelmanDapperLargeScaleDistributed"
  228. literal "false"
  229. \end_inset
  230. .
  231. Yet when an error occurs somewhere in the execution path, it is crucial
  232. to point to the problematic area.
  233. The culprit is the massive scale which currently is a burden to state-of-the-ar
  234. t monitoring systems as they can only observe such unwanted events happening
  235. \begin_inset CommandInset citation
  236. LatexCommand cite
  237. key "alvaroOKWHYTRACING"
  238. literal "false"
  239. \end_inset
  240. .
  241. For example, by observing spikes in latencies such systems do not pinpoint
  242. the actual root cause of the problem.
  243. Even if the problem has been localized, more problems arise - how to relate
  244. the record in the log file to other log files? How to make sure the record
  245. refers to the same service? How to match records efficiently across thousands
  246. of servers?
  247. \end_layout
  248. \begin_layout Standard
  249. This project will focus on researching the field and implement the findings
  250. by building a modern tracing system for a distributed wireless solution
  251. used by Cisco.
  252. The system should be capable of collecting data from the controller, the
  253. access point, and other deployed devices such as authentication servers
  254. and analytics services.
  255. Among others, use cases of such systems include troubleshooting errors,
  256. anomaly detection, latency problems discovery, exploring dependencies and
  257. validating functionality.
  258. \end_layout
  259. \begin_layout Section
  260. Context
  261. \end_layout
  262. \begin_layout Standard
  263. One of the earlier works in the field of tracing in distributed systems
  264. presents Drapper
  265. \begin_inset CommandInset citation
  266. LatexCommand cite
  267. key "sigelmanDapperLargeScaleDistributed"
  268. literal "false"
  269. \end_inset
  270. .
  271. Google has been using the closed-source tracer in their production environments
  272. for at least 2 years, thus proving the maturity of the solution.
  273. The initial requirements for Dapper were: 1) Low overhead - some applications
  274. are very sensitive to network data increase or latency.
  275. 2) Application-level transparency - teams and developers are not keen changing
  276. their codebase on demand therefore the tracing has to be implemented in
  277. lower levels, for example, in common libraries (threading, control-flow,
  278. RPC).
  279. 3) Scalability - Drapper has to be able to support existing and new services
  280. for at least 5 years.
  281. The requirements for the distributed wireless tracing solution at Cisco
  282. closely are similar to those listed by Google, except for scale, which
  283. is limited for security reasons.
  284. It, therefore, makes Dapper a very appealing and useful research platform
  285. for this project.
  286. Zipkin
  287. \begin_inset CommandInset citation
  288. LatexCommand cite
  289. key "OpenZipkinDistributedTracing"
  290. literal "false"
  291. \end_inset
  292. , an open-source project very similar to Dapper, will be used instead as
  293. a drop-in replacement.
  294. \end_layout
  295. \begin_layout Standard
  296. One of the disadvantages of annotation-based schemas (Dapper or X-Trace
  297. \begin_inset CommandInset citation
  298. LatexCommand cite
  299. key "AWSXRayDistributed"
  300. literal "false"
  301. \end_inset
  302. ), is the need for modification of the underlying instrumentation, for example,
  303. the common low-level libraries the services are using.
  304. If the scope is of the project limited, which is true in this case, the
  305. chances of applying it across every service are low.
  306. The other approach is to use so-called black-box schemas (Project5
  307. \begin_inset CommandInset citation
  308. LatexCommand cite
  309. key "inproceedings"
  310. literal "false"
  311. \end_inset
  312. or Sherlock
  313. \begin_inset CommandInset citation
  314. LatexCommand cite
  315. key "arbezzanoGianarbSherlock2019"
  316. literal "false"
  317. \end_inset
  318. ).
  319. Unfortunately, active work on these projects has ended several years ago
  320. (Sherlock repository was archived 2 years ago).
  321. The downsides of black-box schemas are decreased accuracy and large overhead
  322. due to the statistical regression techniques used.
  323. However, there is one major advantage - no code modifications are required
  324. at any level, which might be useful when direct access to a service or
  325. instrumentation is blocked.
  326. \end_layout
  327. \begin_layout Section
  328. Goals and challenges
  329. \end_layout
  330. \begin_layout Standard
  331. The main goals for the project are:
  332. \end_layout
  333. \begin_layout Itemize
  334. Research state-of-the-art on distributed tracing.
  335. \end_layout
  336. \begin_layout Itemize
  337. Define the tracing data model.
  338. For example, sampling rates, entry points, system components, integration
  339. with other observability methods.
  340. \end_layout
  341. \begin_layout Itemize
  342. Write low overhead collector code and encapsulate in libraries or applications.
  343. \end_layout
  344. \begin_layout Itemize
  345. Integrate with a tracing lookup tool such as Zipkin, X-Ray, AppDynamics.
  346. \end_layout
  347. \begin_layout Itemize
  348. Evaluate the code based on relevant metrics.
  349. For example, network data overhead and latency, system resource usage (CPU,
  350. memory, storage), scalability, ease of implementation and transparency.
  351. \end_layout
  352. \begin_layout Standard
  353. Some of the challenges are:
  354. \end_layout
  355. \begin_layout Itemize
  356. Due to the nature of the internship and lack of security clearance, some
  357. parts of the system might not be accessible.
  358. \end_layout
  359. \begin_layout Itemize
  360. Enterprise networking products can have a large codebase and are usually
  361. written in
  362. \begin_inset Quotes eld
  363. \end_inset
  364. low-level
  365. \begin_inset Quotes erd
  366. \end_inset
  367. languages such as C/C+.
  368. It makes the learning curve steep.
  369. \end_layout
  370. \begin_layout Itemize
  371. Existing products might not be homogenous and implementing code will thus
  372. require more individual adjustments across the instrumentation libraries.
  373. \end_layout
  374. \begin_layout Itemize
  375. All non-trivial software is known to potentially contain bugs introducing
  376. security vulnerabilities, unwanted program behavior.
  377. All these factors can impact the speed and quality of development.
  378. \end_layout
  379. \begin_layout Section
  380. Approach
  381. \end_layout
  382. \begin_layout Standard
  383. The first part is to research the field of distributed tracing in-depth.
  384. It includes reading academic papers as well as understanding the capabilities
  385. of tooling available (tracing systems, frameworks like OpenTelemetry
  386. \begin_inset CommandInset citation
  387. LatexCommand cite
  388. key "OpenTelemetry"
  389. literal "false"
  390. \end_inset
  391. ), understanding techniques (different data models, collection methods),
  392. problems that can be solved (tracing, security audits, pattern checking),
  393. advantages and disadvantages, etc.
  394. \end_layout
  395. \begin_layout Standard
  396. The second part is to implement the acquired knowledge in building the actual
  397. tracing system at Cisco.
  398. Depending on the findings from the first part, it could mean adjusting
  399. an existing tracing system, building one from scratch, or mixing.
  400. Working closely with the software engineering team will be necessary to
  401. flatten the learning curve of enterprise subsystems, study the C/C++ libraries
  402. used, find the optimal data structure and algorithms, implement the collector,
  403. and finally integrate with a lookup tool.
  404. \end_layout
  405. \begin_layout Standard
  406. The last part consists of evaluation of the tracing system.
  407. The tracer has to be evaluated both quantitively and qualitatively.
  408. Depending on the quality of the deliverable, two test environments are
  409. possible - either testing or production.
  410. The deliverable will be deployed in the production environment only if
  411. it is accepted by a higher instance as it may appear in instrumentation
  412. throughout the company.
  413. It usually means having all the required functionality, documentation,
  414. code reviews, heavy testing, benchmarking, etc.
  415. The results for some quantitative metrics such as latency, network data
  416. overhead, resource usage will be collected, analyzed and compared against
  417. other results such as Dapper
  418. \begin_inset CommandInset citation
  419. LatexCommand cite
  420. key "sigelmanDapperLargeScaleDistributed"
  421. literal "false"
  422. \end_inset
  423. .
  424. Qualitative metrics such as application-level transparency and ease of
  425. implementation will be investigated by surveying different teams and developers
  426. within Cisco.
  427. \begin_inset CommandInset bibtex
  428. LatexCommand bibtex
  429. btprint "btPrintCited"
  430. bibfiles "chalmers-tracing-in-distributed-systems"
  431. options "plain"
  432. \end_inset
  433. \end_layout
  434. \end_body
  435. \end_document