tag:blogger.com,1999:blog-53003907596478957922024-03-19T00:34:45.804-07:00FutureXskillsMy Empty Mindhttp://www.blogger.com/profile/10861700932904107913noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-5300390759647895792.post-86694230908110667662024-03-19T00:31:00.000-07:002024-03-19T00:34:15.010-07:00NVIDIA Unveils Blackwell: Revolutionizing AI with Next-Gen Chips<p><span style="font-family: helvetica; font-size: large;">In the heart of Silicon Valley, amidst the vibrant tech scene, NVIDIA made waves once again. On Monday, at the company's annual GPU Technology Conference (GTC) in San Jose, CEO Jensen Huang took the stage to unveil NVIDIA's latest triumph: the Blackwell graphics processing unit (GPU). This groundbreaking innovation promises to redefine the landscape of artificial intelligence (AI) and accelerate the pace of technological advancement across industries.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">The announcement comes at a pivotal moment, with the world still reeling from the AI boom initiated by OpenAI's ChatGPT back in 2022. NVIDIA has been at the forefront of this revolution, and the Blackwell GPU solidifies its position as the leading provider of AI hardware solutions.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">Named in honor of David Harold Blackwell, a pioneering mathematician, the Blackwell GPU represents a leap forward in AI computing. Boasting six transformative technologies, it promises to unlock breakthroughs in data processing, engineering simulation, electronic design automation, computer-aided drug design, quantum computing, and generative AI.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">At the core of the Blackwell GPU architecture lies its unprecedented power. With 208 billion transistors packed into a custom-built 4NP TSMC process, Blackwell sets a new standard for performance. Its second-generation transformer engine, fueled by micro-tensor scaling support and advanced dynamic range management algorithms, doubles the compute and model sizes, paving the way for more intricate AI models.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">But the innovation doesn't stop there. The fifth-generation NVLink, with its groundbreaking 1.8TB/s bidirectional throughput per GPU, ensures seamless high-speed communication among up to 576 GPUs. This level of connectivity enables the deployment of multitrillion-parameter AI models with unprecedented efficiency.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">Moreover, Blackwell is not just about raw power; it's about resilience and security. The inclusion of a dedicated RAS engine ensures reliability, availability, and serviceability, while advanced confidential computing capabilities protect AI models and customer data without compromising performance.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">The implications of Blackwell's arrival are profound. It promises to accelerate AI research and development across diverse domains, from healthcare to finance to autonomous vehicles. With its unparalleled performance and energy efficiency, Blackwell will democratize AI, empowering organizations of all sizes to harness the transformative potential of AI.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">The response from industry leaders has been overwhelmingly positive. Companies like Amazon, Google, Microsoft, and Meta have already expressed their intention to adopt Blackwell-powered solutions. Sundar Pichai, CEO of Alphabet and Google, emphasized the importance of investing in infrastructure to accelerate AI development, while Andy Jassy, president and CEO of Amazon, highlighted the longstanding partnership between AWS and NVIDIA in pushing the boundaries of AI in the cloud.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">As NVIDIA charts the course for the future of AI with Blackwell, the possibilities seem limitless. From powering trillion-parameter language models to enabling breakthroughs in scientific research, Blackwell represents a new era of computing—one where AI transcends boundaries and transforms industries.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">In the fast-paced world of technology, one thing is clear: with Blackwell, NVIDIA is not just shaping the future; it's defining it. And as we embark on this journey of innovation and discovery, one can't help but wonder: what incredible feats will AI achieve next?</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">Learn about Vector Databases and Large Language Model (LLMs) <a href="https://youtu.be/AlR_I9--Gwo?si=d4fjQZaKWgzh-ZON" target="_blank">here</a>.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p>FutureXskillshttp://www.blogger.com/profile/03954397681463869119noreply@blogger.com0tag:blogger.com,1999:blog-5300390759647895792.post-56185667966993437492024-03-18T23:59:00.000-07:002024-03-19T00:00:58.294-07:00How to make money on Udemy. Our journey to 50,000 USD revenue. Essential Tips for Instructors<div style="text-align: left;"><span style="font-family: helvetica; font-size: large;"><br /></span></div><div style="text-align: left;"><span style="font-family: helvetica;"><span style="font-size: x-large;">Watch the full video on <a href="https://youtu.be/EtF6g0QGpas" target="_blank">YouTube</a><br /></span><br /></span></div><div style="text-align: left;"><h2 style="text-align: left;"><span style="font-family: helvetica; font-size: large;">Introduction:</span></h2><span style="font-family: helvetica; font-size: large;"><br /></span></div><div style="text-align: left;"><span style="font-family: helvetica; font-size: large;">In recent years, online learning platforms like Udemy have revolutionized education, offering opportunities for both learners and instructors. For aspiring instructors, Udemy provides a platform to share knowledge and generate income. In this comprehensive guide, we'll explore proven strategies and essential tips on how to maximize your earnings on Udemy, drawing from valuable insights shared in this chat window.<br /><br /></span></div><div style="text-align: left;"><h2 style="text-align: left;"><span style="font-family: helvetica; font-size: large;">Building Your Foundation:</span></h2><span style="font-family: helvetica; font-size: large;"><ul style="text-align: left;"><li><span style="font-family: helvetica; font-size: large;">Establish expertise in a specific area and foster a genuine interest in teaching.</span></li><li><span style="font-family: helvetica; font-size: large;">Conduct thorough research and develop a structured approach to course creation.</span></li><li><span style="font-family: helvetica; font-size: large;">Crafting Engaging Content:</span></li><li><span style="font-family: helvetica; font-size: large;">Simplify complex topics to cater to beginners' needs.</span></li><li><span style="font-family: helvetica; font-size: large;">Provide clarity and step-by-step instructions, especially for technical subjects like coding.</span></li><li><span style="font-family: helvetica; font-size: large;">Opt for clear and concise slides, dedicating one point per slide to maintain audience engagement.</span></li><li><span style="font-family: helvetica; font-size: large;">Maximizing Visibility:</span></li><li><span style="font-family: helvetica; font-size: large;">Target relevant keywords in course titles and descriptions for improved searchability.</span></li><li><span style="font-family: helvetica; font-size: large;">Leverage social media platforms, collaborate with influencers, and write informative blogs to promote your courses.</span></li><li><span style="font-family: helvetica; font-size: large;">Launch a free introductory course to attract students and capture their contact information for future promotions.</span></li><li><span style="font-family: helvetica; font-size: large;">Leveraging Udemy's Revenue Models:</span></li><li><span style="font-family: helvetica; font-size: large;">Understand Udemy's revenue models, including revenue share for direct course sales and Udemy for Business.</span></li><li><span style="font-family: helvetica; font-size: large;">Promote your courses strategically to maximize revenue share and attract corporate clients.</span></li></ul><br /></span></div><div style="text-align: left;"><h2 style="text-align: left;"><span style="font-family: helvetica; font-size: large;">Engaging with Students:</span></h2><span style="font-family: helvetica; font-size: large;"><ul style="text-align: left;"><li><span style="font-family: helvetica; font-size: large;">Continuously update course content and provide supplementary materials to enhance student engagement.</span></li><li><span style="font-family: helvetica; font-size: large;">Encourage word-of-mouth promotion by delivering high-quality content and fostering a positive learning experience.</span></li></ul><br /></span></div><div style="text-align: left;"><h2 style="text-align: left;"><span style="font-family: helvetica; font-size: large;">Conclusion:</span></h2><span style="font-family: helvetica; font-size: large;">Embarking on a journey as a Udemy instructor offers immense potential for financial success and personal fulfillment. By following the strategies outlined in this guide, you can unlock your earning potential and establish yourself as a successful Udemy instructor. Remember, success on Udemy requires dedication, innovation, and a commitment to delivering value to your students. Start your journey today and pave the way for a prosperous future in online education.</span></div><div style="text-align: left;"><span style="font-family: helvetica; font-size: large;"><br /></span></div><div style="text-align: left;"><span style="font-family: helvetica; font-size: x-large;">Watch the full video on </span><a href="https://youtu.be/EtF6g0QGpas" style="font-family: helvetica; font-size: x-large;" target="_blank">YouTube</a><br style="font-family: helvetica; font-size: x-large;" /></div>FutureXskillshttp://www.blogger.com/profile/03954397681463869119noreply@blogger.com0tag:blogger.com,1999:blog-5300390759647895792.post-91911725308414313712024-03-18T17:01:00.000-07:002024-03-18T17:07:23.233-07:00Understanding Vector Databases and Large Language Models (LLMs)<p><span style="font-family: helvetica; font-size: large;"></span></p><p><span style="font-family: helvetica; font-size: large;">For a practical demonstration, check out our YouTube video highlighting vector databases in action: Click <a href="https://youtu.be/AlR_I9--Gwo?si=g5brAJindumQ4CnV" target="_blank">here</a></span></p><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgif_QTBVEc8b6-HFjGJnGadn2HDpYnPodUvINQjgXDfqVY40_C1Jq93il6MGxLCPkXHmjJLyl_b78F42RbUyiPqHckRvYKAQOhXqepsw4Ua7lf73WbE_YIuG2vRUnBsUJ5BsQ87ewZOmLfmysRtYxDVTVSPkOwEH8kcpsRU2SifNDFRQnFtHCVfMj11O9y" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: helvetica; font-size: large;"><img data-original-height="1222" data-original-width="2188" height="179" src="https://blogger.googleusercontent.com/img/a/AVvXsEgif_QTBVEc8b6-HFjGJnGadn2HDpYnPodUvINQjgXDfqVY40_C1Jq93il6MGxLCPkXHmjJLyl_b78F42RbUyiPqHckRvYKAQOhXqepsw4Ua7lf73WbE_YIuG2vRUnBsUJ5BsQ87ewZOmLfmysRtYxDVTVSPkOwEH8kcpsRU2SifNDFRQnFtHCVfMj11O9y=w320-h179" width="320" /></span></a></div><span style="font-family: helvetica; font-size: large;"><br /><br /></span></div><p><span style="font-family: helvetica; font-size: large;"></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-family: helvetica; font-size: large;"><br /></span></div><span style="font-family: helvetica; font-size: large;"><br /><br /></span><p></p><p><span style="font-family: helvetica; font-size: large;">In the vast landscape of machine learning and natural language processing (NLP), vectors serve as the fundamental building blocks for representing and understanding data. A vector, in its simplest form, is a one-dimensional container that holds data, typically of the same type, allowing for efficient indexing and retrieval. In the context of NLP, vectors play a crucial role in transforming human language into machine-readable numerical values, paving the way for advanced techniques and models to analyze and generate text.</span></p><p><a href="https://blogger.googleusercontent.com/img/a/AVvXsEi2XS5NEPI1vTX2YRy9La-BQ6Ni6LRh1CJrb4wkoM55poTcLoA_E6GCl1F72EkJtfAzwT2IADJ3iqT04Xx4IGhtY26xOfzRgSZAoochkUIzeVbkhY-i7rHiEH0bw3OQYBR3rHlx-XDExhumy0ZlkM3lxcnKihNqE0qECuDpnf0u10m4uzzp_1ddD27NZwwa" style="margin-left: 1em; margin-right: 1em; text-align: center;"><span style="font-family: helvetica; font-size: large;"><img data-original-height="1292" data-original-width="2436" height="170" src="https://blogger.googleusercontent.com/img/a/AVvXsEi2XS5NEPI1vTX2YRy9La-BQ6Ni6LRh1CJrb4wkoM55poTcLoA_E6GCl1F72EkJtfAzwT2IADJ3iqT04Xx4IGhtY26xOfzRgSZAoochkUIzeVbkhY-i7rHiEH0bw3OQYBR3rHlx-XDExhumy0ZlkM3lxcnKihNqE0qECuDpnf0u10m4uzzp_1ddD27NZwwa=w320-h170" width="320" /></span></a></p><p><span style="font-family: helvetica; font-size: large;">At the heart of this transformation lie techniques like bag of words models and term frequency-inverse document frequency (TF-IDF) models, which create sparse matrices based on the frequency of unique features or words within a corpus. While effective, these methods have limitations in capturing nuanced semantic relationships and context due to their sparse nature.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">Enter word embedding, a revolutionary technique in NLP that represents words as dense vectors in a high-dimensional space. Unlike sparse matrices, word embeddings encode semantic relationships between words, allowing models to understand similarities and differences more effectively. For instance, in a word embedding model, the vectors for words like "king" and "queen" are closer together than they are to unrelated words like "royal," reflecting their semantic similarity.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">Taking this concept further, sentence embedding extends word embedding to entire sentences, representing them as fixed-length vectors. This enables models to understand the meaning and context of entire sentences, facilitating tasks like semantic search and document ranking. By storing these high-dimensional vector embeddings in specialized databases, known as vector databases, efficient retrieval and manipulation of textual data become possible.</span></p><p><span style="font-family: helvetica; font-size: large;"></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-family: helvetica; font-size: large;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjB7NG4tbo99y-GkiVjbq2gATCrMnJ0TSpF4g6LGufslDs6Y5fxeViSXzi-oJQnZBX44y0L2PXL6rFvg-UW0_P1bSMYyI1KVec1u1XTB80Bn-LllpEjfWsKRjFtmcAU_ri7cwkxjR6QDE2AouE5o8RyPdd6D6sHJBmthaCryKbiGJT9TiDzpJ7ciukw7x90" style="margin-left: 1em; margin-right: 1em;"><img data-original-height="1372" data-original-width="2510" height="175" src="https://blogger.googleusercontent.com/img/a/AVvXsEjB7NG4tbo99y-GkiVjbq2gATCrMnJ0TSpF4g6LGufslDs6Y5fxeViSXzi-oJQnZBX44y0L2PXL6rFvg-UW0_P1bSMYyI1KVec1u1XTB80Bn-LllpEjfWsKRjFtmcAU_ri7cwkxjR6QDE2AouE5o8RyPdd6D6sHJBmthaCryKbiGJT9TiDzpJ7ciukw7x90=w320-h175" width="320" /></a></span></div><span style="font-family: helvetica; font-size: large;"><br /><br /></span><p></p><p><span style="font-family: helvetica; font-size: large;">Vector databases leverage advanced indexing techniques to map high-dimensional vectors to specific data points, enabling rapid search algorithms for efficient retrieval. This capability is particularly beneficial in the realm of large language models (LLMs), where the ability to efficiently search through vast collections of text data is paramount.</span></p><p><span id="docs-internal-guid-9805a1e7-7fff-f628-a9eb-b9dc39a6d96f" style="font-family: helvetica; font-size: large;"> </span></p><p><span style="font-family: helvetica; font-size: large;">In LLMs, such as GPT-4, input text is processed one word at a time, with the model predicting the next word in the sequence. Vector databases play a crucial role in enhancing the model's capabilities by enabling quick retrieval of similar words or phrases during the prediction process, thereby improving the generation of coherent and contextually relevant text.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">Moreover, vector databases contribute to the long-term memory of LLMs by providing a structured framework for storing and accessing information. By organizing data into vectors and employing efficient indexing techniques, these databases allow LLMs to retain and recall previously encountered information, augmenting the model's ability to generate coherent text across different sessions and interactions.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">Additionally, vector databases play a vital role in optimizing performance and resource utilization in LLM architectures. By implementing caching mechanisms for frequently accessed vectors, these databases expedite the retrieval process, improving overall performance and response times.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">Incorporating vector databases into models and MLOps workflows is essential for ensuring optimal performance, especially at increased scale. This may involve reassessing data pipelines to enable real-time or near-real-time predictions, fraud detection, recommendations, and search results.</span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p><span style="font-family: helvetica; font-size: large;">In conclusion, vector databases are indispensable tools in the arsenal of large language models and NLP applications. By efficiently storing and retrieving high-dimensional vector embeddings, these databases empower models to understand context, retain information, and optimize performance across various tasks and applications. As the field continues to evolve, vector databases will play a central role in unlocking the full potential of natural language understanding and generation. </span></p><p><span style="font-family: helvetica; font-size: large;"><br /></span></p><p style="--tw-border-spacing-x: 0; --tw-border-spacing-y: 0; --tw-ring-color: rgba(69,89,164,0.5); --tw-ring-offset-color: #fff; --tw-ring-offset-shadow: 0 0 transparent; --tw-ring-offset-width: 0px; --tw-ring-shadow: 0 0 transparent; --tw-rotate: 0; --tw-scale-x: 1; --tw-scale-y: 1; --tw-scroll-snap-strictness: proximity; --tw-shadow-colored: 0 0 transparent; --tw-shadow: 0 0 transparent; --tw-skew-x: 0; --tw-skew-y: 0; --tw-translate-x: 0; --tw-translate-y: 0; border: 0px solid rgb(227, 227, 227); box-sizing: border-box; caret-color: rgb(13, 13, 13); color: #0d0d0d; margin: 0px 0px 1.25em; text-size-adjust: auto; white-space-collapse: preserve;"><span style="font-family: helvetica; font-size: large;">Vector databases offer a range of advantages and disadvantages in the realm of data management and retrieval:</span></p><p style="--tw-border-spacing-x: 0; --tw-border-spacing-y: 0; --tw-ring-color: rgba(69,89,164,0.5); --tw-ring-offset-color: #fff; --tw-ring-offset-shadow: 0 0 transparent; --tw-ring-offset-width: 0px; --tw-ring-shadow: 0 0 transparent; --tw-rotate: 0; --tw-scale-x: 1; --tw-scale-y: 1; --tw-scroll-snap-strictness: proximity; --tw-shadow-colored: 0 0 transparent; --tw-shadow: 0 0 transparent; --tw-skew-x: 0; --tw-skew-y: 0; --tw-translate-x: 0; --tw-translate-y: 0; border: 0px solid rgb(227, 227, 227); box-sizing: border-box; caret-color: rgb(13, 13, 13); color: #0d0d0d; margin: 1.25em 0px; text-size-adjust: auto; white-space-collapse: preserve;"><span style="--tw-border-spacing-x: 0; --tw-border-spacing-y: 0; --tw-ring-color: rgba(69,89,164,0.5); --tw-ring-offset-color: #fff; --tw-ring-offset-shadow: 0 0 transparent; --tw-ring-offset-width: 0px; --tw-ring-shadow: 0 0 transparent; --tw-rotate: 0; --tw-scale-x: 1; --tw-scale-y: 1; --tw-scroll-snap-strictness: proximity; --tw-shadow-colored: 0 0 transparent; --tw-shadow: 0 0 transparent; --tw-skew-x: 0; --tw-skew-y: 0; --tw-translate-x: 0; --tw-translate-y: 0; border: 0px solid rgb(227, 227, 227); box-sizing: border-box; color: var(--tw-prose-bold); font-weight: 600;"><span style="font-family: helvetica; font-size: large;">Advantages:</span></span></p><ul><li><span style="caret-color: rgb(13, 13, 13); color: #0d0d0d; font-family: helvetica; font-size: large; white-space-collapse: preserve;">Enables semantic search using Approximate Nearest Neighbor (ANN) distance measures.</span></li><li><span style="caret-color: rgb(13, 13, 13); color: #0d0d0d; white-space-collapse: preserve;"><span style="font-family: helvetica; font-size: large;">Supports bulk data loading for efficient processing of large datasets.</span></span></li><li><span style="caret-color: rgb(13, 13, 13); color: #0d0d0d; white-space-collapse: preserve;"><span style="font-family: helvetica; font-size: large;">Utilizes indexing for vectors, enabling semantic searches with low latency.</span></span></li><li><span style="caret-color: rgb(13, 13, 13); color: #0d0d0d; white-space-collapse: preserve;"><span style="font-family: helvetica; font-size: large;">Facilitates efficient data retrieval.</span></span></li><li><span style="caret-color: rgb(13, 13, 13); color: #0d0d0d; white-space-collapse: preserve;"><span style="font-family: helvetica; font-size: large;">Offers scalability, providing clustering and fault tolerance for redundancy.</span></span></li></ul><p style="--tw-border-spacing-x: 0; --tw-border-spacing-y: 0; --tw-ring-color: rgba(69,89,164,0.5); --tw-ring-offset-color: #fff; --tw-ring-offset-shadow: 0 0 transparent; --tw-ring-offset-width: 0px; --tw-ring-shadow: 0 0 transparent; --tw-rotate: 0; --tw-scale-x: 1; --tw-scale-y: 1; --tw-scroll-snap-strictness: proximity; --tw-shadow-colored: 0 0 transparent; --tw-shadow: 0 0 transparent; --tw-skew-x: 0; --tw-skew-y: 0; --tw-translate-x: 0; --tw-translate-y: 0; border: 0px solid rgb(227, 227, 227); box-sizing: border-box; caret-color: rgb(13, 13, 13); color: #0d0d0d; margin: 1.25em 0px; text-size-adjust: auto; white-space-collapse: preserve;"><span style="--tw-border-spacing-x: 0; --tw-border-spacing-y: 0; --tw-ring-color: rgba(69,89,164,0.5); --tw-ring-offset-color: #fff; --tw-ring-offset-shadow: 0 0 transparent; --tw-ring-offset-width: 0px; --tw-ring-shadow: 0 0 transparent; --tw-rotate: 0; --tw-scale-x: 1; --tw-scale-y: 1; --tw-scroll-snap-strictness: proximity; --tw-shadow-colored: 0 0 transparent; --tw-shadow: 0 0 transparent; --tw-skew-x: 0; --tw-skew-y: 0; --tw-translate-x: 0; --tw-translate-y: 0; border: 0px solid rgb(227, 227, 227); box-sizing: border-box; color: var(--tw-prose-bold); font-weight: 600;"><span style="font-family: helvetica; font-size: large;">Disadvantages:</span></span></p><p><span style="font-family: helvetica; font-size: large;"></span></p><div><span style="caret-color: rgb(13, 13, 13); color: #0d0d0d; font-family: helvetica; font-size: large; white-space-collapse: preserve;"><br /></span></div><p style="--tw-border-spacing-x: 0; --tw-border-spacing-y: 0; --tw-ring-color: rgba(69,89,164,0.5); --tw-ring-offset-color: #fff; --tw-ring-offset-shadow: 0 0 transparent; --tw-ring-offset-width: 0px; --tw-ring-shadow: 0 0 transparent; --tw-rotate: 0; --tw-scale-x: 1; --tw-scale-y: 1; --tw-scroll-snap-strictness: proximity; --tw-shadow-colored: 0 0 transparent; --tw-shadow: 0 0 transparent; --tw-skew-x: 0; --tw-skew-y: 0; --tw-translate-x: 0; --tw-translate-y: 0; border: 0px solid rgb(227, 227, 227); box-sizing: border-box; caret-color: rgb(13, 13, 13); color: #0d0d0d; margin: 0px 0px 1.25em; text-size-adjust: auto; white-space-collapse: preserve;"><span style="font-family: helvetica; font-size: large;"></span><span style="font-family: helvetica; font-size: large;"></span></p><ul style="text-align: left;"><li><span style="caret-color: rgb(13, 13, 13); color: #0d0d0d; font-family: helvetica; font-size: large; white-space-collapse: preserve;">Traditional queries such as joins and aggregations are not fully supported.</span></li><li><span style="font-family: helvetica; font-size: large;">Limited availability of built-in functions for data and string manipulation.</span></li><li><span style="font-family: helvetica; font-size: large;">Transactional support may be lacking for high levels of ACID compliance.</span></li><li><span style="font-family: helvetica; font-size: large;">Insert latency may occur when processing large datasets due to index processing.</span></li><li><span style="font-family: helvetica; font-size: large;">Memory-intensive operations are required as indexes need to be reloaded into memory for searching, potentially requiring GPU usage for low latency.</span></li></ul><span style="caret-color: rgb(13, 13, 13); color: #0d0d0d; font-family: helvetica; font-size: x-large; white-space-collapse: preserve;">For a practical demonstration, check out our YouTube video highlighting vector databases in action: Click </span><a href="https://youtu.be/AlR_I9--Gwo?si=g5brAJindumQ4CnV" style="caret-color: rgb(13, 13, 13); font-family: helvetica; font-size: x-large; white-space-collapse: preserve;" target="_blank">here</a>FutureXskillshttp://www.blogger.com/profile/03954397681463869119noreply@blogger.com0tag:blogger.com,1999:blog-5300390759647895792.post-52835429012269838722020-09-22T04:41:00.000-07:002023-08-06T09:47:51.729-07:00 Apache NiFi Core Concept and ArchitectureSanjeev Krishnahttp://www.blogger.com/profile/04552092403672413687noreply@blogger.com0tag:blogger.com,1999:blog-5300390759647895792.post-76911099177220142582020-09-21T12:44:00.015-07:002020-09-26T06:42:59.758-07:00What is Apache NiFi?<p> <span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;">Apache NiFi is an open source software to automate and manage the flow of data between different systems</span>.<span style="font-size: large;"> It provides a web-based UI for creating monitoring and controlling data flows. Processors in Nifi are highly configurable, it can also be used to transform data at runtime.</span></p><p><span style="font-size: large;"><br /></span></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;">NiFi helps in ingesting data from difference source systems to a data lake and from data lake to other target systems. Data Lake can be an Amazon S3 or a Hadoop cluster or any storage.<o:p></o:p></span></p><p></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"><b><span style="background: white; color: black; font-family: "Times New Roman",serif; font-size: 16pt; mso-fareast-font-family: "Times New Roman";"><o:p> </o:p></span></b></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"><b><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 16pt;">Some of the Key benefits of Apache NiFi:-</span></b></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"></p><ol><li><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt; text-indent: -0.25in;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; font-family: "Times New Roman", serif; font-size: 16pt; text-indent: -0.25in;"><b>Guaranteed delivery of data</b>: NiFi offer guaranteed delivery of data with the help of its content repository</span><span style="font-size: 16pt; text-indent: -0.25in;"> and write-ahead log.<br /><br /></span></span></li><li><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;"><b>Visualize your Data Flow</b>: - Nifi helps in building a visual data flow, which are very easy to understand and develop. <br /><br /><o:p></o:p></span></li><li><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;"><b>Integration with other Data processing tool</b>s: - It can integrate with other data processing tools like Spark and Kafka.<br /><b style="font-size: 16pt;"><br /></b></span></li><li><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;"><b style="font-size: 16pt;">Facilitates Back Pressure mechanism</b><span style="font-size: 16pt;">: Queues are the link between two processors, it buffers the data to make it available for the downstream processor. If by any reason the downstream job is not consuming the data with same speed as it is being generated in the queue, then these queues can create a backpressure on the upstream Processors to restrict the new data to come in.<br /><b style="font-size: 16pt;"><br /></b></span></span></li><li><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;"><span style="font-size: 16pt;"><b style="font-size: 16pt;">Data flow can be Prioritized</b><span style="font-size: 16pt;">: - Data in the queue can be prioritized before being fetched by the downstream. Priority can be the oldest first, newest first, largest first, or some other custom rule.<br /><br /></span></span></span></li><li><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;"><b>Gives an option to decide Latency Vs Throughput</b>:- In some scenario you may want lowest latency i.e. as soon as data is there you want it to get processed, but in some scenario you may want to achieve more throughput and willing to sacrifice the latency to some extent by allowing latency 1or 2 sec delay. we can make these Latency Vs Throughput decisions while configuring processors.<br /><br /><o:p></o:p></span></li><li><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 16pt; text-indent: -0.25in;"><b>Data Provenance:</b> - It allows us to trace the data and its movement thought different processors. It allows us to troubleshoot and optimize Data flow.<br /><br /></span></li><li><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;">It gives an option to start and stop different Data Flow components separately.</span></li></ol><div><span style="font-family: Times New Roman, serif;"><span style="font-size: 21.3333px;"><br /></span></span></div><div style="text-indent: -24px;"></div><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;">Apart from these features NiFi also provides content encryption. NiFi offers secure exchange of data through the use of protocols with encryption such as 2-way SS</span>L, <span style="font-size: large;">shared-keys or other mechanisms.</span><p class="MsoNormal"><span style="background: white; font-size: 13pt; letter-spacing: -0.15pt; line-height: 18.5467px;"><o:p><br /></o:p></span></p><p class="MsoNormal"><span style="background: white; font-size: 13pt; letter-spacing: -0.15pt; line-height: 18.5467px;"><o:p><br /></o:p></span></p><p align="center" class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in; mso-outline-level: 2; text-align: center;"><b><span style="color: #00b0f0; font-family: "Times New Roman",serif; font-size: 24pt; mso-fareast-font-family: "Times New Roman";">Nifi Setup and Installation<o:p></o:p></span></b></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;">Nifi can be typically configured on edge node. However, it is not mandatory to set it up on any particular node, it can be configures on any node. You just need to provide the location of the Hadoop configurations files in order to with with HDFS and other Hadoop based components. For high availability it can be configured on multiple nodes as well.</span></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;"><br /></span></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;">In order to work with HDFS related processors in NiFi we would need have a running Hadoop cluster. In the NiFi Processor config we need to pass hive-ste.xml and core-site.xml file path from the Hadoop installation.</span></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;">To work on NiFi integration with Spark or Kafka, first we need to set a Hadoop cluster and then Install NiFi or we can install NiFi in an exiting Hadoop cluster and integrate it with existing tools. </span></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in;"><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;"><br /></span></p><p class="MsoNormal" style="line-height: 24pt; margin-bottom: 0in; text-align: center;"><span style="background: white; font-family: "Times New Roman", serif; font-size: 16pt;"><b><span style="color: #2fdeea;">Installation of NiFi on GCP DataProc or Amazon EMR cluster </span></b><o:p></o:p></span></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;">GCP DataProc or Amazon EMR has preinstalled Hadoop, Spark and other tools. We can leverage these cluster install NiFi in it.</span></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;">We need to first Create and Launch a DataProc Cluster with any number of data node based on your data processing requirement.</span></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;"><br /></span></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;"><i><b>Steps to Install NiFi on GCP DataProc or Amazon EMR Cluster:-</b></i></span></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;">1. Login to the master node through SSH and download Nifi tar.gz using wget command from the Apache NiFi mirror page </span></p><p><a href="https://nifi.apache.org/download.html" target="_blank"><span style="font-family: Times New Roman, serif;"><span style="font-size: 21.3333px;">https://nifi.apache.org/download.html</span></span><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;"> </span></a></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhC8zUPYAarJ077mW4sI_SCmjScQQ1ZTkEcNevoYKhESH3TgY9JL0G6YE5w1Zpb7x6nILG0OuStPUDWnAZhGLQGLWja1kyet7cZjOXsatqJhqwygLKuFjbdLUB1DClSdkrc6oFgnsvf3QZW/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="499" data-original-width="1019" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhC8zUPYAarJ077mW4sI_SCmjScQQ1ZTkEcNevoYKhESH3TgY9JL0G6YE5w1Zpb7x6nILG0OuStPUDWnAZhGLQGLWja1kyet7cZjOXsatqJhqwygLKuFjbdLUB1DClSdkrc6oFgnsvf3QZW/w619-h314/image.png" width="619" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><p></p><p class="MsoNormal" style="background: white; line-height: normal; margin-left: .25in; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; letter-spacing: -0.1pt;">command to download the tar file:-</span></p><p class="MsoNormal" style="background: white; line-height: normal; margin-left: .25in; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; letter-spacing: -0.1pt;">
wget </span><span style="box-sizing: border-box; outline-offset: -2px; outline: -webkit-focus-ring-color auto 5px;">http://apachemirror.wuchna.com/nifi/1.12.0/nifi-1.12.0-bin.tar.gz</span><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; letter-spacing: -0.1pt;"><o:p></o:p></span></p><p class="MsoNormal" style="background: white; line-height: normal; margin-left: .25in; mso-margin-bottom-alt: auto; mso-margin-top-alt: auto;"><span style="box-sizing: border-box; outline-offset: -2px; outline: -webkit-focus-ring-color auto 5px;"><br /></span></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPBVyKbknX8wndR53sHalLGheICVYh0LnbkZQVfI1ErFHKodBZXxaQgncDX-a3SEQKOrLzAGaHj40xU1cnbln4s1uqnppbU4D3x99OeS45xHyb1xItGfakaG0irGh-5j-7QknuHsmXk-E4/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="276" data-original-width="902" height="196" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPBVyKbknX8wndR53sHalLGheICVYh0LnbkZQVfI1ErFHKodBZXxaQgncDX-a3SEQKOrLzAGaHj40xU1cnbln4s1uqnppbU4D3x99OeS45xHyb1xItGfakaG0irGh-5j-7QknuHsmXk-E4/w640-h196/image.png" width="640" /></a></div><br /><br /><p></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;">2. Untar and unzip using tar xzf command.</span></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;"></span></p><div class="separator" style="clear: both; text-align: center;"><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9ff90RPkppPc57uLXoK_91TTAsQ-rptEHVt52jBDay8d_ZCJT8wolMsOhqYEx7SF5A3n7KofEmG__LL_nC3gr__qLbiW3Dp7M2PaTKb_utq4FQwGyhB47ZGrI9uRFSuaDVle7R2F205r0/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="22" data-original-width="915" height="16" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9ff90RPkppPc57uLXoK_91TTAsQ-rptEHVt52jBDay8d_ZCJT8wolMsOhqYEx7SF5A3n7KofEmG__LL_nC3gr__qLbiW3Dp7M2PaTKb_utq4FQwGyhB47ZGrI9uRFSuaDVle7R2F205r0/w618-h16/image.png" width="618" /></a></span></div><p></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;">3.Update the bash profile and add the NiFi path by using the following command </span></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;"><span> a. </span>vi ~/.bash_profile</span></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;"><span> b. A</span>dd the following lines as shown in the screenshot</span></p><p><span style="font-family: Times New Roman, serif;"><span style="font-size: 21.3333px;"><span> </span><span> </span>export NIFI_HOME=/home/futurexskill7/nifi-1.12.0/</span></span></p><p><span style="font-size: 21.3333px;"><span style="background-color: white; font-family: Times New Roman, serif;"></span></span></p><p><span style="font-family: Times New Roman, serif;"><span style="font-size: 21.3333px;"><span> </span><span> </span>export PATH=$PATH:$NIFI_HOME/bin</span></span></p><p><span style="background-color: white; font-family: Times New Roman, serif; font-size: 21.3333px;"><span> c. </span>source ~/.bash_profile</span></p><p><span style="font-family: Times New Roman, serif;"><span style="background-color: white; font-size: 21.3333px;"><span> d</span>. Very the new Nifi Path is set by running the "echo $Path" command </span></span></p><p><span style="font-family: Times New Roman, serif;"><span style="background-color: white; font-size: 21.3333px;"><br /></span></span></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;">4. run the command "nifi.sh start"</span></p><p><span style="background-color: white;"><span style="font-family: Times New Roman, serif;"><span style="font-size: 21.3333px;">5. Check the status if NiFi is running or not by running the command "nifi.sh status"</span></span></span></p><p><span style="font-family: Times New Roman, serif;"><span style="background-color: white; font-size: 21.3333px;">6. Once you start the Nifi, logs folder will get created with the log file.</span></span></p><p><span style="font-family: Times New Roman, serif;"><span style="background-color: white; font-size: 21.3333px;">you can check the log file here:-</span></span></p><p><span style="background-color: white; font-size: 21.3333px;"><span style="font-family: Times New Roman, serif;"><span> </span>/nifi-1.12.0/logs/</span></span><span style="font-family: Times New Roman, serif;"><span style="font-size: 21.3333px;">nifi-app.log</span></span></p><p><span style="font-family: Times New Roman, serif;"><span style="font-size: 21.3333px;"></span></span></p><div class="separator" style="clear: both; text-align: center;"><span style="font-family: Times New Roman, serif;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwcbhL1yEy0fj6gDdYLSV03c6q5r56gDj4vKzhQFzH5IYBLDMzIsAP0QvEJzFMye5D6vb9L4kSETCe_l7qd-iuPu8kd2wcvwbHFdgfLI2wOj6taoUVJyiKOz-PvBpJbLkaE_g1HrCqIUuk/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="76" data-original-width="872" height="56" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwcbhL1yEy0fj6gDdYLSV03c6q5r56gDj4vKzhQFzH5IYBLDMzIsAP0QvEJzFMye5D6vb9L4kSETCe_l7qd-iuPu8kd2wcvwbHFdgfLI2wOj6taoUVJyiKOz-PvBpJbLkaE_g1HrCqIUuk/w640-h56/image.png" width="640" /></a></span></div><span style="font-family: Times New Roman, serif;"><br /><br /></span><p></p><p><span style="background-color: white; font-family: "Times New Roman", serif; font-size: 21.3333px;">NiFi provides a Web UI which runs on 8080 port. In order to access the web UI from outside the cluster or from your local machine we need to open the port in the Firewall rule for the GCP, or AWS instance where NiFi is installed.</span></p><p><br /></p><div style="text-align: right;"><span style="font-size: 21.3333px;"><br /></span></div>FutureXskillshttp://www.blogger.com/profile/15942079652867152508noreply@blogger.com1tag:blogger.com,1999:blog-5300390759647895792.post-34468498033875379452020-09-02T00:27:00.007-07:002020-09-02T02:37:48.259-07:00What is an RDD and Why Spark needs it?<p> </p><p><br /></p><div style="line-height: 24pt; margin: 0in;"><span style="background: white; color: black; font-family: "century" , serif; font-size: 16pt;"><span style="font-size: 21.3333px;">Resilient Distributed Data set(</span>RDD) is the core of Apache Spark. It is the fundamental data structure on top of which all the spark components reside. It can also be understood as distributed collection of records which resides in memory*. In Spark Cluster multiple nodes work together on a job, each node works on some portion of data. For computation all the distributed chunks of a data-set which primarily resides in HDFS or any other distributed storage, moves to RAM* of each node for a fraction of second, this distributed data at that point is collectively known as RDD.</span><span style="color: black; font-family: "century" , serif; font-size: 13.5pt;"><o:p></o:p></span></div><o:p></o:p><br /><div style="line-height: 24pt; margin: 0in;"><span style="font-size: medium;"><i><span face="" style="background-color: white; color: #292929; letter-spacing: -0.0666667px;">Click here to checkout our Udemy course<b> </b></span><span face="" style="background-color: #fcff01; color: #292929; font-weight: bold; letter-spacing: -0.0666667px;"><a href="https://www.udemy.com/course/spark-scala-coding-best-practices-data-pipeline/?referralCode=DBA026944F73C2D356CF" rel="nofollow" style="color: black; text-decoration-line: none;" target="_blank">Spark Scala Coding Framework and BestPractices</a> </span></i></span></div><div style="line-height: 24pt; margin: 0in;"><br /></div><div style="line-height: 24pt; margin: 0in;"><span style="background: white; color: black; font-family: "century" , serif; font-size: 16pt;">This is same as the class and object concept, while writing code you create an object of a class in a textile, but the object is actually materialized when it is executed and occupies some heap memory in execution engine.</span><span style="color: black; font-family: "century" , serif; font-size: 13.5pt;"><o:p></o:p></span></div><o:p></o:p><br /><div style="line-height: 24pt; margin: 0in;"><br /></div><h2 style="line-height: 24pt; margin: 0in 0in 0.0001pt; text-align: center;"><b style="text-indent: -0.25in;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;">When RDD is Materialized?</span></span></b></h2><div class="MsoNormal" style="line-height: normal; margin: 0in 0in 0in 0.5in; mso-outline-level: 3; text-indent: -0.25in;"><b><span style="color: #00b0f0; font-family: "georgia" , serif; font-size: 24pt;"><br /></span></b></div><div style="line-height: 24pt; margin: 0in 0in 0.0001pt;"><span style="background: white; color: black; font-family: "century" , serif; font-size: 16pt;"><span>The above process where RDDs gets materialized, happens only when an <i>"</i></span><i><b>Action</b>"</i> <span>is called on RDD. You can keep on deriving one RDD from another through "</span><b style="mso-bidi-font-weight: normal;"><i>Transformation</i>"</b> <span>but Spark won’t materialize the RDD (i.e. data won’t be fetched into RAM). For all the Transaction on an RDD a graph will be created, by which Spark keeps all the information of RDD dependency and transformation operation to be applied on it to create a new RDD. This graph is called DAG.</span></span></div><o:p></o:p><br /><div style="line-height: 24pt; margin: 0in;"><br /></div><div style="line-height: 24pt; margin: 0in;"><span style="background: white; color: black; font-family: "century" , serif; font-size: 16pt;">Spark keeps on adding the transformation and resulting RDD information into DAG until it finds an action call on any subsequent RDD.<o:p></o:p></span></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFdBS4VP9kG_YEPq_2CqUcTg50KFoPzNXHH_owQ08O1KgQDBRAlETrdwkAo8seijlnJYbpnSLHG844FT0Bw80ty8E3LNMle3ZOqWLpXgymACgCywomXiDEylcBKyJmvoTsRRsVypqb5eY/s1600/Apache+Spark+DAG+technologyintrend.com.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="772" data-original-width="373" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFdBS4VP9kG_YEPq_2CqUcTg50KFoPzNXHH_owQ08O1KgQDBRAlETrdwkAo8seijlnJYbpnSLHG844FT0Bw80ty8E3LNMle3ZOqWLpXgymACgCywomXiDEylcBKyJmvoTsRRsVypqb5eY/s640/Apache+Spark+DAG+technologyintrend.com.jpg" width="308" /></a></div><h4 style="line-height: 24pt; margin: 0in; text-align: center;"><span style="background: white; color: black; font-family: "georgia" , serif;"><span style="font-size: xx-small;">Image: - Apache Spark DAG</span></span></h4><div style="line-height: 24pt; margin: 0in;"><span style="background: white; color: black; font-family: "century" , serif; font-size: 16pt;">Once an action is found on RDD the DAG is submitted to the DAG scheduler which further divides the job into multiple stages and execute the DAG to populate the data into RDD and do the predefined transformation.</span><span style="color: black; font-family: "century" , serif; font-size: 16pt;"><o:p></o:p></span></div><o:p></o:p><br /><div style="line-height: 24pt; margin: 0in;"><br /></div><div style="line-height: 24pt; margin: 0in;"><br /></div><h2 style="line-height: 24pt; margin: 0in 0in 0.0001pt; text-align: center;"><b style="mso-bidi-font-weight: normal;"><span style="background: white; color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;">Why RDD is materialized just for a fraction of second?</span></span></b></h2><o:p></o:p><br /><div style="line-height: 24pt; margin: 0in;"><br /></div><div style="line-height: 24pt; margin: 0in;"><span style="background: white; color: black; font-family: "century" , serif; font-size: 16pt;">Since there can be multiple jobs running in spark at the same time so it is not efficient to keep a materialized RDD in memory always. For that reason Spark came up with a concept called lazy evaluation, which means until an action is called on RDD, it wont get materialized. Once an action is called upon an RDD then it will be materialized as per the transformation defined in DAG. Once materialized and the <b>Action </b>is performed the RDDs are flushed from memory.</span><span style="color: black; font-family: "century" , serif; font-size: 16pt;"><o:p></o:p></span></div><o:p></o:p><br /><div style="line-height: 24pt; margin: 0in;"><br /></div><div style="line-height: 24pt; margin: 0in;"><span style="background: white; color: black; font-family: "century" , serif; font-size: 16pt;">Next time you call action on same RDD the DAG will get executed again. In case if there is any RDD which is getting referenced multiple times in the DAG or which is getting computed multiple times then you can Cache or Persist those RDDs, this will avoid re-computation of same RDD.</span><span style="color: black; font-family: "century" , serif; font-size: 16pt;"><o:p></o:p></span></div><o:p></o:p><br /><div style="line-height: 24pt; margin: 0in;"><br /></div><div style="line-height: 24pt; margin: 0in;"><br /></div><h2 style="line-height: 24pt; margin: 0in 0in 0.0001pt; text-align: center;"><b style="mso-bidi-font-weight: normal;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;">What is the difference in the way Spark and Map Reduce processes the data?</span></span></b></h2><div style="line-height: 24pt; margin: 0in;"><br /></div><div style="line-height: 24pt; margin: 0in;"><span style="color: black; font-family: "century" , serif; font-size: 16pt;">Though the underlying concept of mapping and reducing the data is same in Spark and map reduce but there are multiple differences in the way MR and Spark handles the data. The key difference which makes spark faster is, it doesn’t store the intermediate results of stages into the hard disc, rather Spark keeps it in memory.<o:p></o:p></span><br /><span style="color: black; font-family: "century" , serif; font-size: 16pt;"><br /></span></div><h3 style="line-height: 24pt; margin: 0in 0in 0.0001pt;"><span style="font-weight: normal;"><span style="color: black; font-family: "century" , serif; font-size: 16pt;"><o:p> </o:p></span><span style="color: black; font-family: "century" , serif; font-size: 16pt;">Map Reduce Vs Spark Way of Processing data: -</span></span></h3><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8RnruGLHx9__3c77CFLIzIJot5AhO1RqvyYuYdhnzn_DDJ1FDYh2JNGg20Zw7urLqOjj00fNrz4tYEQPUN6Zb6mvVkezW8nakzDr-bf9t7TyC_NHYSlXnrEWXcDR96jvE2BFDP37Af5g/s1600/Map+Reduce+Vs+Spark+Processing.JPG" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="670" data-original-width="1280" height="334" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8RnruGLHx9__3c77CFLIzIJot5AhO1RqvyYuYdhnzn_DDJ1FDYh2JNGg20Zw7urLqOjj00fNrz4tYEQPUN6Zb6mvVkezW8nakzDr-bf9t7TyC_NHYSlXnrEWXcDR96jvE2BFDP37Af5g/s640/Map+Reduce+Vs+Spark+Processing.JPG" width="640" /></a></div><h4 style="line-height: 24pt; margin: 0in; text-align: center;"><span style="color: black; font-family: "georgia" , serif;"><span style="font-size: xx-small;">Image:- Map Reduce Vs Spark Way of Processing data</span></span></h4><div style="line-height: 24pt; margin: 0in;"><br /></div><div style="line-height: 24pt; margin: 0in;"><span style="color: black; font-family: "century" , serif; font-size: 16pt;">Unlike MR, Spark keeps the intermediate result into memory which acts as an input for the next step. However, if at any point of time the available memory in cluster is less than the memory required to keep the resulting RDD or DataFrame then the data is spilled over and written to disk. So RDD data can reside in RAM and hard disc both</span></div><div style="line-height: 24pt; margin: 0in;"><br /></div><div style="line-height: 24pt; margin: 0in;"><span style="color: black; font-family: "century" , serif; font-size: 16pt;">Now as per the definition, RDD stands for Resilient Distributed Data-set. Each term has a meaning defined below: -<o:p></o:p></span></div><div style="line-height: 24pt; margin: 0in;"><br /></div><h3 style="line-height: 24pt; margin: 0in 0in 0.0001pt;"><b style="mso-bidi-font-weight: normal;"><i style="mso-bidi-font-style: normal;"><span style="color: black; font-family: "century" , serif; font-size: 16pt;">Resilient</span></i></b><span style="color: black; font-family: "century" , serif; font-size: 16pt;">: - <span style="font-weight: normal;">Resilient means “able to recover quickly”. RDDs are resilient because they can recover if any of its partitions are lost. RDD can be recomputed if lost, based on the lineage graph called DAG</span>.<o:p></o:p></span></h3><div style="line-height: 24pt; margin: 0in;"><br /></div><h3 style="line-height: 24pt; margin: 0in 0in 0.0001pt;"><b style="mso-bidi-font-weight: normal;"><i style="mso-bidi-font-style: normal;"><span style="color: black; font-family: "century" , serif; font-size: 16pt;">Distributed</span></i></b><span style="color: black; font-family: "century" , serif; font-size: 16pt;">: - <span style="font-weight: normal;">As mentioned in the beginning, Data resides on memory of multiple node in a distributed manner when RDD is materialized.</span><o:p></o:p></span></h3><div style="line-height: 24pt; margin: 0in;"><br /></div><h3 style="line-height: 24pt; margin: 0in 0in 0.0001pt;"><b style="mso-bidi-font-weight: normal;"><i style="mso-bidi-font-style: normal;"><span style="color: black; font-family: "century" , serif; font-size: 16pt;">Dataset</span></i></b><span style="color: black; font-family: "century" , serif; font-size: 16pt;">: - <span style="font-weight: normal;">RDD is the collection of the distributed datasets</span>.<o:p></o:p></span></h3><div style="line-height: 24pt; margin: 0in;"><br /><br /></div><div style="line-height: 24pt; margin: 0in;"><h2 style="text-align: center;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;">Features of an RDD?</span><span style="font-size: 24pt;"><o:p></o:p></span></span></h2><div><span style="color: #00b0f0; font-family: "georgia" , serif; font-size: 24pt;"><br /></span></div></div><div style="margin: 0in;"><span style="background: white; font-family: "century" , serif; font-size: 16pt;">RDD has following key features: -<o:p></o:p></span></div><h3 style="margin: 0in 0in 0.0001pt 0.5in; text-indent: -0.25in;"><span style="font-family: "century" , serif; font-size: 16pt;">1.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><b><span style="background: white; font-family: "century" , serif; font-size: 16pt;">Fault Tolerance</span></b><span style="background: white; font-family: "century" , serif; font-size: 16pt;">: - <span style="font-weight: normal;">Can recover easily if its lost or if any of its partition is lost based on the DAG</span>.</span></h3><div style="margin: 0in 0in 0in 0.5in;"><br /></div><h3 style="margin: 0in 0in 0.0001pt 0.5in; text-indent: -0.25in;"><span style="font-family: "century" , serif; font-size: 16pt;">2.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><b><span style="background: white; font-family: "century" , serif; font-size: 16pt;">Immutable</span></b><span style="background: white; font-family: "century" , serif; font-size: 16pt;">: - <span style="font-weight: normal;">Once an RDD is created it cannot be modified. This makes RDD consistent and safe to be accessed across multiple nodes. If you need to modify it then you will have to create a new RDD from an existing one.</span></span></h3><div class="MsoListParagraph"><br /></div><div style="margin: 0in 0in 0in 0.5in;"><br /></div><h3 style="margin: 0in 0in 0.0001pt 0.5in; text-indent: -0.25in;"><span style="font-family: "century" , serif; font-size: 16pt;">3.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><b><span style="background: white; font-family: "century" , serif; font-size: 16pt;">Lazy Evaluation</span></b><span style="background: white; font-family: "century" , serif; font-size: 16pt;">: - <span style="font-weight: normal;">RDD is materialized only when an action is called otherwise Spark will keep on adding the transformation and the resulting RDD information into a lineage graph called DAG.</span></span></h3><div style="margin: 0in 0in 0in 0.5in;"><br /></div><h3 style="margin: 0in 0in 0.0001pt 0.5in; text-indent: -0.25in;"><span style="font-family: "century" , serif; font-size: 16pt;">4.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><b><span style="font-family: "century" , serif; font-size: 16pt;">In Memory Processing</span></b><span style="font-family: "century" , serif; font-size: 16pt;">: - <span style="font-weight: normal;">Data is processed in memory. intermediate results of each stage is stored in memory until there is a memory shortage and spill over happens. In case of spill over data is written to disk.</span><o:p></o:p></span></h3><div style="margin: 0in 0in 0in 0.5in;"><br /></div><h3 style="margin: 0in 0in 0.0001pt 0.5in; text-indent: -0.25in;"><span style="font-family: "century" , serif; font-size: 16pt;">5.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><b><span style="font-family: "century" , serif; font-size: 16pt;">Partitioned</span></b><span style="font-family: "century" , serif;"><span style="font-size: 16pt;">: - <span style="font-weight: normal;">Data is logically portioned inside RDD to achieve parallel processing. R</span></span><span style="font-size: 21.3333px; font-weight: normal;">e-partitioning of</span><span style="font-size: 16pt;"><span style="font-weight: normal;"> RDD can also be done based on the performance.</span><o:p></o:p></span></span></h3><div style="margin: 0in 0in 0in 0.5in;"><br /></div><h3 style="margin: 0in 0in 0.0001pt 0.5in; text-indent: -0.25in;"><span style="font-family: "century" , serif; font-size: 16pt;">6.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><b><span style="font-family: "century" , serif; font-size: 16pt;">Location Sickness</span></b><span style="font-family: "century" , serif; font-size: 16pt;">: - <span style="font-weight: normal;">While materializing the RDD, DAG Scheduler will place the RDD partition to the node which is closest to the data. This means in most cases a node will work on the portion of data which is present in it. This reduces movement of data through network and shuffling of data.</span><o:p></o:p></span></h3><div class="MsoListParagraph"><br /></div><div style="margin: 0in 0in 0in 0.5in;"><br /></div><h3 style="margin: 0in 0in 0.0001pt 0.5in; text-indent: -0.25in;"><span style="font-family: "century" , serif; font-size: 16pt;">7.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><b><span style="font-family: "century" , serif; font-size: 16pt;">Persistence</span></b><span style="font-family: "century" , serif; font-size: 16pt;">: - <span style="font-weight: normal;">You can persist an RDD, If the same RDD is used multiple times then to avoid re-computation you can save the RDD in Cache or on Hard Disk.</span><o:p></o:p></span></h3><div style="margin: 0in;"></div><div class="MsoNormal"><br /><h2 style="text-align: center;"><span style="color: #00b0f0; font-family: "georgia" , serif; font-size: 24pt;">How to create an RDD?<o:p></o:p></span></h2><h2><span style="font-family: "century" , serif; font-size: 16pt; font-weight: normal;">Following are the three ways to create RDD: -<o:p></o:p></span></h2><h2 style="margin-left: 0.5in; mso-list: l0 level1 lfo1; text-indent: -0.25in;"><!--[if !supportLists]--><span style="font-family: "century" , serif; font-size: 16pt; font-weight: normal;">1.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><span style="font-family: "century" , serif; font-size: 16pt; font-weight: normal;">Load dataset which is present in file or in table or any external storage.</span> </h2><h2 style="margin-left: 0.5in; mso-list: l0 level1 lfo1; text-indent: -0.25in;"><!--[if !supportLists]--><span style="font-family: "century" , serif; font-size: 16pt; font-weight: normal;">2.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><span style="font-family: "century" , serif; font-size: 16pt; font-weight: normal;">Parallelize a collection: - You can pass a list or collection to parallelize() and get an RDD<o:p></o:p></span></h2><h2 style="margin-left: 0.5in; mso-list: l0 level1 lfo1; text-indent: -0.25in;"><!--[if !supportLists]--><span style="font-family: "century" , serif; font-size: 16pt; font-weight: normal;">3.<span style="font-family: "times new roman"; font-size: 7pt; font-stretch: normal; line-height: normal;"> </span></span><!--[endif]--><span style="font-family: "century" , serif; font-size: 16pt; font-weight: normal;">Transform an existing RDD into new one.<o:p></o:p></span></h2><br /><h2><span style="font-family: "century" , serif; font-size: 16pt; font-weight: normal;">All the above three ways to create an RDD will be explained in detail in next post.<o:p></o:p></span></h2><br /><span style="background-color: white; color: #292929; font-size: large; letter-spacing: -0.0666667px;"><i>Click here to checkout our Udemy course</i></span><span style="background-color: white; color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"> </span><span style="background-color: #fcff01; color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"><i><a href="https://www.udemy.com/course/spark-scala-coding-best-practices-data-pipeline/?referralCode=DBA026944F73C2D356CF" rel="nofollow" target="_blank">Spark Scala Coding Framework and BestPractices</a> </i></span></div><div class="MsoNormal"><span style="background-color: #fcff01; color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"><br /></span></div><div class="MsoNormal"><span style="background-color: #fcff01; color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"><br /></span></div><div class="MsoNormal"><span style="background-color: #fcff01; color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"><br /></span></div>FutureXskillshttp://www.blogger.com/profile/15942079652867152508noreply@blogger.com0tag:blogger.com,1999:blog-5300390759647895792.post-59256320113782744082020-09-01T10:49:00.002-07:002020-09-02T02:36:53.838-07:00Deployment modes and Job submission in Apache Spark<p> </p><div class="MsoNormal" style="background-color: white; margin: 0px; outline: 0px; padding: 0px; transition: all 0.2s ease 0s;"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0.0001pt;"><h3><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal; margin-bottom: 0.0001pt; text-align: justify;"><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal; margin-bottom: 0.0001pt;"><div style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; margin: 0in 0in 0.0001pt;"><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="color: #2e2e2e; font-family: "georgia" , serif; font-size: 16pt; font-weight: normal;"><br /></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="color: #2e2e2e; font-family: "georgia" , serif; font-size: 16pt; font-weight: normal;">Spark is a Scheduling Monitoring and Distribution engine, it can also acts as a resource manager for its jobs. When Spark runs job by itself using its own cluster manager then i</span><span style="color: #2e2e2e; font-family: georgia, serif; font-size: 21.3333px; font-weight: 400;">t is called Standalone mode</span><span style="color: #2e2e2e; font-family: georgia, serif; font-size: 16pt; font-weight: normal;">, it can also run its job on top of other cluster/resource managers like Mesos or Yarn. </span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="color: #2e2e2e; font-family: "georgia" , serif; font-size: 16pt; font-weight: normal;"><br /></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="color: #2e2e2e; font-family: "georgia" , serif; font-size: 16pt; font-weight: normal;"><i style="color: black; font-family: "Times New Roman"; font-size: large; text-align: left;"><span style="color: #292929; font-family: Domine, Arial, Helvetica, sans-serif; letter-spacing: -0.0666667px;">Click here to checkout our Udemy course to learn<b> </b></span><span style="background-color: #fcff01; color: #292929; font-family: Domine, Arial, Helvetica, sans-serif; font-weight: bold; letter-spacing: -0.0666667px;"><a href="https://www.udemy.com/course/spark-scala-coding-best-practices-data-pipeline/?referralCode=DBA026944F73C2D356CF" rel="nofollow" style="color: black; text-decoration-line: none;" target="_blank">Spark Scala Coding Framework and BestPractices</a> </span></i></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><div style="text-align: left;"><span style="color: #292929; font-family: Domine, Arial, Helvetica, sans-serif; font-size: medium;"><span style="letter-spacing: -0.0666667px;"><i><br /></i></span></span></div><span style="color: #2e2e2e; font-family: "georgia" , serif; font-size: 16pt; font-weight: normal;">Submitting a job to Spark can be done through various ways. In Addition to cluster and client modes of execution there is also local mode of submitting a spark job. Before we start running our job we must understand these modes of execution.<o:p></o:p></span></div><br /><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="color: #2e2e2e; font-family: "georgia" , serif; font-size: 16pt;"><span style="font-weight: normal;"><br /></span></span></div></div></div></div></div></h3><h3 style="line-height: 24pt; margin: 0in 0in 0.0001pt; text-align: center;"><b style="text-indent: -0.25in;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;"><i>How Spark supports different Cluster Managers?</i></span></span></b></h3><div><b style="text-indent: -0.25in;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;"><br /></span></span></b></div><div class="MsoNormal" style="color: #2e2e2e; line-height: normal;"><div style="text-align: justify;"><span style="font-family: "georgia" , serif; font-size: 16pt;">SparkContext object is the driver program of Apache Spark. It can connect to several types of cluster managers enabling Spark to run on top of other cluster manager frameworks like Yarn or Mesos. SparkContext is the object which coordinates between the independently executing parallel threads of the cluster. <o:p></o:p></span></div></div></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="color: #2e2e2e; margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td><img alt="Spark cluster components, Spark Driver and Workers, Spark Deployment modes, Spark Tutorials" src="https://spark.apache.org/docs/latest/img/cluster-overview.png" style="margin-left: auto; margin-right: auto;" title="Spark cluster components" /></td></tr><tr><td class="tr-caption" style="font-size: 12.8px;">source: <a href="https://spark.apache.org/docs/latest/cluster-overview.html">https://spark.apache.org/docs/latest/cluster-overview.html</a><span style="font-family: "georgia" , serif; font-size: 18pt; text-align: left;"> </span><br /><span style="font-family: "georgia" , serif; font-size: 18pt; text-align: left;"><br /></span></td></tr></tbody></table><div class="MsoNormal" style="color: #2e2e2e; line-height: normal; margin-bottom: 0.0001pt;"><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal; margin-bottom: 0.0001pt;"><span style="font-family: "georgia" , serif; font-size: 16pt;">Spark Job can be launched in three different ways: -</span><span style="font-family: "times new roman" , serif; font-size: 13.5pt;"><o:p></o:p></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal; margin-bottom: 0.0001pt; text-indent: -0.25in;"><span style="font-family: "georgia" , serif; font-size: 16pt;"> 1.</span><span style="font-family: "times new roman" , serif; font-size: 7pt;"> </span><span style="font-family: "georgia" , serif; font-size: 16pt;">Local (also known as pseudo-cluster mode)</span><span style="font-family: "times new roman" , serif; font-size: 13.5pt;"><o:p></o:p></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal; margin-bottom: 0.0001pt; text-indent: -0.25in;"><span style="font-family: "georgia" , serif; font-size: 16pt;"> 2.</span><span style="font-family: "times new roman" , serif; font-size: 7pt;"> </span><span style="font-family: "georgia" , serif; font-size: 16pt;">Standalone (Cluster with Spark default Cluster manager)</span><span style="font-family: "times new roman" , serif; font-size: 13.5pt;"><o:p></o:p></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal; margin-bottom: 0.0001pt; text-indent: -0.25in;"><span style="font-family: "georgia" , serif; font-size: 16pt;"> 3.</span><span style="font-family: "times new roman" , serif; font-size: 7pt;"> </span><span style="font-family: "georgia" , serif; font-size: 16pt;">On top of other Cluster Manager (Cluster with Yarn, Mesos or Kubernetes as Cluster Manager)</span><span style="font-family: "times new roman" , serif; font-size: 13.5pt;"><o:p></o:p></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><br /></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;"><br /></span></div><h2 style="line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">Local:-</span></h2><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">Local mode is pseudo-cluster mode generally used for testing and demonstration. In this mode it runs all component in just one single node.<o:p></o:p></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><br /></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;"><br /></span></div><h2 style="line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">Standalone: - </span></h2><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">In Standalone mode Spark Cluster manager i.e. the default Cluster manager provided in the distribution of Apache spark is used for resource and cluster management of Spark Jobs. It has standalone Master for resource Management and Standalone worker for the task.<o:p></o:p></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><br /></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">Please don't get confused here, </span><span style="font-family: "georgia" , serif; font-size: 16pt;"><i>Standalone mode doesn't mean a single node Spark deployment.</i> It is also a cluster mode deployment of Spark, we need to understand that here in Standalone the cluster will be managed by Spark itself .</span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><br /></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;"><br /></span></div><h2 style="line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">On top of other Cluster Managers -</span></h2><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">Apache Spark can also run on other Cluster managers like Yarn, Kubernates or </span><span style="font-family: georgia, serif; font-size: 21.3333px;">Mesos</span><span style="font-family: georgia, serif; font-size: 16pt;">. However, the most popular cluster manager used in industry for Spark is Yarn because of good compatibility with HDFS and other benefits it brings like data locality and dynamic allocation.</span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><br /></div><o:p></o:p><o:p></o:p><br /><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">The command used to submit a spark job in Standalone and other cluster mode is same.</span><br /><span style="font-family: "georgia" , serif; font-size: 16pt;"><br /></span><br /><div class="MsoNormal" style="color: black; line-height: normal;"><span style="color: #2e2e2e; font-family: "georgia" , serif;">For Python applications, in place of a JAR we need to pass our .py file as <application-jar>, and add Python dependencies like modules, .zip, .egg or .py files in --py-files.<o:p></o:p></span></div><div class="MsoNormal" style="color: black; line-height: normal;"></div><span style="font-family: "georgia" , serif; font-size: 16pt;"></span><br /><div class="MsoNormal" style="color: black; line-height: normal;"><span style="color: #2e2e2e; font-family: "georgia" , serif;">Click to see <span style="color: #2e2e2e; text-decoration-line: none;"><a href="https://spark.apache.org/docs/latest/configuration.html">#other spark properties options</a></span></span><br /><span style="color: #2e2e2e; font-family: "georgia" , serif;"><br /></span></div></div></div><table border="1" cellpadding="0" cellspacing="0" class="MsoTableGrid" style="border-collapse: collapse; border: none; color: #2e2e2e;"><tbody><tr><td style="background: rgb(156, 194, 229); border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div align="center" class="MsoNormal" style="line-height: normal; margin-bottom: 0in; text-align: center;"><b>Scala Spark<o:p></o:p></b></div></td><td style="background: rgb(156, 194, 229); border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div align="center" class="MsoNormal" style="line-height: normal; margin-bottom: 0in; text-align: center;"><b>PySpark</b><o:p></o:p></div></td></tr><tr><td style="background: rgb(242, 242, 242); border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;">spark-submit \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> --class <main-class> \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> --master <master-url> \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> --deploy-mode <deploy-mode> \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> --conf <key>=<value> \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> ... # other spark properties options<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> <application-jar> \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> [application-arguments]<o:p></o:p></div></td><td style="background: rgb(242, 242, 242); border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;">spark-submit \ \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> --master <master-url> \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> --deploy-mode <deploy-mode> \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> --conf <key>=<value> \<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> ... # other Spark properties options<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> --py-files <python-modules-jars><o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> my_application.py<o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"> [application-arguments]<o:p></o:p></div></td></tr></tbody></table><div style="color: #2e2e2e;"><div style="text-align: center;"><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal; text-align: left;"><span style="font-family: "georgia" , serif;"><span style="font-size: x-small;">Table 1: Spark-submit command in Scala and Python</span></span><o:p></o:p></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal; text-align: left;"><br /><span style="font-family: "georgia" , serif; font-size: 21.3333px;">When you submit a job in spark the application jars(job code) is distributed to all worker nodes along with the jar files(if mentioned)</span></div></div></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal; margin-bottom: 0.0001pt;"><h2 style="color: #2e2e2e;"></h2><h2 style="line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;"></span></h2><h2 style="line-height: 24pt; margin: 0in 0in 0.0001pt; text-align: center;"><b style="text-indent: -0.25in;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;"><br /></span></span></b></h2><h3 style="line-height: 24pt; margin: 0in 0in 0.0001pt; text-align: center;"><span style="font-weight: normal; text-indent: -0.25in;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;"><i>How to submit a Spark Job in Standalone Cluster vs Cluster managed by other Cluster Managers?</i></span></span></span></h3><div><b style="text-indent: -0.25in;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;"><br /></span></span></b></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;"><span>Answer to this question is simple. You need to use the "--master" option shown in the above spark submit command and pass the master url of the cluster e.g.</span><o:p></o:p></span></div><table border="0" cellpadding="0" cellspacing="0" class="MsoNormalTable" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; border-collapse: collapse;"><tbody><tr><td style="background: rgb(156, 194, 229); border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">Mode<o:p></o:p></span></div></td><td style="background: rgb(156, 194, 229); border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">Value of “--master”<o:p></o:p></span></div></td></tr><tr><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">For Standalone deployment mode<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">--master spark://HOST:PORT<o:p></o:p></span></div></td></tr><tr><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">For Mesos<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">--master mesos://HOST:PORT</span><span style="font-family: "times new roman" , serif; font-size: 12pt;"><o:p></o:p></span></div></td></tr><tr><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">For Yarn<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">--master yarn<o:p></o:p></span></div></td></tr><tr><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">Local<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">--master local[*] :: * = number of threads<o:p></o:p></span></div></td></tr></tbody></table><div style="text-align: justify;"><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="color: #2e2e2e; font-family: "georgia" , serif;"><span style="font-size: x-small;">Table 2: Spark-submit --master for different Spark deployment modes</span></span><o:p></o:p></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="color: #2e2e2e; font-family: "georgia" , serif; font-size: 16pt;"><o:p></o:p></span></div><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><o:p></o:p></div></div><div style="text-align: justify;"><br /></div><div><div class="MsoNormal" style="color: #2e2e2e; line-height: normal;"><br /></div><div class="MsoNormal" style="color: #2e2e2e; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">By now we have talked a lot on the Cluster deployment mode,<b><i> now we need to understand the application "--deploy-mode" </i></b>. The above deployment modes which we discussed is Cluster Deployment mode and is different from the "--deploy-mode" mentioned in spark-submit</span><span style="font-family: "georgia" , serif; font-size: 21.3333px;">(table 1)</span><span style="font-family: "georgia" , serif; font-size: 16pt;"> command. --deploy-mode is the application(or driver) deploy mode which tells Spark how to run the job in cluster(as already mentioned cluster can be a standalone, a yarn or Mesos). For an application to run on cluster there are two --deploy-modes, one is client and other is cluster mode.</span></div><div class="MsoNormal" style="color: #2e2e2e; line-height: normal;"><br /></div><h2 style="color: #2e2e2e; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;"><br /></span></h2><h3 style="line-height: 24pt; margin: 0in 0in 0.0001pt;"><b style="text-indent: -0.25in;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: large;">Spark Deploy Modes for Application:-</span><span style="font-size: x-large;"> </span></span></b></h3><div><b style="text-indent: -0.25in;"><span style="color: #00b0f0; font-family: "georgia" , serif;"><span style="font-size: x-large;"><br /></span></span></b></div><div class="MsoNormal" style="color: #2e2e2e; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;"><b>Client Mode: -</b> Driver runs in the machine where the job is submitted.</span><br /><span style="font-family: "georgia" , serif; font-size: 16pt;"><br /></span><span style="font-family: "georgia" , serif; font-size: 16pt;"><b>Cluster Mode: -</b> When driver runs inside the cluster. In this case Resource Manager/Master decides which node the driver will run.</span></div><div class="MsoNormal" style="color: #2e2e2e; line-height: normal;"><br /></div><h2 style="color: #2e2e2e; line-height: normal;"><span style="font-weight: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">Now the question arises</span> -</span></h2><div><h3 style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><b><i><span style="color: #00b0f0; font-family: "times new roman" , serif; font-size: 18pt;">"How to submit a job in Cluster or Client mode and which one is better?"</span></i></b></h3></div><div class="MsoNormal" style="color: #2e2e2e; line-height: normal;"><o:p></o:p></div><div style="color: #2e2e2e;"><br /></div><h2 style="color: #2e2e2e; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;"><i>How to submit:-</i></span></h2><div class="MsoNormal" style="color: #2e2e2e; line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">Spark submit command is already shown above. for deploy mode we just need to pass "--deploy-mode client" for client mode and "--deploy-mode cluster" for cluster mode.</span></div></div></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0.0001pt;"><span style="color: black; font-family: "georgia" , serif; font-size: 18pt;"><br /></span></div></div><div class="MsoNormal" style="background-color: white; margin: 0px; outline: 0px; padding: 0px; transition: all 0.2s ease 0s;"><div style="color: #2e2e2e;"><table border="1" cellpadding="0" cellspacing="0" class="MsoTableGrid" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; border-collapse: collapse; border: none;"><tbody><tr style="mso-yfti-firstrow: yes; mso-yfti-irow: 0;"><td style="background: rgb(156, 194, 229); border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div align="center" class="MsoNormal" style="line-height: normal; margin-bottom: 0in; text-align: center;"><b style="mso-bidi-font-weight: normal;">Client<o:p></o:p></b></div></td><td style="background: rgb(156, 194, 229); border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div align="center" class="MsoNormal" style="line-height: normal; margin-bottom: 0in; text-align: center;"><b style="mso-bidi-font-weight: normal;">Cluster<o:p></o:p></b></div></td></tr><tr style="mso-yfti-irow: 1;"><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 14px;">Job fails if the driver is disconnected</span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 14px;">After submitting the job client can disconnect.</span></div></td></tr><tr style="mso-yfti-irow: 2;"><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">Driver runs in the machine where the job is submitted.</span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">Driver runs inside the cluster. Resource Manager or Master decides which node the driver will run</span></div></td></tr><tr style="mso-yfti-irow: 3;"><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">Can be used to work with spark in an interactive manner. Performing action on RDD or DataFrame(like count) and capturing them in logs becomes easy.<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">Cannot be used to work with spark in an interactive manner.<o:p></o:p></span></div></td></tr><tr style="mso-yfti-irow: 4;"><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">Jars can be accessed from Client<span style="mso-spacerun: yes;"> </span>machine.<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">Since the driver runs on a different machine than the client, so the jars present in local machine won’t work. Those jars should be made available to all nodes either by placing them on each node or mention them in --jars or as –py-files during spark-submit.<o:p></o:p></span></div></td></tr><tr style="mso-yfti-irow: 5;"><td colspan="2" style="border: 1pt solid; padding: 0in 5.4pt; width: 467.5pt;" valign="top" width="623"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><b style="mso-bidi-font-weight: normal;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">YARN:-</span></b><span face="" style="color: #1d1f22; font-size: 10.5pt;"><o:p></o:p></span></div></td></tr><tr style="mso-yfti-irow: 6;"><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">Spark driver does not run on the YARN cluster only executor runs inside YARN cluster.<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><br /></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">Spark driver and executor both runs on the YARN cluster.<o:p></o:p></span></div></td></tr><tr style="mso-yfti-irow: 7; mso-yfti-lastrow: yes;"><td style="border: 1pt solid; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">The local dir used by driver is spark.local.dir and for executor it is YARN config </span><code style="border-radius: 3px;"><span style="border: 1pt none; color: #444444; font-family: "lucida console"; font-size: 9pt; padding: 0in;">yarn.nodemanager.local-dirs<span style="float: none;">.</span></span></code><span face="" style="color: #1d1f22; font-size: 10.5pt;"><o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 233.75pt;" valign="top" width="312"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #1d1f22; font-size: 10.5pt;">The local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config </span><code style="border-radius: 3px;"><span style="border: 1pt none; color: #444444; font-family: "lucida console"; font-size: 9pt; padding: 0in;">yarn.nodemanager.local-dirs</span></code><span face="" style="color: #1d1f22; font-size: 10.5pt;"><span style="float: none;">)</span><o:p></o:p></span></div></td></tr></tbody></table><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif;"><span style="font-size: x-small;">Table 3: Spark Client Vs Cluster Mode</span></span><o:p></o:p></div><div style="text-align: center;"><br /></div><div><span style="font-family: "georgia" , serif; font-size: 16pt; line-height: 22.8267px;">Here are some examples on submitting a job in different modes: -</span></div><table border="0" cellpadding="0" cellspacing="0" class="MsoNormalTable" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; border-collapse: collapse;"><tbody><tr><td style="background: rgb(156, 194, 229); border: 1pt solid; padding: 0in; width: 99.85pt;" valign="top" width="133"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">Mode<o:p></o:p></span></div></td><td style="background: rgb(156, 194, 229); border: 1pt solid; padding: 0in 5.4pt; width: 192.15pt;" valign="top" width="256"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">Scala<o:p></o:p></span></div></td><td style="background: rgb(156, 194, 229); border: 1pt solid; padding: 0in 5.4pt; width: 175pt;" valign="top" width="233"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-family: "times new roman" , serif; font-size: 12pt;">PySpark<o:p></o:p></span></div></td></tr><tr><td style="border: 1pt solid; padding: 0in; width: 99.85pt;" valign="top" width="133"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">Local<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 192.15pt;" valign="top" width="256"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">./bin/spark-submit \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --class main_class \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --master local[8] \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> /path/to/examples.jar<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 175pt;" valign="top" width="233"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">./bin/spark-submit \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --master local[8] \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> my_job.py<o:p></o:p></span></div></td></tr><tr><td style="border: 1pt solid; padding: 0in; width: 99.85pt;" valign="top" width="133"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">Spark Standalone: -<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><br /></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 192.15pt;" valign="top" width="256"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">./bin/spark-submit \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --class main_class \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --master spark://<ip-address>:7077 \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --deploy-mode cluster \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --supervise \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --executor-memory 10G \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --total-executor-cores 100 \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> /path/to/examples.jar<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 175pt;" valign="top" width="233"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">./bin/spark-submit \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">--master spark://<ip-add>:7077 \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --deploy-mode cluster \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --supervise \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --executor-memory 10G \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --total-executor-cores 100 \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --py-files<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> my_job.py<o:p></o:p></span></div></td></tr><tr><td style="border: 1pt solid; padding: 0in; width: 99.85pt;" valign="top" width="133"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">Yarn Cluster mode<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 192.15pt;" valign="top" width="256"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">./bin/spark-submit \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --class main_class \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --master yarn \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --deploy-mode cluster \ # can be client for client mode<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --executor-memory 10G \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --num-executors 50 \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> /path/to/examples.jar<o:p></o:p></span></div></td><td style="border-bottom: 1pt solid; border-left: none; border-right: 1pt solid; border-top: none; padding: 0in 5.4pt; width: 175pt;" valign="top" width="233"><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;">./bin/spark-submit \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --master yarn \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --deploy-mode cluster \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --executor-memory 10G \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --num-executors 50 \<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> --py-files<o:p></o:p></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span face="" style="color: #333333; font-size: 10.5pt;"> my_job.py<o:p></o:p></span></div></td></tr></tbody></table><span style="font-family: "georgia" , "times new roman" , serif; font-size: large;"><span style="color: #333333;"></span></span></div><div style="color: #2e2e2e;"><div class="MsoNormal" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; line-height: normal;"><span style="font-family: "georgia" , serif;"><span style="font-size: x-small;">Table 4: Spark submit examples for different mode</span></span><o:p></o:p></div><br /><h3 style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: black; text-align: center;"><i><span style="color: #00b0f0;"><span style="font-size: x-large;">Client or Cluster mode? Which one is better?</span></span></i></h3><div><o:p></o:p><o:p></o:p></div><div class="MsoNormal" style="line-height: normal;"><span style="font-family: "georgia" , serif; font-size: 16pt;">Unlike Cluster mode, if the client machine is disconnected in "client mode" then the job will fail. Client mode is good if you want to work on spark interactively, also if you don’t want to eat up any resource from your cluster for the driver daemon then you should go for client mode, in that case make sure you have sufficient RAM in your client machine. When dealing with huge data set and calling action on RDDs or Dataframes you need to make sure you have sufficient resources available on Client. We have seen many customers using client mode. It’s not like the cluster or client mode is better than the other. You can choose any deploy mode for your application, it depends on what suits your requirement.</span><br /><span style="font-family: "georgia" , serif; font-size: 16pt;"><br /></span></div><span style="color: #292929; font-size: large; letter-spacing: -0.0666667px;"><i>Click here to checkout our Udemy course</i></span><span style="color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"> </span><span style="background-color: #fcff01; color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"><a href="https://www.udemy.com/course/spark-scala-coding-best-practices-data-pipeline/?referralCode=DBA026944F73C2D356CF" rel="nofollow" target="_blank">Spark Scala Coding Framework and BestPractices</a> </span></div><div style="color: #2e2e2e;"><span style="background-color: #fcff01; color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"><br /></span></div><div style="color: #2e2e2e;"><span style="background-color: #fcff01; color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"><br /></span></div><div style="color: #2e2e2e;"><span style="background-color: #fcff01; color: #292929; font-size: x-large; letter-spacing: -0.0666667px;"><br /></span></div></div>FutureXskillshttp://www.blogger.com/profile/15942079652867152508noreply@blogger.com0tag:blogger.com,1999:blog-5300390759647895792.post-45638708087771980342020-08-29T12:52:00.012-07:002020-09-02T02:38:55.069-07:00Capture bad records while loading csv in spark Dataframe<div style="text-align: left;"><p style="text-align: justify;"><span style="font-size: large;">Loading a csv file and capturing all the bad records is a very common requirement in ETL projects. M</span><span style="font-size: large;">ost of the relational database loaders like sql loader or nzload provides this feature but when it comes to Hadoop and Spark (2.2.0) there is no direct solution for this.</span></p><div class="MsoNormal" style="text-align: justify;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEht3n_kzLfr_G2i7uA90VgrtC4tg2RXr6OToiK2DGO21p3Lub_6ioRynQtYuSnXjZbsVEQZqiMv1MQXZUTdMwQL225NvmSMGu0-alZAgYdj6nxk8L22MMexZSKdiz3RidnhUJ6WKMljGTc/s1600/Capture+bad+records+while+loading+a+csv+file+in+Spark+DataFrame+through+spark.read.csv%2528%2529.jpg" style="margin-left: 1em; margin-right: 1em;"><span id="goog_1686425456"></span><img alt="pySpark - Capture bad records while loading csv in Spark Data Frame" border="0" data-original-height="526" data-original-width="1282" height="262" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEht3n_kzLfr_G2i7uA90VgrtC4tg2RXr6OToiK2DGO21p3Lub_6ioRynQtYuSnXjZbsVEQZqiMv1MQXZUTdMwQL225NvmSMGu0-alZAgYdj6nxk8L22MMexZSKdiz3RidnhUJ6WKMljGTc/s640/Capture+bad+records+while+loading+a+csv+file+in+Spark+DataFrame+through+spark.read.csv%2528%2529.jpg" title="Capture bad records while loading csv in Spark" width="640" /><span id="goog_1686425457"></span></a></div></div><div class="MsoNormal" style="text-align: justify;"><span style="font-size: large;">However solution to this problem is present in spark <span face="" style="background: white; color: #404040; line-height: 25.68px;"> </span><a href="https://docs.databricks.com/release-notes/runtime/3.0.html" style="box-sizing: border-box;"><span class="doc"><span face="" style="background: white; color: #00697b; line-height: 25.68px;"><span style="box-sizing: border-box;">Databricks Runtime 3.0</span></span></span></a> where you just need to provide the bad record path and all the bad record file will get saved there.</span><o:p></o:p></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="color: #333333; font-family: "consolas";"><span style="font-size: large;"><br /></span></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-size: large;"><span style="color: #333333; font-family: "consolas";">df</span><span style="color: #404040; font-family: "consolas";"> <b>=</b> </span><span style="color: #333333; font-family: "consolas";">spark</span><b><span style="color: #404040; font-family: "consolas";">.</span></b><span style="color: #333333; font-family: "consolas";">read</span><span style="color: #404040; font-family: "consolas";"><o:p></o:p></span></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-size: large;"><span style="color: #404040; font-family: "consolas";"> <b>.</b></span><span style="color: #333333; font-family: "consolas";">option</span><span style="color: #404040; font-family: "consolas";">(</span><span style="color: #dd1144; font-family: "consolas";">"badRecordsPath"</span><span style="color: #404040; font-family: "consolas";">, </span><span style="color: #dd1144; font-family: "consolas";">"/data/badRecPath"</span><span style="color: #404040; font-family: "consolas";">)<o:p></o:p></span></span></div><div class="MsoNormal" style="line-height: normal; margin-bottom: 0in;"><span style="font-size: large;"><span style="color: #404040; font-family: "consolas";"> <b>.</b></span><span style="color: #333333; font-family: "consolas";">parquet</span><span style="color: #404040; font-family: "consolas";">(</span><span style="color: #dd1144; font-family: "consolas";">"/input/parquetFile"</span><span style="color: #404040; font-family: "consolas";">)<o:p></o:p></span></span></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><span style="font-size: large;">However, in the previous spark releases this method doesnt work. We can achieve this in two ways :-</span></div><div class="MsoNormal"></div><ol><li><span style="font-size: large;">Read the input file as RDD and then use the RDD transformation methods to filter the bad records</span></li><li><span style="font-size: large;">Use spark.read.csv()</span></li></ol><br /><div class="MsoListParagraphCxSpLast" style="mso-list: l0 level1 lfo1; text-indent: -0.25in;"><o:p></o:p></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><i style="font-size: large;"><span style="background-color: white; color: #292929; font-family: Domine, Arial, Helvetica, sans-serif; letter-spacing: -0.0666667px;">Click here to checkout our Udemy course to learn<b> </b></span><span style="background-color: #fcff01; color: #292929; font-family: Domine, Arial, Helvetica, sans-serif; font-weight: bold; letter-spacing: -0.0666667px;"><a href="https://www.udemy.com/course/spark-scala-coding-best-practices-data-pipeline/?referralCode=DBA026944F73C2D356CF" rel="nofollow" style="color: black; text-decoration-line: none;" target="_blank">Spark Scala Coding Framework and BestPractices</a> </span></i></div><div class="MsoNormal"><i style="font-size: large;"><span style="background-color: #fcff01; color: #292929; font-family: Domine, Arial, Helvetica, sans-serif; font-weight: bold; letter-spacing: -0.0666667px;"><br /></span></i></div><div class="MsoNormal"><i style="font-size: large;"><span style="background-color: #fcff01; color: #292929; font-family: Domine, Arial, Helvetica, sans-serif; font-weight: bold; letter-spacing: -0.0666667px;"><br /></span></i></div><div class="MsoNormal"><span style="font-size: large;">In this article we will see how we can capture bad records through spark.read.csv(). In order to load a file and capture bad records we need to perform the following steps:-</span></div><div class="MsoNormal"><span style="font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal; text-indent: -0.25in;"><span style="font-size: large;"><br /></span></span></div><div class="MsoNormal"></div><ol><li><span style="font-size: large;">Create schema (StructType) for the input file to load with an extra column of string type(say bad_records) for corrupt records.</span></li><li><span style="font-size: large;">Call method spark.read.csv() with all the required parameters and pass the bad record column name (extra column created in step 1 as parameter <span style="color: #1f497d; text-indent: -0.25in;">columnNameOfCorruptRecord.</span></span></li><li><span style="font-size: large;">Filter all the records where “bad_records” is not null and save it as a temp file.</span></li><li><span style="font-size: large;">Read the temporary file as csv (spark.read.csv) and pass the <span style="color: #1f497d; text-indent: -0.25in;"> </span><span style="color: #1f497d; text-indent: -0.25in;">same schema as above(step 1)</span></span></li><li><span style="font-size: large;">From the bad data-frame Select “bad_column”.</span></li></ol><br /><div class="MsoListParagraphCxSpLast" style="text-indent: -0.25in;"><o:p></o:p></div><div class="MsoNormal"><span style="color: #1f497d;"><br /></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">Step 5 will give you a data-frame having all the bad records.<o:p></o:p></span></span></div><div class="MsoNormal"><br /><span style="font-size: large;"><u><b>Code:-</b></u></span></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">>>> >>> >>><o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">#####################Create Schema#####################################<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">>>> customSchema = StructType( [ <o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;"> StructField("order_number", IntegerType(), True),<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;"> StructField("total", StringType(), True),\<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;"> <span style="background: yellow; mso-highlight: yellow;">StructField("bad_record", StringType(), True)\</span><o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;"> ]<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;"> )<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">“bad_record” here is the bad records column.<o:p></o:p></span></span></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">#################Call spark.read.csv()####################<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">>>> orders_df = spark.read \<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">... .format('com.databricks.spark.csv') \<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">... .option("badRecordsPath", "/test/data/bad/")\<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;"> .option("mode","PERMISSIVE")\<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">... ... .option("columnNameOfCorruptRecord", "bad_record")\<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">... .options(header='false', delimiter='|',) \<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;"> .load('/test/data/test.csv',schema = customSchema)...<o:p></o:p></span></span></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">After calling spark.read.csv, If a record doesn’t satisfy the schema then null will be assigned to all the column and a concatenated value of all columns will be assigned to the bad records column.<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">>>> orders_df.show()<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">+-------------------+-------------------+-----------------------------+-----------------------------------------<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">|order_number| total <span style="background: yellow; mso-highlight: yellow;">| bad_record|</span><o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">+-------------------+-------------------+-----------------------------+----------------------------------------<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">| 1| 1000| null|<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">| 2| 4000| null|<o:p></o:p></span></span></div><div class="MsoNormal"><span style="color: #1f497d;"><span style="font-size: large;">| null| null| A|30|3000|<o:p></o:p></span></span></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><br /></div><div class="MsoNormal"><span style="font-size: large;">NOTE:-</span><br /><span style="font-size: large; text-indent: -0.25in;">Corrupt record columns are generated at run time when </span><span style="font-size: large; text-indent: -0.25in;">DataFrames instantiated and data is actually fetched (by calling any action).</span><br /><span style="font-size: large; text-indent: -0.25in;">Output of corrupt column depends on other columns which are a part of RDD in that particular ACTION call.</span><br /><span style="font-size: large;">If error causing column is not a part of the ACTION call then bad_column wont show any bad record.</span><br /><span style="font-size: large;">If you want to overcome this issue and want the bad_record to persist then follow step 3,4 and 5 or use caching.</span></div><div><br /></div><div><div style="text-align: left;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #292929; letter-spacing: -0.05pt; line-height: 107%;"><span style="font-size: large;"><span style="background-color: white;"><i>Click here to checkout our Udemy course to learn more about </i></span><span style="background-color: #fcff01;"><i><a href="https://www.udemy.com/course/spark-scala-coding-best-practices-data-pipeline/?referralCode=DBA026944F73C2D356CF" rel="nofollow" target="_blank">Spark Scala Coding Framework and BestPractices</a> </i></span></span></span></div><div style="text-align: left;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #292929; letter-spacing: -0.05pt; line-height: 107%;"><span style="font-size: large;"><span style="background-color: #fcff01;"><br /></span></span></span></div><div style="text-align: left;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #292929; letter-spacing: -0.05pt; line-height: 107%;"><span style="font-size: large;"><span style="background-color: #fcff01;"><br /></span></span></span></div><div style="text-align: left;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #292929; letter-spacing: -0.05pt; line-height: 107%;"><span style="font-size: large;"><span style="background-color: #fcff01;"><br /></span></span></span></div><div style="text-align: left;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #292929; letter-spacing: -0.05pt; line-height: 107%;"><span style="font-size: large;"><span style="background-color: #fcff01;"><br /></span></span></span></div><div style="text-align: left;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #292929; letter-spacing: -0.05pt; line-height: 107%;"><span style="font-size: large;"><span style="background-color: #fcff01;"><br /></span></span></span></div><div style="text-align: left;"><span style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; background-position: initial; background-repeat: initial; background-size: initial; color: #292929; letter-spacing: -0.05pt; line-height: 107%;"><span style="font-size: large;"><span style="background-color: #fcff01;"><br /></span></span></span></div><p class="MsoNormal"><o:p></o:p></p></div><div style="font-family: "times new roman"; margin: 0px;"></div></div>FutureXskillshttp://www.blogger.com/profile/15942079652867152508noreply@blogger.com0tag:blogger.com,1999:blog-5300390759647895792.post-42896744801978099092020-08-29T12:12:00.004-07:002020-08-29T14:22:32.725-07:00Structured Streaming Data storage in Hive Table<p><span face="" style="background-color: white; color: #292929; font-size: 21px; letter-spacing: -0.003em;">In this post we talk about how you can read data from files using Spark Structured Streaming and store the output in a Hive table</span></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTJTvIlBfBj0yrz4T6U5m9XouWOrqzpQOgh6mdJaEzgaYcQVbfNRvSniBMYiqCweu0I8xgsro60UrncpUMEkSG5PCzl6pg9Gt5WqYa3Au0dW5E01e8qigp3taoLOYfJzab1lBC58vLnit-/s1034/stream-hive.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="444" data-original-width="1034" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTJTvIlBfBj0yrz4T6U5m9XouWOrqzpQOgh6mdJaEzgaYcQVbfNRvSniBMYiqCweu0I8xgsro60UrncpUMEkSG5PCzl6pg9Gt5WqYa3Au0dW5E01e8qigp3taoLOYfJzab1lBC58vLnit-/s640/stream-hive.png" width="640" /></a></div><br /><span face="" style="background-color: white; color: #292929; font-size: 21px; letter-spacing: -0.003em;"><br /></span><p></p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="0117" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;">Build a Streaming App</p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="0117" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;"><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">import org.apache.spark.SparkConf</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">import org.apache.spark.sql.SparkSession</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">import org.apache.spark.sql.streaming.OutputMode</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">import org.apache.spark.sql.types.{StringType, StructField, StructType}</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">object StructuredStreamingSaveToHive {</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> def main(args: Array[String]): Unit = {</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> </span><em class="jo" style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">println</em><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">("Structured Streaming Demo")</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> val conf = new SparkConf().setAppName("Spark Structured Streaming").setMaster("local[*]")</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> val spark = SparkSession.</span><em class="jo" style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">builder</em><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">.config(conf).getOrCreate()</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> </span><em class="jo" style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">println</em><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">("Spark Session created")</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> val schema = </span><em class="jo" style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">StructType</em><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">(</span><em class="jo" style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">Array</em><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">(</span><em class="jo" style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">StructField</em><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">("empId",StringType),</span><em class="jo" style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">StructField</em><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">("empName",StringType)))</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> // Create a "inputDir" under the</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> val streamDF = spark.readStream.option("header","true").schema(schema).csv("C:\\inputDir")</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> val query = streamDF.writeStream.outputMode(OutputMode.</span><em class="jo" style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">Append</em><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">()).format("csv")</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> .option("path","hivelocation").option("checkpointLocation","locatoin1").start()</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> query.awaitTermination()</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"> }</span><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><br style="background-color: #f2f2f2; box-sizing: inherit; font-family: menlo, monaco, "courier new", courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;" /><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;">}</span></p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="0117" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;"><span style="background-color: #f2f2f2; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.352px; white-space: pre-wrap;"><br /></span></p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="eda5" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;">pom.xml</p><pre class="jb jc jd je jf fz jg jh" style="background: rgb(242, 242, 242); box-sizing: inherit; color: rgba(0, 0, 0, 0.8); margin-bottom: 0px; margin-top: 56px; overflow-x: auto; padding: 20px;"><span class="cw ji jj bi jk b cm jl jm r jn" data-selectable-paragraph="" id="e416" style="box-sizing: inherit; color: #292929; display: block; font-family: Menlo, Monaco, "Courier New", Courier, monospace; font-size: 16px; letter-spacing: -0.022em; line-height: 1.18; margin-bottom: -0.09em; margin-top: -0.09em; white-space: pre-wrap;"><?xml version="1.0" encoding="UTF-8"?><br style="box-sizing: inherit;" /><project xmlns="http://maven.apache.org/POM/4.0.0"<br style="box-sizing: inherit;" /> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"<br style="box-sizing: inherit;" /> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><br style="box-sizing: inherit;" /> <modelVersion>4.0.0</modelVersion><br style="box-sizing: inherit;" /><br style="box-sizing: inherit;" /> <groupId>org.example</groupId><br style="box-sizing: inherit;" /> <artifactId>FuturexMiscSparkScala</artifactId><br style="box-sizing: inherit;" /> <version>1.0-SNAPSHOT</version><br style="box-sizing: inherit;" /><br style="box-sizing: inherit;" /> <dependencies><br style="box-sizing: inherit;" /> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --><br style="box-sizing: inherit;" /> <dependency><br style="box-sizing: inherit;" /> <groupId>org.apache.spark</groupId><br style="box-sizing: inherit;" /> <artifactId>spark-core_2.11</artifactId><br style="box-sizing: inherit;" /> <version>2.4.3</version><br style="box-sizing: inherit;" /> </dependency><br style="box-sizing: inherit;" /> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql --><br style="box-sizing: inherit;" /> <dependency><br style="box-sizing: inherit;" /> <groupId>org.apache.spark</groupId><br style="box-sizing: inherit;" /> <artifactId>spark-sql_2.11</artifactId><br style="box-sizing: inherit;" /> <version>2.4.3</version><br style="box-sizing: inherit;" /> </dependency><br style="box-sizing: inherit;" /><br style="box-sizing: inherit;" /> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive --><br style="box-sizing: inherit;" /> <dependency><br style="box-sizing: inherit;" /> <groupId>org.apache.spark</groupId><br style="box-sizing: inherit;" /> <artifactId>spark-hive_2.11</artifactId><br style="box-sizing: inherit;" /> <version>2.4.3</version><br style="box-sizing: inherit;" /> <scope>compile</scope><br style="box-sizing: inherit;" /> </dependency><br style="box-sizing: inherit;" /> </dependencies><br style="box-sizing: inherit;" /></project></span></pre><ol style="background-color: white; box-sizing: inherit; color: rgba(0, 0, 0, 0.8); list-style: none none; margin: 0px; padding: 0px;"><li class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja jp jq jr cw" data-selectable-paragraph="" id="5701" style="box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 2em; padding-left: 0px;">Keep the C:\\inputDir directory initially empty.</li><li class="id ie bi if b ig js ii ij ik jt im in io ju iq ir is jv iu iv iw jw iy iz ja jp jq jr cw" data-selectable-paragraph="" id="acfd" style="box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 1.05em; padding-left: 0px;">Start the program and it will be waiting to stream.</li><li class="id ie bi if b ig js ii ij ik jt im in io ju iq ir is jv iu iv iw jw iy iz ja jp jq jr cw" data-selectable-paragraph="" id="2128" style="box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 1.05em; padding-left: 0px;">Then copy each of the files (fil1, file2, file3 mentioned below) to C:\\inputDir directory one file at a time and see the output in “hivelocation” directory under your project root folder.</li></ol><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="fea3" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;"><span class="if jx" style="box-sizing: inherit; font-weight: 700;">fil1.txt</span></p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="b44d" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;">empiId,empName<br style="box-sizing: inherit;" />1,Chris<br style="box-sizing: inherit;" />2,Neil</p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="490b" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;"><span class="if jx" style="box-sizing: inherit; font-weight: 700;">file2.txt</span></p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="6d1a" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;">empiId,empName<br style="box-sizing: inherit;" />3,John<br style="box-sizing: inherit;" />4,Paul</p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="5ce7" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;"><span class="if jx" style="box-sizing: inherit; font-weight: 700;">file3.txt</span></p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="d352" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;">empiId,empName<br style="box-sizing: inherit;" />5,Kathy<br style="box-sizing: inherit;" />6,Ana</p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="11f3" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;">You can create a Hive table pointing to the “hivelocation” and see data getting populated incrementally</p><p class="id ie bi if b ig ih ii ij ik il im in io ip iq ir is it iu iv iw ix iy iz ja gf cw" data-selectable-paragraph="" id="82e9" style="background-color: white; box-sizing: inherit; color: #292929; font-size: 21px; letter-spacing: -0.003em; line-height: 32px; margin: 2em 0px -0.46em; word-break: break-word;">To Learn more about Spark Scala Coding Framework and Best Practices checkout our Udemy course <a class="bw dm jy jz ka kb" href="https://www.udemy.com/course/spark-scala-coding-best-practices-data-pipeline/?referralCode=DBA026944F73C2D356CF" rel="noopener nofollow" style="-webkit-tap-highlight-color: transparent; background-image: url("data:image/svg+xml;utf8,<svg preserveAspectRatio=\"none\" viewBox=\"0 0 1 1\" xmlns=\"http://www.w3.org/2000/svg\"><line x1=\"0\" y1=\"0\" x2=\"1\" y2=\"1\" stroke=\"rgba(41, 41, 41, 1)\" /></svg>"); background-position: 0px 50%; background-repeat: repeat-x; background-size: 1px 1px; box-sizing: inherit; text-decoration-line: none;" target="_blank">https://www.udemy.com/course/spark-scala-coding-best-practices-data-pipeline/?referralCode=DBA026944F73C2D356CF</a></p>FutureXskillshttp://www.blogger.com/profile/03954397681463869119noreply@blogger.com0