-
Notifications
You must be signed in to change notification settings - Fork 28
/
Sqoop_Import
98 lines (37 loc) · 3.85 KB
/
Sqoop_Import
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---------Sqoop : Check connection to database by listing databases----------------------------
sqoop list-databases --connect jdbc:mysql://quickstart.cloudera:3306 --username root --password cloudera
---------Sqoop : Check the tables in the retail_db database------------------------------------
sqoop list-tables --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera
---------Sqoop : Check number of records in orders table in retail_db using sqoop -------------
sqoop eval --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --query "select count(*) from orders"
----------------------------------Import------------------------------------------------------
Q : Import orders table from mysql to HDFS with all default parameters.
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --table orders
The job ran with 4 mappers. It should have created 4 files in the destination folder. Let's check
hdfs dfs -ls /user/cloudera/orders/part* | wc -l
Let's check if the number of lines in file is equal to number of records in the table.
hdfs dfs -cat /user/cloudera/orders/part* | wc -l
Q: Importing orders table data from retail_db into HDFS textfile format with only one mapper (so, total files will be 1). (Make sure to clean up the HDFS directory before running job)
Deleting the directory before running the job : hdfs dfs -rm -r /user/cloudera/orders
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --table orders --m 1
Let's check if one file is created in target directory.
hdfs dfs -ls /user/cloudera/orders/part* | wc -l
Verifying the number of records in the file.
hdfs dfs -cat /user/cloudera/orders/part* | wc -l
Q : Import orders table from the retail_db. Only import first 30000 orders and save the output in directory /home/cloudera/orders.
Clean up the target directory - hdfs dfs -rm -r /user/cloudera/orders
#Cannot use --table and --query both at same time. Also when using free form query we have to specify the splitby condition as well.
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --query "select * from orders where order_id < 30001 and \$CONDITIONS" --target-dir /user/cloudera/orders --split-by order_id
Another method of doing the same
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --table orders --where "order_id < 30001" --target-dir /user/cloudera/orders --split-by order_id --delete-target-dir
Output Check - Number of records moved to hdfs
hdfs dfs -cat /user/cloudera/orders/part* | wc -l
Q : Load the rest of the records from the orders table into hdfs. ( Use sqoop append mode )
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --table orders --where "order_id > 30000" --append --target-dir '/user/cloudera/orders'
Check the total records in hdfs and in mysql db.
hdfs dfs -cat /user/cloudera/orders/part* | wc -l
mysql -uroot -pcloudera -e "select count(*) from retail_db.orders"
Q : Import orders table into hdfs, only selected columns (order_id, order_date, order_status)
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --columns 'order_id,order_date,order_status' --table orders --delete-target-dir
Q : Import table orders from retail_db with columns (order_id, order_date, order_status) with first 30000 records only into parquet files.
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --username root --password cloudera --table orders --columns 'order_id, order_date, order_status' --where 'order_id < 30001' --delete-target-dir --as-parquetfile --target-dir /user/cloudera/orders